La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

A computational model for unsupervised childlike speech acquisition [Elektronische Ressource] / by Holger Brandl

146 pages
Publié par :
Ajouté le : 01 janvier 2009
Lecture(s) : 23
Signaler un abus

A computational model
for unsupervised childlike
speech acquisition
by
Holger Brandl
Submitted to the
Technische Fakultat
Universit at Bielefeld
in partial ful llment of the requirements for the degree of a
Doktor Ingenieur
thesis supervisor thesis supervisor
Dr.-Ing. Frank Joublin Dr.-Ing. Britta Wrede
Honda Research Europe GmbH Universit at Bielefeld
July 20092Acknowledgments
This thesis would not have been possible without Doreen, my beloved girlfriend and proud mother
of my little sunshine Simon. I would like to thank her and my parents for supporting me through-
out all the years of my studies and all their interest in my research.
I thank my supervisor Frank Joublin for his constant support, his always honest feedback,
more inspiring ideas than I could ever implement, and for pushing my work forward by pluckily
questioning many things I’ve always taken for granted. I would like to express my deep thanks to
my supervisor Britta Wrede for supporting me with all her knowledge about speech recognition
and infant development, and her always positive, warm, and charming attitude whenever we met
to shape my fuzzy ideas into a thesis.
I thank my PhD-fellows Miguel Vaz, Xavier Domont, Bjoern Sch olling, Irene Ayllon and
Claudius Gl aeser for all the endless but always highly inspiring discussions, countless Blockrun-
den, their sometimes sharp but always friendly and constructive feedback, and all the fun we had
together. Big thanks go to Jens Schmudderic h for lifting me over the Audiotellerrand into the
wonderful world of multi-modal pattern recognition, the bracing hours in the gym, and his brave
attempts to make our life as PhD-students even more comfortable.
I would like to thank Tobias Rodemann for refocusing me whenever I drifted too far away
from my topic, and also for comforting me in hours of BBCM-despair. I would say many thanks
to Martin Heckmann for sharing his deep understanding of speech processing systems with me
whenever I was confused, for his wonderful ability to condense fuzzy discussions to the essence of
matter, and for supporting my favorite time series modeling technique as long as we’ve not found
the perfect and ultimate replacement solution.
Many thanks go to the ALIS-team for forcing my model from a beautiful annotated corpus-
universe into the rough seas of human-robot interaction. Especially I would like to thank Christian
Goerick who has raised my view on system design to a new and thrilling level. I would like to
thank Gerhard Sagerer, Edgar K orner, Andreas Richter and Franz Kummert for making this thesis
possible - surely somewhere in between O enbach and Bielefeld, but always feeling at home and
welcome in both locations.
Finally, I would like to express my thank all reviewers of this thesis for their endless ghts
against my not yet converged English syntax model, and their mindfulness to reveal all the typos
I’ve really worked hard to hide. It were their questions that helped me to capture at least a short
shimmering glance on what a model for infant-inspired speech acquisition could look like.
Dresden, July 2009
Holger Brandl
34Summary
Speech understanding requires the ability to parse spoken utterances into words. But this ability
is not innate and needs to be developed by infants within the rst years of their life. So far al-
most all computational speech processing systems neglected this bootstrapping process. Here we
propose a model for early infant speech structure acquisition implemented as layered architecture
comprising phones, syllables and words. Our model processes raw acoustic speech as input and
aims to capture its structure unsupervised on di erent levels of granularity.
Most previous models assumed di erent kinds of innate language-speci c predispositions. We drop
such unlikely assumptions and propose a model that is developmentally plausible. We condense
ndings from developmental psychology down to a few basic principles that our model is aiming
to re ect on a functional level. By doing so our proposed model learns the structure of speech by
a multitude of coupled self-regulated bootstrapping processes.
We evaluate our model on speech corpora that have some of the properties of infant-directed speech.
To further validate our approach we outline how the proposed model integrates into an embodied
multi-modal learning and interaction framework running on Honda’s Asimo robot. Finally, we
propose an integrated model for speech structure and imitation learning through interaction, that
enables our robot to learn to speak with an own voice.
56Contents
Acknowledgments 3
Summary 5
Contents 9
1 Introduction 13
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 How language comes to children 17
2.1 Child directed speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Word segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Metric segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 The principle of subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Allophonic and articulatory cues . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Syllable segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Sonority sequencing principle . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Maximum onset principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 Phonotactic learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Phone segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Vocabulary acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Pattern recognition background 29
3.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Incremental clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Self-organizing neural nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Probability density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Parametric approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 MAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Non-parametric approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Information theory basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Model adaption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Statistical language modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.2 Acoustic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.3 Keyword Spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7CONTENTS CONTENTS
3.6.4 Con dence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Neural networks for speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Symbolic models for speech acquisition 49
4.1 Symbolic sub-syllable learning of speech structure . . . . . . . . . . . . . . . . . . . 49
4.2 Symbolic syllable structure learning . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Symbolic word learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Acoustic speech modeling 55
5.1 Implicit model-based speech clustering . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Direct methods for speech segmentation . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Model-based changed point detection . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 Syllable segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Word acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Acoustic model bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Model 65
6.1 Computational requirements and constraints . . . . . . . . . . . . . . . . . . . . . 66
6.1.1 Type of input speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.2 Speech representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.3 Order of bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.4 Processing principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.5 Coupling of speech unit representations . . . . . . . . . . . . . . . . . . . . 71
6.2 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Phones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.1 Unsupervised phone cluster learning . . . . . . . . . . . . . . . . . . . . . . 75
6.3.2 Phonotactic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Syllables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4.1 Syllable spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4.2 Training segment generation . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4.3 Incremental clustering of syllable segments . . . . . . . . . . . . . . . . . . 80
6.4.4 Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.5 Syllable transition modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.5 Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.5.1 Top-down error correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5.2 Basic syntax learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.6 Scienti c contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.6.1 What this model is not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7 Evaluation 91
7.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1.1 Model Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.1.2 Segmentation quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.1.3 Kappa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.1.4 Segmentation vs. clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.1.5 Statistical learning quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.2 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.2.1 Phones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8CONTENTS CONTENTS
7.2.2 Monosyllabic words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2.3 Semi-synthetic speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2.4 Discrete speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2.5 Child-directed read speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 Phones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3.1 Classi cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3.3 Phone-distributed word models . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.3.4 Phone language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.3.5 Phonotactic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3.6 Syllabic parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4 Syllables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4.1 Model initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4.2 Clustering process properties . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.4.3 Spotting performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.4 Subtraction learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.5 Syllable Grammar learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5 Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.5.1 Lexical Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8 Embodied speech acquisition 115
8.1 Grounded word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.2 Linking speech perception and production . . . . . . . . . . . . . . . . . . . . . . . 117
8.2.1 Speech production architecture . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2.2 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.2.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9 Summary and discussion 125
9.1 Discussion and outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Bibliography 135
9CONTENTS CONTENTS
10