Chapter 9 Language Acquisition and Machine Learning

Xiaomeng Ma

The City University of New York – Graduate Center

9.1 Introduction

Child language acquisition and machine learning are two different topics but the core questions are alike. Infants or algorithms are exposed to linguistic input. They are able to find patterns in the input and either produce input-like output (e.g. children start to speak.) or perform other task based on the input (e.g machine translation).

Language is a defining property to human beings. It is a cultural artifact and an important communication tool. Linguistically speaking, all languages in the world can be defined as a collection of sound/meaning pairs. People hear speech stream and make sense out to it. Moreover, they can also produce similar speech in order to communicate. Language is an extremely complex multi-modal system, however, it is acquired by normal developed infants in an effortless manner. For decades, linguistics and psychologists have been trying to understand the mechanism of language acquisition. Chomsky (1986) defined the question on language acquisition into two part: what constitutes knowledge of a language, and how is the knowledge acquired by its users? Theoretical linguists have been working on the first part of the question and psycholinguistics have been working on the second part of the question. The internal paradoxical tension between two parts has been noticed by Chomsky: “To achieve descriptive adequacy it often seems necessary to enrich the system of available devices, whereas to solve our case of Plato’s problem we must restrict the system of available devices so that only a few languages or just one are determined by the given data. It is the tension between these two tasks that makes the field an interesting one, in my view.” (Chomsky, 1986)

This contradiction between adequate descriptive device vs restricted system is also reflects in language acquisition of infants. They learn from the utterances of people around (parents), but the utterances are finite, incomplete, idiosyncratic… They learned language without explicit instructions, that nobody teaches a 2y/o to put a subject, a verb and an object in a sentence. They all learn it rapidly. Usually by the age of 5, children are able to communicate without difficulty. Also, the output of acquisition is uniformly. Typical developed children usually achieved the same level of fluency in their native language. To summarize the paradox here, children receive finite and limited set of input and produce infinite and highly original output. To solve this problem, linguists need to find a class of representations that is sufficiently rich to account for the observed dependencies in natural language.

In machine learning tasks, most of the goals are achieved by learning instances through neural network to build a representation of the instance and produce input-like output. Tom Mitchell (1997) definition of Machine Learning: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. The instances in language-related machine learning task would be sentences. The more instances for input the better representation could be formed. In language related machine learning tasks, the representation the algorithm needs to mirror is concepts in language, such as phonemes, word, or grammar.

In this way, language acquisition and machine learning are similar in the way that, both children and algorithms have to process input data, find features, build representation and perform desired tasks based on the representation they built. Given a stream of linguistic input, an algorithm or human brain incrementally learns a grammar that captures its statistical patterns, which can then be used to parse or generate new data. Therefore, putting language acquisition and machine learning tasks together could probably provide some new perspectives to solve these two challenges faced in both field. In the following paragraphs of this article, two specific tasks (speech categorization and word categorization) will be discussed in the perspective of child language acquisition and machine learning.

9.2 Speech Categorization in language acquisition

An early and essential task for infants is to make sense of speech that they hear. There are at least three levels of discrimination for infants to: phonemic/phonological level, lexical level and phrasal/sentence level. At phonemic/phonological level, infants must learn to partition varied sounds into phonemic categories. For example, English speaking infants need to know that /kæt/ (cat) and /hæt/ (hat) refer to different objects, but /wɔt̬ɚ/ and /wɔtɚ/ (water) both mean the transparent liquid they drink when they feel thirsty. Tonal language speaking infants also need to differentiate between different tones. For example, Chinese speaking infants usually learn it at a young age that /mā/ (mother) and /mǎ/ (horse) are totally different. Infants’ first task in language acquisition is to figure out the phonemic categories before trying to make sense out of them. The complex phonetic learning task is somewhat innate for all typical developing infants. Previous studies have shown that young infants are especially sensitive to acoustic change and speech sounds. Brain imaging studies showed that infants as young as 4 months age can discriminate speech sounds vs non-speech sounds and they prefer languages they are familiar with to unfamiliar ones (Minagawa-Kawai, Cristià and Dupoux, 2011; Moon, Lagercrantz and Kuhl, 2013). Moreover, infants are also able to identify phonological boundaries and segment word based on the rich linguistic information in the auditory stream (Juszky, Cutler and Redanz, 1993; Jusczyk, Houston, and Newsome, 1999; Houston et al., 2000). As well as differentiating all the audio input, the infants must also learn to perceptually group different sounds as the same group. For example, when infants hear the word “cookie” said by mother and father, even though perceptually two “cookie” sound different, they have to be able to integrate the different input and group them together. Kuhl (1985) has this is the problem of speech categorization.

Speech categorization is a very difficult task. In human languages, there are finite sets of phonetic units, or combinations of vowels and consonants. Those phonetic units are difficult to define physically. The features of phonetic units are influenced by talker variety, rates of speech and contexts. When different speakers produce the same sound, the acoustic feature vary widely (Figure 1). Also, when people talk fast or when a sound occurs in a different environment, all can affect the physical feature of a sound.

9.3 Speech categorization as a Machine Learning task

Given the messy nature of speech categorization, speech recognition technologies also face the same challenges as human infants. Last twenty years have witnessed high speed development of speech recognition technology, that enabled the brith of products like Siri and Alexa. However, the performance of Siri and Alexa are still not satisfactory. “Sorry I don’t understand what you said” is a constant frustration by Siri and Alexa users. Siri and Alexa are based on the traditional speech technology, which relies on linguistic resources and textual information to build acoustic and speech models. Traditional speech technology is developed based on larger and larger amounts of labeled data to train models. Traditional approaches in speech recognition include Hidden Markov Models (HMM), Dynamic Time Warping (DTW), combined with artificial neural networks. HMM based models are the most popular models in speech recognition field. Speech signals can be viewed as piecewise signals that fit into ta quantities matrix. Acoustic models and language model are both trained on speech signals in HMM. One advantage of HMM based models is that accuracy is highly related to training data size. It is easier to improve a HMM based model as long as large size of training data are available. However, when several gigabytes of data or memory space are not available, HMM based models could not perform well. This is also the reason Amazon and Apple both deployed cloud space and require network connection as for the working environment of Alexa and Siri, instead of local device. More recent work has focused on an end-to-end speech recognition model that jointly combine all components of the speech and train them together. Recent attempts have successfully trained supervised systems using textual transcripts only (Hannun et al., 2014, Miao, Gowayyed and Metze, 2015). However, this is still not the most efficient model in speech recognition since there is an extra step of translating speech input into textual information. Computational linguists therefore turned to a more efficient approach, which is the way of how human infants process speech. From babbling at 6 months of age and producing full sentence by age of 3 years, young children learn how to talk before they know how to read and write, and with minimal instructions. Inspired by early language acquisition, zero-resource speech technologies were first proposed in the JH CLSP Workshop in 2012, “with an aim to construct a system that learn an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only information available to a language learning infant” (Jansen et al., 2013). Zero resource refers to zero labelled data in training data in order to imitate the unsupervised learning process. In 2015, first Zero Resource Speech Challenge was organized in order to bring researchers together and compare their systems within a common open source evaluation setting. The participants work on the same data provided and evaluated based on the same criterion.

There are two major tasks in zero resource speech challenge: subword modeling and spoken term discovery. Subword modeling requires to build a representation of speech signal that is robust across different talkers, speech speed and context, which is similar to speech categorization. Since this is an unsupervised task, the definition of subword is not confined to phonemes or sound, or any arbitrary linguistic category; instead, subword is defined as basic unit to distinguish words. Subword modeling is similar to speech categorization in child language acquisition. Both tasks involve finding speech features in sound stream that are linguistically relevant (i.e. phoneme structure) and discard non linguistic features (i.e speaker identity). In Zero Resource Speech Challenge, the participants are required to provide a feature representation that maximally discriminate speech units in the raw input. The evaluation usually involves training a phone classifier and evaluating its classification accuracy. In Zero Resource Speech Challenge, Minimal-Pair ABX tasks are used to evaluate feature representation, which does not require any labelled training data (Schatz, 2013; Schatz, 2014). Minimal-Pair ABX task is a match-to-sample tasks to measure discriminability between two sound categories. If sounds A and sounds B belong to two separate categories, α and β, given a new sound X, the task is to decide whether X belongs to α or β. Discriminability of ABX task is defined as the probability that the Dynamic Time Warping (DTW) divergence between α and X and β and X. The dissimilarity is calculated either by the cosine distance or KL-divergence.

In 2015 Zero Resource Speech Challenge, two data set were used for participants to use: Buckeye corpus of conversational English (Pitt et al.,2007) and the Xitsonga section of the NCHLT corpus of South Africa’s languages (de Vries et al.,2014). For the English corpus, 6 male and 6 female native speaker of English recorded a total of 4h59m05s of speech; for Xitsonga section, 12 male and 12 female speakers recorded a total of 2h29m07s speech (Versteegh et al., 2016). There were total 5 algorithms on subword modeling accepted for publications. The scores on ABX discriminability is shown in Table 1. The baseline feature representation is the result of Mel-FrequencyCepstral Coefficients (MFCC). MFCC are coefficients of collective representation of short-term power spectrum of a sound. MFCC is used as baseline feature since it is not linguistic specific. The topline feature representation is the result from labeled data training, which is derived from Kaldi GMM-HMM system. As shown in table 1, for English language, most of the algorithms performed better than baseline in across-speaker and within-speaker task. Two of them even beat the topline in within-speaker task. For Xitsonga language, most of the algorithms performed better than baseline but none of them performed better than topline.

The best performing algorithm for crossand withinspeaker in English and within speaker in Xitsonga is DPGMM (Chen et al. 2015). Chen and his colleagues applied a pipeline of talker-normalized MFCC’s followed by a Dirichlet process Gaussian mixture model (DPGMM). DPGMM is a Bayesian nonparametric model which automatically learn the number of components according to the observed data which has been successfully applied to speech segments clustering (Kamper, Jansen, King and Goldwater, 2014). This approach generated very close result to the topline in across-speaker tasks and in within-speaker tasks, it even out performed topline. This results indicate that speech recognition without previous labeled data is a plausible that worth further pursuit. Badino et al (2015) also applied feature space modeling in their algorithm. They use binarized auto-encoders and HMM encoders to learn input features. The results in cross-speaker tasks were only slightly better than MFCC model and worse in within-speaker tasks. One possible explanation is that phonological features are not binary in nature. Applying binary coder to analyzing non-binary data will result in overrepresentation.

Instead of modeling the feature space, Renshaw’s and Thiolliere’s team both applied topdown information exploiting. They generated word-like pairs using an unsupervised discovery system and used the found pairs as input into a neural network. Renshaw’s team used correspondence auto-encoder (CAE) to learn the patterns in the input. Thiolliere’s team used the discovered pairs to train a siamese network. Their achieved the best results in Xitsonga cross-speaker task.

Baljekar’s team applied articulatory information derived from previously trained speech synthesis system for languages without a writing system. The results were worse than the baseline. They also compared the articulatory features with segment-based inferred phones, and found that inferred phones had the worst performance in Xitsonga tasks. Baljekar’s team did not build a strict unsupervised system since they relied on the information from a partially supervised system. Their results are interesting in the way that it demonstrated how supervised feature interact with unsupervised systems.

In 2017 Zero Resource Speech Challenge, there were two group of data sets: development data and the surprise data. The development data consists English, French and Mandarin corpora, with phone force-aligned using Kaldi (Povey, et al.,2011, Wang, Zhang and Zhang, 2015). The surprise data consists of German and Wolof corpora (Gauthier et al., 2016), but it is not revealed to the participants (Dunbar et al., 2017). A description of the corpus statistics is shown in table 2. There are total 6 papers with 16 systems for subword modeling, which is almost three times as last challenge. All the systems are evaluated using Mixed pair ABX tasks, with a focus on phone triplet minimal pairs that differ in the central sound. For example, A = beg (α) and B = bag (β), X = bag’ should be categorized as α. The scores for each system is shown in table 3. In general, most of the submitted models have better performance on development data than surprise data. All the sixteen systems can be categorized into four strategies.

Heck et al. applied bottom-up frame-level clustering, inspired by the success of Chen et well as learned feature transformations (LDA, neutralize talker variance. The training lables sound is the same as that of its left and right neighbors. The results showed that both P1 and P2 are successful since they are all better than baseline results. Comparing P1 and P2, re-estimation the centroids only slightly improved the results.

Chen et al. applied DPGMM to cluster frames separately on each language. The labels then is trained on MFCCs (C1) and transformed using unsupervised linear VTLN (C2). The results on both development data and surprise data all outperformed baseline. In German within speaker task, the algorithm outperformed topline too. Ansari et al. trained all five languages on two sets of features. The first set is high-dimensional hidden layer trained by MFCC frames. The second set is a hidden layer trained based labels gathered by a Gaussian mixture model on speech frames. The input to the deep neural network are labels trained by MFCC (A1), Gaussian-mixture-HMM (A2), auto encoder features (A3), and another HMM posteriograms features (A4). The results showed that all four models have better results than baseline. MFCC and GaussianMixture-HMM models had better performance than the other two models.

The third strategy is to improve spoken term discovery. Inspired by Thiolliere et al (2015) and Renshaw et al (2015), Yuan (2017) obtained bottle-neck features through unsupervised word-pair generating model and applied STD system to discover acoustic features of word pairs on English only (Y1), all five languages (Y2). They also created a supervised comparison, using transcribed pairs from Switchboard corpus as labels to train STD system (YS). The results are better than baseline. The results of two unsupervised models were very similar to the supervised one.

The last strategy is to use supervised training on nontarget language. Shibata et al generated features from a neural network acoustic models on Japanese as part of an HMM (S1). In (S2), they trained ten other languages (including English, Mandarin and German) on an end-toend convolutional network and bidirectional LSTM. The model with ten languages out-performed the Japanese one. However, since target languages are also included in the training data, it is not a strict zero resource speech recognition task.

The clear winner for 2015 and 2017 Zero Resource Speech Challenge is DPGMM model, as demonstrated in Chen et al (2016) and Heck et al (2017). The most successful strategy for speech unit categorization is bottom-up clustering, which is true for both monolingual environment and multilingual environment. Bottom-up clustering also best resembles how young children build mental representation of speech units among all other algorithms. The success in bottom-up clustering is inspiring to the field of child language acquisition. For decades, psycholinguists struggle to model the process of speech categorization. Successful machine learning algorithms like bottom-up clustering could be useful as a basis to build a child speech categorization model. Meanwhile, in both years of Zero Resource Speech Challenge tasks, there are some unsupervised algorithms outperformed supervised ones, which might indicate that the mechanism in child speech categorization, similar unsupervised algorithms, requires no innate knowledge or structure.

9.4 Word Category Acquisition

After successfully differentiating speech units from sound stream, young children also have to make sense out of the combination of speech units. This requires them to learn the grammatical word categories of the language. Past studies have investigated various hypothesis about how children learn grammatical categories. One way that children learn grammatical categories through statistical learning by tracking statistical information such as frequencies and co-occurence of certain sounds. Children could rely on distributional cues or sentence context to determine the category of a certain word (Mintz, Newport and Bever, 2002). In addition, studies on artificial language learning indicate that infants as young as 12 months of age can use distributional cues to group words that have no semantic meanings into categories (Gerken, Wilson and Lewis, 2005; Gomez and Lakusta, 2004; Lany and Gomez, 2008). Recent studies have suggested that infants use prosody and intonation information to determine syntactic and phrase boundaries in their first year of life (Pennekamp, Weber, and Friederici, 2006). Computational models are constructed by psycholinguists to further unravel the mechanisms in statistical learning. Mintz (2003) focused on the frequent frames in child-directed speech (e.g you __ it, the __one) and found that there is a strong pattern of frames that could enhance children’s category acquisition. Clair, Monaghan and Christiansen (2010) applied computational models further expanded the frames to investigate in child-directed speech. They combined fixed frames bigrams and trigrams (e.g aX, aXb) into flexible frames (e.g. aX + Xb), which increase the power of training data. The accuracy of combined flexible frames is largely higher than bigrams or trigrams, suggesting that a less rigid distributional form may provide more information on children learning language. Although these studies successfully contributed to the knowledge of children’s word category acquisition, they focused on particular structures that can not explain all the categories or the environment of all common words. Also, distributional models could not answer problems about ambiguous categories. In English, about 11% of word types in English are grammatically ambiguous (e.g cook (n./v.)) (DeRose, 1998). It is important to build a model that is able to assign more than one category to a word in order to represent categorical ambiguity in real world.

9.5 Word Categorization as a Machine Learning Problem

Word Categorization, or part-of-speech tagging (POS tagging) is one of the most developed field in Natural Language Processing. POS tagging is the process of word categories in given input. The input to POS tagging tasks usually consist of word-tag pair sets. Most modern language processing on English uses Penn Treebank tags which has 45 tags (Marcus et al., 1993). Two most used algorithms for tagging are the Hidden Markov Model (HMM) and the Maximum Entropy Markov Model (MEMM). HMM is a generative model based on the probability of ngrams word combination. MEMM is a sequence model adaption on logistic regression, which is a discriminative sequence model. The accuracy of POS-tagging tasks highly depends on the labeled training dataset, which is not efficient enough.

Before discussing about how to build algorithms for word category, it is important to understand how grammatical categories are defined. Pike (1967) has discussed this question from etic and emic perspectives: whether the categories were created to fit words or words were sorted into different categories. This similar to the difference in supervised and unsupervised learning tasks. Supervised learning tasks, such as POS tagging which uses labeled data to train the algorithm, which is similar to “category before words” point of view. While in unsupervised learning tasks, categories are emerged from the words, which is in line with “words before category” point of view.

In addition to speech segmentation, young children are also able to abstract syntactic and semantic information from speech stream. The mechanism of syntactic acquisition is still open to debate. In the perspective of formal language theory and generative grammar, there are are finite set of words, which is denoted by Σ. A language is a subset of Σ, denoted as L. L can be defined as a set of grammatical and semantically well-formed sentences of a language. L, in principle, is an infinite set, as there are infinite sentences in any language. Grammar is a generative device to represent L in a finite way. In Chomsky’s terms, these grammars are i-languages, which are innate and universal to all speakers. He also assumed that there is a Language Acquisition Device (LAD), which takes input linguistic data and outputs a grammar of some kind. The existence of LAD has been evoked decades-long debates between empiricists and nativists. In the computational learning theory, LAD is not a mysterious black box any more; it is seen as an algorithm, which can be studied and interpreted using mathematical tools and computational modeling. To crack the internal structure of LAD, reverse engineering is needed. In the software engineering or machine learning tasks, an algorithm is designed to perform desired tasks. In the case of LAD, an algorithm that performs the desired tasks need to be interpreted and modeled for its internal structure.

In machine learning area, the generative approach could also be insightful., in the way that “generative models can be tested in its prevision and recall two customary measures of performance in natural language engineering, which can address perennial questions of model relevance and scalability” (Kolodyn, Lotem and Edelman, 2015). The algorithm wants to imitate the generative sense in language acquisition so that it can parse and produce new materials. The existing statistically based models are still struggling in parsing new materials, let along producing novel utterances.

In Zero Resource Speech Challenges, word categorization is addressed as the Spoken Term Discovery. As described in Versteegh et al (2016), spoken term discovery is the task of finding speech fragments, ideally the speech fragments could correspond to the word-like units in language. Unlike evaluating dissimilarity scores in speech categorization tasks, there are three steps to evaluate a spoken term discovery algorithm. The first is to examine pairwise fragment discovery in audio stream. Normalized Edit Distance (NED) and paired speech intervals and the coverage (COV) are evaluated the decide to acoustic/phonological discrimination of the pairwise fragment. The second step is to cluster all the discovered pairs into classes. The clusters are evaluated against the gold lexicon. Finally, all the classes are used as labels to “parse” the new input. This step is evaluated by how many word tokens were correctly segmented (Token scores) and how many word boundaries were correctly defined (Boundaries scores). The baseline is provided by a randomized matching algorithms (Jansen and Van Durme, 2011), which has a high NED score (good matching) but poor coverage.

In 2015 Zero Speech Resource Challenge, two papers on algorithms of spoken term discovery tasks were accepted for publication. The results of the algorithms in two papers are summarized in table 4. Rasanen et al. () proposed to use syllable segmentation for spoken term discovery. They compared three systems for segmenting speech stream into syllable units: Vseg, EnvMin and Osc (FIND RASANEN PAPER WRITE MORE ABOUT WHAT IS Vseg EnvMin and Osc). Using syllables to determine spoken terms is a highly original approach, since it relies on the prior knowledge about speech. As shown in table 3, this approach is effective, since it consistently beat baseline results. The Osc algorithm seem to be the most effective one among all syllable-based categorization. Lyzinski et al. focused on the second step of spoken term discovery process, which is clustering discovered pairwise into classes. The study used baseline algorithm for pairwise matching segments, and applied three algorithms to cluster the pairs segmented by baseline algorithm: one simple Connected Components (CC), and two modularity based algorithms, FG and Louvain. Although the performance of the three algorithms were similar to baseline or sometimes worse than bsseline, the results differ among three algorithms. This investigation provides insight on how clustering algorithm could impact performance of spoken term discovery system.

9.6 References

Ansari, T. K., Kumar, R., Singh, S., & Ganapathy, S. (2017, December). Deep learning meth- ods for unsupervised acoustic modeling—Leap submission to ZeroSpeech challenge 2017. In Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE (pp. 754-761). IEEE.
Badino, L., Mereta, A., & Rosasco, L. (2015). Discovering discrete subword units with bina- rized autoencoders and hidden-markov-model encoders. In Sixteenth Annual Conference of the International Speech Communication Association.
Baljekar, P., Sitaram, S., Muthukumar, P. K., & Black, A. W. (2015). Using articulatory fea- tures and inferred phonological segments in zero resource speech processing. In Sixteenth Annual Conference of the International Speech Communication Association.
Chen, H., Leung, C. C., Xie, L., Ma, B., & Li, H. (2015). Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study. In Sixteenth Annual Conference of the International Speech Communication Association.
Chen, H., Leung, C. C., Xie, L., Ma, B., & Li, H. (2017). Multilingual bottle-neck feature learning from untranscribed speech.
Chomsky, N. (1986). Knowledge of language: Its nature, origin, and use. Greenwood Pub- lishing Group.
Clair, M. C. S., Monaghan, P., & Christiansen, M. H. (2010). Learning grammatical cate- gories from distributional cues: Flexible frames for language acquisition. Cognition, 116(3), 341-360.
DeRose, S. J. (1998, September). XQuery: A unified syntax for linking and querying general XML documents. In QL.
de Vries, N., Davel, M., Badenhorst, J., Basson, W., de Wet, F., Barnard, E., et al. A smart- phone-based ASR data collection tool for under-resourced languages. Speech Communication 2014;56:119–131.
Ellis, N. C. (2017). Cognition, Corpora, and Computing: Triangulating Research in Usage- Based Language Learning. Language Learning, 67(S1), 40-65.
Gauthier, E., Besacier, L., Voisin, S., Melese, M., & Elingui, U. P. (2016, May). Collecting resources in sub-saharan african languages for automatic speech recognition: a case study of wolof. In 10th Language Resources and Evaluation Conference (LREC 2016).
Gerken, L., Wilson, R., & Lewis, W. (2005). Infants can use distributional cues to form syn- tactic categories. Journal of child language, 32(2), 249-268.
Glass, J. (2012, July). Towards unsupervised speech processing. In Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on (pp. 1-4). IEEE.
Gómez, R. L., & Lakusta, L. (2004). A first step in form-based category abstraction by 12- month-old infants. Developmental science, 7(5), 567-580.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., … & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv 1412.5567. 21
Heck, M., Sakti, S., & Nakamura, S. (2017, December). Feature optimized dpgmm cluster- ing for unsupervised subword modeling: A contribution to zerospeech 2017. In Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE (pp. 740-746). IEEE.
Jansen, A., Dupoux, E., Goldwater, S., Johnson, M., Khudanpur, S., Church, K., … & JHU CLSP Mini-Workshop Research Team. (2013). A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition.
Jansen, A., & Van Durme, B. (2011, December). Efficient spoken term discovery using ran- domized algorithms. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on (pp. 401-406). IEEE.
Kamper, H., Jansen, A., King, S., & Goldwater, S. (2014, December). Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings. In Spoken Lan- guage Technology Workshop (SLT), 2014 IEEE(pp. 100-105). IEEE.
Kilgarriff, A. (2005). Language is never, ever, ever, random. Corpus linguistics and linguis- tic theory, 1(2), 263-276.
Kuhl, P. K. in Neonate Cognition: Beyond the Blooming Buzzing Confusion (eds Mehler, J. & Fox, R.) 231–262 (Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1985).
Kolodny, O., Lotem, A., & Edelman, S. (2015). Learning a Generative Probabilistic Gram- mar of Experience: A Process-Level Model of Language Acquisition. Cognitive Science, 39(2), 227-267.
Lany, J., & Gómez, R. L. (2008). Twelve-month-old infants benefit from prior experience in statistical learning. Psychological Science, 19(12), 1247-1252.
Manenti, C., Pellegrini, T., & Pinquier, J. (2017, October). Unsupervised Speech Unit Dis- covery Using K-means and Neural Networks. In International Conference on Statistical Language and Speech Processing (pp. 169-180). Springer, Cham.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2), 313-330.
Miao, Y., Gowayyed, M., & Metze, F. (2015, December). EESEN: End-to-end speech recog- nition using deep RNN models and WFST-based decoding. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on (pp. 167-174). IEEE.
Minagawa-Kawai, Y., Cristià, A., & Dupoux, E. (2011). Cerebral lateralization and early speech acquisition: A developmental scenario. Developmental Cognitive Neuroscience, 1(3), 217-232.
Mintz, T. H., Newport, E. L., & Bever, T. G. (2002). The distributional structure of grammat- ical categories in speech to young children. Cognitive Science, 26(4), 393-424.
Mintz, T. H. (2003). Frequent frames as a cue for grammatical categories in child directed speech. Cognition, 90(1), 91-117.
Mitchell, T. M. (1997). Machine learning. WCB.
Moon, C., Lagercrantz, H., & Kuhl, P. K. (2013). Language experienced in utero affects vowel perception after birth: A two-country study. Acta Paediatrica, 102(2), 156-160.
Monaghan, P., & Rowland, C. F. (2017). Combining language corpora with experimental and computational approaches for language acquisition research. Language Learning, 67(S1), 14-39. 22
Räsänen, O., Doyle, G., & Frank, M. C. (2015). Unsupervised word discovery from speech using automatic segmentation into syllable-like units. In Sixteenth Annual Conference of the International Speech Communication Association.
Renshaw, D., Kamper, H., Jansen, A., & Goldwater, S. (2015). A comparison of neural net- work methods for unsupervised representation learning on the zero resource speech chal- lenge. In Sixteenth Annual Conference of the International Speech Communication Associa- tion.
Schatz, T., Peddinti, V., Back, F., Jansen, A., Hermansky, H., Dupoux, E.. Evaluating speech features with the minimal-pair abx task (i): Analysis of the classical mfc/plp pipeline. In: Proceedings of Interspeech. 2013.
Schatz, T., Peddinti, V., Cao, X.N., Bach, F., Hermansky, H., Dupoux, E.. Evaluating speech features with the minimal-pair abx task (ii): Resistance to noise. In: Proceedings of Inter- speech. 2014.
Pike, K. L. (1967). Language in relation to a unified theory of the structure of human behav- ior (Vol. 24). Walter de Gruyter GmbH & Co KG.
Pitt, M. A., Johnson, K., Hume, E., Kiesling, S., & Raymond, W. (2005). The Buckeye cor- pus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication, 45(1), 89-95.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., … & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. EPFL-CONF-192584). IEEE Signal Processing Society.
Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M., & Dupoux, E. (2015). A hybrid dy- namic time warping-deep neural network architecture for unsupervised acoustic modeling. In Sixteenth Annual Conference of the International Speech Communication Association.
Versteegh, M., Anguera, X., Jansen, A., & Dupoux, E. (2016). The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science, 81, 67-72.
Wang, D., & Zhang, X. (2015). Thchs-30: A free chinese speech corpus. arXiv preprint arX- iv:1512.01882.
Yuan, Y., Leung, C. C., Xie, L., Chen, H., Ma, B., & Li, H. (2017). Extracting bottleneck fea- tures and word-like pairs from untranscribed speech for feature representation.