34 pages

Method for creating phone duration models using very large, multi-speaker, automatically annotated speech corpus ; Garsų trukmių modelių kūrimo metodas, naudojant didelės apimties daugelio kalbėtojų garsyną

vytautas_magnus_university - Giedrius

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

34 pages

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

VYTAUTAS MAGNUS UNIVERSITY VILNIUS UNIVERSITY INSTITUTE OF MATHEMATICS AND INFORMATICS Giedrius Norkevi čius METHOD FOR CREATING PHONE DURATION MODELS USING VERY LARGE MULTI-SPEAKER AUTOMATICALLY ANNOTATED SPEECH CORPUS Summary of Doctoral Dissertation Physical Sciences, Informatics (09 P) Kaunas, 2010 The research was carried out in 2004 – 2010 at Vytautas Magnus University. st thFrom October 1 , 2004 till October 24 , 2007 scientific supervisor: Assoc. Prof. Dr. Minija Tamoši ūnait ė (Vytautas Magus University, Physical Sciences, Informatics – 09P) Scientific supervisor: Assoc. Prof. Dr. Gailius Raškinis (Vytautas Magus University, Physical Sciences, Informatics – 09P) The dissertation is being defended at the Joint Council of Scientific Field of Informatics of Vytautas Magnus University and Vilnius University Institute of Mathematics and Informatics: Chairman: Prof. Dr. Habil. Vytautas Kaminskas (Vytautas Magus University, Physical Sciences, Informatics – 09P) Members: Prof. Dr. Habil. Antanas Žilinskas (Vilnius University Institute of Mathematics and Informatics, Physical Sciences, Informatics – 09P) Assoc. Prof. Dr. Vytautas Rudžionis (Vilnius University, Technological sciences, Informatics engineering – 07T) Prof. Dr. Habil. Henrikas Pranevi čius (Kaunas University of Technology, Physical Sciences, Informatics – 09P) Prof. Dr. Habil.

Sujets

Cart

Informatics

Informations

Publié par	vytautas_magnus_university
Publié le	01 janvier 2011
Nombre de lectures	41

Extrait

VYTAUTAS MAGNUS UNIVERSITY VILNIUS UNIVERSITY INSTITUTE OF MATHEMATICS AND INFORMATICS Giedrius Norkevičius METHOD FOR CREATING PHONE DURATION MODELS USING VERY LARGE MULTI-SPEAKER AUTOMATICALLY ANNOTATED SPEECH CORPUS Summary of Doctoral Dissertation Physical Sciences, Informatics (09 P) Kaunas, 2010



The research was carried out in 2004  2010 at Vytautas Magnus University. From October 1st, 2004 till October 24th, 2007 scientific supervisor: Assoc. Prof. Dr. Minija Tamoinait Magus University, (Vytautas Physical Sciences, Informatics  09P) Scientific supervisor: Assoc. Prof. Dr. Gailius Rakinis (Vytautas Magus University, Physical Sciences, Informatics  09P) The dissertation is being defended at the Joint Council of Scientific Field of Informatics of Vytautas Magnus University and Vilnius University Institute of Mathematics and Informatics:Chairman: Prof. Dr. Habil. Vytautas Kaminskas (Vytautas Magus University, Physical Sciences, Informatics  09P) Members: Prof. Dr. Habil. Antanas ilinskas (Vilnius University Institute of Mathematics and Informatics, Physical Sciences, Informatics  09P) Assoc. Prof. Dr. Vytautas Rudionis (Vilnius University, Technological sciences, Informatics engineering  07T) Prof. Dr. Habil. Henrikas Pranevičius (Kaunas University of Technology, Physical Sciences, Informatics  09P) Prof. Dr. Habil. Edmundas Kazimieras Zavadskas (Vilnius Gediminas Technical University, Technological Sciences, Informatics Engineering  07T) Official Opponents: Prof. Dr. Habil. Laimutis Telksys (Vilnius University Institute of Mathematics and Informatics, Physical Sciences, Informatics  09P) Assoc. Prof. Dr. Ričardas Kriktolaitis (Vytautas Magnus University, Physical Sciences, Mathematics  01P)

The dissertation will be defended at the public session of the Council of Scientific Field of Informatics at the V. Cepinskis Science Library of Vytautas Magnus University at 13 p.m. on January 24th, 2011 Address: Vileikos str. 8  605, LT  44404, Kaunas, Lithuania. The summary of the doctoral dissertation was sent out on December 23rd, 2010. The dissertation is available for reviewat the National M. Mavydas Library andat the libraries of Vytautas Magnus University and Vilnius University Institute of Mathematics and Informatics.



VYTAUTO DIDIOJO UNIVERSITETAS VILNIAUS UNIVERSITETO MATEMATIKOS IR INFORMATIKOS INSTITUTAS Giedrius Norkevičius GARSTRUKMIMODELIKRIMO METODAS, NAUDOJANT DIDELS APIMTIES DAUGELIO KALBTOJGARSYNDaktaro disertacijos santrauka Fiziniai mokslai, informatika (09 P) Kaunas, 2010



Disertacija rengta 2004  2010 metais Vytauto Didiojo Universitete. Nuo 20041001 iki 20071024 mokslinvadov: doc. dr. Minija Tamoinait Didiojo universitetas, fiziniai (Vytauto mokslai, informatika  09 P) Mokslinis vadovas: dr. Gailius Rakinis (Vytauto Didiojo universitetas, fiziniai mokslai, informatika  09P) Disertacija ginama jungtinje Vytauto Didiojo universiteto ir Vilniaus universiteto Matematikos ir Informatikos instituto informatikos mokslo krypties taryboje:Pirmininkas: prof. habil. dr. Vytautas Kaminskas (Vytauto Didiojo universitetas, fiziniai mokslai, informatika 09P)  Nariai: prof. habil. dr. Antanas ilinskas (Vilniaus universiteto Matematikos ir informatikos institutas, fiziniai mokslai, informatika  09P) doc. dr. Vytautas Rudionis (Vilniaus universitetas, technologijos mokslai informatikos ininerija  07T) prof. habil. dr. Henrikas Pranevičius (Kauno Technologijos universitetas, fiziniai mokslai, informatika  09P) prof. habil. dr. Edmundas Kazimieras Zavadskas (Vilniaus Gedimino technikos universitetas, technologijos mokslai, informatikos ininerija  07T) Oponentai: prof. habil. dr. Laimutis Telksys (Vilniaus universiteto Matematikos ir informatikos institutas, fiziniai mokslai, informatika  09P) doc. dr. Ričardas Kriktolaitis (Vytauto Didiojo universitetas, fiziniai mokslai, matematika  01P)

Disertacija bus ginama vieame informatikos mokslo krypties tarybos posdyje 2011 m. sausio mn. 24 d. 13 val. Vytauto Didiojo universiteto V. Čepinskio tikslijmokslskaitykloje. Adresas: Vileikos g. 8  605, LT  44404, Kaunas, Lietuva. Disertacijos santrauka isiuntinta 2010 m. gruodio mn. 23 d. Disertaciją galima perirti nacionalinje M. Mavydo, Vytauto Didiojo universiteto bei Vilniaus Universiteto Matematikos ir informatikos instituto bibliotekose.



Chapter One - Introduction Language technology research of Lithuanian started 20-30 years ago, becoming more intense in recent years. Although the experimental applications of word recognition, speech recognition are already created the overall quality of these systems is insufficient for practical purposes. Speech synthesis of Lithuanian has been developed for over 10 years having grapheme to phoneme conversion and speech signal generation as a main research fields and paying less attention for prosody analysis and modeling. As a consequence all existing speech synthesis systems of Lithuanian still lack naturalness. It is generally accepted that next to fundamental frequency timing plays a crucial role in naturalness of synthesized speech. Two main aspects are taken into consideration in this research: 1.Construction of language independent method for creating phone duration models using very large multi-speaker automatically annotated speech corpus. 2.Building a model capable of predicting phone duration of Lithuanian. In addition to the main applicability of this research  speech synthesis systems it could also be used for supporting other language technology issues such as speech recognition. Goal and tasks of the work. The main goal of this work is to propose methods that could make very large multi-speaker automatically annotated continuous speech corpus usable in phone duration modeling task. The following subtasks where set in order to accomplish the goal: analysis of the existing methods, features and evaluationPerform the criteria used for phone duration modeling. Choose and implement modeling method, construct a feature set used for phone duration prediction. Propose a method for identifying and eliminating noisy data samples originating from automatic alignment inaccuracies  data noise reduction. Perform the analysis of using multi-speaker1corpus and how it affects objective models evaluation criteria. Propose methods to normalize speaker-specific phone durations (normalization of speaker-specific phone durations, henceforth referred to as NSSPD) thus improving objective evaluation criteria. evaluate the created phone duration models of Lithuanian bySubjectively integrating these models into speech synthesis system and performing listening tests. 1manners (only timing is taken into consideration)Speakers may have different pronunciation



5

Objects of this researchare phone durations and phone duration models of Lithuanian. The process of creation of phone duration models based on very large multi-speaker automatically annotated continuous read speech corpus is analyzed in this dissertation.Scientific novelty. The main scientific novelty of this work is: process of creating phone duration models for Lithuanian usingThe machine learning method was investigated and wrote up for the first time. first time the creation of phone duration models was performedFor the using very large multi-speaker automatically annotated continuous read speech corpus. Methods for normalization of speaker-specific phone durations were  proposed and implemented. The application of proposed methods leads to improved models objective evaluation criteria. Methods for identifying and eliminating noisy data samples where proposed and implemented. Practical significance. Several practical results of the work can be noted: time phone duration models were built and evaluated forFor the first Lithuanian. The achieved results can be used for comparison in further phone duration researches of Lithuanian. dependencies on contextual factors were estimated andPhone duration written in explicit form. This result can serve as a source material in more general linguistic researches of Lithuanian. The proposed methods of data noise reduction and normalization of speaker-specific phone durations enable the use of multi-speaker automatically annotated speech corpus in phone duration modeling task. These methods can be used for other languages. can be integrated into text to speechThe created phone duration models synthesis of Lithuanian thus improving the naturalness of the synthesized speech. The created phone duration models can be integrated into automatic speech recognition systems of Lithuanian possibly increasing the accuracy of such systems. Publications of the results.The results of the research are published in 8 publications. One publication is published in the ISI indexed journal Informatica. One publication is published in the C.E.E.O.L and MLA indexed journal Kalbstudijos (Studies about Languages). Two are printed in the proceedings of the international conferencesSPECOM 2005 andHuman



6

Language Technologies 2007. Other are published in proceedings of local conferences.Publication in the ISI indexed journal:1.G. Norkevičius and G. Rakinis. Modeling Phone Duration of Lithuanian by Classification and Regression Trees, using Very Large Speech Corpus. Informatica, 19(2), p. 271-284, 2008. Publication in the C.E.E.O.L and MLA indexed journal: 1.G. Norkevičius, G. Rakinis, A. Kazlauskien Bendrins lietuvi kalbos daiktavardiir bdvardikirčiavimo struktrinis modelis, algoritmas ir realizacija.Kalbstudijosp. 72  76 (ISSN 1648-2824), 2004Nr. 6, Publications in the proceedings of the international conferences: 1.G. Norkevičius, G. Rakinis, A. Kazlauskien. Knowledge-based grapheme-to-phoneme conversion of Lithuanian words,SPECOM 2005, 10thInternational Conference SPEECH and COMPUTER, 1719 October [Patras, Greece], p. 235238. ISBN 5-7452-0110-x, 2005. 2.G. Norkevičius and G. Rakinis. Inter-speaker Speech Rate Normalization for Phone Duration Modeling of Lithuanian. In Proceedings of the 3rd Baltic Conference onHuman Language Technologies Kaunas: Vytauto Didiojo universitetas, p. 219-225. ISBN 978-9955-704-53-9, 2008. Other publications: 1.G. Norkevičius, G. Rakinis. Garstrukms modeliavimas, klasifikavimo ir regresijos mediais, naudojant didels apimties garsyną.Informacins technologijos 2007, Kaunas: Technologija, p. 52  66, 2007. 2.G. Norkevičius, A. Kazlauskien G. Rakinis. Gars ir trukms modeliavimas naudojant klasifikavimo ir regresijos medius.Informacins technologijos 2006, Kaunas: Technologija, p. 82  85, 2006. 3.A. Kazlauskien, G. Norkevičius, G. Rakinis. Automatizuotas lietuvikalbos veiksmaodikirčiavimas: problemos ir jsprendimo bdai.Baltir kit kalb ir akcentologijos problemos fonetikos, p. 166  173 (ISBN 9955-516-86-0), 2004. (Publications for the press are selected by the Science Council of Lithuania, adopted to the requirements for prestigious publications). 4.G. Norkevičius. Lietuvi odio pried kalbosli analiz. Konferencijos Informacinvisuomenir universitetins studijos praneimmediaga, p. 129 132, 2004.  Structure and size of the work. The dissertation is written in Lithuanian. It consists of 120 pages. The dissertation has 5 chapters including



7

Introduction and Conclusions, a List of 88 References and 4 Appendixes. Main text of the dissertation covers 92 pages, including 14 tables and 32 figures. Chapter Two  Review of segment duration modeling methods and researches This chapter covers a short formulation of segment duration modeling task, also the place of segment duration models in text to speech synthesis systems is shown along with the review of existing segment duration researches. In general, a segment (phone, syllable or any other speech unit) duration model is defined as a prediction system, which takes a feature set݂ଵ, ݂ଶ… ݂ே, used to characterize segment and its corresponding context, as an input and gives segment duration as an output. More generally segment duration model maps feature space=۴݂ଵ ݂ଶ …  ݂ேinto the set of real numbers (1): → ܦ௣௥௘ௗ ܀ (1): ۴As already noted, the main application of segment duration model is text to speech synthesis. Text to speech synthesis can be defined as a system which automatically generates speech from text using letter to phoneme (or grapheme-to-phoneme) conversion (Dutoit, 1997). In order to show the place of segment duration modeling in text to speech synthesis system it is sensible to divide such system into these consecutive steps: Natural language processing Prosody generation oSegment duration modeling oFundamental frequency modeling oSegment intensity modeling Speech signal formation/processing The existing segment duration modeling researches differs in many aspects: segments being modeled, features used to characterize segments and their corresponding context, corpus used for modeling, modeling methods and model evaluation criteria. Barbosa and Campbell states that higher hierarchical order speech segments such as syllable or inter perceptual center groups2(IPCG) are more important (Campbell 1991, 1992; Barbosa et. al. 1994) and therefore by applying the elasticity hypothesis not only affects but rather determines the durations of underlying speech segments (phones). Van Santen agrees that such factor as syllable structure undoubtedly affects phone duration, but also argues that 2Inter perceptual center groups are intervals between vowel starts



8

despite many issues of making a choice between modeling syllable duration and modeling IPCG duration there is no evidence that modeling higher hierarchical order segments in general leads to overall better performance (Richard Sproat, 1998) and thus as the majority of other scientists uses phone as a modeling segment. There are lots of factors and various factor interactions that affect phone duration and this is the main cause why phone duration modeling is not an easy task. The construction of feature set depends on several factors: 1.The expected application of the model being created. If the expected application is text to speech synthesis then one has only text as input, thus features that can only be extracted/calculated from text are appropriate. Otherwise, if the expected application is speech recognition or more general language technology research then acoustic properties of speech signal can also be used in phone duration modeling. 2.The availability of the results of exploratory statistics in phonetics literature of the target language. 3.The availability of the annotated speech corpus. Due to the main applicability of phone duration model meant in this research - speech synthesis systems - only features that can be extracted from text were taken into consideration. Although there is no single clear definition which factors and factor interactions affect phone duration, it is obvious that along with the phones phonetic and prosodic features contextual factors also play crucial role. Features used for modeling/predicting phone duration employed by other researchers could be grouped into levels according to the scope of the context: phone level (target phone identity, identities of preceding and following phones etc.), syllable level (syllable structure, number of phones in syllable, position of target phone in syllable etc.), word level (position of target phone in word, word length etc.), phrase/sentence level (phrase/sentence length, position of target phone or word in phrase/sentence, phrase/sentence type etc.). As for the availability of exploratory statistics in phonetics literature of Lithuanian there is a completely different situation with vowels and consonants. There is fairly a lot of literature about vowel duration and factors that affect vowel duration (Vaitkevičit, 1960; Anusien 1983; Svecevičius, 1964; Dambrauskait-Urbelien, 1967; Pakerys et. al., 1970; Kazlauskien, 1998; Kazlauskien, 2002; Girdenis, 1974; Girdenis et. al., 1982), but very little (Kazlauskien, 2006) about consonant duration. Based on such kind of information researcher usually makes assumptions about factors, factor interactions and in what particular way phone duration is affected. The more assumptions are made the smaller is feature space and less model parameters to estimate.



9

Contrariwise the less assumptions are made the bigger is feature space and more models parameters to estimate. It is well know that in practice the growth of feature space and model parameters to estimate causes data sparsity or the demand for bigger amounts of data. This problem is widely analyzed by Van Santen (1993), he observes that the number of very rare feature vectors is so large that even in small text samples one is assured to encounter at least one of them and calls this problem lopsided sparsity. Nevertheless most of the researchers use relatively small scale, single speaker, manually annotated or at least validated by experts speech corpus. As some researchers note (Cordoba et al., 2002; Pfitzinger, 2002; Carlson, 1991): using multi-speaker speech corpora is inappropriate because different speakers have different pronunciation manners and speak in different speech rate. Van Santen (1993) claims that automatically annotated corpus lack accuracy and thus are unusable in phone duration modeling tasks. It is worth mentioning that there exists recent research of German phone duration modeling (Moers et al., 2010) which was carried out using automatically annotated corpus. Summary of this along with many other researches are presented in table 1:cRarersieeadrcohutwbays:LanguageNupmeabkeerrosfNsuammbpelersofModelNfueamtbuerresofR(MmSs)ECORRAlignmentRemarks s Klabbers E, 2000 Dutch 1 16775 SOP 13 27 - Manual -Van Santen, J., 1993 English 1 41588 SOP 8 - 0.9 Manual 42 models Goubanova, O, 2000 English 1 10000 SOP 8 9 0.9 Unspecified Vowel Möbius B, 1996 German 1 23490 SOP 14 - 0.896 Manual -Cung H., 2002 Korean 1 19071 SOP 8 32.13 0.68 Manual Vowel Cung H., 2002 Korean 1 23032 SOP 8 28.86 0.54 Manual Consonant Venditti, 1998 Japanese 1 1039 SOP 8 - 0.89 Manual Vowel Brinckmann C, 2003 German 1 23133 CART 19 22.46 0.86 Manual -Krishna, N.S., 2004 Telugu 1 6846 CART 12 22.86 0.801 Manual -Bat Manual 20.3 0.79 - 5081 1 11 CART Czechek, R., 2002 Demenko G, 2007 Polish 1 4 hours CART 57 15.7 0.79 Manual -Cung H., 2002 Korean 1 19071 CART 15 27.51 0.78 Manual Vowel Krishna, N.S., 2004 Hindi 1 5104 CART 12 27.14 0.752 Manual -Cung H., 2002 Korean 1 23032 CART 15 24.2 0.71 Manual Consonant Goubanova, O, 2000 English 1 10000 CART 8 20 0.7 Unspecified Vowel Moers D., 2010 German 1 18487 CART 7 39.66 0.8 Automatic -Goubanova, O, 2000 English 1 10000 BTT 8 5 0.94 Unspecified Vowel Córdoba R, 2002 Spanish 1 15141 DNT 23 19.3 - Manual -Teixeira J., 2008 Portuguese 1 18000 DNT 17 - 0.839 Manual -Table 1: Summary of phone duration modeling researchesAs it can be seen (table 1) various researchers use different modeling methods. In general modeling methods used for phone duration prediction could be grouped to such categories: Modeling by expert rules where parameters of rules are estimated manually  rule-based methods. The dominant approach of rule based duration modeling method was proposed by D. Klatt (1987). The essence of 10

this approach lies in modifying some inherent (INHDUR) duration of phone by successively applying a number of rules. Each rule represents increase or decrease of phone duration in percent (PRCNT) depending on contextual factors. Duration cannot be compressed more than some predefined minimal (MINDUR). This type of duration model was quite successfully applied to Swedish, English and French languages. However rule-based models often tend to over generalize and cannot handle exceptions well without getting exceedingly complicated. It has to be noted that along with increasing computational power this method is becoming less attractive. The model is summarized by the formula: ܫܰܪܦܷܴ−ܯܫܰܦܷܴ)ܴܲܥܰܶ ܦܷܴܯܫܰܦܷܴ+(100 (2)= Modeling by expert rules where parameters of rules are estimated by statistical methods  semi-automatic methods. The most popular approach of this category for modeling phone duration is the so called sums-of-products model (SOP) developed by Van Santen (Richard Sproat, 1998). SOP is summarized by the formula: ܦ௣௥௘ௗ ෑ ܵ൫܎Ԧ൯ =෍௜,௝(݂௝) (3)௜∈௄ ௝∈ூ೔ Ԧ Here܎is a phone describing feature vector.ܭ- is the set of indexes each corresponding to a product term.ܫ௜is a set of feature indexes occurring in i-thproduct term. Parametersܵ௜,௝are called factor scales. The capability of capturing an important phenomenon of phone duration - directional invariance is identified as the main advantage of SOP. The directional invariance refers to the property that holding everything else constant, the effects of a factor have always the same direction: IF ܦ௣௥௘ௗ(݂ଵ, ݂ଶ, … , ݂ே) > ܦ௣௥௘ௗ(݂ଵᇱ, ݂ଶ, … , ݂ே) (4)THEN ܦ௣௥௘ௗ(݂ଵ, ݂ଶᇱ, … , ݂ேᇱ) > ܦ௣௥௘ௗ(݂ଵᇱ, ݂ଶᇱ, … , ݂ேᇱ) (5)To illustrate, in Lithuanian, if we find that [a] stressed with rising accent is longer than [u] stressed with rising accent, then the same will hold if we compare these vowels stressed with falling accent. However not all factor interactions are directional invariant, thus modeling duration by SOP is usually performed in three steps: 1.Building a category tree. The purpose of this step is to divide a feature space into categories where phone is affected by similar factors and/or factor interactions and the property of directional invariance is satisfied. This step is usually performed with the help of linguistic experts.



11