Cet ouvrage fait partie de la bibliothèque YouScribe
Obtenez un accès à la bibliothèque pour le lire en ligne
En savoir plus

Automatic evaluation of tracheoesophageal substitute voices [Elektronische Ressource] / vorgelegt von Tino Haderlein

236 pages
Automatic Evaluation of TracheoesophagealSubstitute VoicesDer Technischen Fakulta¨t derUniversita¨t Erlangen–Nu¨rnbergzur Erlangung des GradesDOKTOR–INGENIEURvorgelegt vonTino HaderleinErlangen — 2007Als Dissertation genehmigt von derTechnischen Fakulta¨t derUniversita¨t Erlangen-Nu¨rnbergTag der Einreichung: 27.06.2007Tag der Promotion: 31.10.2007Dekan: Prof. Dr.-Ing. habil. J. HuberBerichterstatter: Prof. em. Dr.-Ing. H. NiemannProf. Dr. med. Dr. rer. nat. U. EysholdtAbstractIn 20 to 40 percent of all cases of laryngeal cancer, total laryngectomy has to be performed,i.e. the removal of the entire larynx. For the patient, this means the loss of the natural voiceand thus the loss of the main means of communication. A popular method of voice restorationinvolves a shunt valve (“voice prosthesis”) between trachea and pharyngoesophageal segmentwhich establishes the tracheoesophageal (TE) substitute voice. From time to time, the substitutevoice has to be evaluated by the therapist for the purpose of reporting therapy progress. This eval-uation is subjective; it is therefore dependent on the particular expert’s experience and similarfactors. In the frame of this thesis, it was examined how automatic methods can be used in orderto provide an objective means of the evaluation of substitute voices.There are some established objective measures which are, however, restricted to the evalua-tion of sustained vowels.
Voir plus Voir moins

Automatic Evaluation of Tracheoesophageal
Substitute Voices
Der Technischen Fakulta¨t der
Universita¨t Erlangen–Nu¨rnberg
zur Erlangung des Grades
DOKTOR–INGENIEUR
vorgelegt von
Tino Haderlein
Erlangen — 2007Als Dissertation genehmigt von der
Technischen Fakulta¨t der
Universita¨t Erlangen-Nu¨rnberg
Tag der Einreichung: 27.06.2007
Tag der Promotion: 31.10.2007
Dekan: Prof. Dr.-Ing. habil. J. Huber
Berichterstatter: Prof. em. Dr.-Ing. H. Niemann
Prof. Dr. med. Dr. rer. nat. U. EysholdtAbstract
In 20 to 40 percent of all cases of laryngeal cancer, total laryngectomy has to be performed,
i.e. the removal of the entire larynx. For the patient, this means the loss of the natural voice
and thus the loss of the main means of communication. A popular method of voice restoration
involves a shunt valve (“voice prosthesis”) between trachea and pharyngoesophageal segment
which establishes the tracheoesophageal (TE) substitute voice. From time to time, the substitute
voice has to be evaluated by the therapist for the purpose of reporting therapy progress. This eval-
uation is subjective; it is therefore dependent on the particular expert’s experience and similar
factors. In the frame of this thesis, it was examined how automatic methods can be used in order
to provide an objective means of the evaluation of substitute voices.
There are some established objective measures which are, however, restricted to the evalua-
tion of sustained vowels. In this thesis, the step from the automatic analysis of vowel recordings
to text recordings is done. For judging speech quality objectively in a real communication sit-
uation, the analysis of entire words and sentences is necessary because the intelligibility of a
substitute voice in a dialogue is a substantial criterion for evaluation. Automatic word recogni-
tion methods were applied to a standard text that was read out by the test persons. Information on
the intelligibility of the individual speakers was gained by the comparison of word recognition
rates with reference evaluation data from human experts. The use of a prosody module allowed
to extract not only acoustic information on the speaker’s voice, but it also measured individual
speaking characteristics.
The inter-rater variability among humans was compared to the automatic analysis results, and
the main finding was that the correlation between human and automatic ratings was as good as
the agreement among the human rater group.
The automatic recognition could be slightly improved on distant-talking recordings by the
use of-law features which are modified Mel-Frequency Cepstrum Coefficients (MFCC). Artifi-
cially reverberated training data for the recognizer is another possibility to achieve better recog-
nition rates even when the reverberation in the test data does not match the acoustic properties of
the training data. This is a step towards a therapy session where the patients will not be required
to wear a headset any more.Contents
List of Figures vii
List of Tables ix
1 Introduction 1
1.1 The Need for Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Towards Screening in Natural Settings . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions Made in this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Tracheoesophageal Substitute Voices 7
2.1 Laryngectomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 History of Substitute Voices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Different Kinds of Voice Rehabilitation . . . . . . . . . . . . . . . . . . 8
2.2.2 The Esophageal Substitute Voice . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Electrical Sound Generators . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Surgical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.5 The Tracheoesophageal (TE) Substitute Voice . . . . . . . . . . . . . . . 10
2.2.6 Stoma Filters and Stoma Valves . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Properties of Substitute Voices . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Dynamics of the PE Segment . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Aerodynamic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Acoustic and Prosodic Properties . . . . . . . . . . . . . . . . . . . . . 16
2.4 Subjective Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Subjective Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 The GRBAS and RBH Scale . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Self-Evaluation Scales (VHI, V-RQOL, SF-36) . . . . . . . . . . . . . . 18
2.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Objective Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 A Model for Alaryngeal Voices . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Objective Measures and Analysis . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 The Dysphonia Severity Index (DSI) . . . . . . . . . . . . . . . . . . . . 29
2.5.4 The Hoarseness Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iii CONTENTS
3 Agreement Measures 33
3.1 Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Pearson’s Product-Moment Correlation Coefficientr . . . . . . . . . . . 33
3.1.2 Spearman’s Rank-Order Correlation Coefficientρ . . . . . . . . . . . . 34
3.2 Cohen’sκ and its Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Chance vs. Competence . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 A Model for Agreement Measuring . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Weightedκ Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.4 Multi-Rater Agreement withκ Measures . . . . . . . . . . . . . . . . . 37
3.2.5 Restrictions of theκ Measure . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Krippendorff’sα . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Speech Corpora 43
4.1 The EMBASSI Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Influence of Reverberation on Human Perception . . . . . . . . . . . . . 43
4.1.2 EMBASSI Corpus Overview . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.3 Training Data for the EMBASSI Baseline Recognizer EMB-base . . . . 45
4.1.4 Training with Distant-Talking EMBASSI Data . . . . . . . . . . . . . . 47
4.1.5 Artificial Reverberation of Speech Data . . . . . . . . . . . . . . . . . . 48
4.1.6 Selecting Room Impulse Responses . . . . . . . . . . . . . . . . . . . . 48
4.1.7 Artificially Reverberated Training Data in EMBASSI Recognizers . . . 49
4.2 The Fatigue Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 The VERBMOBIL Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Training Data for the VERBMOBIL-based Recognizers . . . . . . . . . . 54
4.3.2 Test Sets for the VERBMOBIL-based Recognizers . . . . . . . . . . . . . 54
4.4 Recordings of Laryngectomized Speakers . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 The Text “The North Wind and the Sun” . . . . . . . . . . . . . . . . . . 56
4.4.2 Speaker Groups laryng41 and laryng18 . . . . . . . . . . . . . . . . . . 57
4.4.3 Evaluation by Human Experts . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.4 Intra-Rater and Inter-Rater Correlation . . . . . . . . . . . . . . . . . . 59
4.5 Normal-Speaking Control Groups . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Automatic Speech Analysis 63
5.1 The Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.2 Acoustic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1.4 Language Model and Decoding . . . . . . . . . . . . . . . . . . . . . . 66
5.1.5 Recognizer Training Procedure . . . . . . . . . . . . . . . . . . . . . . 67
5.1.6 Speech Recognizers for the Evaluation of TE Speech . . . . . . . . . . . 67
5.2 Modified Features for Reverberated Environment . . . . . . . . . . . . . . . . . 68
5.2.1 Handling Acoustic Mismatch between Training and Test Data . . . . . . 69
5.2.2 The Root Cepstrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.3 -Law Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70CONTENTS iii
5.3 Recognizer Adaptation to TE Voices . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.2 Linear Interpolation of Hidden Markov Models . . . . . . . . . . . . . . 72
5.3.3 Estimation of the Interpolation Weights . . . . . . . . . . . . . . . . . . 72
5.3.4 Determination of the Interpolation Partners . . . . . . . . . . . . . . . . 73
5.4 Visualization of Recognizer Adaptation . . . . . . . . . . . . . . . . . . . . . . 74
5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.2 A Distance Metric for Semi-Continuous HMMs . . . . . . . . . . . . . . 74
5.4.3 Sammon Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.4 Mappings of Voice Disorders . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Prosodic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.2 The Prosody Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Speech Recognition in Reverberated Environment 83
6.1 Experimental Results on Reverberated Training Data . . . . . . . . . . . . . . . 83
6.1.1 Experiments with the EMBASSI Baseline System EMB-base . . . . . . 83
6.1.2 The EMB-rev Recognizer with Distant-Talking Training Data . . . . . . 84
6.1.3 Artificially Reverberated Training Data in EMBASSI Recognizers . . . 85
6.1.4 Experiments on VERBMOBIL and Fatigue Data . . . . . . . . . . . . . . 86
6.2 Experimental Results on Modified Features . . . . . . . . . . . . . . . . . . . . 87
6.2.1 Root Cepstrum Features . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.2 -Law Features in the EMBASSI Baseline System EMB-base . . . . . . 89
6.2.3 -Law Features and Artificially Reverberated EMBASSI Data . . . . . 93
6.2.4 -Law Features and Artificially Reverberated VERBMOBIL Data . . . . 93
6.2.5 Gaussianization of Feature Components . . . . . . . . . . . . . . . . . . 95
6.3 Results on Beamformed Test Data . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.1 Removing Reverberation from Audio Signals . . . . . . . . . . . . . . . 97
6.3.2 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.3 Experiments with the EMBASSI Baseline System EMB-base . . . . . . 98
6.3.4 Beamforming and Artificially Reverberated EMBASSI Data . . . . . . . 98
6.3.5 Results on the VERBMOBIL-Based Recognizers . . . . . . . . . . . . . . 100
6.3.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 101
7 Automatic Analysis of Tracheoesophageal Voices 103
7.1 Automatic Speech Recognition vs. Human Evaluation . . . . . . . . . . . . . . . 103
7.1.1 Baseline Recognition Results on the NW-base Recognizers . . . . . . . . 103
7.1.2 Correlation between NW-base Recognizers and Human Rating . . . . . . 104
7.2 Results of Recognizer Adaptation to TE Voices . . . . . . . . . . . . . . . . . . 105
7.2.1 Adaptation to Single Speakers . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.2 Adaptation to the Entire laryng18 Speaker Group . . . . . . . . . . . . . 108
7.2.3 Correlation of the Word Accuracy Computed vs. the Reference Text . . . 109
7.2.4 Optimal Conversion of Word Accuracies to Integer Scores . . . . . . . . 112
7.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Prosodic Analysis vs. Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 113
7.3.1 Prosodic Features on TE Speakers and Laryngeal Speakers . . . . . . . . 113iv CONTENTS
7.3.2 Correlation between Prosodic Features and Human Rating . . . . . . . . 114
7.3.3 Analysis of the Fundamental Frequency . . . . . . . . . . . . . . . . . . 118
7.3.4 Measuring the Match of Breath and Sense Units . . . . . . . . . . . . . . 119
7.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4 The Post-Laryngectomy Telephone Test (PLTT) . . . . . . . . . . . . . . . . . . 123
7.4.1 Initial Experiments with Telephone Speech Data . . . . . . . . . . . . . 123
7.4.2 Intelligibility Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.3 PLTT Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.4.4 Words and Sentences in the PLTT . . . . . . . . . . . . . . . . . . . . . 125
7.4.5 Test Data and Automatic Evaluation Results . . . . . . . . . . . . . . . . 126
7.5 Simulated Distant-Talking TE Recordings . . . . . . . . . . . . . . . . . . . . . 128
7.5.1 Test Data and Recognizers . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6 Visualization of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.7 Selection of a Set of Objective Measures . . . . . . . . . . . . . . . . . . . . . . 133
7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8 Discussion 137
9 Outlook 141
10 Summary 143
Bibliography 147
A Reading Material 179
A.1 The Text “The North Wind and the Sun” . . . . . . . . . . . . . . . . . . . . . . 179
A.2 The Reading Sheets for the PLTT . . . . . . . . . . . . . . . . . . . . . . . . . . 180
B Human Evaluation Results 183
C Recognition Results for Gaussianized Features 189
D Evaluation Environment for Voice Analysis 193
D.1 The Projects Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
D.2 The Recognizers Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
D.3 The Evaluation Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
D.3.1 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . 195
D.3.2 Correlation between a Recognizer and Human Raters . . . . . . . . . . . 196
D.3.3 Correlation among Human Raters . . . . . . . . . . . . . . . . . . . . . 196
D.3.4 Computing “Word Hypotheses Graphs” (WHGs) . . . . . . . . . . . . . 197
D.3.5 Computing Prosodic Features . . . . . . . . . . . . . . . . . . . . . . . 197
D.3.6 Further Evaluation Scripts . . . . . . . . . . . . . . . . . . . . . . . . . 198CONTENTS v
E German Translation of Introduction and Summary 201
E.1 Titel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
E.2 Inhaltsverzeichnis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
E.3 Einleitung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
E.3.1 Die Notwendigkeit objektiver Evaluierung . . . . . . . . . . . . . . . . 206
E.3.2 Auf das Screening in natu¨rlichen“ Situationen gerichtet . . . . . . . . . 207

E.3.3 Beitrag dieser Arbeit zur Forschung . . . . . . . . . . . . . . . . . . . . 209
¨E.3.4 Ubersicht . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
E.4 Zusammenfassung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Index 217vi CONTENTS