Voice modeling methods for automatic speaker recognition [Elektronische Ressource] / vorgelegt von Thilo Stadelmann

philipps-universitat_marburg

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

237 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	philipps-universitat_marburg
Publié le	01 janvier 2010
Nombre de lectures	24
Langue	English
Poids de l'ouvrage	8 Mo

Extrait

Voice Modeling Methods
for Automatic Speaker Recognition
Dissertation
zur Erlangung des Doktorgrades der Naturwissenschaften
(Dr. rer. nat.)
dem Fachbereich Mathematik und Informatik
der Philipps-Universit at Marburg
vorgelegt von
Thilo Stadelmann
geboren in Lemgo
Marburg, im April 2010Vom Fachbereich Mathematik und Informatik der
Philipps-Universit at Marburg als Dissertation am
15. April 2010
angenommen.
Erstgutachter: Prof. Dr. Bernd Freisleben
Zweitgutachter: Prof. Dr. Alfred Ultsch
Tag der mundlic hen Prufung am 08. Juli 2010.Abstract
Building a voice model means to capture the characteristics of a speaker’s voice
in a data structure. This data structure is then used by a computer for further
processing, such as comparison with other voices. Voice modeling is a vital step
in the process of automatic speaker recognition that itself is the foundation of
several applied technologies: (a) biometric authentication, (b) speech recognition
and (c) multimedia indexing.
Several challenges arise in the context of automatic speaker recognition. First,
there is the problem of data shortage, i.e., the unavailability of su ciently long
utterances for speaker recognition. It stems from the fact that the speech signal
conveys di erent aspects of the sound in a single, one-dimensional time series:
linguistic (what is said?), prosodic (how is it said?), individual (who said it?),
locational (where is the speaker?) and emotional features of the speech sound
itself (to name a few) are contained in the speech signal, as well as acoustic
background information. To analyze a speci c aspect of the sound regardless of
the other aspects, analysis methods have to be applied to a speci c time scale
(length) of the signal in which this aspect stands out of the rest. For example,
linguistic information (i.e., which phone or syllable has been uttered?) is found
in very short time spans of only milliseconds of length. On the contrary, speaker-
speci c information emerges the better the longer the analyzed sound is. Long
utterances, however, are not always available for analysis.
Second, the speech signal is easily corrupted by background sound sources
(noise, such as music or sound e ects). Their characteristics tend to dominate a
voice model, if present, such that model comparison might then be mainly due
to background features instead of speaker characteristics.
Current automatic speaker recognition works well under relatively constrained
circumstances, such as studio recordings, or when prior knowledge on the number
and identity of occurring speakers is available. Under more adverse conditions,
such as in feature lms or amateur material on the web, the achieved speaker
recognition scores drop below a rate that is acceptable for an end user or for
further processing. For example, the typical speaker turn duration of only one
second and the sound e ect background in cinematic movies render most current
automatic analysis techniques useless.
In this thesis, methods for voice modeling that are robust with respect to short
utterances and background noise are presented. The aim is to facilitate movie
{3{analysis with respect to occurring speakers. Therefore, algorithmic improvements
are suggested that (a) improve the modeling of very short utterances, (b) facil-
itate voice model building even in the case of severe background noise and (c)
allow for e cient voice model comparison to support the indexing of large mul-
timedia archives. The proposed methods improve the state of the art in terms of
recognition rate and computational e ciency.
Going beyond selective algorithmic improvements, subsequent chapters also
investigate the question of what is lacking in principle in current voice modeling
methods. By reporting on a study with human probands, it is shown that the
exclusion of time coherence information from a voice model induces an arti cial
upper bound on the recognition accuracy of automatic analysis methods. A
proof-of-concept implementation con rms the usefulness of exploiting this kind
of information by halving the error rate. This result questions the general speaker
modeling paradigm of the last two decades and presents a promising new way.
The approach taken to arrive at the previous results is based on a novel
methodology of algorithm design and development called \eidetic design". It
uses a human-in-the-loop technique that analyses existing algorithms in terms
of their abstract intermediate results. The aim is to detect aws or failures in
them intuitively and to suggest solutions. The intermediate results often consist
of large matrices of numbers whose meaning is not clear to a human observer.
Therefore, the core of the approach is to transform them to a suitable domain of
perception (such as, e.g., the auditory domain of speech sounds in case of speech
feature vectors) where their content, meaning and aws are intuitively clear to
the human designer. This methodology is formalized, and the corresponding
work ow is explicated by several use cases.
Finally, the use of the proposed methods in video analysis and retrieval are
presented. This shows the applicability of the developed methods and the accom-
panying software library sclib by means of improved results using a multimodal
analysis approach. The’s source code is available to the public upon re-
quest to the author. A summary of the contributions together with an outlook
to short- and long-term future work concludes this thesis.
4Zusammenfassung
Ein Stimmmodell (\voice model") fasst die charakteristischen Eigenschaften einer
Stimme in einer Datenstruktur zusammen. Diese wird zur maschinellen Wei-
terverarbeitung verwendet, z.B. zum Vergleich mit anderen Stimmen. Dies ist
ein Hauptschritt auf dem Weg zur automatischen Sprechererkennung, welche
wiederum der Kern mehrerer marktreifer Technologien ist: (a) biometrische Au-
thentisierung, (b) automatische Spracherkennung und (c) multimediale Suche.
Die automatische Sprechererkennung birgt mehrere Herausforderungen. Zum
einen besteht das Problem der Datenknappheit, d.h. der zu kurzen Sprach au e-
rungen. Es entsteht durch die Eigenschaft des Sprachsignals, unterschiedliche
Aspekte des Klangs in einer einzelnen eindimensionalen Zeitreihe unterzubrin-
gen: linguistische (was wurde gesagt?), prosodische (wie wurde es gesagt?), indi-
viduelle (wer hat es gesagt?), ortliche (wo be ndet sich der Sprecher?) und emo-
tionale Merkmale der Sprache an sich (um nur einige zu nennen) werden ebenso
vermittelt wie Informationen ub er akustische Nebenger ausche. Um einen be-
stimmten Aspekt unabh angig von den ubrigen Aspekten zu analysieren, mussen
die ansonsten ahnlic hen Analysetechniken auf eine bestimmte zeitliche Einheit im
Signal geeicht werden, in der dieser Aspekt gegenuber anderen heraussticht. Bei-
spielsweise entfaltet sich linguistische Information (welches Phonem oder welche
Silbe wurde gerade ausgesprochen?) auf einer Skala von nur wenigen Millisekun-
den L ange. Sprecherspezi sche Informationen hingegen lassen sich um so besser
extrahieren, je angerl der zu analysierende Sprachabschnitt ist. Lange, zusam-
menh angende Sprach au erungen sind allerdings nicht immer verfugba r.
Zum anderen wird das Sprachsignal leicht durch Nebenger ausche wie z.B.
Musik oder Sounde ekte beeintr achtigt. Das Stimmmodell tendiert dann dazu,
eher die Charakteristiken der Nebenger ausche abzubilden anstatt diejenigen der
Stimme. Ein Modellvergleich geschieht dannalscf hlicherweise hautpts achlich auf
Basis der Nebenger ausche anstatt anhand der Stimme.
Aktuelle Systeme zur automatischen Sprechererkennung arbeiten zufrieden-
stellend unter relativ kontrollierten Umst anden wie in ger auscharmen Studioauf-
nahmen oder wenn zus atzliche Informationen z.B. ub er die Anzahl und Identit at
der auftretenden Sprecher verfugbar sind. Unter versch arften Bedingungen, wie
sie beispielsweise in Filmen oder Amateurvideomaterial im Internet auftreten,
sinkt die Erkennungsrate unter die fur Endanwender oder eine Weiterverarbei-
tung akzeptable Schwelle. Zum Beispiel machen die typische Sprachdauer von
{5{ca. einer Sekunde in Kino lmen und die dort auftretenden Sounde ekte eine
Anwendung der meisten aktuellen Systeme auf solchen Daten unm oglich.
In dieser Arbeit werden Methoden fur die Stimmmodellierung untersucht, die
robust gegenub er kurzen Sprach au erungen und Nebenger auschen sind. Das an-
visierte Ziel ist die Indexierung von Filmen hinsichtlich der auftretenden Sprecher.
Zu diesem Zweck werden algorithmische Verbesserungen vorgestellt, die (a) die
Modellierung von kurzen Sprachsegmenten erlauben, (b) die Modellbildung auch
unter betr achtlichem Nebenger auschein uss erm oglichen und (c) einen e zien-
ten Vergleich von Stimmmodellen durchfuhren k onnen, um die Indexierung von
gro en Multimediaarchiven zu unterstutzen. Die vorgeschlagenen Methoden brin-
gen den Stand der Forschung hinsichtlich Erkennungsrate und Rechengeschwin-
digkeit deutlich voran.
Neben diesen punktuellen algorithmischen Verbesserungen besch aftigen sich
die folgenden Kapitel auch mit prinzipiellen Schw achen aktueller Ans atze zur
Stimmmodellierung. Mittels einer Studie mit menschlichen Probanden wird ge-
zeigt, dass die Ausklammerung von zeitlichen Kontextinformationen aus dem
Stimmmodell eine kunstlic he Obergrenze fur