expression-control-tutorial-89

Sojiz

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

16 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Informations

Publié par	Sojiz
Nombre de lectures	42
Langue	English

Extrait

Expression control using synthetic speech.
Brian Wyvill (blob@cs.uvic.ca) and David R. Hill (hill@cpsc.ucalgary.ca)
Department of Computer Science, University of Calgary.
2500 University Drive N.W.
Calgary, Alberta, Canada, T2N 1N4
Abstract
This tutorial paper presents a practical guide to animating facial expressions synchronised to
a rule based speech synthesiser. A description of speech synthesis by rules is given and how a
set of parameters which drive both the speech synthesis and the graphics is derived. An example
animation is described along with the outstanding problems.
Key words: Computer Graphics, Animation, Speech Synthesis, Face-Animation.
© ACM, 1989. This is the authors’ version of the work. It is posted here by permission of ACM
for your personal use. Not for redistribution. The deﬁnitive version was published as Course #
22 of the Tutorial Section of ACM SIGGRAPH 89, Boston, Massachusetts, 31 July - 4 August
1989. DOI unknown.
Note (drh 2008): Appendix A was added, after publication of these tutorial notes by the ACM,
to ﬂesh out some details of the parameter synthesis, and to provide a more complete acoustic
parameter table (the original garbled table headings have been corrected in the original paper
text that follows, but the data was still incomplete—intended for discussion in a tutorial).
Fairly soon after the animation work using the formant synthesiser was ﬁnished a completely
new articulatory speech synthesis system was developed by one of the authors and his colleagues.
This system uses an acoustic tube model of the human vocal tract, with associated new posture
databases—cast in terms of tube radii and excitation, new rules, and so on. Originally a
technology spin-off company product, the system was the ﬁrst complete articulatory real-time
text-to-speech synthesis system in the world and was described in [Hill 95]. All the software
is now available from the GNU project gnuspeech under a General Public Licence. Originally
developed on the NeXT computer, much of the system has since been ported to the Macintosh
under OS/X, and work on a GNU/Linux version running under GNUStep is well under way
(http://savannah.gnu.org/projects/gnuspeech).
Reference
[Hill 95] David R. Hill, Leonard Manzara, Craig-Richard Schock. Real-time articulatory speech-
synthesis-by-rule, Proc. AVIOS ‘95, the 14th Annual International Voice Technologies Applications
Conference of the American Voice I/O Society, San Jose, September 11-14 1995, AVIOS: San Jose,
27-44Expression control using synthetic speech 1
1 Motivation
In traditional hand animation synchronisation between graphics and speech has been achieved
through a tedious process of analysing a speech sound track and drawing corresponding mouth
positions (and expressions) at key frames. To achieve a more realistic correspondence a live
actor may be ﬁlmed to obtain the correct mouth positions. This method produces good results
but must be repeated for each new speech, it is time consuming and requires a great deal of
specialised skill on the part of the animator. A common approach to computer animation uses a
similar analysis to derive key sounds, from which parameters to drive a face model can he found.
(see [Parke 74]) Such an approach to animation is more ﬂexible than the traditional hand method
since the parameters to drive such a face model correspond to the key measurements available
from the photographs directly, rather than requiring the animator to design each expression
as needed. However, the process is not automatic, requiring tedious manual procedures for
recording and measuring the actor.
In our research we were interested in ﬁnding a fully automatic way of producing an animated
face to match speech. Given a recording of an actor speaking the appropriate script, it might
seem possible to design a machine procedure to recognise the individual sounds and to use
acoustic-phonetic and articulatory rules to derive sets of parameters to drive the Parke face
model. However, this would require a more sophisticated speech recognition program than is
currently available.
The simplest way for a computer animator to interact with such a system would be to type
in a line of text and have the synthesised speech and expressions automatically generated. This
was the approach we decided to try.
From the initial input, given the still incomplete state of knowledge concerning speech
synthesis by rules, we wanted to allow some audio editing to allow improvements in the
quality, with the corresponding changes to the expressions being done automatically. Synthetic
speech by rules was the most appropriate choice since this can be generated from keyboard
input, it is a very general approach which lends itself to the purely automatic generation of
speech animation. The major drawback is that speech synthesised in this manner is far from
perfect.
2 Background
2.1 The Basis for Synthesis by Rules
Acoustic-phonetic research into the composition of spoken English during the 50’s and 60’s, led
to the determination of the basic acoustic cues associated with forty or so sound classes. This
early research was conducted at the Haskins Laboratory in the US and elsewhere worldwide. The
sound classes are by no means homogeneous, and we still do not have complete knowledge on
all the variations and their causes. However, broadly speaking, each sound class can be identiﬁed
with a conﬁguration of the vocal organs in making sounds in the class. We shall refer to this as
a speech posture. Thus, if the jaw is rotated a certain amount, and the lips held in a particular
position, with the tongue hump moved high or low, and back or forward, a vowel-like noise can be
produced that is characterised by the energy distribution in the frequency domain. This distribution
contains peaks, corresponding to the resonances of the tube-like vocal tract, called formants. As
the speaker articulates different sounds (the speech posture is thus varying dynamically and
continuously), the peaks will move up and down the frequency scale, and the sound emitted will
change. Figure 1 shows the parts of the articulatory system involved with speech production.Expression control using synthetic speech 2
Figure 1: The Human Vocal Apparatus
2.2 Vowel and Consonant Sounds
The movements are relatively slow during vowel and vowel-like articulations, but are often much
faster in consonant articulations, especially for plosive sounds like /b, d, g, p, t, and k/ (these
are more commonly called the stop consonants). The nasal sounds /m, n/ and the sound at the
end of “running”—/ŋ/, are articulated very much like the plosive sounds, and not only involve
quite rapid shifts in formant frequencies but also a sudden change in general spectral quality
because the nasal passage is very quickly connected and disconnected for nasal articulation by
the valve towards the back of the mouth that is formed by the soft palate (the velum )—hence the
phrase “nasal sounds”. Various hiss-like noises are associated with many consonants because
consonants are distinguished from vowels chieﬂy by a higher degree of constriction in the vocal
tract (completely stopped in the case of the stop consonants). This means that either during,
or just after the articulation of a consonant, air from the lungs is rushing through a relatively
narrow opening, in turbulent ﬂow, generating random noise (sounds like /s/, or /f/). Whispered
speech also involves airﬂow noise as the sound medium, but, since the turbulence
occurs early in the vocal ﬂow, it is shaped by the resonances and assumes many of the qualities
of ordinarily spoken sounds.
2.3 Voiced and Voiceless
When a sound is articulated, the vocal folds situated in the larynx may be wide open and relaxed,
or held under tension. In the second case they will vibrate, imposing a periodic ﬂow pattern on
the rush of air from the lungs (and making a noise much like a raspberry blown under similar
conditions at the lips). However, the energy in the noise from the vocal folds is redistributed
by the resonant properties of the vocal and nasal tracts, so that it doesn’t sound like a raspberry
by the time it gets out. Sounds in which the vocal folds are vibrating are termed voiced. Other
sounds are termed voiceless , although some further qualiﬁcation is needed.Expression control using synthetic speech 3
It is reasonable say that the word cat is made up of the sounds /k æ t/. However, although a
sustained /æ/ can be produced, a sustained /k/ or /t/ cannot. Although stop sounds are articulated
as speech postures, the cues that allow us to hear them occur as a result of their environment.
When the characteristic posture of /t/ is formed, no sound is heard at all: the stop gap, or silence, is
only heard as a result of noises either side, especially the formant transitions (see 2.4 below).
The sounds /t/ and /d/ differ only in that the vocal folds vibrate during the /d/ posture, but not
during the /t/ posture. The /t/ is a voiceless alveolar stop, whereas the /d/ is a voiced alveolar
stop, the alveolar ridge being the place within the vocal tract where the point of maximum
constriction takes place, known as the place of articulation. The /k/ is a voiceless velar stop.
2.4 Aspiration
When a voiceless stop is articulated in normal speech, the vocal folds do not be