Understanding multimodal deixis with gaze and gesture in conversational interfaces [Elektronische Ressource] / Thies Pfeiffer. Technische Fakultät

universitat_bielefeld

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

241 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	universitat_bielefeld
Publié le	01 janvier 2011
Nombre de lectures	46
Langue	English
Poids de l'ouvrage	76 Mo

Extrait

Understanding Multimodal Deixis with Gaze and
Gesture in Conversational Interfaces
Thies Pfei er
A.I. Group
Faculty of Technology
Bielefeld University
P.O. Box 10 01 31
D-33501 Bielefeld
Germany
email: tpfei e@techfak.uni-bielefeld.de
This dissertation has been approved by the Faculty of Technology at Biele-
feld University to obtain the academic degree of a Doctor rerum naturalium
(Informatics).
Dean of the faculty: Prof. Dr. Jens Stoye
First reviewer: Prof. Dr. Ipke Wachsmuth
Second reviewer: Prof. Dr. Hannes Rieser
Submission of the thesis: April 28, 2010
Day of the disputation: October 8, 2010
The background of the model on the frontpage shows the scene \Venice" by
Stefan John, copyright 2009.
The o cial print version has been printed on age-resistant paper according
to DIN-ISO 9706.I
Summary
When humans communicate, we use deictic expressions to refer to objects in
our surrounding and put them in the context of our actions. In face to face
interaction, we can complement verbal expressions with gestures and, hence,
we do not need to be too precise in our verbal protocols. Our interlocutors hear
our speaking; see our gestures and they even read our eyes. They interpret
our deictic expressions, try to identify the referents and { normally { they
will understand. If only machines could do alike.
The driving vision behind the research in this thesis are multimodal con-
versational interfaces where humans are engaged in natural dialogues with
computer systems. The embodied conversational agent Max developed in the
A.I. group at Bielefeld University is an example of such an interface. Max
is already able to produce multimodal deictic expressions using speech, gaze
and gestures, but his capabilities to understand humans are not on par. If
he was able to resolve multimodal deictic expressions, his understanding of
humans would increase and interacting with him would become more natural.
Following this vision, we as scientists are confronted with several challenges.
First, accurate models for human pointing have to be found. Second, precise
data on multimodal interactions has to be collected, integrated and analyzed
in order to create these models. This data is multimodal (transcripts, voice
and video recordings, annotations) and not directly accessible for analysis
(voice and video recordings). Third, technologies have to be developed to
support the integration and the analysis of the multimodal data. Fourth, the
created models have to be implemented, evaluated and optimized until they
allow a natural interaction with the conversational interface.
To this ends, this work aims to deepen our knowledge of human non-verbal
deixis, specically of manual and gaze pointing, and to apply this knowledge
in conversational interfaces. At the core of the theoretical and empirical
investigations of this thesis are models for the interpretation of pointing
gestures to objects. These models address the following questions: When
are we pointing? Where are we pointing to? Which objects are we pointing
at? With respect to these questions, this thesis makes the following three
contributions:
First, gaze-based interaction technology for 3D environments: Gaze plays
an important role in human communication, not only in deictic reference.
Yet, technology for gaze interaction is still less developed than technology for
manual interaction. In this thesis, we have developed components for real-timeII
tracking of eye movements and of the point of regard in 3D space and integrated
them in a framework for Deictic Reference In Virtual Environments (DRIVE).
DRIVE provides viable information about human communicative behavior in
real-time. This data can be used to investigate and to design processes on
higher cognitive levels, such as turn-taking, check-backs, shared attention and
resolving deictic reference.
Second, data-driven modeling: We answer the theoretical questions about
timing, direction, accuracy and dereferential power of pointing by data-driven
modeling. As empirical basis for the simulations, we created a substantial
corpus with high-precision data from an extensive study on multimodal
pointing. Two further studies complemented this e ort with substantial data
on gaze pointing in 3D. Based on this data, we have developed several models
of pointing and successfully created a model for the interpretation of manual
pointing that achieves a human-like performance level.
Third, new methodologies for research on multimodal deixis in the elds
of linguistics and computer science: The experimental-simulative approach
to modeling { which we follow in this thesis { requires large collections of
heterogeneous data to be recorded, integrated, analyzed and resimulated. To
support the researcher in these tasks, we developed the Interactive Augmented
Data Explorer (IADE). IADE is an innovative tool for research on multimodal
interaction based on virtual reality technology. It allows researchers to literally
immerse into multimodal data and interactively explore them in real-time and
in virtual space. With IADE we have also extended established approaches
for scientic visualization of linguistic data to 3D, which previously existed
only for 2D methods of analysis (e.g. video recordings or computer screen
experiments). By this means, we extended McNeill’s 2D depiction of the
gesture space to gesture space volumes expanding in time and space. Similarly,
we created attention volumes, a new way to visualize the distribution of
attention in 3D environments.CONTENTS III
Contents
List of Figures VII
List of Tables XI
List of Acronyms XIII
Acknowledgement XV
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Scope and Objectives . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Interdisciplinary Background 9
2.1 Gaze and Manual Gesture in Communication . . . . . . . . . 10
2.2 Reference and Deixis . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Manual Pointing . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Gaze Pointing . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Coupling of Gesture and Gaze . . . . . . . . . . . . . . . . . . 30
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Related Work in Human-Computer Interaction 35
3.1 Multimodal Interaction with Gesture and Gaze . . . . . . . . 35
3.2 Detecting Pointing in Gaze and Manual Gestures . . . . . . . 44
3.3 Interpreting Pointing . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Integrating Multimodal Deixis . . . . . . . . . . . . . . . . . . 60
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Manual Pointing 65
4.1 Deixis in Construction Dialogues . . . . . . . . . . . . . . . . 66
4.2 Study Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 68IV CONTENTS
4.3 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Domain of Possible Referents . . . . . . . . . . . . . . . . . . 70
4.5 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6 The Interactive Augmented Data Explorer (IADE) . . . . . . 74
4.7 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.8 Simulative Analysis and Visualization with IADE . . . . . . . 79
4.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10 Visualizing Gesture Space in 3D . . . . . . . . . . . . . . . . . 89
4.11 Visualizing Reference Volumes for Manual Pointing . . . . . . 96
4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 Gaze Pointing 103
5.1 Study 1: Direction-based Pointing . . . . . . . . . . . . . . . . 104
5.2 Study 1: Hardware Set-Up . . . . . . . . . . . . . . . . . . . . 104
5.3 Study 1: Visual Ping . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Study 1: Results . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5 Study 1: Discussion . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6 Study 2: Location-based Pointing . . . . . . . . . . . . . . . . 114
5.7 Study 2: Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 114
5.8 Study 2: Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.9 Study 2: Results . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.10 Study 2: Discussion . . . . . . . . . . . . . . . . . . . . . . . . 122
5.11 Visualizing the Point of Regard in 3D . . . . . . . . . . . . . . 124
5.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6 Modeling the Extension of Gaze and Manual Pointing 131
6.1 Study on Manual Pointing Reconsidered . . . . . . . . . . . . 133
6.2 Modeling the Direction of Manual Pointing . . . . . . . . . . . 138
6.3 Modeling the Spatial Extension of Manual Pointing . . . . . . 145
6.4 Modeling Gaze Pointing . . . . . . . . . . . . . . . . . . . . . 157
6.5 Integrating Pointing Models with a Conversational Interface . 162
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7 Applications and Conclusion 167
7.1 Applications with DRIVE . . . . . . . . . . . . . . . . . . . . 167
7.2 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.3 Further Perspectives . . . . . . . . . . . .