Situated computer vision [Elektronische Ressource] / Sven Wachsmuth

universitat_bielefeld

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

237 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	universitat_bielefeld
Publié le	01 janvier 2010
Nombre de lectures	20
Langue	English
Poids de l'ouvrage	8 Mo

Extrait

Situated Computer Vision
Sven Wachsmuth
Habilitationsschrift
(Tag der Habilitation: 30.1.2009)
Universit at Bielefeld
Technische Fakult at
March 24, 20102Contents
1 Situated Perception 11
1.1 Perspectives on computer vision . . . . . . . . . . . . . . . . . 12
1.2 Situation models . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.1 Storage and retrieval structures . . . . . . . . . . . . . 17
1.2.2 Situation models as a dynamical representation . . . . 19
1.3 Context in Human vision . . . . . . . . . . . . . . . . . . . . . 21
1.3.1 Results from eye-tracking experiments . . . . . . . . . 22
1.3.2 The object-detection paradigm . . . . . . . . . . . . . 23
1.3.3 Neurophysiological results . . . . . . . . . . . . . . . . 24
1.4 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 27
2 Perception of Scenes 29
2.1 Why context? . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.1 Aspects of contextual modeling . . . . . . . . . . . . . 33
2.1.2 Contextual modeling for scene understanding . . . . . 35
2.1.3 Con modeling for system control . . . . . . . . . 38
2.2 Recognizing global scene contexts . . . . . . . . . . . . . . . . 41
2.2.1 Holistic scene classi cation . . . . . . . . . . . . . . . . 43
2.2.2 Scenes as a con guration of parts . . . . . . . . . . . . 52
2.3 Using context in object recognition . . . . . . . . . . . . . . . 65
2.3.1 Combining holistic context and object detection . . . . 67
2.3.2 Detecting semantic object-scene inconsistencies . . . . 69
2.3.3 Understanding objects in 3D scenes . . . . . . . . . . . 71
2.3.4 Integrating visual and verbal object descriptions . . . . 77
2.4 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 81
34 CONTENTS
3 Perception of Scene Dynamics 83
3.1 What is an action? . . . . . . . . . . . . . . . . . . . . . . . . 83
3.1.1 Using context in action recognition . . . . . . . . . . . 84
3.2 Action as a symbolic sequence of state-changes . . . . . . . . . 85
3.2.1 Event logic . . . . . . . . . . . . . . . . . . . . . . . . 86
3.2.2 Constraint networks . . . . . . . . . . . . . . . . . . . 90
3.3 Action as a stochastic process . . . . . . . . . . . . . . . . . . 92
3.3.1 Probabilistic motion models . . . . . . . . . . . . . . . 93
3.3.2 Using context in motion models . . . . . . . . . . . . . 96
3.4 Scene evolution . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.5 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 106
4 Cross-situational Learning 109
4.1 Parallel datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2 Statistical translation models . . . . . . . . . . . . . . . . . . 113
4.2.1 Parameter estimation . . . . . . . . . . . . . . . . . . . 115
4.2.2 Applying translation models to captionized images . . 116
4.3 Co-occurrence statistics . . . . . . . . . . . . . . . . . . . . . . 123
4.3.1 Mixture models and clustering methods . . . . . . . . . 125
4.3.2 Likelihood ratio testing . . . . . . . . . . . . . . . . . . 128
4.4 Mutual information methods . . . . . . . . . . . . . . . . . . . 133
4.4.1 Learning an audio-visual lexicon . . . . . . . . . . . . . 133
4.4.2 non-compositional compounds . . . . . . . . . 135
4.5 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 140
5 System Control Strategies 143
5.1 Aspects of system control . . . . . . . . . . . . . . . . . . . . 143
5.1.1 Control theory . . . . . . . . . . . . . . . . . . . . . . 147
5.1.2 Rational agents . . . . . . . . . . . . . . . . . . . . . . 150
5.1.3 Coordination of multiple control processes . . . . . . . 151
5.1.4 User interaction and situation awareness . . . . . . . . 152
5.2 Production systems . . . . . . . . . . . . . . . . . . . . . . . . 153
5.2.1 Coding context in rules . . . . . . . . . . . . . . . . . . 154
5.2.2 Problem spaces . . . . . . . . . . . . . . . . . . . . . . 157
5.3 Frame-based systems . . . . . . . . . . . . . . . . . . . . . . . 159
5.3.1 Schema theory . . . . . . . . . . . . . . . . . . . . . . 160
5.3.2 Semantic networks . . . . . . . . . . . . . . . . . . . . 161
5.4 Utility-based approaches . . . . . . . . . . . . . . . . . . . . . 164CONTENTS 5
5.4.1 Utility-based classi cation . . . . . . . . . . . . . . . . 165
5.4.2 Markov Decision Processes . . . . . . . . . . . . . . . . 170
5.5 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 174
6 System Integration 175
6.1 Requirements for integrated systems . . . . . . . . . . . . . . 175
6.2 Behavior modules . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.3 Situation controller . . . . . . . . . . . . . . . . . . . . . . . . 180
6.4 Service-oriented architectures . . . . . . . . . . . . . . . . . . 183
6.5 Data-driven process coordination . . . . . . . . . . . . . . . . 184
6.6 Blackboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.7 Active memories . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.7.1 The Active Memory infrastructure . . . . . . . . . . . 191
6.7.2 Coordinating memory processes in larger systems . . . 194
6.8 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 196
7 Summary and Outlook 199
A Mathematical details 205
A.1 Learning translation models . . . . . . . . . . . . . . . . . . . 205
A.2 Mutual information measures . . . . . . . . . . . . . . . . . . 2076 CONTENTSPreface
We are currently witnessing a dramatic change of the kind and manner how
people interact with computing machinery. Although most people still use
keyboard and mouse for their desktop personal computers, more and more
computational units invade our daily life that cannot be easily accessed by
traditional means of human-computer interfaces. The interaction space grows
from the easily controlled virtual desktop to an uncontrolled physical envi-
ronment which is the domain of natural human-human communication. One
reason for this is miniaturization like cell phones which are becoming more
and more computationally powerful. Another reason is distribution or per-
vasiveness. Computational units are integrated everywhere in our environ-
ment without being noticed. There is no physical instance for plugging in
a monitor or a keyboard. A third reason is embodiment. The appearance
and movement of robotic toys, like the Sony AIBO, of robotic interfacing
agents, like the Philips iCat, or even of humanoid robots, like the Honda
Asimo, mimic human-like or animal-like characters. They are situated in the
physical world and not in a digital world. Their embodiment and character-
style causes signi cant degrees of anthropomorphism, i.e. the attribution of
uniquely human characteristics and qualities. An important part of it is the
expectation to communicate with these technical platforms in a human-style
fashion. A fourth reason is the availability of technology. Many of the utili-
ties mentioned have built-in cameras and microphones which resembles the
most important and richest sensory modalities of humans. As a consequence,
there is an economic and social pressure to use them and to produce devices
that are more fun to interact with.
Human-human communication has many facets. It is not only based on
language but an inherently multi-modal a air that involves every sense that
we have. The sender as well as the receiver makes extensive use of them in
coding as well as decoding a communicative goal or intention. This can be
78 CONTENTS
veri ed in everyone’s own personal experience. We even do it when the com-
municative channel does not transmit the multi-modal content which might
cause some irritation on the receiver side. For example, a person presenting
a talk via laptop and beamer is pointing on his computer screen although
nobody can see it, a young child telephoning with her Grandma shows her
newest toy when asked about her birthday presents although Grandma can
only hear her, two persons communicating through a closed window are ver-
bally commenting what they show to each other although nobody can hear
anything. It looks irritating because the di erent multi-modal cues that are
produced by the actors relate and reference to each other. They are situ-
ated in that they do not encode the full meaning. Other cues of the current
situation are needed in order to complete their understanding.
This does not only account for communicative situations, but is a more
general principle. Many computer vision techniques seem to be fragile when
they are taken out only slightly of the application scenario they are designed
for. This is an inherent problem that has already been noticed a long time
ago. Solutions can be based on di erent principles. First, we can add explicit
contextual knowledge to the system that is used to control the application of
image operators. Secondly, we can add contextual features that in uence or
bias the classi cation decision of some interpretation process. Thirdly, we can
actively shape the capturing process in order to make the interpretation pro-
cess more invariant to context. Fourthly, we can provide appropriate feedback
about the current system performance so that a potential user can change the
current situation for a more proper system performance. All these strategies
are di erent variations of situated computer vision approaches. They become
especially important if computer v