La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

Situated computer vision [Elektronische Ressource] / Sven Wachsmuth

237 pages
Situated Computer VisionSven WachsmuthHabilitationsschrift(Tag der Habilitation: 30.1.2009)Universit at BielefeldTechnische Fakult atMarch 24, 20102Contents1 Situated Perception 111.1 Perspectives on computer vision . . . . . . . . . . . . . . . . . 121.2 Situation models . . . . . . . . . . . . . . . . . . . . . . . . . 151.2.1 Storage and retrieval structures . . . . . . . . . . . . . 171.2.2 Situation models as a dynamical representation . . . . 191.3 Context in Human vision . . . . . . . . . . . . . . . . . . . . . 211.3.1 Results from eye-tracking experiments . . . . . . . . . 221.3.2 The object-detection paradigm . . . . . . . . . . . . . 231.3.3 Neurophysiological results . . . . . . . . . . . . . . . . 241.4 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 272 Perception of Scenes 292.1 Why context? . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.1.1 Aspects of contextual modeling . . . . . . . . . . . . . 332.1.2 Contextual modeling for scene understanding . . . . . 352.1.3 Con modeling for system control . . . . . . . . . 382.2 Recognizing global scene contexts . . . . . . . . . . . . . . . . 412.2.1 Holistic scene classi cation . . . . . . . . . . . . . . . . 432.2.2 Scenes as a con guration of parts . . . . . . . . . . . . 522.3 Using context in object recognition . . . . . . . . . . . . . . . 652.3.1 Combining holistic context and object detection . . . . 672.3.
Voir plus Voir moins

Situated Computer Vision
Sven Wachsmuth
Habilitationsschrift
(Tag der Habilitation: 30.1.2009)
Universit at Bielefeld
Technische Fakult at
March 24, 20102Contents
1 Situated Perception 11
1.1 Perspectives on computer vision . . . . . . . . . . . . . . . . . 12
1.2 Situation models . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.1 Storage and retrieval structures . . . . . . . . . . . . . 17
1.2.2 Situation models as a dynamical representation . . . . 19
1.3 Context in Human vision . . . . . . . . . . . . . . . . . . . . . 21
1.3.1 Results from eye-tracking experiments . . . . . . . . . 22
1.3.2 The object-detection paradigm . . . . . . . . . . . . . 23
1.3.3 Neurophysiological results . . . . . . . . . . . . . . . . 24
1.4 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 27
2 Perception of Scenes 29
2.1 Why context? . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.1 Aspects of contextual modeling . . . . . . . . . . . . . 33
2.1.2 Contextual modeling for scene understanding . . . . . 35
2.1.3 Con modeling for system control . . . . . . . . . 38
2.2 Recognizing global scene contexts . . . . . . . . . . . . . . . . 41
2.2.1 Holistic scene classi cation . . . . . . . . . . . . . . . . 43
2.2.2 Scenes as a con guration of parts . . . . . . . . . . . . 52
2.3 Using context in object recognition . . . . . . . . . . . . . . . 65
2.3.1 Combining holistic context and object detection . . . . 67
2.3.2 Detecting semantic object-scene inconsistencies . . . . 69
2.3.3 Understanding objects in 3D scenes . . . . . . . . . . . 71
2.3.4 Integrating visual and verbal object descriptions . . . . 77
2.4 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 81
34 CONTENTS
3 Perception of Scene Dynamics 83
3.1 What is an action? . . . . . . . . . . . . . . . . . . . . . . . . 83
3.1.1 Using context in action recognition . . . . . . . . . . . 84
3.2 Action as a symbolic sequence of state-changes . . . . . . . . . 85
3.2.1 Event logic . . . . . . . . . . . . . . . . . . . . . . . . 86
3.2.2 Constraint networks . . . . . . . . . . . . . . . . . . . 90
3.3 Action as a stochastic process . . . . . . . . . . . . . . . . . . 92
3.3.1 Probabilistic motion models . . . . . . . . . . . . . . . 93
3.3.2 Using context in motion models . . . . . . . . . . . . . 96
3.4 Scene evolution . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.5 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 106
4 Cross-situational Learning 109
4.1 Parallel datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2 Statistical translation models . . . . . . . . . . . . . . . . . . 113
4.2.1 Parameter estimation . . . . . . . . . . . . . . . . . . . 115
4.2.2 Applying translation models to captionized images . . 116
4.3 Co-occurrence statistics . . . . . . . . . . . . . . . . . . . . . . 123
4.3.1 Mixture models and clustering methods . . . . . . . . . 125
4.3.2 Likelihood ratio testing . . . . . . . . . . . . . . . . . . 128
4.4 Mutual information methods . . . . . . . . . . . . . . . . . . . 133
4.4.1 Learning an audio-visual lexicon . . . . . . . . . . . . . 133
4.4.2 non-compositional compounds . . . . . . . . . 135
4.5 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 140
5 System Control Strategies 143
5.1 Aspects of system control . . . . . . . . . . . . . . . . . . . . 143
5.1.1 Control theory . . . . . . . . . . . . . . . . . . . . . . 147
5.1.2 Rational agents . . . . . . . . . . . . . . . . . . . . . . 150
5.1.3 Coordination of multiple control processes . . . . . . . 151
5.1.4 User interaction and situation awareness . . . . . . . . 152
5.2 Production systems . . . . . . . . . . . . . . . . . . . . . . . . 153
5.2.1 Coding context in rules . . . . . . . . . . . . . . . . . . 154
5.2.2 Problem spaces . . . . . . . . . . . . . . . . . . . . . . 157
5.3 Frame-based systems . . . . . . . . . . . . . . . . . . . . . . . 159
5.3.1 Schema theory . . . . . . . . . . . . . . . . . . . . . . 160
5.3.2 Semantic networks . . . . . . . . . . . . . . . . . . . . 161
5.4 Utility-based approaches . . . . . . . . . . . . . . . . . . . . . 164CONTENTS 5
5.4.1 Utility-based classi cation . . . . . . . . . . . . . . . . 165
5.4.2 Markov Decision Processes . . . . . . . . . . . . . . . . 170
5.5 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 174
6 System Integration 175
6.1 Requirements for integrated systems . . . . . . . . . . . . . . 175
6.2 Behavior modules . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.3 Situation controller . . . . . . . . . . . . . . . . . . . . . . . . 180
6.4 Service-oriented architectures . . . . . . . . . . . . . . . . . . 183
6.5 Data-driven process coordination . . . . . . . . . . . . . . . . 184
6.6 Blackboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.7 Active memories . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.7.1 The Active Memory infrastructure . . . . . . . . . . . 191
6.7.2 Coordinating memory processes in larger systems . . . 194
6.8 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . 196
7 Summary and Outlook 199
A Mathematical details 205
A.1 Learning translation models . . . . . . . . . . . . . . . . . . . 205
A.2 Mutual information measures . . . . . . . . . . . . . . . . . . 2076 CONTENTSPreface
We are currently witnessing a dramatic change of the kind and manner how
people interact with computing machinery. Although most people still use
keyboard and mouse for their desktop personal computers, more and more
computational units invade our daily life that cannot be easily accessed by
traditional means of human-computer interfaces. The interaction space grows
from the easily controlled virtual desktop to an uncontrolled physical envi-
ronment which is the domain of natural human-human communication. One
reason for this is miniaturization like cell phones which are becoming more
and more computationally powerful. Another reason is distribution or per-
vasiveness. Computational units are integrated everywhere in our environ-
ment without being noticed. There is no physical instance for plugging in
a monitor or a keyboard. A third reason is embodiment. The appearance
and movement of robotic toys, like the Sony AIBO, of robotic interfacing
agents, like the Philips iCat, or even of humanoid robots, like the Honda
Asimo, mimic human-like or animal-like characters. They are situated in the
physical world and not in a digital world. Their embodiment and character-
style causes signi cant degrees of anthropomorphism, i.e. the attribution of
uniquely human characteristics and qualities. An important part of it is the
expectation to communicate with these technical platforms in a human-style
fashion. A fourth reason is the availability of technology. Many of the utili-
ties mentioned have built-in cameras and microphones which resembles the
most important and richest sensory modalities of humans. As a consequence,
there is an economic and social pressure to use them and to produce devices
that are more fun to interact with.
Human-human communication has many facets. It is not only based on
language but an inherently multi-modal a air that involves every sense that
we have. The sender as well as the receiver makes extensive use of them in
coding as well as decoding a communicative goal or intention. This can be
78 CONTENTS
veri ed in everyone’s own personal experience. We even do it when the com-
municative channel does not transmit the multi-modal content which might
cause some irritation on the receiver side. For example, a person presenting
a talk via laptop and beamer is pointing on his computer screen although
nobody can see it, a young child telephoning with her Grandma shows her
newest toy when asked about her birthday presents although Grandma can
only hear her, two persons communicating through a closed window are ver-
bally commenting what they show to each other although nobody can hear
anything. It looks irritating because the di erent multi-modal cues that are
produced by the actors relate and reference to each other. They are situ-
ated in that they do not encode the full meaning. Other cues of the current
situation are needed in order to complete their understanding.
This does not only account for communicative situations, but is a more
general principle. Many computer vision techniques seem to be fragile when
they are taken out only slightly of the application scenario they are designed
for. This is an inherent problem that has already been noticed a long time
ago. Solutions can be based on di erent principles. First, we can add explicit
contextual knowledge to the system that is used to control the application of
image operators. Secondly, we can add contextual features that in uence or
bias the classi cation decision of some interpretation process. Thirdly, we can
actively shape the capturing process in order to make the interpretation pro-
cess more invariant to context. Fourthly, we can provide appropriate feedback
about the current system performance so that a potential user can change the
current situation for a more proper system performance. All these strategies
are di erent variations of situated computer vision approaches. They become
especially important if computer vision results need to be communicated to
a human user. In this case, not all computer vision results matter, only some
selective aspects of a scene are of interest. For a complete interpretation,
these need to be related to the user’s expectations. Thus, the visual inter-
pretation process becomes embedded in a kind of user-system dialog that
can be shaped by verbal statements as well as various other non-verbal con-
textual cues. One example is joint attention that leads to coupled capturing
processes of communication partners. Another example is prompting the
user with computer vision results. This establishes feedback loops that give
an idea of a successful or unsuccessful information exchange.
The modeling of computer vision as a situated process is the general topic
of this thesis. First, we will relate it to general trends in the computer vision
community and to what has been found for the human perceptual system.CONTENTS 9
Psychological experiments have shown that context is extensively used in
the human brain. Visual understanding and language production/perception
in uence each other on a very early stage of processing. These aspects are
discussed in Chapter 1.
Chapter 2 takes a more technical standpoint and discusses several tech-
niques for considering static scenes and their relations to objects. This is
continued in Chapter 3 for dynamic scenes. The interpretation of human
actions inherently involves context because its physical performance is di-
rected towards an environmental state change.
Situations group previously unrelated items into a coherent context. In-
terpretation processes can exploit this grouping by making relations between
these items explicit. However, we can also focus on the dual process which
infers the relations from many occurrences of situational groupings. Thus,
context is exploited for model acquisition and learning. Semantic relations
between words and visual items is a typical example and documents contain-
ing both are omnipresent, e.g., in the world wide web. This topic is treated
in Chapter 4.
Frequently, context and situativity is also a matter of control. A system
that is embedded in the physical world continuously needs to react on chang-
ing environmental conditions. It needs to take decisions that irreversibly
change its own environment. An optimal decision depends on the current
situation and might lead to a new situation. These aspects are treated in
Chapter 5. However, systems that perceive and interact with their envi-
ronment are too complex to be monolithic. They need to deal with many
things in parallel. Typically, the control is distributed over several compo-
nents. Although, several frameworks have been proposed that simplify the
component-based construction of larger systems, the system integration task
is frequently underestimated. It involves more aspects than only control.
Chapter 6 discusses di erent principles and frameworks that keep situated
systems manageable.10 CONTENTS