Robust video content analysis via transductive learning methods [Elektronische Ressource] / von Ralph Ewerth

philipps-universitat_marburg - Ewerth

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

282 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	philipps-universitat_marburg
Publié le	01 janvier 2008
Nombre de lectures	8
Langue	English
Poids de l'ouvrage	13 Mo

Extrait

Robust Video Content Analysis via
Transductive Learning Methods

Dissertation

zur Erlangung des Doktorgrades
der Naturwissenschaften
(Dr. rer. nat.)
vorgelegt dem
Fachbereich Mathematik und Informatik
der Philipps-Universität Marburg

von
Ralph Ewerth
aus Hanau

Marburg/Lahn, 2008

Erstgutachter: Prof. Dr. Bernd Freisleben
Zweitgutachter: Prof. Dr. Otthein Herzog

Tag der mündlichen Prüfung:
ii

iii
iv

FÜR MEINE MUTTER

TO MY MOTHER
v viABSTRACT
Several technological innovations, such as increased hard disk capacities, improved network
bandwidth and mobile multimedia devices, have fostered an enormous increase of multimedia data
in recent years. The growing amount of multimedia data raises the question of how to efficiently
index, summarize and retrieve multimedia content. Up to now, the necessary automatic
understanding of multimedia content is an unsolved problem in practice. In addition, the variability
of multimedia sources and content is enormous, and obviously this is also true for video databases.
The research question addressed by this thesis is how to build robust video content analysis and
indexing approaches that work reliably on arbitrary videos. Many video content analysis approaches
are considered to be “robust” by their inventors. However, in most cases this means that an
algorithm or system has proven to work well on one or more (hopefully large) test sets.
Furthermore, often a single classification model or decision threshold is applied to all test videos in
the same way, which might be a learned model using machine learning techniques or a set of pre-
defined parameters that have been estimated empirically. Obviously, this is a problem as long as we
do not restrict video databases in some respect, since videos can vary in many ways: in terms of
recording devices, the recording circumstances, the used compression technology, editing layout,
genre and, of course, in terms of content. Hence, there is a need for the development of algorithms
that work reliably and independently of the factors mentioned above – a robust video content
analysis and indexing algorithm should automatically adapt to a particular video with respect to the
video content, editing layout and so on, and its indexing quality should not depend on compression
artefacts that are present in a given video.
This thesis investigates how a high-quality analysis and indexing result can be obtained or improved
for a particular given video by considering the context of content and compression appropriately.
One of the major contributions of this thesis is to consider the analysis process for a particular
video as a setting that is well suited for transductive learning. Transductive learning is not aimed at
obtaining a general classification function for all possible test data points (as in inductive learning)
but at obtaining a specific classification for the given test data only. The idea is that the desired
classification function has to be optimal for the unlabeled test data only and not in general (as in the
case of inductive learning). In this thesis, this idea is applied to achieve robust video content
analysis: the unlabeled data of a particular, previously unseen video are incorporated into the
learning and classification process. For this purpose, a self-/semi-supervised learning ensemble
framework is presented that exploits an initial classification (or clustering) result to improve its
quality for a particular video. The proposed framework is based on feature selection and ensemble
classification; it is called self-supervised when the baseline approach relies on unsupervised learning,
viiand it is called semi-supervised when the baseline approach relies on supervised learning. Within the
scope of this thesis, solutions for several video content analysis and video indexing problems are
presented. Apart from the solutions that are based on the proposed learning framework, some
proposals in this thesis employ unsupervised learning or deal with compression artefacts
adequately. Overall, the following tasks are considered: shot boundary detection, estimation of
camera motion, face recognition, semantic concept detection and semantic indexing of computer
game sequences. Several strategies are investigated to utilize the transductive setting in order to
obtain better results for different video content analysis tasks: 1.) dealing with compression
artefacts (video cut detection, camera motion estimation); 2.) estimating parameters automatically
(video cut detection); 3.) applying self-supervised learning (video cut detection, face
recognition/clustering); 4.) applying semi-supervised learning (semantic video retrieval, semantic
video indexing); 5.) applying transductive support vector machines (SVM) (semantic video
retrieval). Experimental results on large test sets (which are publicly available in most cases)
demonstrate the very good performance of the proposed approaches. In particular, the ensemble
version of the proposed framework works reliably for all considered video content analysis tasks, in
contrast to the realization of the framework using a single self-/semi-supervised classifier and in
contrast to the transductive SVM. Finally, the thesis is concluded with a summary of the
contributions and some areas of future work are outlined.
viiiZUSAMMENFASSUNG

In den letzten Jahren ist die Menge der Multimediadaten im Bereich der Computeranwendungen
und im Internet stark gewachsen. Dies ist eine Folge verschiedener technologischer Entwicklungen,
die zu größeren Festplattenkapazitäten, besseren Netzwerkbandbreiten und effizienteren
Kompressionsmethoden für Multimediadaten führten. Nicht zuletzt hat die Verbreitung von
mobilen Geräten mit multimedialen Funktionen zugenommen (z. B. Mobiltelefone, digitale
Kameras), diese Geräte können ihrerseits Multimediadaten generieren bzw. empfangen und
versenden. Mit der stetig anwachsenden Menge multimedialer Daten wächst allerdings auch die
Notwendigkeit, solche Daten anhand des Inhalts effizient zu indexieren, zusammenzufassen und
durchsuchen zu können. Jedoch ist das hierzu notwendige rechnergestützte automatische
Verstehen von beliebigen multimedialen Inhalten nach wie vor ein ungelöstes Problem, bedingt
durch die große Variationsbreite von multimedialen Inhalten und Quellen. Die
Variationsmöglichkeiten sind auch im Falle von Videoaufnahmen vielfältig; dies gilt sowohl für im
Internet zum Download bereitgestellte Videos als auch für Aufnahmen in Film und Fernsehen.
Videos können unterschiedlichen Genres angehören, die Aufnahme- und Kompressionsqualität
kann sehr unterschiedlich sein, und nicht zuletzt können beliebige und sehr unterschiedliche Inhalte
mittels eines Videos präsentiert werden. Dies führt unmittelbar zu der Fragestellung, ob und wie
Algorithmen zur automatischen Videoanalyse und Videoindexierung entwickelt werden können, so
dass ihre Annotationsqualität bezüglich eines einzelnen Videos mit beliebigem Inhalt bestmöglich
ist. Viele in der Literatur vorgeschlagene Systeme werden von den jeweiligen Autoren als “robust”
bezeichnet. Jedoch bedeutet dies in den meisten Fällen lediglich, dass ein solches System auf einer
ausreichend großen Testmenge von Videos ein gutes Ergebnis erzielt hat. In der Regel wird ein
mittels maschinellem Lernen erstelltes Klassifikationsmodell bzw. ein empirisch gefundener
Schwellenwert für jedes Video einer solchen Testmenge in der gleichen Weise angewendet.
Offensichtlich kann dies nicht immer zu bestmöglichen Ergebnissen führen, wenn etwa die
analysierten Videos nicht die gleichen Eigenschaften teilen oder zum Beispiel nicht aus dem
gleichen Genre stammen. Dies zeigt die Notwendigkeit für Algorithmen, die zuverlässig für jedes
beliebige Video funktionieren, unabhängig davon, welchem Genre ein Video angehört oder wie es
komprimiert wurde etc. – ein tatsächlich robuster Algorithmus zur Videoanalyse sollte sich
automatisch an die jeweiligen Charakteristika eines Videos in Bezug auf Inhalt, Editing-Artefakte,
Kompression usw. anpassen.
In dieser Dissertation wird untersucht, wie ein hochwertiges initiales Analyse- bzw.
Indexierungsergebnis für ein Video beliebigen Inhalts durch die Berücksichtigung des Kontexts
von Inhalt und Kompression erreicht oder verbessert werden kann. Eine der maßgeblichen Ideen
ixdieser Arbeit ist, den Analyseprozess eines Videos als ein transduktives Lernszenario aufzufassen:
Transduktives Lernen zielt auf die Erstellung eines Lern- und Klassifikationsmodell ab, das die
gegeben