Instance-based ontology matching and the evaluation of matching systems [Elektronische Ressource] / vorgelegt von Katrin Simone Zaiß

heinrich-heine-universitat_dusseldorf

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

131 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	heinrich-heine-universitat_dusseldorf
Publié le	01 janvier 2010
Nombre de lectures	17
Langue	English
Poids de l'ouvrage	1 Mo

Extrait

Instance-Based Ontology
Matching and the Evaluation of
Matching Systems
Inaugural-Dissertation
zur
Erlangung des Doktorgrades der
Mathematisch-Naturwissenschaftlichen Fakultat
der Heinrich-Heine-Universitat Dusseldorf
vorgelegt von
Katrin Simone Zai
aus Dusseldorf
November 2010Aus dem Institut fur Informatik
der Heinrich-Heine Universitat Dusseldorf
Gedruckt mit der Genehmigung der
Mathematisch-Naturwissenschaftlichen Fakultat der
Heinrich-Heine-Universitat Dusseldorf
Referent: Prof. Dr. Stefan Conrad
Koreferent: Prof. Dr. Martin Lercher
Tag der mundlichen Prufung: 10.12.2010 The Answer to the Ultimate Question of
Life, the Universe, and Everything:
101010
(loosely based on Douglas Adams’ novel
The Hitchhiker’s Guide to the Galaxy)Acknowledgements
This thesis is the result of three and a half year of research at the Databases and
Information Systems Group of the Department of Computer Science at the Heinrich
Heine University of Dusseldorf.
First of all, I acknowledge my advisor and rst referee Prof. Dr. Stefan Conrad for
supporting me in my research and for creating a motivating and comfortable working
atmosphere. I am very thankful for the possibility to work under his supervision
and to learn from his experiences. I also thank the second reviewer of this thesis,
Prof. Dr. Martin Lercher, for his interest in my work and willingness to be the second
referee.
My special compliments go to my colleagues and friends, Sadet Alcic and Tim Schluter ,
who helped me with their technical knowledge and their patience, and who enlightened
each working day with their great sense of humor. It was a pleasure to share the o ce
with Sadet, and I thank Tim for the joint work. Additionally, I extend my compliments
to my new colleagues, Ludmila Himmelspach, Jiwu Zhao and Thomas Scholz, and to
my former colleague, Johanna Vompras, who supported me with her experience.
Special thanks to Guido Konigstein, Marga Pottho , and Sabine Freese for their gui-
dance in technical and administrative issues.
My deepest and warmest thanks go to my family, my parents Sabine and Kurt Zai ,
my brother Max and my sister Alexa. They support me in each situation and without
their love and their mental support, this thesis would not have been possible.
Last but not least, I deeply want to thank Andre Teloo for supporting me each and
every day with his patience, his understanding and his ability to always look on the
bright side of life.
Dusseldorf, Germany
November, 2010 Katrin Simone Zai Abstract
The matching of heterogeneous information sources is a crucial task in many di erent
domains. In order to nd relations between the di erent pieces of information, which
are annotated using di erent structures and formats, matching systems have been de-
veloped. In the past two decades, ontologies became more and more important as a
way to represent the semantics of information in a machine read- and processable way.
Hence, many ontology matching systems have been developed as well, which make use
of the di erent parts of ontologies to resolve the heterogeneities. Most systems focus on
the exploit of schema or structure information, but ontologies also provide instances,
which express the semantics of a concept independent of its meta information. Cur-
rent instance-based matching methods give room for improvements in several aspects.
Matching Systems also need to be evaluated using appropriate test data. Existing
benchmarks are not su cient for testing instance-based methods. In this thesis, we
focus on the development of instance-based matching methods, their combination with
schema- and structure-based methods and their evaluation.
We introduce two novel instance-based matching methods. The rst method makes
use of regular expressions or sample values to characterize the concepts of an ontology
by their instance sets. The second approach uses the instance sets to calculate many
di erent features like average length or the set of frequent values. Both approaches
nally compare the characterizations, i.e. the regular expressions or the features, to
obtain similarities between the entity sets of two (or more) ontologies. An alignment
between the ontologies is then obtained by examining the similarity set.
In order to test single matching methods or complex matching systems well-de ned
test benchmarks have to be available, preferably including the correct alignments to
facilitate the evaluation. Current benchmarks do not enable extensive studies on
instance-based methods, because the number of instances is signi cantly too low. We
present an additional benchmark, ONTOBI, which can be used to test instance-based
methods, but also all other kinds of matching algorithms or systems.
Finally, we presentMICU, a complex matching system which uni es the advantages of
instance-, schema- and structure-based matching methods combined with an e cient
user feedback interaction. In order to speed up the process alignments of previous
matching cycles are reused.Zusammenfassung
Heterogene Informationsquellen ndet man in vielen unterschiedlichen Gebieten und
das Matching (der Abgleich) dieser Quellen ist ein Prozess, der hau g gebraucht wird.
Um die Verbindungen zwischen den verschiedenen Informationen, die unterschiedlich
formuliert und struktiert sein konnen, zu nden, wurden Matching-Systeme entwickelt.
Als Struktur zur maschinenles- und verarbeitbaren Reprasentation von Wissen wurden
in den letzten zwei Jahrzehnten Ontologien immer popularer. Folglich wurden auch vie-
le Ontologie-Matching-Systeme entwickelt, welche die unterschiedlichen Elemente der
Ontologien untersuchen um die Heterogenitaten zwischen den Ontologien aufzulosen.
Dabei verwenden die meisten Systeme hauptsachlich Schema- und Strukturinformatio-
nen, obwohl Ontologien auch Instanzen enthalten, welche die Bedeutung der Konzepte
unabhangig von jeglichen Meta-Informationen beschreiben. Die bisher existierenden in-
stanzbasierten Methoden bieten noch einigen Raum fur Verbesserungen. Diese Arbeit
beschaftigt sich mit der Entwicklung neuer instanzbasierter Methoden, ihrer Kombi-
nation mit schema- und strukturbasierten Methoden und ihrer Evaluation.
Zu Anfang werden zwei neue instanzbasierte Methoden vorgestellt. Der erste An-
satz verwendet regulare Ausdrucke oder Beispielwerte um Konzepte einer Ontologie mit
Hilfe ihrer jeweiligen Instanzmengen zu charakterisieren. Die zweite Methode berech-
net aus der Instanzmenge verschiedene Features (Merkmale) wie die Durchschnittslange
oder die Menge der am hau gsten vorkommenden Werte. In beiden F allen werden die
Charakteristika, d.h. die regularen Ausdruc ke oder die Feature-Werte, verglichen um
eine Ahnlichkeit zwischen den verschiedenen Elementen der zwei Ontologien zu berech-
nen. Die paarweisen Ahnlichkeiten werden dann verwendet um die Korrespondenzen
zwischen den Ontologien zu nden.
Um einzelne Methoden oder komplexe Matching-Systeme testen zu konnen, braucht
man geeignete Testdaten-Sets, in denen idealerweise auch direkt die Menge der Referenz-
Korrespondenzen enthalten sein sollte. Bisher verfugbare Benchmarks bieten jedoch
nicht die Moglichkeit instanzbasierte Methoden ausfuhrlich zu testen, da nicht genugend
Instanzen vorhanden sind. Mit ONTOBI prasentieren wir einen zusatzlichen Bench-
mark, mit dem man instanzbasierte, aber auch alle anderen Arten von Matching-
Methoden oder -Systemen, testen kann.
Abschlie end stellen wir mit MICU ein komplexes Matching-System vor, welches die
Vorzuge von instanzbasierten mit denen von schema- und strukturbasierten Methoden
kombiniert und dabei e zient mit dem Benutzer zusammenarbeitet. Zus atzlic h werden
die Ergebnisse fruherer Matching-Durchlaufe wiederverwendet.