Extraction and integration of Web query interfaces [Elektronische Ressource] / Thomas Kabisch. Gutachter: Ulf Leser ; Felix Naumann ; Eberhard Rahm

humboldt-universitat_zu_berlin - Dipl.-Inf. Thomas Kabisch

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

139 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	humboldt-universitat_zu_berlin
Publié le	01 janvier 2011
Nombre de lectures	20
Langue	English
Poids de l'ouvrage	6 Mo

Extrait

Extraction and Integration of Web Query Interfaces
DISSERTATION
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
(Dr. Rer. Nat.)
im Fach Informatik
eingereicht an der
Mathematisch-Naturwissenschaftlichen Fakultät II
Humboldt-Universität zu Berlin
von
Dipl.-Inf. Thomas Kabisch
Präsident der Humboldt-Universität zu Berlin:
Prof. Dr. Jan-Hendrik Olbertz
Dekan der Mathematisch-Naturwissenschaftlichen Fakultät II:
Prof. Dr. Peter Frensch
Gutachter:
1. Prof. Dr. Ulf Leser, Humboldt-Universität zu Berlin
2. Prof. Dr. Felix Naumann, Hasso-Plattner-Institut Potsdam
3. Prof. Dr. Eberhard Rahm, Universität Leipzig
eingereicht am: 24.01.2011
Tag der mündlichen Prüfung: 13.07.2011Abstract
Databases on the Web oﬀer large amounts of structured content from various
domains. Many popular Web applications, such as comparison shopping systems
or search engines, rely on the programmatic access and/or the integration of the
content of such Web databases. With the rapid increase of the amount of data
available this way, techniques that support a seamless programmatic access of Web
databases become increasingly important.
In contrast to relational databases Web databases do not provide interfaces that
directly support a programmatic access to the databases content. In contrast, the
interfaces focus on human users. Therefore, in comparison with classical database
integration, the integration of Web databases requires additional eﬀort to trans-
form Web interfaces into a machine readable representation. The realization of this
transformation step is challenging because these interfaces in general do not provide
suﬃcient meta information about their elements, because they lack a common struc-
ture, andbecauselogicalelementsarealmostindistinguishablefromrepresentational
elements.
This thesis focuses on the integration of Web query interfaces. We model the inte-
gration process in several steps: First, unknown interfaces have to be classiﬁed with
respect to their application domain (classiﬁcation); only then a domain-wise treat-
ment is possible. Second, interfaces must be transformed into a machine readable
format (extraction) to allow their automated analysis. Third, as a pre-requisite to
integration across databases, pairs of semantically similar elements among multiple
interfaces need to be identiﬁed (matching). Only if all these tasks have been solved,
systems that provide an integrated view to several data sources can be set up.
This thesis presents new algorithms for each of these steps. We developed a novel
extraction algorithm that exploits a small set of commonsense design rules to derive
a hierarchical schema for query interfaces. In contrast to prior solutions that use
mainly ﬂat schema representations, the hierarchical schema better represents the
structure of the interfaces, leading to better accuracy of the integration step. Next,
we describe a multi-step matching method for query interfaces which builds on the
hierarchical schema representation. It uses methods from the theory of bipartite
graphs to globally optimize the matching result. As a third contribution, we present
a new method for the domain classiﬁcation problem of unknown interfaces that, for
the ﬁrst time, combines lexical and structural properties of schemas. All our new
methods have been evaluated on real-life datasets and perform superior to previous
works in their respective ﬁelds. Additionally, we present the system VisQI that
implements all introduced algorithmic steps and provides a comfortable graphical
user interface to support the integration process.
iiZusammenfassung
Web Datenbanken enthalten große Mengen von qualitativ hochwertigen struktu-
rierten Inhalten. Viele populäre Anwendungen wie beipielsweise Produktvergleichs-
systeme oder Suchmaschinen erfordern Methoden für einen programmgestützten
Datenbank-Zugriﬀ und die Integration der unterliegenden Inhalte. Durch das starke
Wachstum der Datenmenge in Web Datenbanken wird dieses Problem zunehmend
wichtiger.
Im Gegensatz zu beispielsweise relationalen Datenbanken unterstützen Web Da-
tenbanken den programmgestützten Zugriﬀ auf ihre Inhalte in der Regel nicht durch
geeignete Schnittstellen. Ansätze, die einen automatisierten Zugriﬀ auf Web Daten-
banken bereitstellen, können ausschließlich die für menschliche Interaktion konzi-
pierten Schnittstellen nutzen. Daher ist ein zusätzlicher Aufwand erforderlich, um
die Web Schnittstellen in eine maschinenlesbare Form zu transformieren. Die Reali-
sierung dieses Schrittes ist komplex, da die Web Schnittstellen keinerlei Metainfor-
mation über ihre Elemente bereitstellen und keine einheitliche Struktur aufweisen.
Wir unterscheiden zwischen zwei Arten von Web Schnittstellen: Anfrageschnitt-
stelle (Web Form) und Ergebnisschnittstelle. Die Anfrageschnittstelle ermöglicht
dem Nutzer, interaktiv Parameter für eine Datenbankanfrage zu deﬁnieren. Die
Ergebnisschnittstelle präsentiert die Datenbankrückgaben in einer Web-gerechten
Form. Diese Arbeit fokussiert auf die Integration von Anfrageschnittstellen.
Wir identiﬁzieren mehrere Schritte für den Integrationsprozess: Im ersten Schritt
werden unbekannte Anfrageschnittstellen auf ihre Anwendungsdomäne hin analy-
siert, um ein domänenweises Vorgehen in den Folgeschritten zu ermöglichen. Im
zweiten Schritt werden die Anfrageschnittstellen in ein maschinenlesbares Format
transformiert (Extraktion). Im dritten Schritt werden Paare semantisch gleicher Ele-
mente zwischen den verschiedenen zu integrierenden Anfragesschnittstellen identi-
ﬁziert (Matching). Diese Schritte bilden die Grundlage, um Systeme, die eine inte-
grierte Sicht auf die verschiedenen Datenquellen bieten, aufsetzen zu können.
Diese Arbeit beschreibt neuartige Lösungen für alle drei der genannten Schritte.
Der erste zentrale Beitrag ist ein Exktraktionsalgorithmus, der eine kleine Zahl von
Designregeln dazu benutzt, um Schemabäume abzuleiten. Gegenüber früheren Lö-
sungen, welche in der Regel lediglich eine ﬂache Schemarepräsentation anbieten, ist
der Schemabaum semantisch reichhaltiger, da er zusätzlich zu den Elementen auch
Strukturinformationenabbildet.DerExtraktionsalgorithmuserreichteineverbesser-
te Qualität der Element-Extraktion verglichen mit Vergängermethoden. Der zweite
Beitrag der Arbeit ist die Entwicklung einer neuen Matching-Methode. Hierbei er-
möglicht die Repräsentation der Schnittstellen als Schemabäume eine Verbesserung
vorherigerMethoden,indemauchstrukturelleAspekteindenMatching-Algorithmus
einﬂießen. Zusätzlich wird eine globale Optimierung durchgeführt, welche auf der
Theorie der bipartiten Graphen aufbaut.
Als dritten Beitrag entwickelt die Arbeit einen Algorithms für eine Klassiﬁkation
von Schnittstellen nach Anwendungsdomänen auf Basis der Schemabäume und den
abgeleiteten Matches. Zusätzlich wird das System VisQI vorgestellt, welches die
entwickeltenAlgorithmenimplementiertundeinekomfortablegraphischeOberﬂäche
für die Unterstützung des Integrationsprozesses bietet.
iiiContents
1 Introduction 1
1.1 Goals of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Extraction of Query Interfaces . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Matching of Query In . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Domain Classiﬁcation of Deep Web Sources . . . . . . . . . . . . . 5
1.2 Example Step-by-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions and Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Assignment of Contributions to Authors . . . . . . . . . . . . . . . 9
1.4 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Fundamentals of Deep Web Integration 13
2.1 Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Architectures of Virtual Integration Systems . . . . . . . . . . . . 15
2.1.2 Schema Management . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 The Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Technologies of the Web . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Tec versus Reality . . . . . . . . . . . . . . . . . . . . . . 26
2.2.4 Challenges of Deep Web Integration . . . . . . . . . . . . . . . . . 27
2.3 Related Aspects of Deep Web Integration . . . . . . . . . . . . . . . . . . 28
2.3.1 Structural Result Wrapping . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Building Uniﬁed Query Interfaces . . . . . . . . . . . . . . . . . . . 29
2.3.3 Federated Web Information Systems . . . . . . . . . . . . . . . . . 30
2.3.4 Evolution of Web Pages . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.5 Deep Web Crawlers . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.6 Mashups and Entity Search . . . . . . . . . . . . . . . . . . . . . . 32