La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

Large-scale pattern-based information extraction from the world wide web [Elektronische Ressource] / von Sebastian Blohm

De
223 pages
Ajouté le : 01 janvier 2010
Lecture(s) : 12
Signaler un abus

Large-ScalePattern-BasedInformationExtraction
fromtheWorldWideWeb
ZurErlangungdesakademischenGradeseines
DoktorsderWirtschaftswissenschaften
(Dr. rer. pol.)
vonderFakulta¨tfu¨r
Wirtschaftswissenschaften
am KarlsruherInstitutfu¨rTechnologie
vorgelegte
DISSERTATION
von
M.Sc. SebastianBlohm
Tagdermu¨ndlichenPru¨fung: 22. Januar2010
Refernt: Prof. Dr. RudiStuder
Korreferent: Prof. Dr. Dr. LarsSchmidt-Thieme
2010KarlsruheiiAbstract
Extracting information from text is the task of obtaining structured, machine-
processable facts from information that is mentioned in an unstructured manner. It
thus allows systems to automatically aggregate information for further analysis, effi-
cient retrieval, automaticvalidation, or appropriatevisualization. InformationExtrac-
tionsystemsrequireamodelthatdescribeshowto identifyrelevanttargetinformation
in texts. Thesemodelsneed to beadaptedto theexactnatureof thetargetinformation
and to the nature of the textual input, which is typically accomplished by means of
MachineLearningtechniquesthatgeneratesuch modelsbasedon examples. Onepar-
ticulartypeofInformationExtractionmodelsaretextualpatterns. Textualpatternsare
underspecifiedexplicitdescriptionsoftextfragments. Theautomaticinductionofsuch
patternsfromexampletextfragmentswhichareknowntocontaintargetinformationis
acommonwaytolearnthistypeofextractionmodels.
This thesis explores the potential of using textual patterns for InformationExtrac-
tionfromtheWorldWideWeb. Wereviewanddiscussalargebodyofrelatedworkby
describing it within a common framework. Then, we empirically analyze the effects
of a multitude of design choices in pattern-based Information Extraction systems. In
particular,weinvestigatehowpatternscanbefilteredappropriately.Weshowhowcor-
poraofdifferentnaturecanbeexploitedbeneficiallyandhowthenatureofthepatterns
influencesextraction quality. Finally, we present new ways of mining textual patterns
bymodellingpatterninductionasawell-understoodtypeofDataMiningproblems.
iiiivAcknowledgements
I am indebted to many people who guided and supported me during working towards
myPh.D.andwritingthisthesis. Mostprominently,thesearemyadvisorsRudiStuder
and Philipp Cimiano. Rudi Studer gave me the chance to do this research and the
guidance, trust and freedom I needed to complete it and learn a lot. Philipp Cimiano
madethisworkpossiblethroughinvaluablediscussions,ideasandoptimism.
Furthermore, I would like to thank my colleagues at the AIFB institute and in the
X-Media project. The friendly, focused environment and the chance to build on a
large body of previous experience was a great asset to me. Most of all, I would like
to thank Johanna Vo¨lker, Krisztian Buza and Frank Dengler for their commitment,
trustand patienceduringourcollaborationsandSebastian Rudolphfor commentsand
discussionsduringtheproductionofthisthesis.
Additionally,I am gratefulto YunyaoLi, ThomasHampp and Shiv Vaithyanathan
and their colleagues at IBM for an intense collaboration during my stay at the Un-
structured Information Mining group in Almaden which taught me entirely different
perspectivesonmyresearch.
Most of the text in this thesis was written during my stay at TU Delft. I would
like to thank Ursula and Philipp Cimiano and the Web InformationSystems group of
Geert-JanHoubenfortheirhospitality.
I owe a lot to Kathrin Heuser, Olesya Isaenko, Maria Maleshkova, Tobias Hauth,
Stefan Kittler, Pascal Kretschmann,Egon Stemle, Ju¨rgenUmbrich and AndreasWag-
ner who contributed their ideas and a lot of labor to my research and project work as
thesisstudentsorassistants inourlab.
Finally, I thank my parents and my sisters for supporting and motivating me not
onlybutmoreintenselyduringmyworktowardsthisthesis.
My Ph.D. studies were financially supported by two generous Ph.D. Fellowship
Awards from IBM and a travel grant from the Karlsruhe House of Young Scientists.
Duringmy Ph.D. studies I worked in the X-Mediaprojectsponsoredby the European
CommissionaspartoftheInformationSociety Technologies(IST)programunderEC
grantnumberIST-FP6-026978.
vviContents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 ProblemStatement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 WhatisSpecialaboutOperatingatWebScale? . . . . . . . . . . . . 4
1.4 Trendsin theFieldofInformationExtraction . . . . . . . . . . . . . 4
1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Reader’sGuide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 PublishedResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
I Preliminaries 9
2 MethodologicalandTechnicalFoundations 11
2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 NaturalLanguageProcessing . . . . . . . . . . . . . . . . . . . . . . 14
2.3 MachineLearningandDataMining . . . . . . . . . . . . . . . . . . 20
2.4 InformationRetrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 InformationExtractionTasks 31
3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 DimensionsofInformationExtractionTasks . . . . . . . . . . . . . . 32
3.3 Prominentextractiontasks . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 ChallengesinIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 FocusofthisThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 ApproachestoInformationExtraction 39
4.1 ApplicationsandEvaluation . . . . . . . . . . . . . . . . . . . . . . 40
4.2 MachineLearningforInformationExtraction . . . . . . . . . . . . . 44
4.3 InformationExtractionandtheSemanticWeb . . . . . . . . . . . . . 54
II Large-ScaleExtractionMethods 59
5 TheIterativePatternInductionFramework 61
5.1 FrameworkOverview . . . . . . . . . . . . . . . . . . . . . . . . . . 62
viiviii CONTENTS
5.2 PatternsforRelationExtraction. . . . . . . . . . . . . . . . . . . . . 63
5.3 TheAlgorithmicFramework . . . . . . . . . . . . . . . . . . . . . . 65
5.4 AssumptionsandChallenges . . . . . . . . . . . . . . . . . . . . . . 67
5.5 TheProntoSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6 Related ExtractionSystems . . . . . . . . . . . . . . . . . . . . . . . 72
5.7 EvaluationParadigms . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.8 PerformanceofSystemsintheLiterature . . . . . . . . . . . . . . . 91
6 ControllingtheQualityofInducedPatterns 93
6.1 FilteringFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 AnalysisofResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 TextCorpusandExtractionDynamics 113
7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2 TheProblemofLowRedundancy . . . . . . . . . . . . . . . . . . . 114
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4 ExperimentalEvaluation . . . . . . . . . . . . . . . . . . . . . . . . 121
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8 EfficientPatternInductionwithDMMethods 127
8.1 PatternInductionasFrequentItemsetMining . . . . . . . . . . . . . 129
8.2 ExperimentalEvaluation . . . . . . . . . . . . . . . . . . . . . . . . 135
8.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9 PatternExpressivity 141
9.1 TheRoleofPatternExpressivity . . . . . . . . . . . . . . . . . . . . 143
9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.3 TaxonomicSequentialPatterns . . . . . . . . . . . . . . . . . . . . . 146
9.4 PatternMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
III Applications 163
10 Web-wideIEforMarketAnalysis 165
10.1 TheCompetitorScenarioForecastTask . . . . . . . . . . . . . . . . 165
10.2 InformationExtractionforCSF . . . . . . . . . . . . . . . . . . . . . 166
10.3 PracticalExperience . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171CONTENTS ix
11 CommunitiesGeneratingStructuredKnowledge 173
11.1 CombiningHumanandMachineIntelligence . . . . . . . . . . . . . 174
11.2 SystemDesignandImplementation . . . . . . . . . . . . . . . . . . 177
11.3 PracticalExperiences . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
IV Conclusion 183
12 SynopsisofResults 185
12.1 ControllingQualityofIterativePatternInduction . . . . . . . . . . . 185
12.2 SupervisionandRedundancy . . . . . . . . . . . . . . . . . . . . . . 186
12.3 Rich PatternsandScalableInduction . . . . . . . . . . . . . . . . . . 186
12.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
13 Outlook 189
13.1 ApplicationScenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 189
13.2 AdvancingIEMethods . . . . . . . . . . . . . . . . . . . . . . . . . 190
Appendix 195
References 197x CONTENTS