Knowledge-intensive subgroup mining [Elektronische Ressource] : techniques for automatic and interactive discovery / vorgelegt von Martin Atzmüller

julius-maximilians-universitat_wurzburg

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

222 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	julius-maximilians-universitat_wurzburg
Publié le	01 janvier 2006
Nombre de lectures	14
Langue	English
Poids de l'ouvrage	7 Mo

Extrait

Knowledge IntensiveSubgroupMining–
TechniquesforAutomatic
andInteractiveDiscovery
DissertationzurErlangungdes
naturwissenschaftlichenDoktorgrades
derBayerischenJulius–Maximilians–Universitat¨ Wurzb¨ urg
vorgelegtvon
MartinAtzmuller¨
aus
BadKissingen
Wurzb¨ urg,2006Eingereichtam: 01.06.2006
beiderFakultat¨ fur¨ MathematikundInformatik
1. Gutachter: Prof. Dr. FrankPuppe
2. Prof. Dr. KatharinaMorik
3. Gutachter: Prof. Dr. NadaLavrac
Tagdermundlichen¨ Prufung:¨ 15.12.2006Abstract
Data mining has proved its signiﬁcance in various domains and applications. As an im
portant subﬁeld of the general data mining task, subgroup mining can be used, e.g., for
marketing purposes in business domains, or for quality proﬁling and analysis in medical
domains. The goal is to efﬁciently discover novel, potentially useful and ultimately inter-
esting knowledge. However, in real world situations these requirements often cannot be
fulﬁlled,e.g.,iftheappliedmethodsdonotscaleforlargedatasets,iftoomanyresultsare
presentedtotheuser,orifmanyofthediscoveredpatternsarealreadyknowntotheuser.
Thisthesisproposesacombinationofseveraltechniquesinordertocopewiththesketched
problems: We discuss automatic methods, including heuristic and exhaustive approaches,
and especially present the novel SD Map algorithm for exhaustive subgroup discovery
that is fast and effective. For an interactive approach we describe techniques for sub
groupintrospectionandanalysis,andwepresentadvancedvisualizationmethods,e.g.,the
zoomtable that directly shows the most important parameters of a subgroup and that can
beusedforoptimizationandexploration. Wealsodescribevariousvisualizationsforsub
group comparison and evaluation in order to support the user during these essential steps.
Furthermore,weproposetoincludepossiblyavailablebackgroundknowledgethatiseasy
to formalize into the mining process. We can utilize the knowledge in many ways: To fo
custhesearchprocess,torestrictthesearchspace,andultimatelytoincreasetheefﬁciency
of the discovery method. We especially present background knowledge to be applied for
ﬁlteringtheelementsoftheproblemdomain,forconstructingabstractions,foraggregating
values of attributes, and for the post processing of the discovered set of patterns. Finally,
thetechniquesarecombinedintoaknowledge intensiveprocesssupportingbothautomatic
and interactive methods for subgroup mining. The practical signiﬁcance of the proposed
approach strongly depends on the available tools. We introduce the VIKAMINE system
asahighly integratedenvironmentforknowledge intensiveactivesubgroupmining.
Also,wepresentanevaluationconsistingoftwoparts: Withrespecttoobjectiveevaluation
criteria, i.e., comparing the efﬁciency and the effectiveness of the subgroup discovery
methods, we provide an experimental evaluation using generated data. For that task we
present a novel data generator that allows a simple and intuitive speciﬁcation of the data
characteristics. TheresultsoftheexperimentalevaluationindicatethatthenovelSD Map
methodoutperformstheotherdescribedalgorithmsusingdatasetssimilartotheintended
application concerning the efﬁciency, and also with respect to precision and recall for the
heuristicmethods. Subjectiveevaluationcriteriaincludetheuseracceptance,thebeneﬁtof
the approach, and the interestingness of the results. We present ﬁve case studies utilizing
the presented techniques: The approach has been successfully implemented in medical
andtechnicalapplicationsusingreal worlddatasets. Themethodwasverywellaccepted
bytheusersthatwereabletodiscovernovel,useful,andinterestingknowledge.Acknowledgments
My work was supported by many people: Without their helpful advice and encourage
ment, it would have been impossible to ﬁnish this thesis. First, I would like to thank my
supervisorProf.Dr.FrankPuppefromtheUniversityofWurzb¨ urg: Heprovidedmewitha
perfectenvironment of research withopen doorsand openminds. Inspiring mein fruitful
discussions, he has been a source of invaluable ideas. Furthermore, I want to thank the
reviewers Prof. Dr. Nada Lavrac from the Jozef Stefan Institute, Ljubljana, Slovenia, and
Prof.Dr.KatharinaMorikfromtheUniversityofDortmund,Germany.
Additionally, I want to thank my colleagues of the AI working group at the University of
¨Wurzburg, especially Joachim Baumeister for being my partner in the learning methods
dream team. He supported my work by contributing ideas, discussing countless issues,
andwritingpublicationswithme. AlexanderHornlein,¨ ManuelFehler,NormanBrummer¨ ,
MartinSchuhmann,RainerHerrler,FranziskaKlugl Frohnme¨ yerandmyothercolleagues
also supported me during the years. Last but not least, I want to thank Petra Braun -
the good heart of our group - who especially helped me cope with various problems of
bureaucracy and other things of ordinary life. The practical parts of my work would have
notbeenpossiblewithoutProf.Dr.Hans PeterBuscher(DRK KlinikenBerlin K openick)¨
and Achim Hemsing (Department of Prosthodontics, University of Wurzb¨ urg) who were
mypartnersinthemedicalprojects. Iamgratefulfortheirvaluableandconstructiveinput
tomywork. FurtherthanksgotoGeorgBuscher,forhissupportwiththemedicaldatasets.
I would especially like to thank Julia Eischer, Joachim Baumeister, Alexander Hornlein,¨
ManuelFehler,andNormanBrummer¨ forreadingandpolishingpartsofmythesis.
I thank my family for their encouragement, their support, and for providing me with an
anchor of life, especially my father and mother, and my brother. Finally, I owe the most
important gratitude to my wife Irene. She supported me in our private life and always
encouragedmewhileﬁnishingthiswork.
Ithankyouall.
MartinAtzmuller¨
Wurzb¨ urg,2006Contents
I Introduction 1
1 Motivation 3
1.1 ResearchGoalandContext . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 OverviewoftheApproach . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 AlgorithmsforSubgroupDiscovery . . . . . . . . . . . . . . . . 5
1.2.2 ExploitingBackgroundKnowledge . . . . . . . . . . . . . . . . 5
1.2.3 SubgroupIntrospectionandAnalysis . . . . . . . . . . . . . . . 6
1.2.4 VisualizationMethods . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.5 Knowledge IntensiveActiveSubgroupMining . . . . . . . . . . 7
1.3 CriteriaandProblemsofanEvaluation . . . . . . . . . . . . . . . . . . . 8
1.3.1 Predictivevs. DescriptiveDataMining . . . . . . . . . . . . . . 8
1.3.2 EvaluatingtheEffectivenessandtheEfﬁciency . . . . . . . . . . 8
1.3.3 UserAcceptanceoftheDiscoveredPatterns . . . . . . . . . . . . 9
1.3.4 IncludingSubjectiveInterestingnessMeasures . . . . . . . . . . 10
1.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 StructureofthisWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 TheKnowledge IntensiveActiveSubgroupMiningProcess 15
2.1 MotivationforKnowledge IntensiveActiveSubgroupMining . . . . . . 15
2.2 SubgroupMining–AnOverview . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 RelationtoKnowledgeDiscoveryinDatabases . . . . . . . . . . 15
2.2.2 SubgroupDiscovery . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 TheGeneralSubgroupMiningProcess . . . . . . . . . . . . . . 18
2.3 ProcessModelforKnowledge IntensiveActiveSubgroupMining . . . . 19
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
II Methods 21
3 SubgroupDiscovery 23
3.1 TheSubgroupDiscoveryApproach . . . . . . . . . . . . . . . . . . . . 23
3.1.1 DeﬁnitionofBasicOntologicalKnowledge . . . . . . . . . . . . 24
3.1.2 TargetVariable . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 SubgroupDiscoveryContext . . . . . . . . . . . . . . . . . . . . 25
3.1.4DescriptionLanguage . . . . . . . . . . . . . . . . . . 25vi Contents
3.1.5 SubgroupQualityandInterestingnessMeasures . . . . . . . . . . 27
3.1.6 SearchStrategiesforSubgroupDiscovery . . . . . . . . . . . . . 30
3.2 PruningtheSearchSpace . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 RedundancyManagementofSetsofSubgroups . . . . . . . . . . . . . . 34
3.3.1 GroupingSubgroupsbyClustering . . . . . . . . . . . . . . . . 36
3.3.2 SubgroupSelectionbyWeightedCovering . . . . . . . . . . . . 36
3.3.3 CausalSubgroupAnalysisandSelection . . . . . . . . . . . . . 38
3.4 SubgroupDiscoveryAlgorithms . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1 SizeoftheSearchSpace . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2 Exhaustive . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.3 BeamSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4 PRIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.5 Apriori SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.6 SD Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Knowledge IntensiveSubgroupMining 55
4.1 ProblemDeﬁnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 TypesandClassesofBackgroundKnowledge . . . . . . . . . . . . . . . 56
4.2.1 ConstraintKnowledge . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 OntologicalKno . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 AbstractionKnowledge . . . . . . . . .