La lecture à portée de main
Découvre YouScribe en t'inscrivant gratuitement
Je m'inscrisDécouvre YouScribe en t'inscrivant gratuitement
Je m'inscrisDescription
Sujets
Informations
Publié par | julius-maximilians-universitat_wurzburg |
Publié le | 01 janvier 2006 |
Nombre de lectures | 14 |
Langue | English |
Poids de l'ouvrage | 7 Mo |
Extrait
Knowledge IntensiveSubgroupMining–
TechniquesforAutomatic
andInteractiveDiscovery
DissertationzurErlangungdes
naturwissenschaftlichenDoktorgrades
derBayerischenJulius–Maximilians–Universitat¨ Wurzb¨ urg
vorgelegtvon
MartinAtzmuller¨
aus
BadKissingen
Wurzb¨ urg,2006Eingereichtam: 01.06.2006
beiderFakultat¨ fur¨ MathematikundInformatik
1. Gutachter: Prof. Dr. FrankPuppe
2. Prof. Dr. KatharinaMorik
3. Gutachter: Prof. Dr. NadaLavrac
Tagdermundlichen¨ Prufung:¨ 15.12.2006Abstract
Data mining has proved its significance in various domains and applications. As an im
portant subfield of the general data mining task, subgroup mining can be used, e.g., for
marketing purposes in business domains, or for quality profiling and analysis in medical
domains. The goal is to efficiently discover novel, potentially useful and ultimately inter-
esting knowledge. However, in real world situations these requirements often cannot be
fulfilled,e.g.,iftheappliedmethodsdonotscaleforlargedatasets,iftoomanyresultsare
presentedtotheuser,orifmanyofthediscoveredpatternsarealreadyknowntotheuser.
Thisthesisproposesacombinationofseveraltechniquesinordertocopewiththesketched
problems: We discuss automatic methods, including heuristic and exhaustive approaches,
and especially present the novel SD Map algorithm for exhaustive subgroup discovery
that is fast and effective. For an interactive approach we describe techniques for sub
groupintrospectionandanalysis,andwepresentadvancedvisualizationmethods,e.g.,the
zoomtable that directly shows the most important parameters of a subgroup and that can
beusedforoptimizationandexploration. Wealsodescribevariousvisualizationsforsub
group comparison and evaluation in order to support the user during these essential steps.
Furthermore,weproposetoincludepossiblyavailablebackgroundknowledgethatiseasy
to formalize into the mining process. We can utilize the knowledge in many ways: To fo
custhesearchprocess,torestrictthesearchspace,andultimatelytoincreasetheefficiency
of the discovery method. We especially present background knowledge to be applied for
filteringtheelementsoftheproblemdomain,forconstructingabstractions,foraggregating
values of attributes, and for the post processing of the discovered set of patterns. Finally,
thetechniquesarecombinedintoaknowledge intensiveprocesssupportingbothautomatic
and interactive methods for subgroup mining. The practical significance of the proposed
approach strongly depends on the available tools. We introduce the VIKAMINE system
asahighly integratedenvironmentforknowledge intensiveactivesubgroupmining.
Also,wepresentanevaluationconsistingoftwoparts: Withrespecttoobjectiveevaluation
criteria, i.e., comparing the efficiency and the effectiveness of the subgroup discovery
methods, we provide an experimental evaluation using generated data. For that task we
present a novel data generator that allows a simple and intuitive specification of the data
characteristics. TheresultsoftheexperimentalevaluationindicatethatthenovelSD Map
methodoutperformstheotherdescribedalgorithmsusingdatasetssimilartotheintended
application concerning the efficiency, and also with respect to precision and recall for the
heuristicmethods. Subjectiveevaluationcriteriaincludetheuseracceptance,thebenefitof
the approach, and the interestingness of the results. We present five case studies utilizing
the presented techniques: The approach has been successfully implemented in medical
andtechnicalapplicationsusingreal worlddatasets. Themethodwasverywellaccepted
bytheusersthatwereabletodiscovernovel,useful,andinterestingknowledge.Acknowledgments
My work was supported by many people: Without their helpful advice and encourage
ment, it would have been impossible to finish this thesis. First, I would like to thank my
supervisorProf.Dr.FrankPuppefromtheUniversityofWurzb¨ urg: Heprovidedmewitha
perfectenvironment of research withopen doorsand openminds. Inspiring mein fruitful
discussions, he has been a source of invaluable ideas. Furthermore, I want to thank the
reviewers Prof. Dr. Nada Lavrac from the Jozef Stefan Institute, Ljubljana, Slovenia, and
Prof.Dr.KatharinaMorikfromtheUniversityofDortmund,Germany.
Additionally, I want to thank my colleagues of the AI working group at the University of
¨Wurzburg, especially Joachim Baumeister for being my partner in the learning methods
dream team. He supported my work by contributing ideas, discussing countless issues,
andwritingpublicationswithme. AlexanderHornlein,¨ ManuelFehler,NormanBrummer¨ ,
MartinSchuhmann,RainerHerrler,FranziskaKlugl Frohnme¨ yerandmyothercolleagues
also supported me during the years. Last but not least, I want to thank Petra Braun -
the good heart of our group - who especially helped me cope with various problems of
bureaucracy and other things of ordinary life. The practical parts of my work would have
notbeenpossiblewithoutProf.Dr.Hans PeterBuscher(DRK KlinikenBerlin K openick)¨
and Achim Hemsing (Department of Prosthodontics, University of Wurzb¨ urg) who were
mypartnersinthemedicalprojects. Iamgratefulfortheirvaluableandconstructiveinput
tomywork. FurtherthanksgotoGeorgBuscher,forhissupportwiththemedicaldatasets.
I would especially like to thank Julia Eischer, Joachim Baumeister, Alexander Hornlein,¨
ManuelFehler,andNormanBrummer¨ forreadingandpolishingpartsofmythesis.
I thank my family for their encouragement, their support, and for providing me with an
anchor of life, especially my father and mother, and my brother. Finally, I owe the most
important gratitude to my wife Irene. She supported me in our private life and always
encouragedmewhilefinishingthiswork.
Ithankyouall.
MartinAtzmuller¨
Wurzb¨ urg,2006Contents
I Introduction 1
1 Motivation 3
1.1 ResearchGoalandContext . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 OverviewoftheApproach . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 AlgorithmsforSubgroupDiscovery . . . . . . . . . . . . . . . . 5
1.2.2 ExploitingBackgroundKnowledge . . . . . . . . . . . . . . . . 5
1.2.3 SubgroupIntrospectionandAnalysis . . . . . . . . . . . . . . . 6
1.2.4 VisualizationMethods . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.5 Knowledge IntensiveActiveSubgroupMining . . . . . . . . . . 7
1.3 CriteriaandProblemsofanEvaluation . . . . . . . . . . . . . . . . . . . 8
1.3.1 Predictivevs. DescriptiveDataMining . . . . . . . . . . . . . . 8
1.3.2 EvaluatingtheEffectivenessandtheEfficiency . . . . . . . . . . 8
1.3.3 UserAcceptanceoftheDiscoveredPatterns . . . . . . . . . . . . 9
1.3.4 IncludingSubjectiveInterestingnessMeasures . . . . . . . . . . 10
1.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 StructureofthisWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 TheKnowledge IntensiveActiveSubgroupMiningProcess 15
2.1 MotivationforKnowledge IntensiveActiveSubgroupMining . . . . . . 15
2.2 SubgroupMining–AnOverview . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 RelationtoKnowledgeDiscoveryinDatabases . . . . . . . . . . 15
2.2.2 SubgroupDiscovery . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 TheGeneralSubgroupMiningProcess . . . . . . . . . . . . . . 18
2.3 ProcessModelforKnowledge IntensiveActiveSubgroupMining . . . . 19
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
II Methods 21
3 SubgroupDiscovery 23
3.1 TheSubgroupDiscoveryApproach . . . . . . . . . . . . . . . . . . . . 23
3.1.1 DefinitionofBasicOntologicalKnowledge . . . . . . . . . . . . 24
3.1.2 TargetVariable . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 SubgroupDiscoveryContext . . . . . . . . . . . . . . . . . . . . 25
3.1.4DescriptionLanguage . . . . . . . . . . . . . . . . . . 25vi Contents
3.1.5 SubgroupQualityandInterestingnessMeasures . . . . . . . . . . 27
3.1.6 SearchStrategiesforSubgroupDiscovery . . . . . . . . . . . . . 30
3.2 PruningtheSearchSpace . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 RedundancyManagementofSetsofSubgroups . . . . . . . . . . . . . . 34
3.3.1 GroupingSubgroupsbyClustering . . . . . . . . . . . . . . . . 36
3.3.2 SubgroupSelectionbyWeightedCovering . . . . . . . . . . . . 36
3.3.3 CausalSubgroupAnalysisandSelection . . . . . . . . . . . . . 38
3.4 SubgroupDiscoveryAlgorithms . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1 SizeoftheSearchSpace . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2 Exhaustive . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.3 BeamSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4 PRIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.5 Apriori SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.6 SD Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Knowledge IntensiveSubgroupMining 55
4.1 ProblemDefinition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 TypesandClassesofBackgroundKnowledge . . . . . . . . . . . . . . . 56
4.2.1 ConstraintKnowledge . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 OntologicalKno . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 AbstractionKnowledge . . . . . . . . .