La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

Data-mining techniques for call-graph-based software-defect localisation [Elektronische Ressource] / von Frank Eichinger

De
193 pages
Data-Mining Techniquesfor Call-Graph-BasedSoftware-Defect Localisationzur Erlangung des akademischen Grades einesDoktors der Ingenieurwissenschaftenvon der Fakultät für Informatikdes Karlsruher Instituts für Technologie (KIT)genehmigteDissertationvonFrank Eichingeraus HannoverTag der mündlichen Prüfung: 31. Mai 2011Erster Gutachter: Prof. Dr.-Ing. Klemens BöhmZweiter Prof. Dr. rer. nat. Ralf H. ReussnerKIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.eduData-Mining-Technikenfür Aufrufgraph-basierteSoftware-Defekt-LokalisierungMotivation und ZielSoftware ist selten gänzlich frei von Fehlern und manuelle Fehlersuche ist eine zeit-aufwändige – und damit kostenintensive – Aufgabe. Automatische Fehlerlokalisie-rungstechniken sind daher überaus wünschenswert. Dies trifft insbesondere für großeSoftwareprojekte zu, in denen eine manuelle Fehlerlokalisierung nur mit erheblichemAufwand möglich ist.Ein noch relativ junger Ansatz zur (halb-)automatischen Fehlerlokalisierung ist dieAnwendung von Data-Mining-Techniken auf Aufrufgraphen von Programmausfüh-rungen. Solche Graphen stellen üblicherweise Methoden als Knoten und Methoden-aufrufe als Kanten dar. Die entsprechenden Fehlerlokalisierungsansätze arbeiten mitFehlern, die in einigen – aber nicht allen – Programmausführungen auftreten.
Voir plus Voir moins

echniquesTData-MiningCall-Graph-BasedrofLocalisatione-DefectSoftwar

zurErlangungdesakademischenGradeseines

IngenieurwissenschaftenderDoktors

vonderFakultätfürInformatik

desKarlsruherInstitutsfürTechnologie(KIT)

genehmigte

Dissertation

nvo

EichingerFrank

vHannoausre

TagdermündlichenPrüfung:31.Mai2011
ErsterGutachter:Prof.Dr.-Ing.KlemensBöhm
ZweiterGutachter:Prof.Dr.rer.nat.RalfH.Reussner

KIT–UniversitätdesLandesBaden-WürttembergundnationalesForschungszentruminderHelmholtz-Gemeinschaft

.kit.eduwww

fürAData-Mining-Tufrufgraph-basierechnikente
ekt-LokalisierungSoftware-Def

ZielundationMotiv

SoftwareistseltengänzlichfreivonFehlernundmanuelleFehlersucheisteinezeit-
aufwändigerungstechnik–enundsinddamitdaherküberausostenintensive–wünschenswert.Aufgabe.DiestrifftAutomatischeinsbesondereFehlerlokalisie-fürgroße
Softwareprojektezu,indeneneinemanuelleFehlerlokalisierungnurmiterheblichem
ist.möglichandAufwEinnochrelativjungerAnsatzzur(halb-)automatischenFehlerlokalisierungistdie
AnwendungvonData-Mining-TechnikenaufAufrufgraphenvonProgrammausfüh-
aufruferungen.alsSolcheKantenGraphendar.DiestellenentsprechendenüblicherweiseMethodenFehlerlokalisierungsansätzealsKnotenundarbeitenMethoden-mit
kretFehlern,könnendieineinigenGraph-Mining-T–aberechniknichtenallenzum–EinsatzkProgrammausführungenommen,diemitauftreten.AufrufgraphenKon-
Miningarbeiten,diefindetalsinksolchenorrekteoderGraphenfehlerhafteMuster,dieAusführungtypischfürgekfehlerhafteennzeichnetsind.AusführungenGraph-
sind.Darauskanndann–ggf.inKombinationmitweiterenData-Mining-Techniken–
eineFehlerlokalisierungabgeleitetwerden.EineSoftwareentwicklerinbzw.einSoft-
wareentwicklerkanndannmitdieserInformationdenFehlerdeutlichschnellerfinden
beheben.undZieldieserDissertationimangewandtenData-Miningisteseinerseits,Data-
Mining-TechnikenfürdasspezielleAnwendungsproblem–Fehlerlokalisierungin
Software–zuentwickeln.DazugehörtsowohldasSpezifizierenvongeeigneten
DatenrepräsentationenwieverschiedenartigenAufrufgraph-Typen,alsauchdieEnt-
wicklungvondaraufabgestimmtenAnalyseprozessen.AndererseitsistesdasZiel,
Graph-Mining-Technikenweiterzuentwickeln.DieseTechnikensollenmöglichst
überdenkonkretenAnwendungsfallhinaus,unabhängigvonderAnwendungsdo-
sein.einsetzbarmäne,

i

BeiträgeundVorgehen

DieseArbeitbetrachtetverschiedeneAspektederFehlerlokalisierungmitAufruf-
graphenundleistetsovierwesentlicheBeiträge.DabeilegtderersteBeitragdie
Grundlagen,dieweiterenBeiträgeerweiterndieseindemsieweitereFehlerarten
lokalisieren,skalierbareAnsätzefürgroßeSoftwareprojekteuntersuchenbzw.die
eln:weiterentwickselbstechnikData-Mining-T

keitsgründenFehlerlokalisierungeineReduktionmitgveonwichtetenAufrufgraphenAvorufrufgraphen.derAnalyseDaausunabdingbarSkalierbarist,-
gehtandieserStellevielInformationverloren.Umdieszukompensieren,stelltdiese
inArbeitAufrufgrapheneinenAnsatzvdarstellt.or,derDabisherAufrufhäufigkkeineeitendediziertevonMethodenGraph-Mining-TalsKantengeechnikfürwichtege-
wichteteGraphenexistiert,schlägtdieseArbeiteinenkombiniertenAnsatzvor:Es
kommenherkömmlichesGraph-MiningundeinenumerischeData-Mining-Technik
zumsieren,dieEinsatz.dieDieseVAufrufhäufigkorgehensweiseeitvonerlaubtMethodenesinsbesondere,beeinflussen.solcheFehlerzulokali-

HierarchischeFehlerlokalisierungmitAufrufgraphen.Graph-Mining-Al-
gorithmenskalierennichtfürgroßeGraphen.VondaheristestrotzeingesetzterRe-
duktionennichtmöglich,diebisherentwickeltenTechnikenunmittelbaraufgroße
Softwareprojekteanzuwenden.DieseArbeitverfolgteinenanderenAnsatz.Siebe-
schäftigtsichzunächstmitGraph-RepräsentationenverschiedenerGranularitätsstu-
fen(Paket-,Klassen-undMethodenebene)unduntersuchterstmaligderenEignung
zurLokalisierungvonFehlern.BasierendaufsolchenGraphenwerdendannhier-
archischeAnalyseverfahrenentwickelt.DieselokalisierenFehler,indemsieaufei-
nergrobenGranularitätsstufebeginnen,potentiellfehlerhafteRegionenidentifizieren
unddannGraphenfeinererGranularitätdieserRegionenanalysieren.Dieserelativ
kleinenAusschnittsgraphenführendeutlichseltenerzuSkalierbarkeitsproblemen.

herigenFehlerlokalisierungAufrufgraph-basiertenmitTechnikDatenfluss-annotierenistesnichttenAmöglich,Fehlerufrufgraphen.zulokalisieren,Mitbis-
dienurDarstellungdenDatenflussDatenflüsseverändern.beinhaltet.DasEineliegtsolchedaran,dassDarstellungkeinezubisherigefindenistAufrufgraph-allerdings
schwierig,daeineeinzelneKantetypischerweisesehrvieleMethodenaufrufe–und
tionsomitundDatenflüsseAnalysetechnik–vrepräsentiert.or,dieDieseAbstraktionenArbeitvschlägtoneineDatenflüssen,Aufrufgraph-Repräsenta-generiertdurch
einensolcheFehlerDiskretisierungsansatz,lokalisiertwerden,diebeinhaltet.primärDadurchdenDatenflusskönnen(nebenbeeinflussen.anderen)vorallem

ii

Gewicht-Constraint-basiertesapproximativesGraph-Mining.Einande-
rerAnsatzSkalierbarkeitsproblemenzubegegnennebendenzuvorbeschriebenen
hierarchischenVerfahren,istdasFormulierenvonConstraints(Bedingungen)oder
auchdasZulassenvonapproximativenErgebnismengen.BisherigeConstraint-ba-
sierteGraph-Mining-AlgorithmengebenGarantienbzgl.derVollständigkeitderEr-
gebnismengen,betrachtenallerdingskeinegewichtetenGraphen.Dieshängtdamit
zusammen,dasskeingesetzmäßigerZusammenhangzwischenGraph-Topologieund
Gewichtenbesteht.DasichindieserArbeitgewichteteAufrufgraphenjedochals
sinnvollesKonzepterwiesenhaben,wirdhierdennochdieVerwendungvonGewicht-
basiertenConstraintsuntersucht.Eswirdgezeigt,dasssiesowohlzuperformanteren
Algorithmenführen,alsauchdassinderpraktischenAnwendungaufGarantienbzgl.
derVollständigkeitverzichtetwerdenkann.

ErgebnisseundAusblick
IndieserDissertationwerdenverschiedeneData-Mining-basierteTechnikenzurFeh-
lerlokalisierunginSoftwareentwickelt,sowieGraph-Mining-Technikenweiterent-
wickelt.DieFehlerlokalisierungweistinderEvaluationdurchschnittlichdoppeltso
präziseErgebnisseaufwieeineverwandteAufrufgraph-basierteTechnik.DieErgeb-
nissekönnendurchdieBerücksichtigungvonDatenflüssennochmalsverbessertwer-
mitden.DesFehlernWauseiterenderwirdPraxisineinesdiesergroßenArbeitSoftwerstmaligareprojektseineerfolgreichAufrufgraph-basierteevaluiert.TechnikBeim
Gewicht-Constraint-basiertenGraph-MiningwirdindieserArbeiteineAusführungs-
beschleunigungumdenFaktor3,5erzielt,beigleichbleibenderPräzisioninderFeh-
lerlokalisierung.UmdieGeneralitätdiesesAnsatzeszuzeigen,wirderzusätzlichmit
Graph-DateneinerganzanderenDomäne,derTransportlogistik,evaluiert.
WiealleFehlerlokalisierungstechnikensindauchdieindieserArbeitvorgeschla-
genTechnikennichtinderLage,alleArtenvonFehlernzulokalisieren.–IhreStär-
kenliegeninderLokalisierungsolcherFehler,diesichaufdieAufrufgraphenbzw.
aufdieDatenflüsseniederschlagen.EineErgänzungdurchandereTechnikenistdaher
sinnEinevoll,umimmereinwichtigermöglichstwerdendebreitesSpektrumEntwicklunganinFehlerartenderSoftwabzudeckaretechniken.istdieEnt-
wicklungvonmehrfädigerSoftwarefürMehrprozessorsysteme.InsolchenUmge-
bsondersungentretenschwierigeigenezuArtenlokalisierenvonsind.FehlernDiesaufist(z.B.vorallemdarinSynchronisationsfehler),begründet,diedassbe-sie
indeterministischauftreten.DieseArbeitzeigtunteranderem,dasseinTeildieser
FehlerbereitsmitAufrufgraph-basiertenTechnikenlokalisiertwerdenkann.Durch
erweiterteGraph-RepräsentationenundLokalisierungstechniken,diedieSpezifika
parallelerAusführungenexplizitberücksichtigen,istdieLokalisierungweitererFeh-
lerarten(z.B.bestimmteWettlaufsituationen)zuerwarten.

iii

ementswledgknoAc

Firstofall,IwouldliketoexpressmydeepappreciationtoProfessorDr.Klemens
Böhm,thesupervisorofmydissertation.ThroughoutthefiveyearsattheKarlsruhe
InstituteofTechnology(KIT),Klemensprovidedmewithconstructivesuggestions
andcomments,pushedmetowardspublishingresultsathighlyreputablevenuesand
protaughtvidedmeallhowthetowriteassistanceresearchIneededpapers.toWhenconductamydeadlineresearch.wasInapproaching,particular,IreceiKlemensved
hisAtthefeedbackKIT,Iwithinwaswhours,orkingnowithmatteraatteamwhichoftimestudents,ofdaywhoorwerenight.doingtheirprojects
andthesesundermysupervisionorwerecontributingasworkingstudentstomy
research.Ihavediscussedmanyoftheideasunderlyingthisdissertationwiththese
teamInparticularmembers,,Iandwouldmanylikeresultstoehaxpressvebeenmyachiegratitudevedtowiththethefollohelpwingofthesestudentspersons.who
providedmostvaluablehelp:MatthiasHubercontributedmanyideasincall-graph-
baseddefectlocalisationandconstraint-basedminingandputthemintopractice.
RolandKlugworkedwithmeonthelocalisationofdataflow-affectingbugsanddid
onthehierarchicalimplementationdefectandelocalisation.xperiments.ChristopherRolandalsoOßnerconductedcontinuedthethisfirstweork.xperimentsBesides
defectimplementationlocalisation.andeMycollexperiments,gePhilippChrisW.contribL.utedGroßeworkimportantedonideasdefectonlocalisationhierarchical
inmultithreadedprogrammesandranfirstexperiments.Thesewerecontinuedby
AlexanderBieleš,whoprovidedvaluablehelpandconductedmanyexperiments.
JonasconductingReinschecontribxperiments.utedbyreimplementingapproachesfromtherelatedworkand
dissertation,BesidestheIwouldpersonslikewhotowthankorkedmywithcollemegesonfromtheresearchKlemens’sdirectlygroupatrelatedthetoInstitutemy
forProgrammeStructuresandDataOrganisation(IPD).Allofthemhavecontributed
totoadiscusspleasantwquestionsorkingreegardingxperience.myresearchFurthermore,andtheyhelpedpromevidedwithmewithtechnicaltheandorpossibilitygan-
Dr.isationalMircoissues.Stern,asInwellparticularasDr,.IwErikouldlikBuchmannetoandmentionJuttamyofMülle.ficemates,Further,IChriswouldand
likand,etoeagain,xpressMirco.myWesinceregavethanksthemylecturecollagesonfoundationsMatthiasofBracht,databaseDr.StephansystemsandSchosserthe
practicalcourseondatawarehousingandminingtogether,andIenjoyedourpleasant
legescooperationprovidedveryessentialmuch.andWhenhelpfulwritingcomments,researchinpapers,particularaDr.numberThorbenofBurfurtherghardt,col-

v

Dr.Björn-OliverHartmann,MartinHeine,Dr.EmmanuelMüller,HeikoSchepperle
andNotDr.onlySilviathevonmembersStackelberofg.Klemens’sgrouphavecontributedtomywork,butas
wellfurthercollegesattheIPD.InparticularcooperationwithDr.KlausKrogmann
onlocalisingdataflow-affectingbugsandwithDr.VictorPankratiusonlocalising
atdefectsresearchinproblemsmultithreadedfromdifprogrammesferenthaveperspectibeenvves.eryFurtherinspiringandthankshelpedgotomealookingnumber
YofanfromscholarstheIhUniaveversitydiscussedofmyCalifornia,workUSAwith,andinProfessorparticularDrto.ProfessorAndreasDrZeller.Xifengfrom
SaarlandUniversity,Germany.
providedSpecialvthanksaluablegotoinsightsmyonsecondlocalisingsupervisordataflo,w-afProfessorfectingDrb.ugs,RalfgaH.veReussnerfeedback.Hone
myresearchpapersandhelpedmedesigningexperiments.
Attheend,Iwouldliketothankmyparentsandmyfriends.Myparentssupported
methroughouttheyearsandgavemethepossibilitytostudyattheuniversity.Many
ofcreditsmyIwfriendsouldliksupportedetomededicateallthetoatimeveryandspecialgavemeperson.alifeBirte,outsideyouuniarevwersity.onderful!Thelast

2011MayKarlsruhe,

vi

Contents

11.1IntroductionLocalisingDefectsinSoftware.......................21
1.2Call-GraphMiningforDefectLocalisation................4
1.3ContributionsofthisDissertation.....................6
1.4OutlineofthisDissertation.........................8
22.1BackgrGraphoundTheory.................................1111
2.1.12.1.2TGraphsrees.................................................................1121
2.2SoftwareEngineering............................13
2.2.1GraphsinSoftwareEngineering.................13
2.2.2Bugs,Defects,InfectionsandFailuresinSoftware.......16
2.32.2.3DataMiningSoftw.are.Testing................................andDebugging.................2108
2.3.22.3.1TheData-MiningData-MiningTechniquesProcessforandTabularAppliedDataData............Mining......2202
2.3.3Frequent-Pattern-MiningTechniques...............24
33orkWRelated33.1DefectLocalisation.............................33
3.1.23.1.1StaticDynamicApproachesApproaches..................................................3346
3.2Data3.1.3MiningDefect..Localisation................................inMultithreadedProgrammes.......4442
3.2.13.2.2WMiningeightedSignificantSubgraphMiningSubgraphs.......................................4474
3.2.3Constraint-BasedSubgraphMining................50
53RepresentationsCall-Graph44.1Call4.1.1GraphsTotalattheReductionMethodLevel.................................................5543
4.1.24.1.3TReductionemporalofOrderinIterationsCallGraphs.........................................5585

vii

Contents

......................RecursionsofReduction4.1.4.............................Comparison4.1.54.2CallGraphsatDifferentLevelsofGranularity..............
4.3CallGraphsofMultithreadedProgrammes................
4.4DerivationofCallGraphs..........................
.................................Subsumption4.5LocalisationectDefCall-Graph-Based55.1Overview...................................
.......................ApproachesStructuralExisting5.25.2.1TheApproachfromDiFattaetal.................
5.2.2TheApproachfromLiuetal....................
5.2.3TheApproachfromChengetal..................
5.3Frequency-BasedandCombinedApproaches...............
5.3.1Frequency-BasedApproach....................
.......................ApproachesCombined5.3.2..........................aluationEvExperimental5.4.........................SetupExperimental5.4.1........................ResultsExperimental5.4.25.4.3ComparisontoRelatedWork...................
.................................Subsumption5.56HierarchicalDefectLocalisation
6.1Overview...................................
6.2DynamicCallGraphsatDifferentLevels.................
6.2.1CallGraphsattheMethodLevel.................
6.2.2CallGraphsattheClassLevel...................
6.2.3CallGraphsatthePackageLevel.................
6.2.4TheZoom-InOperationforCallGraphs.............
......................LocalisationDefectHierarchical6.36.3.1DefectLocalisationinGeneral..................
......................ProceduresHierarchical6.3.26.4EvaluationwithRealSoftwareDefects..................
6.4.1TargetProgrammeandDefects:MozillaRhino.........
........................MeasuresaluationEv6.4.26.4.3ExperimentalResults(DifferentLevels).............
6.56.4.4SubsumptionExperimental.Results................................(Hierarchical)...............
7LocalisationofDataflow-AffectingBugs
7.1Overview...................................

viii

950616266676699607071717272767878708284887789898191919292969001001101101301501107701

7.27.2.1DatafloDeriw-EnabledvationofCallGraphsProgrammeTraces........................................
.......................AbstractionswDataflo7.2.27.2.3ConstructionofDataflow-EnabledCallGraphs.........
7.37.3.1LocalisingOvervieDatafloww-Affecting...............................Bugs....................
7.3.27.3.3EntropFrequenty-BasedSubgraphDefectMiningLocalisation....................................
7.3.4Follow-Up-InfectionDetection..................
7.3.67.3.5ImproIncorporationvementsofforStaticStructure-AfInformationfectingBugs...........................
7.47.4.1ExperimentalExperimentalEvaluationSetting..................................................
7.4.37.4.2ExperimentalSupplementaryResultsExperiments............................................
.................................Subsumption7.588.1OvConstraint-Basederview...Miningof................................WeightedGraphs
..........................Constraintseight-BasedW8.2............................Miningeight-BasedW8.38.4W8.4.1eightedSoftwGraphare-DefectMiningAppliedLocalisation.........................................
8.4.28.4.3WExploratieighted-GraphveMiningClassification...........................................
..........................aluationEvExperimental8.5...............................Datasets8.5.1.......................SettingsExperimental8.5.28.68.5.3SubsumptionExperimental.Results........................................................
9ConclusionsandFutureResearchDirections
9.19.2LessonsSummaryofLearnedthisDissertation.......................................................
.........................DirectionsResearchFuture9.3

AppendixAA.1OvMultithreadingerview..Def.ect................................Localisation

Contents

901901011111211211211311511511611611711711021211123321621821131131231331431431531631931141141341541

151151151

ix

Contents

x

A.2

A.3

A.4A.5A.6

MultithreadingDefectLocalisation....................
...............................wervieOvA.2.1A.2.2CalculatingDefectivenessLikelihoods..............
..........................aluationEvExperimentalA.3.1BenchmarkProgrammesandDefects...............
........................SettingExperimentalA.3.2A.3.3AccuracyMeasuresforDefect-LocalisationResults
................................ResultsA.3.4.............................ExampleDetailedAResultComparisonswithRelatedWork..................
.................................Subsumption

.................

.

.

.

.

.

251251351451451651157.518951261261

Intr1oduction

fSoftwailuresareeisxperiencedrarelyfreebythefromusersdefectsarethatannocauseying,fandailingtheybehacostviourthe.Oneconomytheonebillionsside,
ofdollarsannually[RTI02].Thisisinparticularseverewhenfailuresoccurafter
ethextremelysoftwareewxpensiasve,released.too.OnMoretheotherconcretelyside,,manuallocalisingdebdefectsuggingisofsoftwconsideredarecantobebe
thestudiesmosthaveshotime-consumingwnthat35%andofdiftheficultoverallactidevityvinelopmentthiscontetimextisspent[DLZ05,fordebJH05],uggingand
activities[RTI02].Automatedmeanstolocalisedefectsandtoguidedevelopersde-
buggingaprogrammearethereforemorethandesirable[ZNZ08].Ifadeveloper
obtainssomehintswheredefectsmightbelocated,debuggingbecomesmoreeffi-
cient.possible.CertainlyMore,thespecificallyrespecti,theveyshouldtechniquesexcludeshouldmostlocaliseoftheacodedefectfromasbeingpreciselyanal-as
withysedabywidehumans.rangeofFurthermore,defects.aHowever,defect-localisationresearchhasshotechniquewnthatshouldnonebeofablethetoexist-deal
ingtechniquesfordefectlocalisationisperfect,i.e.,isabletolocaliseanykindof
tionsdefectof[RAF04,defect-localisationSJYH09].Itistechniquesthereforethatstillwsupportorthwhiledevtelopersoinvinestigateeliminatingfurtherfailingdirec-
.viourbehagrOneaph-miningwaytoteclocalisehniquesdefects[CLZ+in09,softwDFLS06,areisLtoYY+analyse05].Suchdynamicgraphscallaregraphsrepresen-with
tationsofprogrammeexecutions.Analysingcallgraphsaimsatfindinganomalies
inoffagraphilingexstructures,ecutions.anditGraphisoneminingofintheturnmoreisarecentgeneraldevtechniqueelopmentsforinthedataanalysismining
[AingW10c,results,inCH06].particularGraphminingcomparedbearstomorethepotentialtraditionaltoproducetechniquesverythatrelypreciseonmin-data
representationsthatarelesscomplex.Therationaleisthatmanyreal-worldarte-
facts–suchasprogrammeexecutions–canberepresentedverypreciselybymeans
ofgraphdemonstratedstructures.bythePThepoageRankwerofalgorithmanalysing[BP98]suchforstructuresrankinghasresultsimpressiinwebvelysearch,been
aswellasbymanyfurtherlink-miningapplications[YHF10].
rewSoftwardingareandengineeringchallengingandareadefectforappliedlocalisationdatainminingparticular[DDGhas+08,beenHG08,identifiedXTLL09].asa
Further,tacklingchallengingapplicationproblems–suchasdefectlocalisationin
softwcationare–domainmightasleadwelltoinno[HCXY07].vationsIninthisthedissertation,data-analysiswedomainelaboratelyandininthevestigateappli-

1

ODUCTIONINTR1.CHAPTER

graph-miningtechniquesfortheanalysisofdynamiccallgraphsandultimatelyfor
thelocalisationofdefectsinsoftware.Thisdirectionofresearch+hasbeenofin-
terestengineeringinboth[CLZscientific+09,DFLS06].communities,Thisdatadissertationminingin[AappliedW10b,LYdataY05]miningandlikesoftwwiseareis
motivated,solveschallengesandcontributesinbothfields,dataminingandsoftware
engineering.Inparticular,thisincludesadvancesindefectlocalisation,problem-
orientedgraphdatarepresentationsandanalysistechniques.

1.1LocalisingDefectsinSoftware
haveResearchbeenindethevelopedfieldofforsoftwdefectarereliabilitylocalisationhas–somebeeneofxtensithemveb,anduildingvariousondatatechniquesmining.
Tsourceechniquescodeforonly,odefectrtheylocalisationanalyseareprogrammeeitherexstaticecutions,ordynamicrespecti,vi.e.,ely.theydealwith
Statictechniquestypicallyrelyoncode-qualitymeasuresoronidentifyingtypical
defect-proneprogrammingpatterns.Programmecomponentswithsuspiciousval-
uesofthemeasuresorpatternsidentifiedtobedefectpronearethengoodhintsfor
localisingdefects.However,staticapproachestypicallyleadtomanyfalse-positive
warnings,andtheyhavedifficultiesdiscoveringimportantclassesofhard-to-findde-
[RAF04].fectsDynamictechniquesinturnanalyseprogrammeexecutionsandtypicallycompare
theliesinthecharacteristicsexecutions,fromwhichcorrectandlocalisationfailingextechniquesecutions.thenThissuspecthelpstotoreferidentifytodefects.anoma-
asThedifdifferentferentmethodologiesapproachesusetodifderiveferentdefectinformationlocalisations.deriveTdwofromoftheexbestecutions,dynamicaswellap-
proaches,whichhaveoutperformedanumberofcompetitors,areSOBER[LFY+06]
andtechniquesTarantulahavepro[JH05]ventowithdetectitsvcertainariationsdefectsv[AZGvG09].erywell,Hotheweyvdeor,notevenanalysethoughallthesekind
ofinformationthatcouldbeobtainedfromprogrammeexecutionsandispotentially
ofrelevance.Tonameoneexample,Tarantulaonlymakesuseoftheinformation
whetheracertainpieceofcodeisexecutedornot.Certaindefectshowevermight
alterthenumberoftimesapieceofcodeisexecuted,whichtheseapproachesdonot
.considerExample1.1:TheexampleJavaprogrammegiveninListing1.1couldhaveadefec-
tiveloopconditioninLine16.Thiswouldbeacall-frequency-affectingbug,assuch
aofdefectmethodwa.ouldaffectApproachestheexsuchecutionasTarfrequencantulaywofouldLinenot17noticeandthusthisefthefect.callfrequency
Analysingdynamiccallgraphsisarelativelyrecentdynamicdefect-localisation
muchapproachdetailed[CLZ+and09,fine-grainedDFLS06,LYY+information05].Itreisgardingpromising,programmesinceesuchxecutions,graphscontainwhich

2

1.1.LOCALISINGDEFECTSINSOFTWARE

21publicstaticclassExamplejava.util.Random{generator;
34publicstaticvoidmain(String[]args){
56ifgenerator=new(generator.nextInt(100)java.util.Random();<99)
a(0);7b(3);8}9101211//privatesomestaticapplicationvoida(intcodex){
}13141615forprivate(intstatici=0void;i<yb(;inti++)y){
a(generator.nextInt(100));17}18}19

Listing1.1:AnexampleJavaprogramme.

canhardlybefoundinanyotherrepresentation.Inparticular,callgraphsreflectthe
structureofmethodinvocationsofanexecution–ortherelationshipofmorefine-
grainedormorecoarse-grainedprogrammecomponents,aswewillsee.Inmethod-
levelcallgraphs,methodsarerepresentedasnodesandmethodcallsasedges.
Example1.2:InatypicalexecutionoftheexampleprogrammegiveninListing1.1,
methodmaincallsmethodaonce,beforeitcallsmethodb.Methodbthencalls
methodathreetimes.ThecallgraphinFigure1.1(a)reflectsthisbehaviour.
Besidestheadvantagesofcall-graphanalysis,mininggraphsismuchmorecom-
plexthanmanyotheranalysistechniques.Therefore,tocopewiththesizeofcall
graphs,theyaretypicallyreducedtocompactrepresentationswhereoneedgestands
foranumberofmethodcalls.However,call-graph-baseddefectlocalisationcanstill
becomputationallyexpensiveandcanleadtoscalabilityproblems.
Whilerelatedworkincall-graph-baseddefectlocalisationhasinvestigatedbasic
call-graphrepresentations,weextendcallgraphswithmoreinformationrelevantfor
thelocalisationofdefects.Inparticular,thisinformationreferstothecontextof
methodinvocations,executionfrequenciesanddataflows.Theseextensionsaimat
broadeningtherageofdetectabledefects.

3

ODUCTIONINTR1.CHAPTER

mainmainmain1,1,0,01,1,0,0 1 1ababab
3,0,1,2 3aaaaa(c)(b)(a)

Figure1.1:ExamplecallgraphsreferringtotheprogrammegiveninListing1.1.

Example1.3:Figure1.1(b)continuesExample1.2.Itisanexampleforareduced
representationofthegraphgiveninFigure1.1(a).Further,itincludescallfrequencies
asnumericaledgeweights.Fluctuatingfrequencyvaluescanbeidentifiedbyanalysis
techniquesandmightbeahintforadefect.
ThegraphinFigure1.1(c)containsadditionalinformationrelatedtothedataflow.
Itisannotatedwithtuplesofweightsattheedgesandcanbeanalysedbymining
techniques.Inthissimplifiedexample,thefirsttupleelementisthecallfrequency,
asbefore.Theotherthreetupleelementsstandforthenumberofmethodcallswith
parametervaluesfallingintotheintervals‘low’,‘medium’and‘high’.Concretely,
imaginethatmethodbcallsmethodawithvalues98,83and50forparameterx.The
tuple3,0,1,2thenstandsforthreecallsintotal,zerocallswithalowvalue,onecall
withamediumvalueandtwocallswithahighvalue.
Untiltoday,call-graph-baseddefectlocalisationhasnotbeenstudiedextensively.
Therefore,manyquestionsconcerningsuchtechniquesarecurrentlyunanswered.
Thisincludesthequestionwhatkindofdefectscanbelocalisedandhowwellthe
techniquesscale.Inthisdissertation,weinvestigatethepotentialofcall-graph-based
techniquesanddifferentcall-graphrepresentationsfordefectlocalisation.There-
fore,itisnottheprimaryaimtodevelopatechniquewhichrulesoutanyexisting
technique,buttocomprehensivelyinvestigatetheusageofcallgraphsfordefectlo-
calisation.

1.2Call-GraphMiningforDefectLocalisation

Miningcallgraphsasdescribedbeforeintroducestwomainchallengesfordataana-
lysis,thatpartlydependoneachother:

4

2.1.FindingAnalysingtheadequateresultingdatagraphsrepresentations

1.2.CALL-GRAPHMININGFORDEFECTLOCALISATION

DataRepresentations.Findingadequatedatarepresentationsisaninherentpart
oftheknowledge-discoveryprocess[CCK+00,FPSS96].Thisnon-trivialprocessis
thegeneraldata-analysisprocedure,aimingatthediscoveryof“valid,novel,poten-
tiallyusefulandultimatelyunderstandablepatternsindata”[FPSS96].Datamining
isonlyonestepwithintheprocess;theotherstepsrangefromunderstandingtheap-
plicationdomaintothedeploymentoftheanalysistechnique.Findinganadequate
datarepresentationandacquiringthisdataarethestepsprecedingtheactualdata-
miningstep.Inparticular,problem-specificdatarepresentationshavebeenidenti-
fiedtobekeyforthesuccessofanyapplieddata-miningproblem[HG08].Inthe
software-engineeringapplicationdomainofthisdissertation,callgraphsaretheded-
icateddatarepresentation.However,itisnotobvioushowexactlytorepresentthe
call-graphstructure(topology)inordernottoloseanyimportantinformationandto
obtaingraphsofamanageablesize.Otheraspectsarethegranularityofcallgraphs
andthequestionhowtoincorporatemoredomain-specificinformationsuchasinfor-
mationregardingcallsofmethodsthatbelongtotheprogramminglanguage.Graphs
atgranularitiesdifferentfromthemethodlevelhaverarelybeeninvestigatedina
defect-localisationcontext,andfindingrepresentationsischallengingasonewould
likenottolosetoomuchinformationincoarsegraphrepresentations.Further,it
isdemandingtocomeupwithadequaterepresentationsforcallfrequenciesand–
moreimportantly–fordataflows,whichweidentifytobecrucialforthelocalisa-
tionofcertaindefects.Anotherchallengingareaisthedefinitionofcallgraphsfor
multithreadedprogrammes,wherepartsoftheprogrammeareexecutedinparallel.

MiningWeightedCallGraphsandLocalisingDefects.Besidesthedata
representation,definingtheactualanalysisprocedureistheothermainchallengefor
call-graph-baseddefectlocalisation.Itleadstothreefurthersubproblems:

•Howtominecallgraphsthatareweighted?
•Howtodealwithscalabilityissuescausedbylargegraphs?
•Howtoderiveactualdefectlocalisations?

Weightedsubgraphmining.Aswewillsee,weidentifydifferenttypesofweighted
graphstobenaturalandadequaterepresentationsforourminingproblem.However,
weighted-subgraphmininghasnotbeeninvestigatedcomprehensively,andthereare
noobviouswaysfortheanalysisofourweightedcallgraphs.Mosttechniquesthat
havebeenproposedforminingweightedgraphsareveryspecificfortherespective
miningproblemandapplicationdomain,andtheycannotbeappliedfordefectlo-
calisation.Itisthereforeanunsolvedproblemhowsubgraphminingwithweighted
graphscanbeachievedingeneral.Thisproblemisdifficulttosolve,sinceitbrings
togetherthedomainofgraphstructures(topologies)andthedomainofnumerical
weights.Thesetwodomainsareingeneralnotconnectedbymeansofaguaranteed

5

ODUCTIONINTR1.CHAPTER

law.Thismakesitdifficulttodealwithbothkindsofinformationwithinthesame
algorithm.Scalabilityissues.Scalabilityofsubgraph-miningalgorithmsischallenging,since
frequentsubgraphmininginherentlyinvolvessubgraph-isomorphismproblems.This
problemisknowntobeNP-complete[GJ79].Therefore,findingefficientalgorithms
isnoteasy.Forinstance,approximateandconstraint-basedalgorithmsmightsolve
thescalabilitydefect-localisationproblem,results.butThisbearisaastrade-ofsuchftechniquesbetweenleadscalabilitytosmallerandresultpossiblysetswthatorse
ingmighttechniques,containlesstheinformationscalabilityrelevproblemantforcandefectbetackledlocalisation.byafallBesidesbacktoadoptedthemin-data-
tionsrepresentationthatcanbeproblem.minedmoreWheneasilydoing.Hoso,wethever,aimthisistobearsfindasimilarsuitablegraphtrade-off.representa-
Derivingdefectlocalisations.Therearemanydifferentwaystoderiveadefect
localisationbasedonresultsfromminingweightedsubgraphs.Suchalocalisation
techniqueshouldbeefficientlycomputable,shouldcoverapossiblywiderangeof
difThus,ferentfindingtypesaoftechniquedefectsthatandfulfilsshouldalltheseultimatelybecharacteristicsusefulforisdifsoftwficult.aredevelopers.

1.3ContributionsofthisDissertation

InordertosolvethechallengesmentionedinSections1.1and1.2,thisdissertation
stagesfeaturesofthecontribknoutionsinwledge-discobothverydomains:process.insoftwThearecontribengineeringutionsanddescribedattheindiftheferentfol-
lowingtwoparagraphsareourbasicapproachfordefectlocalisationwithweighted
tocallbroadengraphs.theTherangefolloofwingdetectableparagraphsdefectsbuildandontothisscaleapproachforlargerandesoftwxtendareitinprojects.order
TheTheselastextensionsparagraphdealsubsumeswithboththetheresultsdataindefectrepresentationlocalisation.andtheminingtechniques.

Weighted-Call-GraphRepresentations.Reducingthesizeofcallgraphsas
directlyproblems.Hoobtainedwever,fromthisleadsprogrammetoaelossxofecutionsisinformationmandatorywhich,mightcausedbebyrelevantscalabilityfor
defectlocalisation.Inthisdissertation,weproposeanapproachthatreducesthesize
ofFurtherthe,ourgraphs.Itapproachdoessoannotatestoanecallxtentthatfrequencieskeepsasimportantnumericedgestructuralweights.information.Thisin-
fortheformationwlocalisationouldbeoflostanotherwise.importantThisclassofcall-graphdefects,call-frrepresentationequency-afallowsfectinginbparticularugs.

Data-Mining-BasedDefectLocalisationwithWeightedCallGraphs.To
analysetheweightedcallgraphsproposed–intheabsenceofasuitableout-of-the-
boxtechniqueforweightedgraphmining–weproposeacombinedapproach:It

6

1.3.CONTRIBUTIONSOFTHISDISSERTATION

utilisesvanillafrequent-subgraph-miningtechniquesinafirststepandemploysatra-
ditionaldata-miningtechnique,featureselection,inasubsequentanalysisstep.To
broadenincorporatethetherangeofdetectiondetectableofanotherdefects,classweoffurtherdefects,proposestructure-afcombinationfectingbstrateugs.giesUlti-to
mately,wederivearankingofmethods,orderedbytheirlikelihoodtobedefective.
Asoftwaredevelopercanthenusethisrankingtoinvestigatethemethods,starting
withtheonesuspectedtobemostsuspicious.

HierarchicalDefectLocalisationwithGraphsatDifferentGranularities.
Graph-miningalgorithmsdonotscalewellforlargegraphs,eveniftoughcall-graph-
reductiontechniquesareapplied.Therefore,itisnotpossibletoapplyexistingcall-
graph-baseddefect-localisationtechniquestolargesoftwareprojects.Inorderto
applythedevelopeddefect-localisationtechniquestosuchlargeprojects,wedevelop
hierarchicalproceduresinthisdissertation.Tothisend,wefirstlyproposenovel
call-graphrepresentationsatdifferentlevelsofgranularity,i.e.,atthepackage,class
andmethodlevel.Wetheninvestigatetheirusefulnessfordefectlocalisationand
proposevarioushierarchicalanalysisprocedures.Theseprocedureslocalisedefects
startingatthemostcoarse-grainedcall-graphrepresentation.Theretheyidentifypo-
tentiallydefectiveregionsinthecode.Then,theyproceedwithfiner-grainedgraphs
ofthepreviouslyidentifiedregionsetc.Suchgraphs,representingsmallregionsof
thewholegraph,leadtoscalabilityissuesinmuchfewercases.

DefectLocalisationwithDataflow-EnabledCallGraphs.Existingcall-
graph-baseddefect-localisationtechniquesdonotallowforthelocalisationofde-
fectsthataffectthedataflowofaprogrammeexecutionratherthanthemethod-call
structure.Thisisassuchtechniquesobviouslycanonlydetectdefectsthatinflu-
encethecallgraph,whichisnotthecasewithsuchdefects.Findingarespective
call-graphrepresentationisdifficult,asedgesinacallgraphtypicallyrepresenthuge
numbersofmethodcallsandcorrespondinglyhugenumbersofdataflows.Inthis
dissertation,weproposedataflow-enabledcallgraphsthatextendcallgraphswith
abstractionsreferringtothedataflow.Wederivethegraphsusingdiscretisationtech-
niques.Furthermore,weextendthedefect-localisationtechniquetodealwiththe
resultinggraphs.Withtheseextensions,weareabletolocalisedefectsthatprimarily
affectthedataflow,besidesotherclassesofdefects.

MiningWeightedGraphswithWeight-BasedConstraints.Besidesthe
aforementionedhierarchicalapproach,constraint-basedminingisafurtherapproach
withthepotentialtoincreasescalability.Suchalgorithmsleadtosmallerresultsets
andmakeuseofpruningopportunitiesintheminingalgorithms.However,existing
constraint-basedgraph-miningalgorithmsdonotdealwithweightedgraphs.This
isasmostweight-basedconstraintsdonotfulfilcertainproperties–mostimpor-

7

ODUCTIONINTR1.CHAPTER

rion.tantlyInthisanti-monotonicitydissertation,–wedowhichdeveloptheoreticallyweight-basedforbidstheirconstraintsusageandasintepruninggratethemcrite-
intopattern-growthalgorithmsforfrequentsubgraphmining.Wedosoasweight-
basedconstraintsseemtobeawellsuitedgeneralapproachforminingweighted
graphs.Asmentioned,weight-basedconstraintscannotbeemployedforpruning–
innaletheoryfor.thisIninthisvestigationdissertation,isthatwedotheresoisneevvidenceerthelessthatandweightsstudytheandefgraphfects.Thestructuresratio-
arestraintsfrequentlycanleadtocorrelatedapproximateinreal-wresults,orldi.e.,graphsto[MAF08].incompleteAsresultminingsets,withwesuchstudycon-the
constraintscompletenessleadandtothebothausefulnessbetterofsuchperformanceconstraints.ofminingTheresultalgorithmsisthatandwellweight-basedresults
inpractice.Concretely,wedemonstratethatguaranteeingcompletenessofmining
resultsinabdicableintheanalysisproblemsinvestigated–notonlyinoursoftware-
datasetsengineeringfromapplication.transportationBesideslogisticsdefectandlocalisation,considerdifweferentevaluateanalysisourapproachproblems,withi.e.,
graphplicabilityofclassificationtheandweight-basedexplorativeconstraintsmining.Wproposed.edosotodemonstratethebroadap-

ResultsinSoftware-DefectLocalisation.Theresultsofdefectlocalisation
usingthecall-graphrepresentationsandlocalisationtechniquesdevelopedinthis
dissertationareencouraging:Comparedtoexistingcall-graph-basedtechniques,the
resultsapproachescanbedevimproelopedveddisplaywhenonemploavyingerageadataflodoubledw-enabledlocalisationcallgraphs.precision.ComparedThese
tomoreestablishedapproachesfromthesoftware-engineeringdomain[AZGvG09,
JH05,LFY+06],ourapproachwasabletoderivebetterdefect-localisationresultsin
12outof14casesinourtestsuite.Further,forthefirsttime,wesuccessfullyapply
call-graph-mining-baseddefectlocalisationtoreal-worlddefectsfromarealandrel-
ativelylargesoftwareproject(MozillaRhino,≈49klinesofcode).Inoursetup,our
approachnarrowsdowntheamountofcodeadeveloperhastoexaminetoabout6%
ofthewholeprojectonaverage.Inconstraint-basedmining,weachieveaspeed-up
of3.5whileobtainingevenslightlybetterdefect-localisationresults.

1.4OutlineofthisDissertation
Wenowdescribethecontentsoftheremainderofthisdissertation.Chapters2
anddescribe3ourintroducebasicthebackgrounddefect-localisationandtherelatedapproach,work,andrespectiChaptersvely.6–8areChapterse4xtensionsand5
thereof.Chapters5–8includeevaluationsoftherespectivetechniques.Chapter9
concludes.InintroduceChapterthe2,webackgroundsdescribefromthegraphbackgroundtheory,ofsoftwthisaredissertation.engineeringInanddataparticular,mining.we

8

1.4.OUTLINEOFTHISDISSERTATION

Thesedescriptionsarelimitedtoanextentthatonecanfollowthedescriptionsinthe
chapters.succeedingInChapter3,wediscussrelatedwork.Thischapterisdividedintotwoparts,one
ondefectlocalisationandoneondatamining.Inthedefect-localisationpart,we
discussthedifferentexistingapproachesfordefectlocalisation(notincludingcall-
graph-mining-basedtechniques)andcontrastthemtothetechniquesdevelopedin
thisdissertation.Inthedata-miningpart,wediscussexistingtechniquesformining
weightedgraphs,forminingsignificantsubgraphs(includingapproximativetech-
niques)andforconstraint-basedsubgraphmining.Thesetechniquesarerelated,as
weproposedifferentwaysforminingweightedgraphsinthisdissertation,including
technique.constraint-basedapproximateanChapter4isaboutcall-graphrepresentations.Thisincludesrepresentationspro-
posedbyotherauthorsfromthecloselyrelatedwork,aswellasthecall-graphrep-
resentationsthatarenewinthisdissertation.Inparticular,weintroducedifferent
kindsofweightedcallgraphs.Wediscussallthesegraphrepresentationswithinthe
samechapter,astheyarecloselyrelatedtoeachother,andasthisallowsforabetter
comparison.Inparticular,wefocusoncallgraphsatthemethodlevelinthischapter,
i.e.,onenodeinacallgraphrepresentsamethod.Then,wecommentoncall-graph
representationsonotherlevelsofgranularity,weproposegraphrepresentationsfor
multithreadedprogrammes,andwesayhowweactuallyderivecallgraphsfrompro-
ecutions.xegrammeInChapter5,wedescribedefectlocalisationbasedonthecall-graphrepresen-
tationsdiscussedbefore.Again,wediscusscloselyrelatedtechniquesdealingwith
traditionalgraphrepresentationswithinthesamechapterasthetechniquesnewlypro-
posedinthisdissertation.Thisis,wediscussexistingstructuraltechniquesfordefect
localisation,followedbynovelfrequency-basedapproaches.Wealsoproposepossi-
bilitiestocombinebothkindsofapproachesinordertobeabletodetectabroader
rangeofdefects.Then,wepresentanevaluationthatcomparesselectedgraphrepre-
sentationsandminingtechniques.Besidesthetechniquesdescribedinthischapter,
wealsocompareournewlyproposedapproachtoestablishedapproachesfromthe
relatedworkinsoftwareengineering.
Chapter6isabouthierarchicaldefectlocalisation.Thisis,wegeneraliseourcall-
graphrepresentationstodealwithcallgraphsatseverallevelsofgranularity.This
allowsustoproposehierarchicalminingproceduresthatstartwithcallgraphsat
coarselevelsofgranularity,beforezooming-inintoregionsofthecallgraphssus-
pectedtobedefective.Thisaimsatascalabletechniquefordefectlocalisation.We
evaluatethistechniquewitharelativelylargesoftwareprojectalongwithrealdefects.
InChapter7,wedealwiththelocalisationofdataflow-affectingbugs.Wefirst
introducedataflow-enabledcallgraphs,whicharecallgraphsincorporatingabstrac-
tionsreferringtothedataflow.Thenwesayhowwederivedataflow-enabledcall
graphsbymeansoftracingandsuperviseddiscretisation.Inordertolocalisedata-

9

ODUCTIONINTR1.CHAPTER

flow-affectingbugsalongwithothertypesofdefects,weadoptourminingtechnique
fromtheprecedingchapters.Finally,weevaluatethisnewapproach.
Chapter8isaboutconstraint-basedminingofweightedgraphs.Thistechniqueis
motivatedbyourdefect-localisationproblem,butisactuallyamoregeneraltechnique
forminingweightedgraphs.Concretely,weintroduceweight-basedconstraints,and
weexplainhowtointegratethemintopattern-growth-basedfrequentsubgraphmin-
ing.Inthischapter,wealsoexplaindifferentanalysissettingswhereminingwith
weight-basedconstraintsisofrelevance,includingtheapplicationtodefectlocali-
sation.Ultimately,weevaluatethedifferentanalysissettingswithgraphdatafrom
softwareengineeringandtransportationlogistics.
Chapter9concludesthisdissertation.Besidesashortsummary,wehighlightthe
lessonslearned,andweexplainsomedirectionsforfuturework.

Portionsofthewholeworkhavebeenpublishedin[EB10,EBH08a,EBH08b]
(weightedcall-graphrepresentationsandbasicdefect-localisationtechniques,Chap-
ters4(localisationand5),of[EB09,dataflow-afEOB11]fectingb(hierarchicalugs,Chapterdefect7),localisation,[EHB10a,ChapterEHB10b]6),(constraint-[EKKB10]
basedminingofweightedgraphs,Chapter8)and[EPGB10](multithreadingdefect
A).Appendixlocalisation,

10

oundkgrBac2

Thisdissertationisaboutapplieddatamining,ithasadedicatedfieldofapplica-
tion,softwaredefectlocalisation,anditfocusesontechniquesfortheanalysisof
callgraphs.Thischapterthereforefirstintroducestheformalgraph-theoreticback-
grounds(Section2.1).Thenitdiscussesimportantconceptsfromthefieldofap-
plication,softwareengineering(Section2.2),andfinally,itintroducestherelevant
2.3).(Sectiontechniquesdata-mining

yTheorGraph2.1ofWevienowwthatintroducearerelethevantbasicinthisconceptsdissertationofgraphs(Sectionsandtrees2.1.1fromanda2.1.2,graph-theoreticrespectively).point

Graphs2.1.1Inthisdissertation,graphsaretypicallylabelled:
graphs)(Labelled2.1DefinitionAlabelledgraphisafour-tuple:G∶=(V,E,L,l).Visthesetofvertices1,E⊆V×V
thesetofedges,Lasetofcategoricallabelsandl∶V∪E→Lalabellingfunction.
E(G)denotesthesetofedgesofG,V(G)thesetofverticesandL(G)thesetof
labels.Sometimeswedonotexplicitlymentionthelabelsofedges.Inthiscase,alledges
havethesamedefaultlabel.Further,graphscanbeweighted:
graphs)weighted(Labelled2.2DefinitionAlabelledweightedgraphisasix-tuple:G∶=(V,E,L,l,W,w).V,E,Landlareas
inDefinition2.1,W⊆Risthedomainoftheweights,andw∶E→Wisafunction
whichassignsweightstoedges.
Alltechniquesdiscussedinthisdissertationcaneasilybeextendedtocovernodes
thatareweighted(w∶V∪E→W).Further,tuplesofweightscanbehandledwith
thefollowingvariation:W⊆Rn,n∈N.
1Inthisdissertation,weusevertexandnodeinterchangeably.

11

CHAPTER2.BACKGROUND

graphs)of(Properties2.1NotationAllgraphscanbedirectedorundirected(e∈Eisanorderedtupleoranunordered
set,respectively).Further,allgraphscanbeconnected(anytwonodesareconnected
byatleastonepath)orunconnected(thereexistsatleastonepairofnodesthatis
notconnectedbyapath;thegraphconsistofseveralcomponents).See[Die06]for
details.furtherIfnotmentionedexplicitly,wedealwithdirectedandconnectedgraphsinthis
dissertation.Toexplicitlydistinguishgraphsfromtrees(seeSection2.1.2),wecall
graphsthatmightincludecyclesalsogeneralgraphs.
[Die06]))(see(Subgraphs2.3DefinitionAlabelledgraphG′isasubgraphofalabelledgraphG(andGasupergraphofG′)
ifandonlyifV(G′)⊆V(G),E(G′)⊆E(G),L(G′)⊆L(G),andG′preservesthe
labellingdefinedinG.G′⊆Gdenotessuchasubgraph-supergraphrelationship.If
G′⊆GandG′≠G,G′iscalledapropersubgraphofG,denotedG′⊂G.
Notethatweightsarenotconsideredforthedefinitionofsubgraphs.
problem)(Subgraph-isomorphism2.4DefinitionThequestionwhetheragivengraphG′isasubgraphfromanothergivengraphG
(G′⊆G)iscalledthesubgraph-isomorphismproblem.
Thesubgraph-isomorphismproblemasdefinedbeforeforgeneralgraphsisknown
[GJ79].NP-completebeto

reesT2.1.2Treesarevariantsofgraphs.Astheyarerelevantinthesoftware-engineeringdomain,
too,wenowbrieflyintroducethemostimportantnotions.
Definition2.5(Trees(see[Die06]))
isAnaccalledyclicatreeconnected(seegraphDefinition,i.e.,2.1aandconnectedNotationgraph2.1whicforhtheedgesdodefinitionnotofformaconnectedcircle,
aphs).grtooIn(inthisthiscase,dissertation,Definitiontreesare2.2(asappliesgraphs)alwaccordingly).ayslabelled,andtheycanbeweighted,
NodesNotationhaving2.2only(Propertiesoneoftrees)outgoing/incomingedgearecalledleaves,nodesthatare
unorconnecteddered,toseetheNotationsamenode2.3),artheeyarecalledalsosiblingscalled.freeWhentreestreesar[CMNK05].eundirectedWhentr(andees
arAtereeTdirectedwithandatwodedicatednodesarrootenodeconnectedr∈Vby(Tan)isedgecalled,oneacallsrootedthemtree.parentandchild.

12

ENGINEERINGARESOFTW2.2.

Notation2.3(Unorderedandorderedtrees)
Aswithgraphs,treesarebydefaultunordered,asVandEareunorderedsets.
Rootedtreescanalsobeordered.Inthiscase,onehastodefineanorderbetweenall
siblingsthatarechildrenfromthesameparentnode.
Inthecontextofsoftwareengineering,wetypicallydealwithbothlabelledand
directedrootedunorderedtreesandlabelledanddirectedrootedorderedtrees.
[CMNK05]))(see(Subtrees2.6DefinitionThedefinitionofsubtreesisthesameastheoneforsubgraphsgiveninDefinition2.3.
inWhenadditiondealingtobewithasubororderedderingtreesof,thetheorcorrderingespondingamongtheverticesinsiblingstheinthesupertreesubtr.eehas
sidesChitheetal.onefor[CMNK05]inducedsubtrdescribeeesgivfurthereninvariationsDefinitionfor2.6.thedefinitionofsubtreesbe-

EngineeringSoftware2.2Aswenowknowthetheoreticalbackgroundofgraphsandtrees,wenowfirstdis-
cussthemostimportantgraphsinsoftwareengineering(Section2.2.1).Wethen
clarifyournotiononfailingbehaviourinsoftware(Section2.2.2)andintroducethe
foundationsofsoftwaretestinganddebugging(Section2.2.3).

EngineeringSoftwareinGraphs2.2.1Graphshavebeenusedforalongtimeindifferentsubdisciplinesofsoftwareengi-
neering.Themostimportantdistinctionisifthegraphsarestaticordynamic,i.e.,if
theyrepresentaspectsfromthesourcecodeorfromprogrammeexecutions,respec-
tively.Inthefollowing,weintroducethegraphsthatarerelevantinthisdissertation.

(CFGs)Graphswol-FloContrControl-flowgraphs(CFGs)[All70]arestaticrepresentationsofsourcecode,fre-
quentlyusedincompilertechnology.InaCFG,thesourcecodeisdividedinto
severalso-calledbasicblocks.Eachbasicblockconsistsofallstatementsthatare
alwaysexecutedconjunctively,i.e.,newblocksstartwhenthecontrolflowchanges
(dueto,e.g.,aniforforstatement).CFGsareunweightedgeneralgraphs:
Notation2.4(Control-flowgraphs(CFGs,see[All70]))
InaCFG,eachbasicblockisrepresentedasanode,andcontroldependenciesare
representedasedgesconnectingthesenodes.

13

CHAPTER2.BACKGROUND

codeExamplegiveni2.1:nListingFigure2.1.2.1(a)isthecontrol-flowgraph(CFG)fromtheexamplesource

(PDGs)Graphsogramme-DependencePrProgramme-dependencegraphs(PDGs)[OO84]arestaticgraphsaswell,andthey
aretypicallyusedinprogrammeslicing[KL88]andoptimisation[FOW87].While
CFGsreflectthepurecontrolstructureofaprogramme,PDGsincorporateaddition-
allydataflow-relatedinformation.Tothisend,theyrequireafinerlevelofgranularity
thanCFGs,asdataflowsmightoccurbetweentheindividualstatementswithinabasic
blockinaCFG.AsCFGs,PDGsareunweightedgeneralgraphs:
Notation2.5(Programme-dependencegraphs(PDGs,see[OO84]))
InaPDG,everystatementformsitsownnode(withfewexceptions).Further,thereis
adedicatedentrynode,andthereareextranodesrepresentingeveryparameterofa
method2.Acontroledgeconnectsanodeawithanodebifandonlyiftheexecution
ofnodebdependsonnodea.Besidescontroledges,nodesinPDGsareconnected
bymeansofedgesofanothertype(technicallyofanotherlabel;say,‘data’instead
of‘control’)whenthereisadataflowbetweenthenodes.
Example2.2:Figure2.1(b)istheprogramme-dependencegraphs(PDG)fromthe
examplesourcecodegiveninListing2.1.Controldependenciesaredisplayedby
solidlines,datadependenciesbydashedlines.

GraphsCallDynamicandStaticbeCallgrobtainedaphscanfrombethebothsourcestaticcode.orItdynamicrepresents[GKM82].allmethodsAstaticofacallgrprogrammeaphas[All74]nodescan
andall(sometimespossiblealsocalledmethodcallinvtreesocations)inasthisedges.dissertation.WedealTheywithrepresentdynamicancallexgrecutionaphs
ofaparticularprogrammeandreflecttheactualinvocationstructureoftheparticular
execution.Chapter4providesdetaileddefinitionsforthevariousvariantsofcall
graphs.orderWedithouttree.anyThefurthermainmethodtreatment,ofaanprogramme(unreduced)iscalltheroot,graphisandantheunweightedmethodsinvrokooteded
directlyareitschildren.Originally,thesiblingsareorderedbythetimeofexecu-
tion.smallerHocallwever,graphs,unreducedwhichcallaregraphsweightedorbecomeveryunweightedlargegandenerarealgraphstypically.reducedto
eInxtend,Chapterincluding4(andtheasdifwellferentinChaptersreduction6andtechniques,7),wefurtherdiscussvcallariantsgraphsofthetoalargraphsger
andthequestionhowtoactuallyderivesuchgraphsfromprogrammeexecutions.As
2Inthisdissertation,inasoftwarecontext,weusemethodinterchangeablywithfunction.

14

ENGINEERINGARESOFTW2.2.

1publicstaticintmult(inta,intb){
2intres=0;
int3;i=14while(i<=a){
b;+=res5i++;6}7res;return8}9

Listing2.1:ExampleJavamethodperforminganintegermultiplication.

res = 0;i = 1;

entry mult

a = a_in;

b = b_in;i = 1;while i <= awhile i <= ares = 0; false truei++;res += b;return res;i++;return res;res += b;
PDG(b)CFG(a)

Figure2.1:Aforthecontrol-flomethodwintgraph(CFG)mult(intandaa,intb)programme-dependenceasgiveningraphListing(PDG)2.1.

15

CHAPTER2.BACKGROUND

mainmain

baba

main

ba

aaaaaaaa...a
(a)(c)(b)

Figure2.2:af(a)Anfectingunrbugeducedand(c)dynamicwithafrcallgrequency-afaph,(b)afectingcallbug.graphwithastructure-

wewillseeinSection4.2andChapter6,callgraphscanalsobedefinedatlevels
ofgranularitydifferentfromthemethodlevel.Forinstance,basicblocks,classesor
packagesmightformthenodesofacallgraph.

2.2.2Bugs,Defects,InfectionsandFailuresinSoftware
Inthefieldofdebugging,oneusuallyavoidsthetermsfault,buganderror,butdis-
tinguishesbetweendefects,infectionsandfailures[Zel09].Inthisfrequently-cited
classification,thesetermshavethefollowingmeaning:
•Defectsaretheplacesinthesourcecodewhichcauseaproblem.
•Infectionsareincorrectprogrammestates(usuallytriggeredbydefects).
•Failuresareanobservableincorrectprogrammebehaviour(e.g.,auserexpe-
calculations).wrongriencesInthisdissertation,weusethetermbugwhenreferringtodifferenttypesoffail-
ingbehaviour.Wenowintroduceamoredetaileddifferentiationofourown(unless
otherwisestated),whichisinparticularusefulwhendealingwithcall-graph-based
localisation:defect•Crashingandnon-crashingbugs[LYY+05]:Crashingbugsleadtoanun-
expectedterminationoftheprogramme.Prominentexamplesincludenull-
pointerexceptionsanddivisionsbyzero.Inmanycases,e.g.,dependingon
theprogramminglanguage,therespectivedefectsarenothardtofind:Astack
traceisusuallyshownwhichgiveshintswheretheinfectionoccurred.Harder
tocopewitharenon-crashingbugs,i.e.,failureswhichleadtowrongresults
withoutanyhintthatsomethingwentwrongduringtheexecution[LYY+05,
+09].LCHAsnon-crashingbugsarehardtofind,allapproachestolocalisedefectswith
call-graphmining(includingtheonesproposedinthisdissertation)focuson

16

ENGINEERINGARESOFTW2.2.

themandleaveasidecrashingbugs.However,whencallgraphscanbegen-
eratedfromcrashingprogrammeexecutions,therearenoobstaclesinlocalis-
ingtherespectivedefectswithcall-graph-basedtechniquesinthesamewayas
ugs.bnon-crashing•Occasionalandnon-occasionalbugs:Occasionalbugsarefailureswhichoc-
curwithsomebutnotwithanyinputdata.Inthecontextofmultithreadedpro-
grammes,occasionalbugscanalsoarisewhentheprogrammeinputremains
thesame,butdifferentthreadschedulesareexecuted.Findingoccasionalbugs
isparticularlydifficult,astheyarehardertoreproduce,andmoreprogramme
executionsarenecessaryfordebugging.Furthermore,theyoccurmorefre-
quently,asnon-occasionalbugsareusuallydetectedearly,andoccasionalbugs
mightonlybefoundbymeansofextensivetesting.
Asallcall-graph-baseddefect-localisationtechniques(includingtheonespro-
posedinthisdissertation)relyoncomparingcallgraphsoffailingandcorrect
programmeexecutions,theydealwithoccasionalbugsonly.Inotherwords,
besidesexamplesoffailingprogrammeexecutions,thereneedstobeacertain
numberofcorrectexecutions.
•Structureandcall-frequency-affectingbugs(call-graph-affectingbugs):
Thisdistinctionisparticularlyusefulwhendesigningcall-graph-baseddefect-
localisationtechniques.Structure-affectingbugsaredefectsresultingindiffer-
entstructures(topologies)ofthecallgraphwheresomepartsaremissingor
occuradditionallyinfailingexecutions.Incontrast,call-frequency-affecting
bugs(frequency-affectingbugsforshort)aredefectswhichleadtoachange
inthenumberofcallsofacertainsubtreeinfailingexecutions,ratherthanto
completelymissingornewsubstructures.Ingeneral,ithappensfrequentlythat
astructure-affectingbugalsoaffectsthecallfrequencies(asasideeffect)and
viceversa.SeeExample2.3foranillustrationofbothkindsofdefects.We
calltheclassofbothkindsofdefects,structureandfrequency-affectingbugs,
alsocall-graph-affectingbugs.
Whilethecall-graph-basedtechniquesfromtherelatedworkfocusonstruc-
ture-affectingbugs,wedevelopadefect-localisationtechniqueinChapter5
thatisabletolocalisebothstructureandfrequency-affectingbugs.
•Call-graphanddataflow-affectingbugs:Incontrasttocall-graph-affecting
bugs,dataflow-affectingbugsmanifestthemselvesbyinfecteddataexchanged
betweenprogrammecomponents.Inthisdissertation,wefocusoninfected
datavaluesexchangedviamethod-callparametersorreturnvalues,e.g.,cases
whereamethodreturnsawrongvalue.Dataflow-affectingbugsmightaffect
thecallgraphasasideeffect.Chapter7providesmoredetailsandexamples.

17

CHAPTER2.BACKGROUND

Puredataflow-affectingbugsareusuallynotcoveredbycall-graph-baseddefect
localisation,butwepresentatechniqueinChapter7whichisabletodiscover
bothcall-graphanddataflow-affectingbugs.
Example2.3:ThegraphsinFigure2.2arerepresentationsfromexecutionsofthe
programmegiveninListing1.1.
Figure2.2(b)isacallgraphwherethecallofmethodafrommethodmainis
missing,comparedtotheoriginalgraphinFigure2.2(a).Thisisanexamplefor
astructure-affectingbug.Theoriginalcausefortheinfectionmightbeadefective
if-conditioninthemainmethod.
InthegraphgiveninFigure2.2(c),adefectiveloopconditionoradefectiveif-
conditioninsidealoopinmethodbaretypicalcausesfortheincreasednumberof
callsofmethoda.Thisisanexampleforafrequency-affectingbug.

2.2.3SoftwareTestingandDebugging
Softwaretestingisaninherentpartofthesoftwaredevelopmentprocess[Som10].
Theoverallaimoftestingistoensurethatprogrammes3providethefunctionality
specifiedbefore,withouteventuallyleadingtoanyfailures.Theaimofdebugging
istofindandfixthedefectsthatcausedeviationsfromthespecification(failures).
Softwarequalityassurance(orvalidationandverification),whichincludestesting
anddebugging,isawidefieldofresearchofitsown.Wenowbrieflyexplainsome
fundamentaltermsandtechniques,inordertounderstandthetechniquesdiscussed
dissertation.thisin

DifferentTestingApproaches
Ttakesestingplaceplaysatallanstagesimportantoftheroleinprocess,thefromwholecodingsoftwtoare-definalvtestselopmentbeforetheprocess,softwasareit
isreleased.Inthedifferentstagesofthesoftware-developmentprocess,onedoes
unittesting,componenttesting,integrationtestingandsystemtesting[Bei90].Unit
atestingprogrammetakesplaceproduceduringtheecoding,xpectedanditresults.ensuresthatComponentthetestingsmallestdoestestablethesamepiecesforof
largeragglomerationsofunits.Integrationtestingensuresthecorrectfunctionality
ofsesystem.veralThiscancomponents.consistofSystemonelartestinggelooksprogrammeattheorofafunctionalitycollectionofaofwholeprogrammes;software
weconsidermainlythefirstcaseinthisdissertation,aseachprogrammeleadstoits
graph.callnwoRegressiontestingisperformedwhenpreviousversionsofaprogrammeareavail-
able,alongwiththeirtests[Bei90].Whenonlysmallpartsfromaprogrammeare
changedbetweentwoversions,onecanexpectthatmosttestsfromtheoldversion
3Theprogrammeexaminedistypicallycalledprogrammeundertest.

18

ENGINEERINGARESOFTW2.2.

passinthenewversionaswell.Onlywherefunctionalitywaschangedbetweenthe
versions,testsaresupposedtofail–allotherfailingtestscanbehintsforrealfailing
programme.ainviourbehaInthisdissertation,werelyonsystemtestsexaminingtheexecutionsofentire
programmes.However,whensetsofcallgraphsfromasmallercomponentthanthe
wholeprogrammecanbederived,therearenoprincipalobstaclesinapplyingthe
call-graph-basedtechniquesdevelopedinthisdissertation.Inreal-worldsoftware
projects,regressiontestswilltypicallybeusedtoperformsystemtests(andcanbe
usedtodriveourdefect-localisationtechniques),astestsfrompreviousversionsare
ailable.vafrequently

PerformingSoftwareTests
Softwaretestsareformalprocedures,consistingofprogrammeinputsandexpected
outputs[Bei90].Theprogrammeinputsincludesystemconfigurations,programme
parametersandfilesanduserinputredbytheprogramme.Theexpectedoutput
includeseverythingthatisproducedbytheprogramme,suchasfileswrittenand
outputdisplayedonscreen.Designingsoftwaretestsisitsownfieldofresearch,
typicallyaimingatcoveringmanydifferent(butnon-overlapping)executionsofa
programmewhichexecutepossiblylargepartsofthesourcecode.
Toderivetheexpectedoutputandtocompareitwiththeactualoutput,onetyp-
icallyreliesontestoracles[How78].Theirpurposeistodecidewhetheracertain
executionyieldsanyobservableproblems,i.e.,failures.Suchanoraclecanbea
programmethatproducesthecorrectresultandcomparesittotheoutputfromthe
programmeundertest,oritcanbethedataitselfthattheprogrammeundertestis
supposedtocalculate,e.g.,calculatedmanually.Besidesunexpectedoutput,other
kindsofobservableproblemssuchasdeadlockscanbeconsideredtobeafailure.
Testoraclesshouldbeabletodetectsuchbehaviouraswell.
Inthisdissertation,weassumethatbothtestcasesandtestoraclesareavailable,as
wefocusonthelaterdefect-localisationstep.Thisassumptionisreasonable,since
testingisaninherentpartofmodernsoftwaredevelopment[Som10].Furthermore,
inmostsoftwareprojects,(regression)systemtestsareavailable,includingbothtest
oracles.testandcases

gingugDebDebuggingisaswellitsownfieldofresearch[Zel09].Itincludeseverythingfrom
dealingwithtestcases,observingprogrammeexecutions,localisingdefectsandulti-
matelyfixingthem.Ithasalsobeendescribedastheprocessofrelatingafailureto
aninfectiontoadefect[Zel09].

19

CHAPTER2.BACKGROUND

Inthisdissertation,wedeveloptechniquesforthedefect-localisationpartofde-
bugging.Thisis,weaimathelpingsoftwaredevelopersinlocalisingdefectsinorder
tofixthemonceafailingbehaviourhasbeenexperienced.

MiningData2.3Wenowintroducethefoundationsofdataminingthatarerelevantinthisdissertation.
Wediscussthedata-miningprocessandapplieddatamining(Section2.3.1),selected
data-miningtechniquesfortabulardata(Section2.3.2)andfinallyfrequent-pattern-
miningtechniques,includinggraphmining(Section2.3.3).

2.3.1TheData-MiningProcessandAppliedDataMining
Theliteraturehasproposedanumberofdata-miningprocessmodels(e.g.,[CCK+00,
frameFPSS96]).work,Sometimes,frequentlythecalledtermthedataprocessminingofknowledgstandseforadiscosingleveryinstepwithindatabasesa.larger

TheCRISP-DMData-MiningProcessModel
Awell-knownrepresentativeofdata-miningprocessmodelsisCRISP-DM(CRoss-
IndustryStandardProcessforDataMining)[CCK+00],whichhasbeenproposedby
anindustryconsortium.Itdescribesaniterativeprocesswithanumberofloopsgoing
backtoearlierstages:businessunderstanding,dataunderstanding,datapreparation,
modelling(theactualdata-miningstep),evaluationanddeployment(seeFigure2.3).
Thisprocessillustratesthatdataminingorknowledgediscoveryconsistsofanumber
ofstagesapartfromtheactualmodelling:Atfirst,oneneedsanunderstandingofthe
businessorthedomainoftheapplication.Then,oneneedstounderstandthedataone
isworkingwithoroneplanstocollect.Next,oneneedstopreparethedatainorder
tobesuitedforthedata-miningalgorithmchosen.Onlywhenthechosenalgorithm
leadstowellevaluationresults,thewholeprocesscanbedeployed.
Inthisdissertation,thefirstfivestepsfromtheCRISP-DMprocessmodelareof
relevance.Morespecifically,themaincontributionofthisdissertationisnotonlyin
themodellingpart,butalsointhedata-preparationpartoftheprocess:
•Businessunderstanding:Atfirst,wehavetodevelopanunderstandingforthe
principlesofsoftwaretechnologyandthenatureofthevariousdefects,infec-
ailures.fandtions•Dataunderstanding:Whenweknowthedomain,wehavetounderstandwhich
dataisavailableand–aswearenotfacedwithanindustryprojectwherethe
aimistoanalysedatathatisalreadyavailable–whichdatawecancollect,e.g.,
ecutions.xeprogrammefrom

20

BusinessUnderstanding

Deployment

Data

Evaluation

DataUnderstanding

2.3.DATAMINING

DataPreparation

Modelling

Figure2.3:TheCRISP-DMdata-miningprocessmodel[CCK+00].

•Datapreparation:Whenweknowwhichdataisavailableandcanbecollected,
wehavetodecidehowtorepresentthedata.Inthisdissertation,wedevelopa
representations.call-graphofnumber•Modelling:Dependingonthedatarepresentationchosen,wecanchose–or
develop–ananalysistechniqueoracombinationofdifferenttechniques,such
asfrequentsubgraphminingandfeatureselection.
•Evaluation:Whenallpreviousstepsaredone,wehavetoevaluateourap-
proachconsistingofallprevioussteps.

ronment.ThenextHostepwevewr,ouldthisbeparttoofdeploythetheprocesswholeisnotinprocess,thefocuspossiblyofinthisandissertation.industrialenvi-

MiningDataAppliedtantBesidesdirectionresearchofonresearchtheisgeneralapplieddata-miningdatamining,processalsomodel,calledandomain-specificincreasinglyimpormining-
[HG08]ordomain-drivendatamining(D3M)[CYZZ10].Thisdirectionofresearch
partlybuildsontherecognitionthatdata-miningresearchinthepasthasmainlyfo-
cussedattentiononbhasuildingbeennepaidw,fforasteractualandmorereal-worldpreciseapplicationstechniques,and[CYZZ10].thatrelativAnotherelylittleim-
portantrecognitionisthatnotmuchattentionhasbeenpaidontheintegrationof

21

CHAPTER2.BACKGROUND

resentationssophisticatedaswellscientificasanddedicatedengineeringanalysisdomaintechniquesknoarewledge.deemedThus,tobespecificessentialdatarep-for
thesuccessofapplieddatamininginanydomain[HG08].
Thisdissertationisinthefieldofapplieddatamininginsoftwareengineering.
Theresentationscall-graphthatincorporaterepresentationsdedomainvelopedknoinwledgethisrelevantdissertationfortheareanalysisspecificdataproblem.rep-
Theanalysistechniquesdeveloped–eitherasacombinationofexistingtechniques
orasanewanalysistechnique–arespecificforthedatarepresentationsdeveloped
beforehand.

2.3.2Data-MiningTechniquesforTabularData

Most(conventional)data-miningtechniquesdealwithtabulardata,i.e.,thedatato
beanalysedisstoredintablesasinrelationaldatabases,andonetuple(arowinthe
table)referstooneobjectintherealworld.Atypicalexampleiscustomerdata,where
onepropertiestuplereferssuchastooneage,grossperson,andincometheandsecolumnsx.Baseddescribeonsuchnumericaldata,orvcateariousgoricaldata-
miningtaskscanbedefined,suchasclassification,regressionandclusteranalysis
–foreverytasktherearemanydifferentalgorithmsandimplementationsavailable
(see,e.g.,[BBHK10,HK00,HMS01,Mit97,WF05]).Inclassification,thetaskis
topredictanunknowncategoricalclassofatuple,e.g.,ifapersoniscreditworthy
ornot,basedonacollectionofdatafromthepast.Inregression,thetaskisto
tuplespredictaintoprenumericalviouslyattribunknoute.wnInclustergroups(oranalysis,partitions)thetaskofistotuplesgroupthat(orsharepartition)thesamethe
propertiesandhavepropertiesdifferentfromtheothergroups(orpartitions).
Besidesthebigsuccessofthedata-miningtasksandtechniquesdealingwithtabu-
lardata,noteverykindofreal-worldobjectscanadequatelybedescribedusingtuples
insuchatable.Asanexample,chemicalmoleculescanintuitivelybedescribedasa
graphgraph.Ofstructure,course,wherebasedonatomssuchareathenodes,representation,andabindingsnumberareofthemeasuresedgesofacanbelabelledde-
rivedandcanbestoredinatupleofnumericalandcategoricalvalues.Asanexample,
clesoneandcouldderimaybeveasthewellnumbertheoftotalnodes,weighttheofainformationmolecule.whetherHowevtheer,graphdespitecontainsthatcsuchy-
arepresentationmightbesuitedforcertainapplications,itdoesnotkeepallinfor-
mationencodedinthecorrespondinggraphrepresentation.Therefore,thetabular
rivingarepresentationcertainmightprecisionnotofbesuitedanalyses.forThecertainsameappliesapplications,totheorsoftwmightnotalloare-engineeringwde-
domain,wherecallgraphscanrepresentaprogrammeexecutionmoreadequately
asthantheatablenumberthatofcontains,methodse.g.,calleddifduringferentanexmeasuresecution.correspondingtoinformationsuch

22

2.3.DATAMINING

Inthisdissertation,werelyontechniquesfortheanalysisofgraphs(seeSec-
tion2.3.3)andmakeuseofoneconventionaldata-miningtechnique,featureselec-
tion,whichwedescribeinthefollowing.

SelectionFeatureTheManyyeitherdata-miningdonotscaletechniqueswellforsufferfromhigh-dimensionaltheso-calleddata,or“cursetheofdatabecomesdimensionality”less:
techniquesmeaningfulcanwithhelpantoincreasingreducethenumbernumberofofdimensionsdimensionsin[BGRS99].tabularFdata.eaturTheye-selectionaimat
findingsubsetsofattributes(tupleelementsorcolumnsinatable)thatstilldescribe
thethatdatameasureswell,ortheirtheyaimusefulness.atInscoringthefollothesewing,attribweutesfocusbyonassigningsuchthemwithusefulness-scoring-ascore
basedfeature-selectiontechniques,astheyarerelevantfortheanalysistechniques
developedinthisdissertation.
dictTypicallyanother,thecolumn,usefulnesse.g.,aincatefeaturegoricalselectionclassisattribbasedute.onForthetheattribcateutesgoricalabilitycase,topre-this
abilityisalsocalleddiscriminativeness.Respectivemeasuresarefrequentlyused
tributeinternallyisbestinsuitedtobdecision-tree-inductionuildthenextspliton,algorithms,inorderastotheyhperformavetaowelldecideclassificationwhichat-
[BK98,Qui93].Anothersourceofsuchdiscriminativenessmeasuresaretechniques
fromstatisticsthatmeasurethecorrelationbetweenattributes.Inthefollowing,we
introducediscriminatithevenessinformationmeasuresgain([Qui93]InfoGainastw)oandrepresentatiinformation-gainvesofratiofeature-selection(GainRatioal-)
gorithmswithahighrelevanceinpractice:
Definition2.7(InformationGainandInformation-GainRatio(see[Qui93]))
Letclass.DDbeCaisdatathetabledomain.CofisC,oneandDcolumnC=iinDdenotesthatthesetassociatesofrowseachthatrowbelong(tuple)totothea
i-thnumericalclass(i∈values.DC).TheLetAinformationdenoteanygain(otherInfoGaincolumn)isdifaferentmeasurfreomCbased,onconsistingentropofy
(BothInfo),measurandesthemeasureinformation-gaintheratiodiscriminativeness(GainRofatioan)inattribturnuteisAbasedwhenonvaluesInfoGainv∈A.
partitionthedatasetD.ThepartitioningisdoneinawaythattheInfoGainofA
ismaximised.ThisrequiresadiscretisationofA’svaluesintonintervals(see,e.g.,
discr[ER97]eteforintervalsmoreofA(informationn=∣DAon∣).theDA∈jdiscrdenotesetisation),thewherpartitioneDAisconsistingtheofdomaintheofsettheof
rowsofDthatbelongtothej-thintervalofA(j∈DA).TheGainRationormalises
thetheattribInfoGainuteintovaluebyintervals:SplitInfo,whichistheentropy(Info)ofthediscretisationof

23

CHAPTER2.BACKGROUND

Info(D)∶=−∑∣D∣CD=∣i∣⋅log2(∣D∣DC∣=i∣)
D∈iCInfoGain(A,D)∶=Info(D)−∑∣∣DDA∣∈j∣⋅Info(DA∈j)
D∈jASplitInfo(A,D)∶=−∑∣∣DDA∣∈j∣⋅log2(∣∣DDA∣∈j∣)
D∈jAGainRatio(A,D)∶=InfoGain(A,D)
DA,SplitInfo)(meansPossiblethatvanaluesattribofuteInfoGaindiscriminatesandGainRperfectlyatiobetweenareintheclasses;intervatal0,[a0n,1].attribVutealuehas1
noinfluenceonclassdiscrimination.OpposedtoGainRatio,themaximumvalue
ofclassInfoGaindistributions,canonlythebe1maximumifthevaluedistribofutionofInfoGainclassesisloinwerC,eisvenequal.iftheInskattribewedute
.perfectlydiscriminates

2.3.3Frequent-Pattern-MiningTechniques
Opposedtotabular-data-miningtechniquesasintroducedinSection2.3.2,frequent-
databasespattern-miningof,e.g.,techniquesitemsets,[HCXY07]sequences,trdiscoeesverandgrfrequentaphs.orTheseinterestingtechniquespatternscaninbe
seenasahierarchyofminingtechniques,assequencesgeneraliseitemsets,treesgen-
eralisetechniquessequences,andsomeandofitsgraphsvariations,generaliseaswelltrees.asInthethefollofoundationswing,ofweconstrintroduceaint-basedthese
.mining

MiningItemsetItemsetsandAssociationRules.Itemsetmininghasbeenintroducedinthe
contextofassociation-rulemining[AMS+96]–theprobablymostprominentex-
ampledatabaseforofthistrtaskansactionsis,markandet-baskaedtransactionanalysis.consistsInofitemsetoneormining,moreonebinaryanalysesitemsa.
Asanexample,asupermarkettransactionconsistsofanumberofproductsbought.
thatThesewereproductsfrequentlyarecalledboughtitems.togetherThe,whereideaoftheitemsetnotionofminingfrequencistoyisgidentifyivenbyitemsa
user-definedminimum-supportvalue(supp,thesupportmightbeeithermeasured
absolutelyorasaratioorpercentage):Findminallsetsofitemsthataresubsetsofat
leastsuppmintransactionsinagivendatabase.Afamousexample[SA96a]forthe
itemset-miningproblemisthediscoveryofafrequentitemsetconsistingofbeerand
diapers:{beer,diapers}.

24

2.3.DATAMINING

setsInaresplitassociation-ruleintoassociationmining,rules.oneAsfirstanegeneratesxample,thefrequentitemsetsaforementionedbeforeitemsetthesecoulditem-
bebeer.splitasBesidesfollothews:{supportdiapversalue,}→{beerassociation},sayingrulesthathaveapeopleconfidencewhobuyvalue.diapersThisalsoisbuthey
allprobabilityitemsfromthatbothansidesassociationoftherulerulediholds,videdi.e.,bythethenumbernumberofoftransactionstransactionsincludingincluding
allitemsfromtheleftside.Thelegendsays[SA96a]thattheconfidenceoftheafore-
supportmentionedfromrulethecouldunionbeofbothremarkablysides)whigh,ouldbewhilerelatithevoelyveralllow.supportoftherule(the

Itemset-MiningAlgorithms.Thefirstandprobablyeasiestalgorithmforitem-
setminingisthea-priorialgorithm[AMS+96].Itbuildsontheideathatthesupport
fromasubsetfromsomeitemsetcannotbesmallerthanthesupportfromitssuperset.
Thealgorithmusesthisideainalevel-wiseapproach:Itfirstgeneratesallfrequent
itemsetsthatconsistofasingleitem(1-itemset).Then,itusesthese1-itemsetsto
combinatoriallygenerateall2-itemsets.These2-itemsetcandidatesarethensearched
inthedatabaseinordertodeterminetheiractualsupport.Theremaining2-itemsets
thatareactuallyfrequentarethensavedandusedtogenerateallpotential3-itemsets
etc.Duetotherepeatedcandidategenerationandtestforfrequency,itisalsosaid
thatthealgorithmfollowsthegenerate-and-testparadigm.
Despiteitsrelativelysimpleapproachforgeneratingallfrequentitemsets,thea-
priorialgorithm[AMS+96]hasbeencriticisedasitdoesnotscalewell.Thisis
duetothepotentialhighnumberofcostlydatabasescansfordeterminingtheactual
supportoftheitemsets.Anumberoffurtheralgorithmstrytoovercomethischal-
lenge,bydifferentdatarepresentationsandalgorithmdesigns:TheEclatalgorithm
[Zak00]organisesthetransactiondatabaseintoasubsetlatticeandperformsadepth-
firstsearchinthisdatastructure.TheFP-growthalgorithm[HPY00]makesuseofa
prefix-treestructure(frequent-patterntree,FP-tree)forthetransactiondatabase.The
algorithmthenfollowsadivide-and-conquerapproachtoderiveallfrequentitemsets.
OneoftheadvantagesoftheFP-growthalgorithmisthatitreliesontheso-called
pattern-growthmethod,whichreplacesthecostlygenerate-and-testapproach:Only
itemsetsthatoccuratleastonceinthedatabasearetestediftheyfulfilthesuppmin
criterion.ThisisdoneefficientlyinasubtreeoftheFP-treestoringthetransaction
database.

QuantitativeAssociationRules.Itemsetminingandassociation-rulemining
dealbydefaultwithbinarydata.Thisis,acertainitemispartofanitemsetoritis
not.Inmarket-baskedanalysis,asanexample,itisnotconsideredwhetheracertain
productiscontainedinatransactiononceorhundredtimes.Quantitativeassociation
rules[SA96a]introducenumericalweightstoitemsanddiscretisethisinformation
inordertodealwithitinanextendedassociation-rule-miningalgorithm.

25

CHAPTER2.BACKGROUND

MiningSequenceSequenceminingisageneralisationofitemsetmining:Insteadofanalysingtransac-
tionsconsistingofsetsofitems,itanalysessequencesofsuchtransactions.Inmore
detail,thetaskistofindallsequencesthataresubsequencesofatleastsuppminse-
markquenceset-baskinaetdatabaseanalysis:ofWhensequencesoneis[DP07].abletoOnetrackofthecustomerfirstapplicationspurchasesowvaesratime,gain
athefrequentideaistopatternfindthatfrequentsomesequencescustomersoffirst(setsbuyof)adigitalitems.Focamerarinstance,andaitcameracouldbag,be
sometimelateranewlensandlateronsomefiltersforthenewlens.Otherappli-
cationsincludeDNA-sequenceanalysisinbiologyandlog-fileanalysisfromweb
ers.servThefirstalgorithmforsequencemining,AprioriAll[AS95],isageneralisation
ofthealgorithma-priori[SA96b]algorithmisanandimprovhasementbeenandaproposedbygeneralisationthesameforauthors.hierarchiesTheofitems,GSP
proposedbythesameauthorsaswell.
AsGSP[SA96b]stillfollowsthegenerate-and-testparadigm,itbearsthesame
efficiencyproblems.Therefore,anumberofdifferentsequence-miningalgorithms
hasbeendevelopedthataimatovercomingthischallengeand/orproposefurther
enhancements.Oneofthewell-knownvariationsdiscoversfrequentepisodes(i.e.,
partiallyorderedcollectionsofeventsoccurringtogether)insteadoffrequentsub-
sequences[MTV97].TheSPADEalgorithm[Zak01]reliesonaverticaldatabase
formatwhichallowsforanoptimisedlattice-basedsearchspace.Asequence-mining
algorithmthatfollowsthepattern-growthapproachisPrefixSpan[PHMA+04].The
authorshaveshownthattheiralgorithmperformsbetterthanallaforementionedal-
mining.sequenceforgorithms

MiningSubtreeFrequentFrequentsubtreeminingisthenextgeneralisationofsequencemining–oraspecial
caseoffrequentsubgraphmining(asintroducedinthefollowing).Theideaisto
discoverfrequentsubtreesinadatabaseoftrees.Astherearedifferentkindsoftrees
(seeSection2.1.2),therearedifferenttechniquesforminingthem:

26

•RootedorderedtreescanbeminedwiththeFREQTalgorithm[AAK+02].
•withRootedtheunoralgorithmderedtrUnoteescan[AAbeUN03]minedandwithwiththeuFREQTHybridT[NK03]reeMiner(thetwo[CYM04],last-
mentionedalgorithmsarebasedonFREQT).
•andUnrwithootedtheunorthederHybredtridTeescanreeMinerbemined[CYM04]withasthewell.FreeTFurthermore,reeMinersuch[CYM03]trees
canalsobeminedwitharbitrarygraphminers,astreesarespecialcasesof

2.3.DATAMINING

graphgraphs.minerInparticularinternally,minesGastonfor[NK04]treesisbeforesuitediteforxtendsthethemanalysistoofgeneraltrees,asgraphs.this

creaseInthegeneral,runtimededicatedoftreeminingminingalgorithmsalgorithmscomparedcanto(butthedousagenotofnecessarilygeneraldo)graph-de-
miningalgorithms.Algorithmsforrootedorderedtreesbenefitingeneralmostfrom
thethanspecificsthoseforoftherootedrespectiorderedvetrees.treesandAlgorithmsthoseforforrootedunrootedunorderedunorderedtreestreesbenefitlessthanless
arethosemoreforrootedalgorithmsunordereddedicatedtrees.fordeBesidesviatingthedefinitionstree-miningofalgorithmssubgraphmentioned,relationshipsthereand
otherspecialcases.Chietal.presentacomprehensivesurveyoftree-miningalgo-
[CMNK05].rithms

MiningSubgraphFrequentProblemDefinitionandAlgorithms.Frequentsubgraphminingisthegener-
alisationofallaforementionedfrequent-pattern-miningtechniques:Roughlyspeak-
ing,itemsetsaregraphswithoutanyedges(E=∅),sequencesaregraphsconsisting
ofusedpathsformanonlyy,andapplications,treesarebutgraphssufferswithoutfromctheycles.NP-completeTherefore,graphsubgraph-isomorphismminingcanbe
problem(seeDefinition2.4).Aswerelyongraph-miningtechniquesinthisdisser-
tation,wedefinethetaskmoreformallythantheotherpattern-miningtechniques
before:mentionedmining)subgraph(Frequent2.8DefinitionofLetDfinding∶={allg1,...,subgrg∣D∣aph}beapatternsgraphf∈Fdatabasewith.asuppFrequentortofatsubgraphleastsuppminingisinDthe.Thetask
supportofagraphfissupport(f,D)∶=∣{g∣g∈D∧f⊆g}∣.Inshort,minf∈F⇐⇒
support(f,D)≥suppmin.
Frequentsubgraphmining(andotherfrequent-pattern-miningalgorithmsaswell)
isanalysisoften[AusedasW10a]aboruildinggraphblockofclassificationsomehigher[CYH10a].-levelWithanalysisthetasklatter,suchfrequentasclustersub-
graphpatternsareminedfromasetofclassifiedgraphs.Astandardclassifieristhen
learnedonthesubgraphfeaturesdiscovered.
Manyalgorithmshavebeenproposedforfrequentsubgraphmining.Thefirstal-
gorithms,AGM[IWM00]andFSG[KK01],relyonthegenerate-and-testparadigm
knosearchwnstratefromgythe.Alla-priorimorerecentalgorithm.algorithmsTheyrelythereforeonafollowdepth-firstimplicitlysearch.aThesebreadth-first-algo-
rithmsincludeFFSM[HWP03],gSpan[YH02](seealsothefollowingparagraph
aboutsectiononpattern-groclosedwthmining),algorithms)Gastonandits[NK04],extensionMoFaCloseGr[BB02]aphanditse[YH03]xtension(seealsoMoSSthe

27

CHAPTER2.BACKGROUND

[BMB05].Fourofthemorerecentalgorithmsmentionedhavebeencomparedexper-
imentallybyindependentscientistsusinganumberofdifferentdatasets[WMFP05].
TheresultisthatGastonandgSpanaremostlythealgorithmswiththebestrun-
timebehaviour,dependingonboth,thenatureofthegraphdatabasesanalysedand
thememoryarchitectureofthemachineusedfortheexecution.WefocusongSpan
anditsvariationsinthefollowing,asitperformswellandismorewidelyusedinthe
.GastonthancommunityscientificThecomparisonin[WMFP05]andthesurveyin[YH06]containmoreinformation
aboutthefrequent-subgraph-miningalgorithmsmentioned.Wealsolookatsome
morerecentalgorithmsintherelated-workchapter(Section3.2.2).

Pattern-GrowthAlgorithms.Algorithm2.1depictsthebasicstepsofageneric
pattern-growth-basedfrequent-subgraph-miningalgorithm[YH06].Theideaisthat
startingfromanemptygraph-pattenp,thecurrentpatternisineachstepextendedin
severalwaysbyexactlyoneedge,leadingtonewfrequentsubgraphs.Theyarethen
processedrecursively,correspondingtoadepth-firstsearch.Concretely,Lines1–2
checkifthecurrentgraphpatternisalreadycontainedintheresultset,Line4adds
patternstotheresultset,andLine5extendsthecurrentpattern,leadingtoasetof
frequentpatternsP.ThealgorithmthenprocessesthemrecursivelyinLines6–7and
stopsinLine9whenPisempty.
Algorithm2.1pattern-growth(p,D,suppmin,F)
Input:currentpatternp,databaseD,suppmin
FsetresultOutput:1:ifp∈Fthen
3:2:endrifeturn
4:F=F∪{p}
5:P=e′xtend-by-one-edge(p,D,suppmin)
6:forallp∈Pdo
7:pattern-growth(p′,D,suppmin,F)
rofend8:neturr9:

Algorithm2.1performsadepth-firstsearch,whichsearchspaceisvisualisedin
Figure2.4.Inthissearchspace,therootistheemptygraph(V=∅,E=∅).Each
eachothernode,nodewhilecorrespondsitgeneratestoathenon-emptychildrengraph,ofaandnodeandAlgorithmcalls2.1itselfisrecursicalledvelyonce.Thefor
leavesarenotextendedfurtherduetothesuppmincriterion.InthegenericAlgo-
rithm2.1,thesamegraphmightbegeneratedseveraltimesatdifferentplaceswithin
thesearchspace.Wecallsuchgraphsduplicates.

28

1-edge

0-edge

-edge2s

{}

s’

2.3.DATAMINING

...Figure2.4:Apattern-growthsearchspace.

Example2.4:ImaginethatnodesinFigure2.4standsforthegrapha→b→cand
wasgeneratedfromthegrapha→b(itsparentnode)byextendingitwithedgeb→c.
Nodes′standsforgrapha→b→caswell,butwasgeneratedfromgraphb→c.
Nodes′isthereforeaduplicate.Lines1–2inAlgorithm2.1checkforduplicatesand
prunethesearchspace.Thus,thechildfromnodes′isactuallynotgenerated.
AlthoughLines1–2inAlgorithm2.1identifyduplicateswhichavoidstoprocess
thesamegraphsrepeatingly,thischeckforduplicatesiscomputationallyexpensive
andshouldbeavoidedinordertoconstructafastalgorithm.Theextensionofgraphs
shouldbethereforeasconservativeaspossible,whileitstillhastoguaranteetogen-
erateallgraphs.Differentalgorithmsusedifferentstrategiesforthisexpansionof
graphs,andwefocusonthestrategyfromthegSpanalgorithm[YH02]inthefol-
lowing,asweusethisalgorithminthisdissertation.
gSpan[YH02]usesastrategyforgenerating(extending)graphsthatisbasedon
depth-firstsearch(DFS)ingraphs.Ingeneral,onecanperformdifferentdepth-first
searchesinthesamegraph,resultingindifferentdepth-first-searchtrees(DFStrees).
SuchaDFStreecanunambiguouslyberepresentedasanorderedlistofedges(or-
deredbythediscoverytimeduringsearch).gSpanusesasetofrulesforgenerating
(extending)graphs,whichreliesontheorderinadepth-firstsearchandonextending
thegraphonlyalongtherightmostpathinaDFStree:theDFS-lexicographicorder.
Thisensuresthatonegraphisalwaystraversedthesameway.Whengraphsarecon-
structedinthisway,itisguaranteedthatthefrequent-pattern-miningproceduredoes
notextendgraphsalreadydiscovered.

MiningClosedClosedpotentialofminingisgeneratinganresultimportantsetsconceptwithlessinredundancfrequent-patternyinafastermining,runtimeasitthanbearsnon-the

29

CHAPTER2.BACKGROUND

closedconceptexistsalgorithms.forallInthefollowing,frequent-pattern-miningwedefineclosedtechniques.miningAswithtwoegraphs,xamples,butthethe
CloSpanalgorithm[YHA03]performsclosedsequencemining,andtheCMTree-
Mineralgorithm[XY05]performsclosedminingforrootedunorderedtrees.
Mining)Graph(Closed2.9DefinitionClosed-frequent-subgraph-miningalgorithmsdiscoveronlysubgraphpatterns′which
artheeresultclosed.setAFgrwhicaphhfhasiseclosedxactlyiftheandsameonlyifsupportnootherandisgrapaphroperpatternsuperfgrisaphpartofoff
(f⊂f′).
Closedminingalgorithmsproduceresultsetsthatmightbemoreconcise(smaller),
forcompletethefollosetofwingfrequentreason:Thesubgraphsresultcansetsbearederifreevedoffromredundancthesetyofintheclosedsensethatsubgraphs.the
In(alongconcretewiththeterms,incidentthecompletenodeswhensetcantheybebecomeobtainedbyunconnected)systematicallyfromallremographsvinginedgesthe
closedresultsetandaddingthesenewgraphstothenon-closedresult.
TheCloseGraphalgorithm[YH03]isanextensionofgSpan[YH02]thatmakes
usethereofarepruningcaseswhereopportunitiestheresultandsetfromspeedsupclosedminingmininginismannoty(ornotsituations.much)Hodifwevferenter,
fromthesetofallfrequentsubgraphs.Insuchcases,theadditionaleffortforcheck-
ingforclosednessandno(oronlyfew)pruningopportunitiesmightslowdownthe
algorithm.Ingeneral,theprobabilityforsuchsituationsincreaseswithincreasing
hasizeveofthethesameunderlyingsupportgraphdecreasesdatabase.whenThisgraphisasdatabasestheprobabilityincreaseinthatsize.twoInthissubgraphsdis-
sertation,weuseCloseGraphforminingdatabasesofcallgraphs.Wedosoasour
graphdatabasesarerelativelysmall,andwethereforedonotexpecttosufferfromthe
effectdescribedbefore.Furthermore,inpreliminaryexperiments,CloseGraphhas
gSpanproduced,inaruntimedefect-localisationthathasresultsindeedthatbeenfareasternot.worsethanthosewhenemploying

MiningConstraint-BasedConstraint-basedminingisanotherimportantconceptinfrequent-patternmining,as
itallowsforfasterruntimesandresultsetsfocusedontheuser’sneeds.However,
constraint-basedminingrequirestheusertospecifyaconstraint,andnotallcon-
straintscanbeeasilyintegratedintominingalgorithms.Constraint-basedmining
hasoriginallybeenintroducedforitemsetmining[NLHP98],andhasbeencarried
forwardtosequences(e.g.,[GRS99,PHW02])andgraphs(seeSection3.2.3).
Mining)(Constraint-Based2.10DefinitionAconstraintcinconstraint-basedminingisaBooleanpredicatewhichanyf∈F

30

2.3.DATAMINING

Fmust⇐⇒fulfil,(suppwherorte(f,FDis)≥thesuppresultmin∧set.c(fF)=ormallytrue),,inwhereconstrDistheaint-baseddatabase.mining,f∈
mostConstraintimportantone,predicatescanbeanti-monotonicitycategorised,andintosebrieflyveralmentionclasses.Wsomeenowfurtherintroduceconstraintthe
classes:AconstrDefinitionaintc2.11isanti-monotone(Anti-Monotone⇐⇒(Constraints∀f′⊆f(see∶c(f)=[NLHP98]))true⇒c(f′)=true),where
Fistheresultset.
Example2.5:Aprominentexampleofanti-monotoneconstraintsisthefrequency
criterion:Ifagraphhasasupportofatleastsuppmin,allitssubgraphshavethesame
oralargersupport.Therefore,anti-monotoneconstraintsarethebasisforalla-priori
oneanddoespattern-gronotsatisfywththeminingconstraint,algorithms:withoutTheystopmissingeanyxtendingpatterns.patternswhenthecurrent
Theclassofmonotoneconstraints[NLHP98]asacomplementtoanti-monotone
constraintsmonotone).eHoxistswevaser,well(bmonotoneutthereareconstraintsconstraintsarelessthatusefulareforneitherpruning.anti-monotonenor
Anotheraforementionedclassareconstraintsuccinctclasses.constraintsRespective[NLHP98],patternswhichthatarefulfilsuchorthogonalconstraintstothe
canbeenumeratedbeforethesupportiscountedinagraphdatabase.
c(f)∶Example=x∈2.6:f,Inwheref∈Fconstraint-based.Thisis,itemsetonlygraphsmining,athatincludesuccinctitemconstrxaintarecouldsupposedbe
tobeintheresultset.Ina-priori-stylealgorithms,thiscanbetestedbeforesupport
counting,whichspeedsupminingsignificantly.
Anotherclassofconstraintsareconvertibleconstraints[PHL04].Theyhavebeen
introducedforitemsets,andfocusonaggregateconstraintsthatbuildonfunctions
suchasaverage,medianandsum,referringtonumericannotationsoftheitems.
Theseannotationsarefixedforeveryitem,nomatterinwhichtransactiontheyoccur;
anlessesuitedxampletowouldprunebethethesearchpriceofaspace,certaintheyitem.canbeAlthoughusedtoconspeedvertibleuptheconstraintsFP-growthare
algorithm[HPY00]foritemsetmining[PHL04].

31

orkWRelated3

Thisdissertationisaboutdomain-specificdatamining,inparticularaboutsoftware
defectlocalisation.Therefore,wedescriberelatedworkintheapplicationdomain,
i.e.,variousdefect-localisationtechniques(Section3.1),aswellasrelateddata-
miningtechniques–inparticulardifferentapproachesforsubgraphmining(Sec-
tion3.2).WefurthermorediscussrelatedworkthatiscloselyrelatedtooursinChap-
ters4and5,i.e.,othercall-graph-baseddefectlocalisationtechniques.

LocalisationectDef3.1

Defect-localisationtechniquesareeitherstaticordynamic[Bin07].Dynamictech-
niquesrelyontheanalysisofprogrammerunswhilestatictechniquesdonotre-
quireanyexecution.Anexampleforastatictechniqueissource-codeanalysis.It
canbebasedoncodemetricsordifferentgraphsrepresentingthesourcecode,e.g.,
staticcallgraphs,control-flowgraphsorprogramme-dependencegraphs(seeSec-
tion2.2.1).Dynamictechniquesusuallytracesomeinformationduringaprogramme
executionwhichisthenanalysed.Thiscanbeinformationonthevaluesofvariables,
branchestakenduringexecutionorcodesegmentsexecuted.Afurtherdistinction
ofdefect-localisationtechniquesisthelevelofgranularity:Whilesometechniques
identifyclassesormethodswithanincreasedlikelihoodtobedefective,othertech-
niquesidentifydefectsatafinerlevelofgranularity,e.g.,statements,linesofcodeor
statements.ofblocksItisworthbeingmentionedthatnodefect-localisationtechniqueisperfectinthe
sensethatisisabletolocaliseanykindofdefect.Astudyoncomparingdifferent
staticapproachesbyRutaretal.[RAF04]cametotheconclusionthatnoneofthe
toolstheyhaveinvestigatedstrictlysubsumesoneoftheothers.Thesameapplies
todynamictechniques:Santelicesetal.[SJYH09]havecomparedseveraldynamic
approachesandcamesimilarlytotheconclusionthatnosingleapproachperforms
bestforallkindsofdefects.Thedifferentdefect-localisationtechniquesdescribed
inthissection–aswellastheonesproposedinthisdissertation–canthereforebe
consideredtobeorthogonaltoeachother.Acombinationofdifferenttechniqueswill
probablybethemosteffectivewaytododefectlocalisationinpractice.
Intheremainderofthissectionwediscussaselectionofdifferentstaticanddy-
namicdefect-localisationtechniques(Sections3.1.1and3.1.2,respectively).We

33

CHAPTER3.RELATEDWORK

thenbrieflyintroducesomerelatedworkonlocalisingdefectsinmultithreadedpro-
3.1.3).(Sectiongrammes

hesoacApprStatic3.1.1MiningSoftwareMetricsandSoftwareRepositories
Software-complexitymetricsaremeasuresderivedfromthesourcecode,describ-
ing,e.g.,thecomplexity,qualityormaintainabilityofaprogrammeoritsmethods.
Thesoftware-engineeringcommunityhasbeenveryactiveindefiningsuchmetrics
[HS95,Jon08],buttheyaretypicallynotintendedtofacilitateadefectlocalisa-
tion.However,inmanycases,complexitymetricscorrelatewithdefectsinsoftware
ZNZ08].[NBZ06,Astandardtechniqueinthefieldofminingsoftwarerepositoriesistomappost-
releasefailuresfromabugdatabasetodefectsinstaticsourcecodefromaversion-
managementsystem.Suchamappinghasbeendone,forinstance,byNagappan
etandal.build[NBZ06].Theprincipal-componentauthorsderivmodelsestandardbasedoncomplethemxityandonmetricsthefrominformationsourceifcodethe
softwareentitiesconsideredcontaindefects.Theprincipal-componentmodelscan
thendiscoverpredictthateverypost-releaseprojectfhasailuresitsfornespecificwsetpiecesofofcomplesoftwxityare.Hometricsweverwell,thesuitedauthorsfor
defectlocalisation.Thesesetsofmetricscanonlybeusedwithinnewerversionsof
ofthemetricssameorprojectevenorawithinsingleverymetricsimilarthatissuitedprojects.forIndefectparticular,predictionsthereisfornoanuniyvsoftwersalareset
project.StudiesrelatedtotheoneofNagappanetal.[NBZ06]are,forinstance,theonesby
Knabprobabilities.etal.andTheSchröterapproachetal.byKnabSchröteretal.etal.[KPB06][SZZ06]useusesdecisionretreesgressiontopredicttechniquesfailureto
predictthelikelihoodofdefectsbasedonstaticusagerelationshipsbetweensoftware
components.Alltheseapproachesrathergivehintsoncodequalityissuesthanpinpointingac-
torytual.defects.ConcerningFurthermore,theleveltheofyrequiregranularitya,lardifgeferentcollectionsoftwofdefectsare-repository-miningandversionhis-ap-
areproachesdefinedfocusattheondifmethodferentlevleel,velsthusofderivingabstraction.defectHoweverlocalisations,manyatthiscomplelevexitylofmetricsgran-
.ularity

Detectionatternect-PDefSyntacticAsanumberoftypicaldefect-proneprogrammingpatternsareknown,thereisanum-
beroftoolsthatheuristicallysearchforsyntacticdefectpatterns[RAF04].FindBugs
[AHM+08]fromAyewahetal.,forinstance,isawell-knownrepresentative.Itanal-

34

TIONLOCALISADEFECT3.1.

useryses.ThestatictoolJavacansourceseamlesslycodeandbedeplogeneratesyedawithinnumberinteofgratedwarningsdevelopmentpresentedentoviron-the
ments(IDEs)suchaseclipse.However,FindBugstypicallydoesnotfindmore
leadingsophisticatedtoahighlogicalrateofdefectsfalseandpositiitvesfrequently[RAF04].producesNevaertheless,sheernumberFindBugsofwcanarnings,help
tocessfullydisciplineusedinaprogrammerslarge-scalewritingindustriallesssettingdefect-prone[AP10].code,andFindBugsthetooldelihasversbeendefectsuc-
defectilocalisationsve.ataveryfinegranularity,identifyingstatementsorlineswhichmightbe

MiningofProgramme-DependenceGraphs(PDGs)

TheworkofChangetal.[CPY08]focusesondiscoveringneglectedconditions.They
areaclassofdefectswhichareinmanycasesnon-crashingoccasionalbugs.Anex-
ampleofaneglectedconditionisaforgottencaseinaswitchstatement.Chang
etal.workwithstaticprogramme-dependencegraphs(PDGs,seeSection2.2.1)and
utilisedependenciesgraph-mining(edges)betweentechniques.elementsPDGsare(nodes)graphsofadescribingmethodorbothofancontrolentireandpro-data
gramme.Theideabehind[CPY08]istofirstdetermineconditionalrulesinasoftware
project.Thesearerules(derivedfromPDGs,aswewillsee)occurringfrequently
whichwithinaareproject,consideredtorepresentingbenefglectedault-freeconditionspatterns..Then,Thisisrulebasedviolationsontheareassumptionsearched,
thatthemoreacertainpatternisused,themorelikelyitistobeavalidrule.Toput
thesealgorithmideasandintoapplypractice,ittoathedatabaseauthorsofdevelopPDGs.aInheuristictheirfrapproach,equentansubgrexpertaph-mininghasto
matcconfirmhingandalgorithmpossibly,editwhichtheisderulesvelopedfoundbybythetheauthorsalgorithm.aswell,Finally,asearchesheuristicthePDGsgraph-to
findtheruleviolationsinquestion.Thisleadstofine-graineddefectlocalisationsat
el.vstatement-letheFromatechnicalpointofview,itisnotablethattherearenoguaranteesforthetwo
heuristicalgorithms.algorithms:Furthermore,Ittheremainsapproachunclearinrequireswhichanecasesxperttographsearexaminenotthefoundrules,bytypi-the
callyhundreds,byhand.However,thealgorithmsdoworkwellintheevaluationof
thesimilarauthors,tobutaredynamic-call-graphnotcomparedminingtoasrelatedinvwork.estigatedThoughinthisgraph-miningdissertationaretechniquesusedin
[CPY08],theapproachesarenotrelated.TheworkofChangetal.reliesonstatic
PDGs.Theydonotrequireanyprogrammeexecution,asdynamiccallgraphsdo.

35

CHAPTER3.RELATEDWORK

hesoacApprDynamic3.1.2CManyprogrammes,dynamicrangingdefect-localisationfrom200to700approacheslineshaofvecodebeene(LOC),valuatedwhichwithwereasetoforiginallysmall
introducedbySiemensCorporateResearch[HFGO94].Theseso-calledSiemens
ofProgtestrammescases.proTheyvidecanabenumberseenofasaartificiallystandardintroducedbenchmark,defectsalthoughalongthewithaprogrammesnumber
arerathersmallandthedefectsarerealisticbutartificial.

gingugDebDeltaDeltadebuggingisageneralstrategyinventedbyZeller[Zel99]forsystematically
searchingforcausesoffailures,followingthetrial-and-errorprinciple.Itdoesso
bydeterminingtherelevantdifferencebetweentwoconfigurationswithrespectto
agiventest.Aconfigurationinthiscontextcanbe,e.g.,aprogrammeinput,user
interactions,athreadschedule,codechangesorprogrammestates.
Lookingatdeltadebuggingwithprogrammeinputsasanexample[ZH02],one
searchesfortheminimaldifferencebetweenaninputthatleadstoafailureandan
inputthatleadstoacorrectexecution.Tothisend,onehastoprovideatestoracle
(seeSection2.2.3)thatdecideswhetheraprogrammeexecutioniscorrectorfailing,
aswellasafailingandapassingprogrammeinput.Deltadebuggingthenfindstwo
programmeinputsleadingtocorrectandfailingresultswithaminimaldifference.
Thisinformationcanbeusedtoeasemanualdebugging.Asanexample,whenthe
programmeinvestigatedisacompilerandtheinputdataissomesourcecode,the
differenceintheinputsourcecodeisprobablyrelatedtosomestatement.Thedefect
isthenlikelytobelocatedinthepartsofthecompilerhandlingthiskindofstatement.
Multithreadedsoftwareintroducesindeterminismtoaprogrammeexecution(see
andSectionfailures3.1.3).mightThisonlyis,occurtherearewhenahugeinternallynumberacertainofpossibleinterleavingthreadisexinterleaecuted.vings,In
[CZ02],theauthorspresentadelta-debuggingapproachwhichisabletoidentify
fture/replayailure-inducingtoolthread[CS98]ininterleaordervings.toInrecordconcretethethreadterms,theinterleayusevingtheandDEJtoAVUreplaycap-it
infections,deterministicallyi.e.,.Blocationsywheresystematicallyavthreadaryingswitchthesecausesreplays,thedeltaprogrammedebuggingtofail.localisesThis
givdirectlyeshintspinpointingwherethethisactuallocation.defectHomightweverbe,Tzoreflocatedetal.withinthe[TUYT07]sourcehavecode,shownwithoutthat
approachesbuildingonvaryingthreadinterleavingsanddeltadebuggingdonotscale
forlargesoftwareprojects.
Similarlytoprogrammeinputsandthreadinterleavings,deltadebugginghasbeen
ableappliedthattoexecutesprogrammecorrectlychangesandav[Zel99]:ersionWhenthatfoneails,ve.g.,ersionfromofaavprogrammeersion-controlisavsys-ail-

36

LOCALISADEFECT3.1.TION

tem,deltadebuggingcanrevealtheactualchangethatcausesthefailingbehaviour.
Thisrequirestheavailabilityofdifferentversionsfromthesameprogramme.
Thedelta-debuggingtechniquewhichsettingisprobablyclosesttothedefect-
localisationapproachesdevelopedinthisdissertation,is[CZ05].Itdoesnotrely
onprogrammeinputs,differentthreadinterleavingsorprogrammeversions,butre-
quiresaprogrammewithanoccasionalbugalongwithrespectivetestcasesonly.
Thetechniqueextendsearlierworkoftheauthors[Zel02]:Itappliesdeltadebug-
gingtoprogrammestates,representedbythecurrentvariablevalues.Tothisend,
itrepresentsthevariablevaluesbymeansofso-calledmemorygraphs.Then,the
approachsystematicallymodifiesthememorygraphs,i.e.,theprogrammestatesof
runningprogrammes,usingadebugger.Todoso,itemploysthedelta-debugging
strategytocomputeminimaldifferencesofmemorygraphsofcorrectandfailingex-
ecutions,i.e.,variables.In[CZ05],theauthorstheninvestigatecausetransitions.
Theyprovideameanstolocalisethedefectinthesourcecodewhichlaterleadsto
theinfectedvariableidentifiedbydeltadebuggingonmemorygraphs.Thisleadsto
defectlocalisationsatthefine-grainedgranularitylevelofstatements.
Theauthorsevaluatetheapproachbasedoncausetransitionsanddeltadebugging
[CZ05]withtheSiemensProgrammes.Itoutperformsanothermorebasicdefect-
localisationapproach.However,aswewillseeinthefollowing,differentcomple-
mentaldynamicapproachesoutperformdeltadebuggingonthesamebenchmarkpro-
grammes.

FromCoverageAnalysistoSequenceAnalysis
Statement-CoverageAnalysis.Coverageanalysiscanbeseenasthebasisfor
manydynamicapproaches,includingthecall-graph-basedonesdiscussedinthisdis-
sertation.TarantulafromJonesetal.[JHS02]issuchatechnique,usingtracing
andvisualisation.Tolocalisedefects,itutilisesarankingofprogrammecomponents
whichareexecutedmoreofteninfailingexecutions.Thisisthenusedtovisualise
thesourcecodefortheprogrammer,usingdifferentcoloursandintensities.Inmore
detail,aprogrammecomponentisabasicblockinacontrol-flowgraph(seeSec-
tion2.2.1),i.e.,asequenceofstatementsalwaysexecutedconjunctively.Tarantula
calculatesthedefect-likelihoodforabasicblockeasfollows:
failetotalfailed(e)d
PTarantula(e)∶=passed(e)+failed(e)
totalpassedtotalfailed
wherepassed(e)isthenumberofcorrectexecutionsthathaveexecutedbasicblocke
atleastonce,andfailed(e)similarlyreferstofailingexecutions.totalpassedand
totalfailedarethetotalnumbersofprogrammeexecutionsthatarecorrectandfailing,
.elyvrespecti

37

CHAPTER3.RELATEDWORK

Forthesource-codevisualisation,Tarantulausesdifferentcoloursdependingon
thesuspiciousnessvaluePTarantula.Forinstance,sourcecodewithvalue1isvisualised
inred.Forafurtherdifferentiationofthevisualisation,Tarantulausesanadditional
brightnessscore.However,intheexperimentsbytheauthors,theyonlyusethe
suspiciousnessvaluePTarantula[JH05,JHS02]torankthestatements.Thebrightness
scoreiscalculatedasfollows:
brightness(e)∶=max(passed(e),failed(e))
dtotalfailedassetotalpWhiletheTarantulatechniqueisrelativelysimple,itproducesgooddefect-lo-
calisationresults.Inanevaluationconductedbytheauthors[JH05]basedonthe
SiemensProgrammes,ithasoutperformedfivecompetitiveapproaches,includinga
delta-debuggingapproach[CZ05].However,itdoesnottakeintoaccounthowoften
astatementisexecutedwithinoneprogrammerun.Thismightmisscertaindefects
suchasfrequency-affectingbugs.Ingeneral,Tarantuladerivesdefect-localisations
atabasic-blocklevel,buttheauthorsalsodescribehowtomaptheseresultstothe
methodlevel[JH05].
Abreuetal.[AZGvG09]aimatimprovingTarantula[JHS02]byevaluatingdiffer-
entscoringfunctionsbesidesPTarantulawithinthesameframework.Mostimportantly,
theyhaveinvestigatedtheJaccardcoefficientknownfromstatisticsandtheOchiai
coefficientwhichistypicallyusedinmolecularbiology:

edfaile)(PJaccard(e)∶=totalfailed+passed(e)
edfaile)(POchiai(e)∶=√totalfailed⋅(failed(e)+passed(e))
BasedonexperimentswiththeSiemensProgrammes,Abreuetal.[AZGvG09]
hacoefveficientfoundthatperformstheJevaccarendbettercoefthanficienttheJperformsaccardbettercoefficientthan.TarantulaandtheOchiai

TheSequence-AnaltechniquerefinesysiscovApprerageoacanalysishes.andDallmeieranalysesetal.sequencespresentofAMPLEmethodcalls.[DLZ05].The
thanauthorsstatementdemonstratecoveragethatonlythe.Moretemporalorderconcretelyof,callsAMPLEismorecomparespromisingtoobject-specificanalyse
Then,sequencesitderiofvesaincomingrankingandattheoutgoinggranularityobjectlecalls,velofusingaclasses,whichsliding-windoiswmuchapproach.coarser
thanmethods,basicblocksorstatements.Thisrankingisbasedontheinformation
runswhichreobjectsgarding(i.e.,theirinstancesstatementofsequences.classes)differthemostbetweenpassingandfailing

38

TIONLOCALISADEFECT3.1.

AfairlyrecentapproachisRAPIDfromHsuetal.[HJO08].Itdirectlyextends
thementsTarandantulathenfiltersapproachall[JHS02].statementsRAPIDhavingfirstavaluecalculatesoflessPTarantulathanv0.6.aluesBasedforallonstate-the
remainingstatementshavinganincreasedlikelihoodtobedefective,itderivesmaxi-
mumcommonsubsequencesintheprogrammeexecutiontraces.Tothisend,RAPID
utilisessequencesthetoBIDEtheuser,sequence-miningstartingwiththosealgorithmcontaining[WH04].theFinallyhighest,rankRAPIDedpresentsstatementsthe
ecutionaccordingtosequencesPTarantulafor.theThisdevaimseloperat,promakingvidingdefectcontextuallocalisationinformationeasierthanreferringonlytoepro-x-
seemsvidingtobepossiblypromising,defectivetoourstatementsknoorwledge,lines.ithasHonewevveerr,ebeenvenevthoughaluatedthecomprehen-technique
.elyvsi+proacLohet.al.This[LCHis,it09]doesalsonotdeallocalisewithdefects,sequences,butbutdecidespresentawhetherfailuranexe-detectionecutionap-is
not.orcorrect

Dataflow-PathAnalysis.Masri[Mas09]proposesadataflow-focusedapproach
whichhassomesimilaritiestosequenceanalysis.Heperformsadynamicanaly-
sishewoforksdataflowithwsdataflobetweenwpaths,statementswhichtoaredetectsimilardefectstoinsequences.sourceThecode.yTocomprisethisend,fre-
quency,sourceandtargettypes(e.g.,branch,statement)andthelengthoftheexe-
cuteddataflowpath.Specifically,Masricomparessub-pathsofdataflowsofcorrect
andwithfaailingemechanismxecutionssimilartotoranktheonedefectinpositions[JHS02].atHothewever,granularitySantelicesleveletofal.statements,[SJYH09]
describethatthemonitoringofdataflowsasdonebyMasri[Mas09]ismuchmore
eenabledxpensivecallthangraphsmoreasproposedlightweightinChapterapproaches7cosuchverasTdatafloarwantulainformation[JHS02].besidesDataflothew-
control-flow-relatedcall-graphstructure.However,thisisdonedifferentlythanin
[Mas09].theapproachbyMasri[Mas09]isthereforecomplementarytothework
dissertation.thisinpresented

ysis,canSubsumption.beseenasaBothbasisforapproaches,themorestatement-cosophisticatedverageanalysiscall-graph-basedandsequencetechniquesanal-we
focusoninthisdissertation:Theusageofsequencesinsteadofstatementcover-
agestructuralisageneralisationinformation,alsowhichreferredtakestomoreasacontestructuralxt,easesinformationthemanualintodebaccount.uggingThispro-
cessinformation[HJO08].(encodedinCall-graph-basedthegraphs)techniquesthaninsequences.turncoLikverewise,moreacomplesubgrxaphstructuralcontext
bprouggingvidedactibesidesvities.amorefine-graineddefectlocalisation,likelyeasesthemanualde-

39

CHAPTER3.RELATEDWORK

LocalisationectDefStatisticalStatisticaldefectlocalisationisafamilyofdynamictechniqueswhichmakeuseof
moredetailedinformationthancoverage-analysistechniques.Suchtechniquesare
basedoninstrumentationofthesourcecode,whichallowscapturingthevaluesof
predicatesorotherexecution-specificinformation,sothatpatternscanbedetected
amongthevariablevalues.DaikonfromErnstetal.[ECGN01]usessuchanap-
proachtodiscoverprogrammeinvariants.Thisproblemissomewhatdifferentfrom
defectlocalisationandcanthereforehardlybecomparedtosuchtechniques.How-
ever,theauthorsclaimthatdefectscanbedetectedwhenunexpectedinvariantsap-
pearinfailingexecutionsorwhenexpectedinvariantsdonotappear.

TheApproachfromLiblitetal.Liblitetal.[LNZ+05]relyonthestatisti-
calanalysisofprogrammepredicates,buildingonearlierworkfromtheauthors
[LAZJ03],whichusespredicatesandregressiontechniques.Themorerecentap-
proachconsidersalargenumberofBooleanprogrammepredicates,mostimportantly
predicatesthatareevaluatedwithinconditionstatements(e.g.,if,for,while)and
predicatesreferringtoreturnvaluesoffunctions.Concretely,return-valuepredicates
indicatewhetherthereturnedvalueis<0,≤0,>0,≥0,=0or≠0.Foreachpredicate
inaprogramme,theauthorscalculatethelikelihoodthatitsevaluationtotruecorre-
lateswithfailingexecutions.Thisisusedtorankpredicates.Predicatesaretypically
usedtoperformdecisionsrelevantforcontrol-flowbrancheswithinaprogramme.
Therefore,predicatescanbemappedtobasicblocks[SJYH09],andpredicate-based
analysisisafine-graineddefect-localisationtechnique.
Liuetal.[LFY+06]haveshownwithexperimentsusingtheSiemensProgrammes
that[LNZ+05]performsconstantlybetterthandeltadebugging[CZ05],andthatit
performssimilarlyasTarantula[JHS02].Despitethesegoodresults,theapproach
fromLiblitetal.[LNZ+05]inherentlybearstherisktomisscertaindefects:Asthe
likelihoodcalculationonlyconsiderswhetherapredicatehasatleastoncebeeneval-
uatedastrueinaprogrammeexecution,itmightnotlocalisefrequency-affecting
bugs.Inparticular,ifapredicateisevaluatedtotrueatleastonceineveryexecu-
tion,themethodconsidersthepredicatetobecompletelyunsuspicious.

TheSOBERMethod.Liuetal.proposeasimilarapproach,calledSOBER
[LFY+06],thatovercomessomeproblemsof[LNZ+05].Itmakesuseofasubset
ofthepredicatesanalysedin[LNZ+05],whichincludesthecondition-statementand
return-valuepredicatesdiscussedbefore.Itthenusesamoresophisticatedstatistical
methodtocalculatedefectlikelihoods:Itmodelsthepredicateevaluationstotrue
andfalseinbothcorrectandfailingprogrammeexecutionsandusesthemodel
differenceasthedefectlikelihoodofpredicatepasfollows:
PSOBER(p)∶=−log(L(p))

40

TIONLOCALISADEFECT3.1.

whereLisafunctionwhichcalculatesthesimilarityofthepredicateevaluation
models.See[LFY+06]foralldetailsonhowthesimilarityfunctionsarechosenand
PSOBERiscalculatedexactly.
TheauthorsshowthattheSOBERmethodisabletodetecttheinfectionsidenti-
fiedbysuspiciousinvariantsmentionedbefore[ECGN01]aswell.AsSOBERuses
predicateanalysis,thegranularityisasin[LNZ+05]thelevelofpredicatesorbasic
blocks.Theevaluationconductedbytheauthors[LFY+06]basedontheSiemens
ProgrammeshasshownthatSOBERperformsalmost+constantlybetterthanTaran-
tula[JHS02]andtheapproachfromLiblitetal.[LNZ05]andaswellconstantly
betterthandeltadebugging[CZ05].

Subsumption.Opposedtothecall-graph-basedtechniquesdiscussedinthisdis-
tiessertation,ofcallthegraphsintoaccount.statistical-defect-localisationHence,detectingapproachesstructurdonote-aftakefectingbstructuralugsispropermore-
ficult.difAnotherknownissueisthatstatisticaldefectlocalisationmightpossiblymisssome
defects.Thisiscausedbytheusualpractice(e.g.,asdonein[LAZJ03,LNZ+05])not
topartlyobservoveevercomeseryvthisalueissueduringbyanexcollectingecution,buttinformationoconsiderfromsampledproductivvealues.codeon[LAZJ03]large
numbersofmachinesviatheInternet.However,thisdoesnotfacilitatethediscovery
ofdefectsComparedbeforetothethedataflosoftwareisw-analysisreleased.approachincallgraphsproposedinChapter7,
++valsLiblitforetal.return[LNZvalues05]ofandmethods,SOBERi.e.,by(−∞Liu,et0),al.[0,0[LFY]and06](0,∞)consider.Avonlyariablethreenumberinter-
ofdynamicallyidentifiedintervalsmightbebettersuitedtocapturedefectsthatdonot
notmanifestconsiderthemselfdatafloinwstheinfixtheedintervmethod-callalsgiven.parameters,Furthermore,whichmight[LNZ+05,containLFY+important06]do
too.information,defect-related

DefectLocalisationwithGraphicalModels
Afairlyrecenttechniqueistheapplicationofgraphicalmodelstodefectlocalisation.
Graphicalmodelsareamachine-learningtechniquerelyingonstatistics,bringing
togetherconceptsfromgraphtheoryandprobabilitytheory[Jor99].Well-known
representativesofgraphicalmodelsareBayesiannetworks,alsoknownasdirected
acyclicgraphicalmodelorbeliefnetwork[Jen09].
Dietzetal.[DDZS09]makeuseofgraphicalmodelsandapplythemfordefectlo-
calisation.Theytrainso-calledBernoulligraphmodels,withdataobtainedfrompro-
grammeexecutions.Moreconcretely,theauthorsgeneratemodelsforeverymethod
ofaprogrammeexecution,wherenodesrefertostatementswithinthemethods.Once
themodelsaregenerated,theauthorsusethemforBayesianinferencetocalculate

41

CHAPTER3.RELATEDWORK

ontheseprobabilitiesofprobabilities,transitionstheyderibetweenvethedefectnodesinlocalisationsanewattheprogrammestatementexlevecution.el.Based
Theauthorsevaluatetheirtechniquewithrealdefectsfromlargesoftwarepro-
grammes,originatingfromanearlyversionoftheiBUGSproject[DZ09].Inthe
eevaluationxperiments,covtheersyonlyoutperformsituationsTarwhereantulaasoftw[JHS02]aredealmostveloperconsistentlyhasto.inHvoweestigatever,upthe
to1%ofthesourcecodeinordertofindthedefect.–Thestudydoesnotcoverthe
thanlocalisation1%oftheofcode.defectsthatarehardertodetect,i.e.,whereonehastoinvestigatemore

SlicingogrammePrDynamicDynamicprogrammeslicing[KL88]canbeveryusefulfordebuggingalthoughitis
notexactlyadefect-localisationtechnique.Ithelpssearchingfortheexactcauseof
afailure,i.e.,thedefect,iftheprogrammeralreadyhassomecluewhichpartsofthe
programmestateareinfectedorknowswherethefailureappears,e.g.,ifastacktrace
isavailable.Programmeslicinggiveshintswhichpartsofaprogrammemighthave
contributedtoafaultyexecution.Thisisdonebyexploringdatadependenciesand
revealingwhichstatementsmighthaveaffectedthedatausedatthelocationwhere
appeared.ailurefthe

3.1.3DefectLocalisationinMultithreadedProgrammes
Multicorecomputerswithseveralcoresonasinglechiphavebecomeubiquitous.
Theyprovidedeveloperswithnewopportunitiestoincreaseperformance,butap-
plicationsneedtobemultithreadedtoexploitthehardwarepotential[Pan10].One
drawbackofmultithreadedsoftwaredevelopment,comparedtothesequentialcase,
isthatprogrammersareadditionallyconfrontedwithnon-determinismandparallel-
programmingerrors.Non-determinismarisesastheoperatingsystemmightassign
differentthreadschedulestodifferentexecutionsofthesameprogramme[CS98].
Parallel-programmingerrorssuchasatomicityviolations,raceconditions(i.e.,un-
controlledconcurrentaccessofmemoryobjectsfromdifferentthreads)anddead-
locks[FNU03,LPSZ08]arefrequentlytriggeredbythiseffect.
Inthisdissertation,ourfocusisonthelocalisationofdefectsinsequentialpro-
grammes.However,asdescribedabove,multithreadedprogrammingleadstonew
kindsofdefects,andmultithreadedprogrammesseemtobemoredefect-pronethan
sequentialsoftware.Therefore,webrieflysummarisethemostimportantresearch
directionsinthefieldofdefectlocalisationinmultithreadedprogrammesinthefol-
lowing.Concretely,wecommentonaselectionofstaticanddynamicapproaches
andconcludethissectionwithabriefsubsumption.

42

TIONLOCALISADEFECT3.1.

hesoacApprStatic+ToolsRacerXemplobyyingEnglerstaticandanalysis,Ashcraftsuch[EA03]asinvESC/JestigateavabythesourceFlanagancodeetal.without[FLLex02]ecu-or
ftion,alse-positibut–vewsimilarlyarnings.totheFurthermore,single-threadedsomecasetools–suchmightasproduceESC/Jlaravgearequirenumberspro-of
grammerannotationstoreducethenumberofwarnings,whicharetedioustocreate.
Theintuitionbehindmanystaticapproachesistodiscoversituationsinwhichvari-
ablesorobjectsareaccessedconcurrentlywithoutexplicitmonitoring.Thispossi-
blyleadstoraceconditions.ESC/Java,forinstance,letstheuserspecifywhich
dataobjectsshouldalwaysbeaccessedinacontrolledway.Thetoolthengener-
atespredicatesfromtheannotations,andatheoremprooferderiveswhetherthese
predicatesholdfortheentiresourcecode.Thisinformationthenisusedtoderive
wdecidearnings.whetherRacerXadoesmemorynotaccessrelyonmightbeannotations,accidentallybutemploysaunmonitored.setofWhileheuristicssuchto
anapproachismoreconvenienttouse,italsotendstoproducemorefalsepositive
arnings.w

hesoacApprDynamicDynamicracedetectorssuchasEraserbySavageatal.[SBN+97]instrumentpro-
grammesandanalysetheruntimebehaviourofthememoryaccessofeachthread.
Eraser,forinstance,monitorsthelockseachthreadcurrentlyholds.Basedonob-
servedsetsoflockspervariable,itidentifiessituationsthatpossiblyleadtoracecon-
ditions.Otherdynamicapproachesrelyontheanalysisofhappens-beforerelations:
Whennosynchronisationconstructsprotectareadandawriteaccessortwowrite
accessesfromtwothreads,araceconditionislikely.Toderivesuchhappens-before
relations,logicalLamportclocks[Lam78]andvectorclocks[Mat89]havebeenused.
HybridracedetectorssuchastheonebyO’CallahanandChoi[OC03]combinedif-
ferentdynamictechniquestoimproveracedetection.TheIBMMulticoreSDKbyQi
etal.[QDLT09]isanimplementationofthisapproach.Italsomakesuseofsome
staticanalysis:Itanalysesthesourcecodetoimprovetheruntimeofdynamicanal-
ysisbyidentifyingvariablesthatcansafelybeexcludedfromfurtherconsideration.
However,derivingtheinformationneededforthedynamicapproachesatruntime
impliesapossiblyhugeoverhead.Further,dynamicapproachescaninfluenceapro-
grammeundertestandchangeitstiming,whichcanmakearaceconditiondisappear.
Thiseffectisknownastheprobeeffect[Gai86].
Anothergeneralproblemofdynamicracedetectorsisthataracemightmanifest
itselfonlywhencertainthreadschedulesoccur.Asschedulingisdonebytheoper-
atingsystem,developershavelimitedinfluenceonreproducingarace.Addressing
thisproblem,ConTestbyFarchietal.[FNU03]executesamultithreadedJavapro-
grammeseveraltimesandinfluencesthreadschedulesbyinsertingcertainstatements

43

CHAPTER3.RELATEDWORK

(e.g.,sleep())intoaprogramme.Chess,developedbyMusuvathietal.[MQB07]
forC#,hasanadditionalrefinement:amodifiedthreadschedulerexhaustivelytries
outeverypossiblethreadinterleaving.Ontopofthat,adelta-debuggingstrategy
[Zel99]asdescribedinSection3.1.2canbeusedtoautomaticallylocaliseadefect.
Suchanapproachhasbeenfollowed,e.g.,in[CZ02].However,asmentionedbefore,
Tzorefetal.[TUYT07]haveshownthatsuchapproachesdonotscalewell.

SubsumptionAllofthetoolsmentionedinthissectionondefect-localisationinmultithreadedpro-
grammesfocusonidentifyingatomicityviolations,raceconditionsordeadlocks.
Thesetoolsarespecialisedonaparticularclassofparallelprogrammingerrorsthat
areduetowrongormissingusageofsynchronizationconstructsinparallelprogram-
minglanguages.However,failuresofmultithreadedprogrammesmighthaveother
causes,too.Forinstance,theymightoriginatefromnon-parallelconstructsthattrig-
gerwrongparallelprogrammebehaviour.
Example3.1:Supposethataprogrammerforgetsorincorrectlyspecifiesacondition
whensheorhewritesthecodecreatingthreadsinathreadpool.Thisslipaffects
parallelbehaviourandmightleadtoanunboundedcreationofthreads,wrongcontrol
flowandincorrectprogrammeoutputs.
Suchsituationsmightbebettertackledbyanalysinganomaliesofexecutionssuch
asdifferencesbetweencallgraphsfromcorrectandfailingexecutionsofapro-
gramme.Inthisdissertation,wediscusssomeideasconcerningcall-graphrepre-
sentationsformultithreadedprogrammesinSection4.3,andwepresenttheresults
fromafirststudyondefectlocalisationwithsuchgraphsinAppendixA.Wefurther-
morecometotheconclusionthatthefieldofdefectlocalisationwithmultithreaded
programmesbearsmuchpotentialforfutureinvestigations.Wepresentsomeideas
inSection4.3andChapter9.

MiningData3.2Insiderthisweightedsection,wesubgraphdiscussminingrelated(Sectiondata-mining3.2.1),miningtechniques.significantInparticularsubgr,aphswecon-(Sec-
tion3.2.2)andconstraint-basedsubgraphmining(Section3.2.3).

MiningSubgrapheightedW3.2.1Weightedgraphsareubiquitousintherealworld.Forinstance,thinkoftransportation
avnetwerageorks,speed,wherethetime,numericaltheweightsdistanceetc.attachedAswelltoedgessoftwaremightcallstandgraphsforasinthevload,estigatedthe

44

3.2.DATAMINING

inthisdissertationfordefectlocalisationcanbeattachedwithweights:Edgeweights
mightrepresentcallfrequenciesorabstractionsofthedataflow.However,weareonly
awareofafewstudiesanalysingweightedgraphswithfrequentsubgraphmining.
Moststudiesfocusonthespecificanalysisproblem,ratherthanproposinggeneral
weighted-subgraph-miningtechniques.Inthefollowing,wereviewsomeworkbased
ondiscretisation,andwediscussapproachesbuildingontheconceptofweighted
.support

hesoacApprDiscretisation-BasedLogisticNetworks.Jiangetal.[JVB+05]investigatefrequentsubgraphmining
inlogisticnetworkswhereedgesrepresentsingletransportsandareannotatedwith
severalweightssuchasdistancebetweentwonodesandtheweightoftheload.With
eachwhichareweight,suitedadifforferentgraphweightedmininggraphfromcanthebeedgeconstructed.weights,theInorderauthorstousederiavebinninglabels
strategy.Eachweightispartitionedintorangesofthesamesize,givingafew(7to
10)distinctlabels.Thebinningstrategyfordiscretisationmaycurbresultaccuracy,
fortworeasons:(1)Theparticularschemedoesnottakethedistributionofvalues
intosationaccount.leadstoaThus,numbercloseofvaluesorderedmaybe(ordinal)assignedintervtoals,difbuferenttthebins.authors(2)Thetreatthemdiscreti-as
unorderedcategoricalvalues.Forexample,theinformationthat‘medium’isbetween
‘low’and‘high’islost.

ImageAnalysis.Nowozinetal.[NTU+07]dodiscretisationaswellbeforeit
agescomesaretofrequentrepresentedassubgraphweightedmining.graphs.TheyThestudyauthorsimage-analysisrepresenteachproblems,pointofandinterestim-
byonevertexandconnectallvertices.Theyassigneachedgeavectorconsisting
ofimage-analysis-specificmeasures.Thentheydiscretisetheweights,butwitha
methodmoresophisticatedthanbinning.Theweightvectorsareclustered,resulting
incategoricallabelsofedgeswithsimilarweightvectors.However,theriskoflos-
ingpotentiallyimportantinformationbydiscretisationisnoteliminated:(1)Itmight
stillhappenthatclosepointsinann-dimensionalspacefallintodifferentclusters.
(2)Evenwhenvaluedistributionsareconsidered,theauthorsdosointhecontextof
theoriginalgraphs.Whenfrequentsubgraphminingisappliedafterwards,thedistri-
butionswithinthedifferentsubgraphscanbeverydifferent,andotherdiscretisations
appropriate.morebecould

Subsumption.Inthisdissertation,wedealwithsoftwarecallgraphsthatare
aredifweighted.ferentForfromtheanalysisdiscretisation:oftheseIngraphs,Chapterswe5–7weproposeinvtwoestigatekindsaofpostprocessingapproachesthatap-
proach,inChapter8aconstraint-basedminingapproach.Bothproposalsavoidthe

45

CHAPTER3.RELATEDWORK

shortcomingsofdiscretisationmentioned.Theyanalysenumericalweightsinstead
intervdiscreteofals.

MiningSubgrapheighted-FrequentWTheApproachesbyJiangetal.Jiangetal.[JCSZ10]dealwithatext-classi-
ficationtask,formulatedasaweighted-frequent-subgraph-miningproblem.Thisis
basedontheconceptofweightedsupportformulatedbytheauthors.Thisconcept
buildsontheassumptionthatcertainedgeswithinagraphareconsideredtobemore
significantthanothers,andthatthesignificanceisreflectedintheedge-weightvalues
(i.e.,asignificantedgedisplaysahighvalue)1.Concretely,theauthorscalculatethe
weightedsupportwsupofasubgraphgasfollows:
wsup(g)∶=sup(g)⋅e∈E∑(g)w(e)
Thisis,theweightedsupportofacertainsubgraphishighwhenithasahigh
supportandcontainsedgeshavinghighweightvalues.Correspondingly,weighted-
frequent-subgraphminingasdefinedbytheauthorsdiscoverssubgraphssatisfying
acertainuser-definedminimum-weighted-supportthreshold.However,themini-
mumweightedsupportcriterionisnotanti-monotoneandcanthereforenotbeused
toprunethesearchspaceinpattern-growth-basedfrequent-subgraph-miningalgo-
rithms.Theauthorsthereforemakeuseofanalternativebutweakerconcepttoprune
thesearchspaceandimplementtheirtechniqueasanextensionofgSpan[YH02]
(seeSection2.3.3).In[JCZ10],theauthorspresentvariationsoftheapproach,in-
cludingtwofurtherweight-basedcriteriathatareanti-monotone.
Usingtheirapproaches,theauthorsachievewellresultsnotonlyinthetext-classifi-
cation+application[JCSZ10],butalsoappliedto(medical)image-analysisproblems
[ECJ10,JCSZ08]andcertainproblemsfromlogistics[JCZ10].

TheApproachesfromShinodaetal.Shinodaetal.[SOO09]presentanap-
proachsimilartotheonesfromJiangetal.[JCSZ10,JCZ10].Theyconsidergraphs
withweightednodesandedges(referredtoasinternalweights),andtheirgraphs
themselfareassignedwithaweightaswell(referredtoasexternalweights).They
definetheinternalweightedsupportwsupintsimilartoJiangetal.,buttheyconsider
thetotalinternalweightofthegraphdatabaseD(inthedenominator):

wsup(g)∶=sumofallinternalweightsofginallgraphsd∈Dwhereg⊆d
intsumofallinternalweightsofallgraphsinD
Ifthereareseveralembeddingsofg∈d,theonewiththemaximumweightischosen.
1Jiangetal.consideronlyweightededges,butclaimthattheirconceptscanbeeasilytransferredto
nodes.weighted

46

3.2.DATAMINING

Theauthorsdefinetheexternalweightedsupportwsupextsimilarlyasfollows:

wsupext(g)∶=sumofsumtheeofallxternalexternalweightsofweightsallofgraphsalldgraphs∈DinwhereDg⊆d
Finally,theydefineageneralweightedsupportwsupgen,basedonauser-defined
parameterλ(0≤λ≤1):
wsupgen(g)∶=λ⋅wsupext(g)+(1−λ)⋅wsupint(g)
terλ,Basedtheonaauthorsuserdefine-definedthegenerminimumgeneral-weighted-subgral-weighted-supportaph-miningvalueproblem.andTheirparame-so-
lutiontothisproblemissimilartotheoneofJiangetal.[JCSZ10]:Astheminimum
general-weighted-supportcriterionisnotanti-monotone,theyrelyonaweakerprun-
ingcriterionforminingwithapattern-growth-basedsubgraph-miningalgorithm.The
authorsalsoproposearelatedproblem,miningexternalweightedsubgraphsunder
theirinternaleweightxperiments,constrShinodaaints,etwhichal.is[SOO09]solvedachiesimilarlyvewellwithinresultsthewithsameframesyntheticwork.data,In
communicationgraphsandchemicalcompoundgraphs.

Subsumption.Whileminingforweightedfrequentsubgraphs(orminingusing
thevariationsfromShinodaetal.)isadequateforcertainapplications,itrelieson
theassumptionthathighweightvaluesidentifysignificantcomponents.Thisdoes
notholdineverydomain.Forinstance,insoftware-defectlocalisation,high(or
low)edge-weightvaluesareingeneralnotrelatedtodefects.Therefore,weighted-
frequent-subgraphminingcannotbeusedforeveryproblemandofferslessflexibility
thanconstraintsonarbitrarymeasuresasinvestigatedinChapter8ofthisdissertation.
Furthermore,toourknowledge,theweighted-frequent-subgraph-miningtechniques
presentedinthissectionhaveneverbeenevaluatedsystematicallynorcomparedto
approaches.evalternati

SubgraphsSignificantMining3.2.2Inmanysettings,frequentsubgraphminingisfollowedbyafeature-selectionstep.
Thisistoeasesubsequentprocessessuchasgraphclassification[CYH10a]andto
identifythemostsignificantfeatures.Thedifferentproposalsusevariousobjective
functionsforfeatureselection.Besidesothers,Chengetal.[CYH10b]haveidenti-
fiedthistwo-stepapproachofminingandselectingtobethecomputationalbottle-
neckinmanygraph-miningapplications:Ontheoneside,generatinglargenumbers
offrequentsubgraphstochoosefromisexpensiveandincertainapplicationseven
infeasible.Ontheotherside,theselectionprocesscanbeexpensiveaswell.

47

CHAPTER3.RELATEDWORK

Anumberofstudiesinvestigatescalablesubgraph-miningalgorithms[CHS+08,
RS09,SKT08,SNK+09,TCG+10,YCHY08].Theydealwiththedirectmining
ofsubgraphssatisfyinganobjectivefunction,insteadoffollowingthetwo-stepap-
theproach.frequencInyotherwcriterion,ords,btheutcontainsubgraphallsets(orminedmost)mightgraphsbewithreincompletegardtowithsomeregardotherto
objectivefunction.Onecanconsiderthesefunctionstobeconstraints,astheynar-
rowdowntheminingresults.However,theydonotnecessarilyfallintoanyof
thebasedonconstrainttheirabilityclassestointroduceddiscriminateinSectionbetween2.3.3.classesorObjectinumericalvevfunctionsaluesareassociatedeither
withthegraphs[SKT08,SNK+09,TCG+10],onsomeothermeasureofsignificance
[CHS+08,RS09]orleavethischoicetotheuserbyallowingforinterchangeablemea-
sures[YCHY08].Inthefollowing,welookattheapproachesmentionedinalittle
detail.more

hesoacApprBoosting-BasedTheapproachfromSaigoetal.[SNK+09],gBoost,buildsonaboostingtechnique
withdecision-stumpclassifiers.Ineachiteration,theysearchforthemostpromising
classifier,consistingofasinglediscriminativesubgraph.Thesepromisingsubgraphs
arefoundbyrepeatedlycalculatingstructuralobjectivefunctionsmeasuringthedis-
criminativeness.Theydosoinapattern-growthsearchspacesimilartotheonefrom
gSpan[YH02](seeSection2.3.3).Theauthorsusetheirdiscriminativenessmeasure
torefinepruningboundsinthesearchspaceineachiteration.
Saigoetal.[SKT08]refinetheirapproachinthegPLSalgorithm.Itmakesuse
ofthesameboostingtechniqueandpatternsearchspace,butreliesonpartialleast-
squaresregression(PLS)toprunethesearchspaceandtoselectthemostpromising
subgraphs.

ALeap-Search-BasedApproach
Ydifaneferenttal.kindsof[YCHY08]objectivpresentefunctionstheLEAPthatarealgorithm.notItanti-monotone.allowsforThetheideainteofgrationtheal-of
gorithmisnottoprunethesearchspace,buttoleapinthisspace.Thisisincontrast
toperforminga(pruned)stringentdepth-firstsearchasdonebyalgorithmssuchas
gSpan[YH02](seeSection2.3.3).Therebyitmakesuseoftheobservationthat
nificancestructurallyscores.similarTherefore,subgraphsthetendtoauthorshaverelysimilaronastratesupportgyvthataluesminesandwithstatisticalanexpo-sig-
nentiallydecreasingminimumsupportthreshold.Thisleadstoafastdiscoveryof
(nearinformation-)optimalgainassubgraphs.objectivInetheefunctions.valuation,ThetheG-testisauthorsausemeasuretheofG-teststatisticalaswellsignif-as
icance,Definitionand2.7).theTheyinformationsuccessfullygainapplymeasurestheirthetechniquediscriminatitovsevenesseralofadatasetssubgraphfrom(seethe

48

3.2.DATAMINING

chemicaldomain.Furthermore,Chengetal.[CLZ+09]employtheLEAPalgorithm
forcall-graph-baseddefectlocalisation(seeChapter5).

GuaranteesOptimalitywithMiningBothboosting-basedapproaches[SNK+09,SKT08]aswellastheLEAPalgorithm
[YCHY08]haveproventoworkwellintherespectivesettingsandeva+luations.How-
ever,theydonotprovideoptimalityguarantees.Thomaetal.[TCG10]presentan
approach,CORK,whichintegratesanobjectivefunctionintothepattern-growth-
basedfrequentsubgraphminergSpan[YH02](seeSection2.3.3).Theyusethis
functiontogreedilyprunethesearchspace.Thedistinctivenessoftheirapproach
isthattheobjectivefunctionhasthesubmodularityproperty,andtheauthorsshow
thatsuchfunctionsusedforpruningensurenear-optimalresults.Thisis,CORK
providestheoptimalityguaranteethatalmostalldiscriminativesubgraphsusefulfor
found.areclassification

APartitioning-BasedApproach
RanuandSingh[RS09]investigateasettingthatreliesonsignificance(withrespect
tothestatisticalp-valuemeasure)ratherthanontheabilitytodiscriminatebetween
classes.Theyobservethatsignificantsubgraphsmighthaveanysupportvalue.In
particular,significantsubgraphsmighthaveasupportthatistoolowtobeminedeffi-
ciently.Thisisasfrequent-subgraph-miningalgorithmsroughlyscaleexponentially
withdecreasingminimumsupportvalues.Basedonthisobservation,theydevelop
theGraphSigtechniquewhichbuildsontwomainsteps:Inthefirststep,theypar-
titionallgraphsintosetssuchthatallgraphsinasetarelikelytocontainacommon
significantsubgraphwithahighsupport.Theydosobyusingatechniquesimilarto
asliding-windowapproachonthegraphs,basedonrandomwalks.Thisgeneratesa
setoffeaturevectorsforeachgraph.Theauthorsthenmineclosedsubfeaturevectors
whicharesignificantandusethemtogroupallgraphscontainingasubfeaturevector
intoagroup.Inthesecondstep,theauthorsmakeuseofthesegroupsofgraphs.As
thesegroupsarerelativelysmall,theyapplyafrequent-subgraph-miningtechnique
oneverysetofgraphswithaverysmallminimumsupportvalue.Thisprocedure
allowsforfindingsignificantsubgraphswithalowsupportwhichcannotbediscov-
eredbytraditionaltechniquesduetoscalabilityissues.Intheevaluation,theauthors
demonstratethattheirsignificantsubgraphsarewell-suitedforgraph-classification
applications.

SubgraphsRepresentativeMining+Chaojicriminatietval.eness.[CHSThey08]aredonotconcernedmeasureaboutthefindingsignificancesubgraphsofthatsubgraphsarenorrepresentatitheirvdis-e

49

CHAPTER3.RELATEDWORK

forthecompletesetoffrequentsubgraphs(i.e.,notsimilartothegraphsintheresult
set)withregardtothegraphstructure.Tothisend,theauthorsintroduceparame-
terbeloαw∈[v0,alue1]:α.FrequentFurthermore,subgraphstheyhaveintroducetohaveaparametersimilarityβ∈[to0,1]:graphsForinevtheeryresultfrequentset
subgraphthatisnotpartoftheresultset,therehastobeatlestonesubgraphinthe
resultsethavingasimilarityofatleastvalueβ.IntheORIGAMIalgorithm,the
ofauthorstheirmeasuremaximumthecommonsimilaritysubgraph.betweenFortwomininggraphsbyfrequentcalculatingsubgraphsthewhichrelativecomplysize
withtherestrictionsdefinedbythetwoparametersα,β,theauthorsmineasetof
theysubgraphsadoptainafirrandom-wststepalk.Insteadapproachofwhichenumeratingenumeratestheasubsetcompleteofsetdivofersesuchsubgraphs.graphs,
Inasecondstep,theyextracttheresultsetcomplyingwiththeparameters.Theydo
sobymappingtheproblemtoamaximum-cliqueproblemwhichtheyagainsolve
algorithm.randomisedawith

SubsumptionVsuccess.ariousHoweresearchersver,thehavyehavestudiednottakscalableenweightsminingintoofsubgraphaccount.Inpatterns,thiswithdissertation,much
inparticularinChapter8,weusemeasuresbuildingonedgeweightsasobjective
functions,todecidewhichgraphsaresignificant.Theusageofweightsallowsfor
amoreapproaches,detailedoursanalysisdoesnotascomparednecessarilytotheproducegraphgraphstructuresetswhichonly.areLikethecompletepreviouswith
regardtofrequencyorsomeotherhardconstraint.

MiningSubgraphConstraint-Based3.2.3Constraint-basedminingallowstheusertoformulateconstraintsdescribingthepat-
ternsconstraintssheorbyheisnarrowinginteresteddownin.Thetheirmininginternalsearchalgorithmsspaceinandturnthusmaymakspeedingeuseupofthetheseal-
gorithm.monotonicityInandSection2.3.3,succinctnessweha,vaespresentedoriginallytheintroducedconstraintbyNgclassesetal.[NLHP98].anti-monotonicity,
[WZWMore+05]brecentlyuild,ontheconstraint-basedconstraintgraphclassesminingintroducedhasinbeen[NLHP98]proposed.andWcateangetgoriseal.
variousgraph-basedconstraintsintotheseclasses.Thentheauthorsdevelopaframe-
worktointegratethedifferentconstraintclassesintoapattern-growth-basedgraph-
miningalgorithm.Theyuseanti-monotoneconstraintstoprunethesearchspaceand
monotoneconstraintstospeeduptheevaluationoffurtherconstraints.Further,they
useproposetheawaysuccinctnesstodealpropertywithtosomereducetheweight-basedsizeoftheconstraints.graphFodatabase.rtheaWvangeterage-weightal.also
intheconstraint,database.theyTheproposeydotosotomitoshrinknodestheandgraphedgessizewithandoutliertoavvoidaluesthefromevthealuationgraphsof

50

3.2.DATAMINING

suchsituations‘unfavwhereourable’suchelements.constraintsThisleadcantoleadtosignificantincompletespeedupsresultaresets.rare,accordingFurthermore,to
theevaluationoftheauthorswithoneartificialdataset,andtheydonotmakeany
statementsregardingresultquality.
In[ZYHY07],Zhuetal.extend[WZW+05]byrefiningtheclassesofconstraints,
andtheyintegratethemintominingalgorithms.However,theydonotconsider
too.weights,succinctAlthoughtheconstraintstechniquesandtheirderiproposedvations,workmostwellwithweight-basedmonotone,constraintsanti-monotonedonotfallor
intothesecategories[WZW+05].Theyarenotconvertible(seeSection2.3.3)as
well,evenifsuchconstraintsmightseemtobesimilar.Theweightsconsideredin
convertibleconstraintsstaythesameforeveryiteminalltransactions,whileweights
ingraphscanbedifferentineverygraphinD.Therefore,theestablishedconstraint-
based-miningschemescannotuseweight-basedconstraintsforpruningwhileguar-
completeness.anteeing

51

Call-Graph4Representations

Call-graph-baseddefectlocalisationnaturallyreliesoncallgraphs.Suchgraphsare
representationsofprogrammeexecutions.Rawcallgraphstypicallybecomemuch
toolargeforgraph-miningalgorithms,asprogrammesmightbeexecutedforalong
periodandfrequentlycallotherpartsoftheprogramme,whichaddsinformationto
thegraph.Therefore,itisessentialtocompressthegraphs–wecallthisprocessalso
reduction.Itisusuallydonebyalossycompressiontechnique.Thisinvolvesthe
trade-offbetweenkeepingasmuchinformationaspossibleandastrongcompres-
sion.Theliteraturehasproposedanumberofdifferentcall-graphrepresentations
[CLZ+09,DFLS06,LYY+05],standingfordifferentdegreesofreductionanddif-
ferenttypesandamountsofinformationencodedinthegraphs.Inthisdissertation,
wemakefurtherproposalsforcallgraphcompressionsandforencodingadditional
informationbymeansofnumericalannotationsattheedges.Toeasepresentation,
wediscusstherelatedapproachesfromtheliterature(inparticularthecall-graphrep-
resentationsRtmp,RordandRblock;RtotalandRunordaresimplifiedvariantsthereof)
alongwiththetotalnew01mproposals(totalRsubtree)orvariations01m(Rwtotal,Rtotalmult)inthisdisserta-
tion.Besidesthegraphrepresentationsdiscussedinthischapter,weintroducefurther
graphrepresentationsinChapters6and7.Theyfocusonspecificgraphrepresenta-
tionsforcallgraphsatdifferentlevelsofabstractionandontheincorporationof
dataflow-relatedinformation,respectively.
InSection4.1,wediscusscall-graphrepresentationsatthemethodlevel.InSec-
tion4.2,webrieflyexplaincallgraphsatotherlevelsofgranularitythanthemethod
level.InSection4.3,wepresentcall-graphrepresentationsformultithreadedpro-
grammes.InSection4.4,weexplainhowwetechnicallyderivecallgraphsfrom
Javaprogrammeexecutions.Section4.5subsumesthischapter.

4.1CallGraphsattheMethodLevel

Wenowdiscusscall-graphrepresentationsatthemethodlevel.Thebasisforall
suchobtainedfromrepresentationstracingareprogrammeunreducedexcallecutionsgraphs,(inSectionsometimes4.4alsowegivcalledesomecalltrdetailsees,aons
tracing):

53

CHAPTER4.CALL-GRAPHREPRESENTATIONS

graphs)call(Unreduced4.1NotationUnrrootededucedorderedcallgrtrees.aphsNodescanbestandobtainedforbymethodstracingandaoneprogredgeammestandseforxecution.eachTheymethodare
invocation.Theorderofthenodesisthetemporalorderinwhichthemethodswere
xecuted.eExample4.1:Figure4.1(a)isanexampleofsuchagraph.Evenifnotdepictedinthe
wefigure,wantthetosiblingsemphasiseinthethegraphtemporalareorderordered,webyeexxpressecutionthetimeorderfrombyleftincreasingtoright.inteWhengers
attachedtothenodes.Figure4.4(a)isthesamegraphfeaturingthisrepresentation.
InSection4.1.1,wedescribethetotalreductionscheme.InSection4.1.2,we
someintroducevtechniquesariousmaketechniquesuseofforthethetemporalreductionorderofofiterativmethodelyexcallsecutedduringstructures.reduction,As
ofwerecursiondescribeintheseSectionaspects4.1.4inandSectionconclude4.1.3.withWeaprobriefvidesomecomparisonideasinontheSectionreduction4.1.5.

ReductionotalT4.1.1Thetotalreductiontechniqueisprobablytheeasiesttechnique,andityieldsgood
compression.Inthefollowing,weintroducetwovariants.
Notation4.2(Totalreduction,Rtotal)
Intotallyreducedgraphsatthemethodlevel,everydistinctmethodisrepresentedby
exactlyonenode.Whenonemethodhascalledanothermethodatleastonceinan
execution,adirectededgeconnectsthecorrespondingnodes.
NotethatthetotalreductionmaygivewaytotheexistenceofloopsinRtotalgraphs
(i.e.,theoutputisaregulargraph),anditlimitsthesizeofthegraph(intermsof
nodes)tothenumberofmethodsoftheprogramme.Indefectlocalisation,Liuetal.
[LYY+05]haveintroducedthistechnique,alongwithatemporalextension(seeSec-
4.1.3).tionInthisdissertation,weextendtheplaintotal-reductionscheme(Rtotal)toinclude
callfrequencies.Wedosoasthiseasesthediscoveryoffrequency-affectingbugs,as
see.willweNotation4.3(Totalreductionwithedgeweights,Rwtotal)
BuildingonRtotalgraphsasdefinedinNotation4.2,everyedgeisannotatedwitha
numericalweight.Itrepresentsthetotalnumberofcallsofthecalleemethodfrom
method.callertheEventhoughtheextensionintheRwtotalgraphsisquitesimple,wearenotaware
ofanystudiesusingweightedcallgraphsfordefectlocalisation.Furthermore,these

54

a

bc

4.1.CALLGRAPHSATTHEMETHODLEVEL

aa

1c1ccb 14bbbdbbdbd
(a)unreduced(b)Rtotal(c)Rwtotal
Figure4.1:Totalreductiontechniques.

weightsallowformoredetailedanalyses,inparticularregardingthelocalisationof
frequency-affectingbugs.
Example4.2:Figure4.1containsexamplesofthetotalreductiontechniques:(a)is
anunreducedcallgraph,(b)itstotalreduction(Rtotal)and(c)itstotalreductionwith
edgeweights(Rwtotal).
Ingeneral,totalreduction(RtotalandRwtotal)reducesthegraphsquitesignificantly.
Therefore,itallowsgraph-mining-baseddefectlocalisationwithsoftwareprojects
largerthanotherreductiontechniques.Ontheotherhand,muchinformationonthe
(Rtotalprogrammeonly)easxwellecutionasislost.informationThisonconcernsdifferentfrequenciesstructuralofthepatternsexecutionswithinofthemethodsgraphs
(RtotalandRwtotal).Inparticular,theinformationislostinwhichcontext(atwhich
positionwithinagraph)acertainsubstructureisexecuted.

IterationsofReduction4.1.2Nextstructurestototal(i.e.,reduction,causedbyreductionloops)isbasedpromising.ontheThisiscompressionduetooftheiteratifrequentvelyexusageecutedof
iterationsintoday’ssoftware.Furthermore,asdescribedbefore,therelativelysevere
total-reductiontechniquesgivewaytotheassumptionthattheylosemuchinforma-
tionoriginallyavailableinunreducedcallgraphs.Inthefollowing,weintroducetwo
variantsthatencodemorestructuralinformationthantotallyreducedgraphs.
unordUnorNotationdered4.4zer(Unorderedo-one-manyreducedzero-one-mangraphsyarereduction,rootedR01m(unor)dered)treeswherenodes
rgrepraphsesent(seemethodsNotationand4.1),edgessuchgmethodraphsinignorvocations.etheorIndercontrandastomittounrisomorpheducedsub-call
structureswhichoccurmorethantwicebelowthesameparentnode.
unorddoThenotRlead01mtocallreductiongraphsofensuresanethatxtrememanysize.equalIncontrast,substructuresthecalledinformationwithinthatasomeloop

55

CHAPTER4.CALL-GRAPHREPRESENTATIONS

aaa 11bcbcbc
14bbbdbbbdbd
(a)unreduced(b)R01munord(c)Rsubtree
Figure4.2:Reductiontechniquesbasedoniterations.

outesubstructurexactisnumbers.executedThisseisveralindicatedtimesisbystilldoubledencodedinsubstructuresthegraphwithinstructure,thecallbutgraphwith-
(onlysubstructuresoccurringmorethantwicearenotincluded).Comparedtototal
reductionunord(Rtotal),moreinformationonaprogrammeexecutioniskept.Thedownside
isthatR01mcallgraphsgenerallyaremuchlarger.
TheRunordreductionisasimplifiedvariantoftheonefromDiFattaetal.[DFLS06]
(seeRord01m01minSection4.1.3).ThedifferenceisthatR01munordgraphsdonottakethe
intemporalthisorderdissertationoftheformethodcomparisonsexecutionswithotherintoaccount.techniquesWewhichusedothisnotmakerepresentationuseof
information.temporalNotation4.5(Subtreereduction,Rsubtree)
andSubtredgee-reseducedmethodgrinaphsarvocations.erootedThis(unorrdereductioned)trignoreeseswhertheeornodesderrandeprresenteducesmethodssub-
treesexecutediterativelybydeletingallbutoneisomorphsubtreebelowthesame
parentnodeinanunreducedcalltree(seeNotation4.1).Theedgesareweightedand
numericalweightsrepresentcallfrequencies.Algorithm4.1describesthereduction
procedureindetail.
TheRsubtreereductionisnewlyproposedinthisdissertation.Itleadstosmaller
graphsthanR01munord.Theedgeweightsallowforadetailedanalysis;theyserveas
thebasisofouranalysistechniquedescribedinChapter5.Wediscussdetailsofthe
reductiontechniqueintheremainderofthissection.
(a)Exampleisan4.3:unreducedFigurecall4.2graph,illustrates(b)itsthetwozero-one-maniteration-basedyreductionreductionwithouttechniques:temporal
corderare(Rreduced01munord)toandtw(c)oitscallssubtreewithRunordreductionand(Rtosubtreeone).edgeNotewiththattheweightfour4callswithofRbfrom.
Further,thegraphresultingfromR01msubtreehasonenodemorethantheoneobtainedsubtree
fromRwtotalinFigure4.1(c),butthesamenumberofedges.

56

4.1.CALLGRAPHSATTHEMETHODLEVEL

level 1alevel 1alevel 1a
2level 2bblevel 2bblevel 2b
1 1 212 3
level 3cdcdclevel 3dccdlevel 3dc
(c)(b)(a)Figure4.3:Arawcalltree,itsfirstandsecondtransformationstep.

NotethatwithRtotal,andwithR01munordinmostcasesaswell,thegraphsofacor-
rectandafailingexecutionwithafrequency-affectingbugarereducedtoexactly
thesamegraph.WithRsubtree(andwithRwtotal),theedgeweightswouldbedifferent
whencallfrequency-affectingbugsoccur.Analysistechniquescandiscoverthis(see
5).Chapter

ocedurePrSubtree-ReductionTheFoThertherootsubtreenodeisatreductionlevel(1.RAllsubtree),otherweornodesganiseareinthelevcallelstreenumberedintonwithhorizontaltheledistancevels.
betotothestartroot.atAlevelnaïve1withapproachnodetoa.reduceThere,theoneewxampleouldcallfindtreetwoinchildFiguresubtrees4.3(a)wwithoulda
differentstructure–onecouldnotmergeanything.Therefore,weproceedlevelby
level,startingfromleveln−1,asdescribedinAlgorithm4.1.
algorithm.reductionSubtree4.1Algorithm1:2:forInput:levela=calln−tree1toor1doganisedinnlevels
4:3:formereachgeallnodeinisomorphleveldochild-subtreesofnode,
weightsedgecorrespondingupsum6:5:endendforfor

Example4.4:SupposewewanttoreducethegraphgiveninFigure4.3(a).Westart
inlevel2.Theleftnodebhastwodifferentchildren.Thus,nothingcanbemerged
there.Intherightb,thetwochildrencaremergedbyaddingtheedgeweightsof
themergededges,yieldingthetreeinFigure4.3(b).Inthenextlevel,level1,we
processtherootnodea.Here,thestructureofthetwosuccessorsubtreesisthesame.
Therefore,theyaremerged,resultinginthetreeinFigure4.3(c).

57

CHAPTER4.CALL-GRAPHREPRESENTATIONS

4.1.3TemporalOrderinCallGraphs
Sofar,thecallgraphsdescribedjustrepresenttheoccurrenceofmethodcalls.Even
though,say,Figure4.2(c)mightsuggestthatbiscalledbeforecintherootnodea,
thisinformationisnotencodedinthegraphs.Asthismightberelevantfordiscrim-
inatingfaultyandcorrectprogrammeexecutions,thedefect-localisationtechniques
proposedin[DFLS06,LYY+05]takethetemporalorderofmethodcallswithinone
callgraphintoaccount.Inthefollowing,weintroducethecorrespondingreductions.
Notation4.6(Totalreductionwithtemporaledges,Rtmptotal)
Inadditiontothetotalreduction(Rtotal,seeNotation4.2),totallyreducedgraphs
withtemporaledgeshaveso-calledtemporaledgesthataredirected.Suchanedge
connectstwomethodswhichareexecutedconsecutivelyandareinvokedfromthe
samemethod.Technically,temporaledgesaredirectededgeswithanotherlabel,
e.g.,‘temp’,comparedtootheredgeswhicharelabelled,say,‘call’.
TheRtotaltmpreductionhasbeenintroducedbyLiuetal.[LYY+05],andtheresulting
graphsarealsoknownassoftware-behaviourgraphs.Asthegraph-miningalgo-
rithmsusedforfurtheranalysiscanhandleedgeslabelleddifferently,theanalysis
ofRtotaltmpgraphsdoesnotgivewaytoanyspecialchallenges,exceptforanincreased
numberofedges.Inconsequence,thetotallyreducedgraphslosetheirmainadvan-
tage,theirsmallsize.However,takingthetemporalorderintoaccountmighthelp
defects.certaineringvdiscoNotation4.7(Orderedzero-one-manyreduction,Rord01m)
Orderedzero-one-many-reducedgraphsareasunreducedcallgraphs(seeNota-
tion4.1)rootedorderedtrees.Toincludethetemporalorder,thereductiontechnique
differstotheR01munordreduction(seeNotation4.4)asfollows:WhileR01munordomitsany
isomorphsubstructurewhichisinvokedmorethantwicefromthesamenode,only
substructuresareremovedwhichareexecutedmorethantwiceindirectsequence.
TheRord01mreductionhasbeenintroducedbyDiFattaetal.[DFLS06].Asthere-
sultinggraphsarerootedorderedtrees,theycanbeanalysedwithanorder-awaretree
miningalgorithm.Thefactthatsubstructuresareonlyremovedwhentheyoccurin
directsequencefacilitatesthatalltemporalrelationshipsareretained.Forinstance,in
thereductionofthesequenceb,b,b,d,b(seeFigure4.4)onlythethirdbisremoved,
anditisstillencodedthatbiscalledafterdonce.
Dependingontheactualexecution,theRordtechniquemightleadtoextremesizes
ofcalltrees.Forexample,ifwithinaloopa01mmethodaiscalledfollowedbytwocalls
ofb,thereductionleadstotherepeatedsequencea,b,b,whichisnotreducedatall.
Therootedorderedtreeminerin[DFLS06]partlycompensatestheadditionaleffort
forminingalgorithmscausedbysuchsizes,whicharehugecomparedtoR01munord.
Rootedorderedtreeminingalgorithmsscalesignificantlybetterthanusualgraph-
miningalgorithms[CMNK05],astheymakeuseoftheorder.

58





4.1.CALLGRAPHSATTHEMETHODLEVEL

a

c

bd
(a)unreduced(b)Rtotaltmp(c)Rord01m
Figure4.4:Temporalinformationincall-graphreductions.

Example4.5:Figure4.4illustratesthetwographreductionswhichareawareofthe
temporalorder.(Theintegersattachedtothenodesrepresenttheinvocationorder.)
(a)isanunreducedcallgraph,(b)itstotalreductionwithtemporaledges(dashed,
Rtotaltmp)and(c)isitsorderedzero-one-manyreduction(Rord01m).Notethat,compared
toR01munord,Rord01mkeepsathirdnodebcalledfromc,asthedirectsequenceofnodes
interrupted.isblabelled

sionsRecurofReduction4.1.4Anotherchallengewiththepotentialtoreducethesizeofcallgraphsisrecursion.
Thetotalreductions(Rtotal,RwtotalandRtotaltmp)implicitlyhandlerecursionastheyre-
ducebothiterationandrecursion.Forinstance,wheneverymethodiscollapsedto
asinglenode,(self-)loopsimplicitlyrepresentrecursion.Besidesthat,recursionhas
notbeeninvestigatedmuchinthecontextofcall-graphreductionandinparticular
notasastartingpointforreductionsinadditiontoiterations.Thereasonforthatis,
aswewillseeinthefollowing,thatthereductionofrecursionislessobviousthanre-
ducingiterationsandmightfinallyresultinthesamegraphsaswithatotalreduction.
Furthermore,incompute-intensiveapplications,programmersfrequentlyreplacere-
cursionswithiterations,asthisavoidscostlymethodcalls.Nevertheless,wehave
investigatedrecursion-basedreductionofcallgraphstoacertainextentandpresent
someapproachesinthefollowing.Twotypesofrecursioncanbedistinguished:
•Directrecursion.Whenamethodcallsitselfdirectly,suchamethodcallis
calledadirectrecursion.AnexampleisgiveninFigure4.5(a)wheremethodb
callsitself.Figure4.5(b)presentsapossiblereductionrepresentedwithaself-
loopatnodeb.InFigure4.5(b),edgeweightsasinRsubtreerepresentboth
frequenciesofiterationsandthedepthofdirectrecursion.
•Indirectrecursion.Itmayhappenthatsomemethodcallsanothermethod
whichinturncallsthefirstoneagain.Thisleadstoachainofmethodcallsas
intheexampleinFigure4.5(c)wherebcallscwhichagaincallsbetc.Such

59

CHAPTER4.CALL-GRAPHREPRESENTATIONS

aaaaab 1 1babb 1cba 3
1b2 1ca1 11
bccccadbcd
(a)(b)(c)(d)(e)(f)
Figure4.5:Examplesforreductionbasedonrecursion.

chainsreducedcanasbeshoofwninarbitraryFigureslength.4.5(c)Obandviously4.5(d).,suchThisindirleadsecttortheecuresionsxistencecanofbe
loops.andBoth4.5(f)typesofillustraterecursiononewareayofchallengingreducingwhendirectitcomesrecursions.toreduction.WhiletheFiguressubsequent4.5(e)
andrefledxivebecomecallsofsiblings.aareAsmerwithgedintototalasinglereductions,nodethiswithleadsatoneweightedwstructuresself-loop,whichb,c
donotoccurintheoriginalgraph.Indefectlocalisation,onemightwanttoavoid
suchartefacts.Forinstance,dcalledfromexactlythesamemethodasbcouldbea
withstructurindirecte-affectingrecursionbugisthatwhichitiscannotbefoundhardtowhendetectsuchandartefbecomesactseoccurxpensi.vTheetoproblemdetect
alloccurrencesoflong-chainedrecursion.Toconclude,whenreducingrecursions,
oneInhasthistobeadissertation,warethat,weasfocuswithontotalthereduction,reductionofsomeartefiterationsacts(Rmaysubtreeoccur)o.rfallback
todealtotalwithreductionsmallerwithgraphsweightsmaking(Rgraphwtotal).miningThisfalleasierbackandhasthattheadvrecursionsantagearethattreatedwe
withoutanyextraeffort.

Comparison4.1.5Tocomparereductiontechniques,wemustlookatthelevelofcompressionthey
achieveoncallgraphs.Table4.1containsthesizesoftheresultinggraphs(increas-
inginthenumberofedges)whendifferentreductiontechniquesareappliedtothe
samecallgraph.ThecallgraphusedhereisobtainedfromanexecutionoftheJava
difftooltakenfrom[Dar04]usedintheevaluationinChapter5.Clearly,theeffect
ofthereductiontechniquesvariesextremelydependingonthekindofprogramme
andthedataprocessed.However,thesmallprogrammeusedillustratestheeffectof
thevarioustechniques.Furthermoreitcanbeexpectedthatthedifferencesincall-
graphcompressionsbecomemoresignificantwithincreasingcall-graphsizes.This
isbecauselargergraphstendtooffermorepossibilitiesforreductions.

60

4.2.CALLGRAPHSATDIFFERENTLEVELSOFGRANULARITY

edgesnodesreductionRtotal,Rwtotal2230
3536RRtmptotalsubtree2247
unordRRord01m1176211661
01m2,1982,199unreducedTable4.1:Examplesfortheeffectofcall-graph-reductiontechniques.

Obviously,thetotalreduction(RtotalandRwtotal)achievesthestrongestcompres-
sionandyieldsareductionbytwoordersofmagnitude.As22nodesremain,the
programmehasexecutedexactlythisnumberofdifferentmethods.Thesubtreere-
duction(Rsubtree)hassignificantlymorenodesbutonlyfivemoreedges.As–roughly
speaking–graph-miningalgorithmsscalewiththenumberofedges,thisseemstobe
tolerable.Weexpectthesmallincreaseinthenumberofedgestobecompensated
bytheincreaseinstructuralinformationencoded.Theunorderedzero-one-many
reductiontechnique(R01munord)againyieldssomewhatlargergraphs.Thisisbecause
repetitionsarerepresentedasdoubledsubstructuresinsteadofedgeweights.With
thetotalreductionwithtemporaledges(Rtotaltmp),thenumberofedgesincreasesby
roughly50%duetothetemporalinformation,whiletheorderedzero-one-manyre-
duction(Rord01m)almostdoublesthisnumber.Chapter5assessestheeffectivenessof
defectlocalisationwiththedifferentreductiontechniquesalongwiththelocalisation
methods.Clearly,somecall-graph-reductiontechniquesalsoareexpensiveintermsofrun-
time.However,wedonotcomparetheruntimes,asthesubsequentgraphminingstep
usuallyissignificantlymoreexpensive.
Tosummarise,differentauthorshaveproposeddifferentreductiontechniques,each
onetogetherwithalocalisationtechnique(seeChapter5):thetotalreduction(Rtmp)
in[LYY+05],thezero-one-manyreduction(Rord01m)in[DFLS06]andthesubtreetotalre-
duction(Rsubtree)proposedinthisdissertation.Someofthereductionscanbeusedor
atleastbevariedinordertoworktogetherwithadefect-localisationtechniquediffer-
entfromtheoriginalone.InChapter5,wepresentoriginalandvariedcombinations.

4.2CallGraphsatDifferentLevelsofGranularity
Sofar,wehaveconsideredcallgraphsatthemethodlevel.However,callgraphscan
bemorefinegrainedormorecoarsegrained.Finerlevelsofgranularityallowfor
moredetaileddefectlocalisations,butthegraphsaretypicallymuchlarger.Coarser
finerlegranularitiesvelsofarelessgranularity,detailed,inbutparticularleadtoatsmallerbasic-block-legraphs.velIncallthefollographswing,aswedefinedlookbya

61

CHAPTER4.CALL-GRAPHREPRESENTATIONS

Chengetal.[CLZ+09].InChapter6,welookatcoarsercall-graphrepresentations,
i.e.,attheclasslevelandatthepackagelevel.
Chengetal.[CLZ+09]relyonthemethod-levelcallgraphswithtotalreductionand
temporaledgesasintroducedbyLiuetal.[LYY+05](Rtotaltmp,seeNotation4.6).Be-
sidesthesegraphs,theyalsointroducebasic-block-levelcallgraphs(Rtotalblock),aiming
localisations:defectfine-grainedmoreatNotation4.8(Basic-block-levelcallgraphs,Rtotalblock)
Eachbasicblockasknownfromstaticcontrol-flowgraphs(seeSection2.2.1)forms
anodeinthedynamicRtotalblockcallgraph.Threekindsofdifferentlylabelleddirected
edgesconnectthesenodes.Edgesofthetype‘call’correspondtomethodcalls,
edgesofthetype‘trans’totransitionsbetweentwobasicblocksandedgesofthetype
‘return’tomethodreturns.
Example4.6:Listing4.1isanexampleJavasourcecodeofaprogrammecontain-
ingasimpleinteger-multiplicationmethod(knownfromExamples2.1and2.2)and
amainmethodthatcallsthemultiplicationmethodonce.Figure4.6isthecorre-
spondingbasic-block-levelcallgraph(Rtotalblock),representingasingleexecutionofthe
programme.NotethatChengetal.[CLZ+09]donotmakeuseofanyweightsintheirRtotalblock
graphs.However,introducingweightscorrespondingtocall/transition/returnfre-
quencieswouldbeeasy.

4.3CallGraphsofMultithreadedProgrammes
Sofar,wehaveconsideredcallgraphsfromsingle-threadedprogrammes.How-
ever,asmotivatedinSection3.1.3,defectlocalisationinmultithreadedprogrammes
isachallengingfield.Althoughmultithreadedprogrammesarenotinthefocusof
thisdissertation,wediscusssomespecialitiesofsuchprogrammesandpossiblecall-
graphrepresentationsinthissection.Welimitthesediscussionstothemethod-level
case,althoughrespectivegraphscanbedefinedsimilarlyforotherlevelsofgranu-
larity.InAppendixA,weevaluatetheusefulnessofthegraphsdevelopedherefor
programmes.multithreadedinlocalisationdefect

UnreducedandTotally-ReducedMultithreadedCallGraphs.Inthemul-
tithreadedcase,everymethodcanbeexecutedseveraltimesinmorethanonethread.
Therefore,inunreducedcallgraphs,nodesareinitiallylabelledwithaprefixconsist-
ofingsuchoftheacallrespectigraph.veThisthreadeIDxampleandmethodrepresentsname.theFiguremethodcalls4.7(a)ofoneillustratesanprogrammeexampleex-
ecution,withoutanyreductions.Toachieveastrongreductionofcallgraphsfrompo-
tentiallylargemultithreadedprogrammes,weconsideratotalreductionofthegraphs

62

4.3.CALLGRAPHSOFMULTITHREADEDPROGRAMMES

1publicstaticvoidmain(String[]args){
4));System.out.println(mult(3,2}345publicstaticintmult(inta,intb){
6intres=0;
;i=1int78while(i<=a){
109resi++;+=b;
}11res;return12}13

Listing4.1:ExampleJavaprogrammeperforminganintegermultiplication.

println

false

res = 0i = 1

truewhileres += bi <= ai++

Figure4.6:Abasic-block-levelcallgraph(Rtotalblock)[CLZ+09],representingtheexe-
cutionoftheprogrammefromListing4.1.Dashedlinesstandfor‘call’
edges,solidlinesfor‘trans’edgesanddottedlinesfor‘return’edges.

63

CHAPTER4.CALL-GRAPHREPRESENTATIONS

mT0:m 2 2T1:aT2:aT0:cT0:cac
2 4T1:bT1:bT2:bT2:bT0:bT0:bb
(a)unreduced(b)Rmulttotal
T0:mmT0:m1 1 21 1 21 1 2
T1:aT2:aT0:cT1:aT2:aT0:caac
2 2 2 3 1 2 2 2 2
T1:bT2:bT0:bT1:bT2:bT0:bbbb
(e)(d)(c)

Figure4.7:Examplegraphsillustratingalternativechoicesforcall-graphrepresenta-
tionsapplications.multithreadedfor

inthefollowing(seeSection4.1.1).Asinalltotalreductionvariants,eachmethodis
uniquelyrepresentedbyexactlyonenode,forthemomentidentifiedbythemethod
nameprefixedwiththethreadID.Twonodesareconnectedbyanedgeifthecorre-
asinspondingtheRwmethodsgraphcalleachrepresentationsotherattoleastrepresentonce.callFurthermore,frequencies.weuseFigureedge4.7(c)weightsis
anexampletotalforsuchatotallyreducedgraphrepresentation,itisthereducedversion
fromthecallgraphinFigure4.7(a).

TemporalRelationships.Forthelocalisationofdefectsinmultithreadedsoft-
ware,itseemstobenaturaltoencodetemporalinformationincallgraphs,e.g.,to
tackleraceconditionscausedbyvaryingthreadschedules.Thecallgraphssuchasthe
oneinFigure4.7(a)donotencodeanyorderofexecutionofthedifferentthreadsand
methods.Onestraight-forwardapproachtoincludesuchinformationusestemporal
edges(seeSection4.1.3).Theproblemwiththisidea,however,isthattheoverheadto
obtainsuchinformationmightbelargeandrequiressophisticatedtracingtechniques.
Furthermore,itmaysignificantlyinfluenceprogrammebehaviour–possiblymaking
afailuredisappear.Inaddition,increasingtheamountofinformationinthecallgraph
makestheminingprocessmoredifficultandtime-consuming.Wethereforepropose
amorelightweightapproachwithouttemporalinformationencodedinthegraphs.

64

4.3.CALLGRAPHSOFMULTITHREADEDPROGRAMMES

call-graphNon-DeterministicrepresentationThreadthatcontainsNames.thethreadFigureIDs4.7(c)intheillustratesnodeourlabels.totallyThisisareducedwk-
wingard,assystem.threadsTherefore,arevallocatedariouscorrectdynamicallyexbyecutionsthecouldruntimeleadentovironmentthreadsorwiththedifoperat-ferent
IDsinputfordata.theWesamethereforemethodwcall,ouldevnotenbeforableatoprogrammecompareseusingveralthesameprogrammeeparametersxecutionsand
inbasedFigureonthe4.7(e),nodewhichlabels.isdirectlyOmittingderithisvedfrominformationtheonewouldinFigureresultin4.7(c).thegraphshown

ReplicatedTasksandVaryingThreadInterleavings.Graphssuchasthe
aoneshighindeFiguresgreeof4.7(c),redundanc(d)yandthat(e)sufdoesfernotfromhelptwofindingproblems:defects.(1)FoTheryemightxample,acontainpro-
grammeusingthreadpoolscouldhavealargenumberofthreadswithsimilarcalls
duetotheexecutionofreplicatedtasks(andthereforesimilarmethodcalls).This
typicallyproducesacallgraphwithseveralidenticalandlargesubtrees,whichcon-
tainnomeaningfulinformationfordefectlocalisation.(2)Thecallfrequencies(i.e.,
theedgeweights)mightnotbeusefulfordefectlocalisation,either.Differentexecu-
tionweights.schedulesExampleofthe4.7sameillustratesprogrammehowthiscanefleadfecttocangraphsdisturbwithdata-miningwidelydifferinganalyses,edgeas
suchdifferencesarenotrelatedtoinfections.
Example4.7:ThinkofmethodainFigure4.7(c)astherun()method,callingthe
wthreadorker2taskwouldmethodbothb,callwhichmethodtakesbworktwice,fromasinataskFigurepool.4.7(c).Sometimes,Inotherthreadcases1asandin
whileFigurethread4.7(d),2woulddependingonlyoncalltheitoncescheduling,orvicevthreadersa.1couldcallmethodbthreetimes,

weProposedproposeaGraphgraphrepresentationRepresentation.thatavoidsBasedonrepeatedtheobservsubstructuresationsasdiscussedfollows:sofar,
multSimilarNotationtoR4.9total(TgrotalaphsasreductiondescribedforinmultithreadedNotation4.2forprogrammes,single-thrRtotal)eadedprogrammes,
multRtotalferringgrtoaphsthedosamenotmethodconsiderarethemerthrgedeadintonamesasingleorIDsnode,eithere.venifThisitis,isallcallednodeswithinre-
eads.threntferdifExample4.8:Figure4.7(b)isanexampleforRtotalmultgraphs.Itisthereducedversion
ofthemultithreadedcallgraphinFigure4.7(a).
Therepresentationproposedisrobustinthesensethatdifferentschedulesdonot
influencethegraphstructure.Thereasonisthatmethodsexecutedindifferentthreads
arestructuresmappedfromtothedifferentsameexnodes.ecutionsTherarelydodifwnsidefer.ofthisConsequentlyrepresentation,astructuralisthatanalysisgraph

65

CHAPTER4.CALL-GRAPHREPRESENTATIONS

ofthecallgraphsasinotherapproaches(e.g.,[LYY+05,DFLS06])islesspromising.
However,theedgeweightsintroducedaimtocompensateforthiseffectandallowfor
analyses.detailed

PossibleExtensions.Asmentioned,ourproposalforreducedcallgraphsof
multithreadedprogrammesasdescribedinthepreviousparagraphdoesnotleadto
manydifferencesinthegraphstructure.Wethereforepresentsomeideasforpossible
extensionsinthefollowing.Clearly,theyshoulddealwithissuessuchasthosere-
latedtonon-deterministicthreadnames,varyingthreadinterleavingsandreplicated
taskssubstructuresasdiscussedforthedifbefore.ferentAsonetypeseofxample,threads.Agraphpossiblerepresentationssolutionforcanthehaveproblemdistinctof
indeterministicthreadIDsistheintroductionofthreadclasses.Eachoftheseclasses
standsforasource-codecontext,i.e.,apositioninthesourcecodewherenewthreads
arecreated.Asanexample,oneclasscouldstandforGUI-relatedthreadsandonefor
database-access-relatedthreads.Furtherinformationtoenhancetheexpressiveness
ofcallgraphscouldbeinformationonlocksoncertainobjects.Thisinformation
couldbeincludedasanannotationofnodesoredges.

4.4DerivationofCallGraphs

Intionsorderandtotoderistorevethecallrelegraphsvantfrominformation.programmeAsewexrelyecutions,onweJavahainvethistotracedissertation,theexecu-we
+emploconsidered.yAspectJAspectJ[KHHisan01]toweaaspect-orientedvetracingprogrfunctionalityamming(AintoOP)thelanguageJava[KLMprogrammes+97]
whichprovidescross-cuttingconcernfunctionalityforJava.Thebasicfunctionality
oftionalityAspectJatiscertaintodefinepointsofso-calledaprogrammepointcutsexwhichecution.allowListingforthe4.2additioncontainsoftheextraessencefunc-
oftheaspectsweusetogeneratecallgraphs.

{tracingaspectpublic12pointcutAspectJ..getMethod()(..));:execution(**(..))&&!execution(*
*{getMethod()before():345//deriveedgeName=callingMethodcallingMethoandd+"->"+calleeMethodcalleeMethod;
callGraph.add(edgeName);6}7}8

66

code.AspectJ4.2:Listing

SUBSUMPTION4.5.

ofAsListingusualin4.2.WAspectJethen,wedefinefirstadeclarepointcutanaspectwhich,catchestracingallinourmethodcase,invinocationsLine1
(execution(**(..)))inLine2.Thesecondpartofthisline(thepartbe-
hind&&)avoidsthatAspectJ-specificmethodswillbetracedandbecomepartof
ouradvicecalldescribesgraphs.Whatwhathasfollotowsibesthedonedefinitionwithinaofanpointcutadviceand,atstartingwhichineLinexact3.pointAn
ofbythetheexpointcutecution.isInactuallythiscase,invokeLinesd4(thistois6arecontrolledexecutedbythebeforekeaywordmethodbefore()matched
ininvolvLineed3).(LineIn4).theWbodyederiofvethetheadvice,calleewemethodfirstbyderiveaccessingthenamestheofspecialthevmethodsariable
callingmethodthisJoinPointStaticPartwithastackwemaintaininbyAspectJourself.(whichWeinthenvolvesassemblereflection)anedgeandnamethe
basedonthetwomethodnames(Line5)andstoreitinaninternaldatastructure
(Line6).Thisdatastructurecountstheoccurrencesofalledgesinanedgelistand
iticatedcaneasilypointcutbeatusedthetoendderiofveacallprogrammegraphsinexarbitraryecutiontowriterepresentations.thecallWegraphuseaintoded-a
file.

4.5Subsumption

Inthischapter,wehaveintroducedvariouscall-graphrepresentationsthatarethe
basisforthedefect-localisationtechniquesweintroduceinthefollowingChapter5.
Wecomparedhavethesefocusedongraphsmethod-lefromaveldescripticallvegraphspointwithofitsviewdif.InferentvChapterariants,5,weandwillwehashedve
halightveondiscussedtheirsomeusefulnessgraphfordefectrepresentationslocalisation.atdifferentBesideslevelsofmethod-legranularityvelgraphs,(andwillwe
dosomoreextensivelyinChapter6)aswellasrepresentationsformultithreaded
programmeprogrammes.exFurtherecutions.,wehaveexplainedhowweactuallyderivecallgraphsfrom

67

Call-Graph-Based5ectDefLocalisation

Thischapterfocusesontheactualdefect-localisationprocess.Therelatedwork
[CLZ+09,DFLS06,LYY+05]andaswellthisdissertationsuggestanumberofdif-
ferentapproachesforthisprocess,relyingonvariouscall-graphrepresentations(see
[CLZChapter+09,4).InDFLS06,thisLYYdissertation,+05]andwenoveldistinguishfrequencbetweeny-basedvariousandcombinedstructuralapproaches.approaches
Inthischapter,wefirstpresentanoverviewinSection5.1.Toeasepresentation,we
thenapproachesdiscussinexistingSectionrelated5.3.WethenapproachespresentinanSectione5.2,xperimentaldirectlyevfolloaluationwedinbytheSectionnov5.4el
andasubsumptioninSection5.5.
sisThefortheapproachmorepresentedsophisticatedinthisapproacheschapter(ininChaptersparticular6inand7Sectionfocusing5.3)servonesasscalabilityaba-
issuesanddefectsthataffectthedataflow,respectively.InChapter8,wepresentan
approachforconstraint-basedsubgraphminingwhichisafurtherdevelopmentofthe
5.3.Sectioninpresentedapproach

wvieOver5.1

Wenowgiveanoverviewoftheprocedureofcall-graph-baseddefectlocalisation.
Thisisagenericprocedurewhichappliestothetechniquesthatarenewinthisdis-
sertation(Section5.3)aswellastomostrelatedstudies(Section5.2).Algorithm5.1
firstassignsaclass(correct,failing)toeveryprogrammetrace(Line3),usinga
testoracle(seeSection2.2.3).Theapproachesdiscussedinthisdissertationrequire
suchanoracle,andtheyaretypicallyavailableinthesoftwaredevelopmentprocess
[JH05].Theneverytraceisreduced(Line4),whichleadstosmallercallgraphs(see
Chapter4).Nowfrequentsubgraphsaremined(Line6).Forthisstep,severalalgo-
rithms,e.g.,treeminingorgraphminingindifferentvariants,canbeused.Thelast
stepcalculatesalikelihoodofcontainingadefect.Thiscanbeatdifferentlevelsof
granularity,typicallyatthemethodlevel(asshowninLine7).Thecalculationofthe
likelihoodisbasedonthefrequentsubgraphsminedandfacilitatesarankingofthe
methods,whichcanthenbegiventothesoftwaredeveloper.

69

CHAPTER5.CALL-GRAPH-BASEDDEFECTLOCALISATION

Algorithm5.1Genericgraph-mining-baseddefect-localisationprocedure.
1:Input:G=∅a//collectioninitialiseofaprogrammecollectionoftracesreducedt∈Tgraphs
2:foralltracest∈Tdo
4:3:G=assignGa∪{rclasseduc∈{e(tco)}rrect,failing}tot
6:5:SGend=fofrrequent_subgraph_mining(G)
7:calculateP(m)forallmethodsm,basedonSG
5.2ExistingStructuralApproaches
particularStructural.Inapproachessomecases,foradefectlikelihoodlocalisationP(mcan)thatlocalisemethodstructurme-afcontainsfectingabdefectugsinis
calculated,foreverymethod.Thislikelihoodisthenusedtorankthemethods.Inthe
thefollodifwing,ferentwereferstructuraltoitasscoringascore.Iapproaches.nSections5.2.1–5.2.3weintroduceanddiscuss

5.2.1TheApproachfromDiFattaetal.
ordDirootedFattaetorderedal.tree[DFLS06]mineruseFREQTtheR01m[AAK+call-graph02]tofindreductionfrequent(seesubtreesChapter4)(Lineand6thein
Algorithm5.1).Thecalltreesanalysedarelargeandleadtoscalabilityproblems.
Hence,theauthorslimitthesizeofthesubtreessearchedtoamaximumoffour
nodes.Basedontheresultsoffrequentsubtreemining,theydefinethespecificneigh-
exbourhoodecutions(SNwhich).Itareisnotthesetfrequentofallincallsubgraphsgraphsofcontainedcorrectinexallcallecutions:graphsoffailing

SN∶={sg∣(support(sg,Dfail)=100%)∧¬(support(sg,Dcorr)≥suppmin)}
wheresupport(g,D)denotesthesupportofagraphg,i.e.,thefractionofgraphsina
graphdatabaseDcontainingg.DfailandDcorrdenotethesetsofcallgraphsoffailing
andcorrectexecutions.[DFLS06]usesaminimumsupportsuppminof85%.
Basedonthespecificneighbourhood,DiFattaetal.defineastructuralscorePSN
whichcanbeusedtoguidethefollowingmanualdebuggingprocess:
support(gm,SN)
PSN(m)∶=support(gm,SN)+support(gm,Dcorr)
wheregmdenotesallgraphscontainingmethodm.NotethatPSNassignsvalue0to
methodswhichdonotoccurwithinSNandvalue1tomethodswhichoccurinSN
butnotincorrectprogrammeexecutionsDcorr.

70

5.2.EXISTINGSTRUCTURALAPPROACHES

5.2.2TheApproachfromLiuetal.
Although[LYY+05]isthefirststudywhichappliesgraph-miningtechniquestody-
namicdirectlycallcompatiblegraphstotothelocaliseapproachnon-crashingfromDibFugs,attaetthisal.work[DFLS06].fromLiuInet[Lal.YYis+05],not
defectlocalisationisachievedbyarathercomplexclassificationprocess,anditdoes
notgeneratearankingofmethodssuspectedtocontainadefect,butasetofsuch
methods.TheworkisbasedontheRtotaltmpreductiontechniqueandworkswithtotalreduced
graphswithtemporaledges(seeChapter4).Thecallgraphsareminedwithavari-
antoftheCloseGraphalgorithm[YH03](seeSection2.3.3).Thisstepresultsin
frequentsubgraphswhichareturnedintobinaryfeaturescharacterisingaprogramme
execution:Abinaryfeaturevectorrepresentseveryexecution.Inthisvector,every
elementindicatesifacertainsubgraphisincludedinthecorrespondingcallgraph.
Usingthosefeaturevectors,asupport-vectormachine(SVM)classifier[Vap95]is
learnedwhichdecidesifaprogrammeexecutioniscorrectorfailing.Moreprecisely,
foreverymethod,twoclassifiersarelearned:onebasedoncallgraphsincludingthe
respectivemethodandonebasedongraphswithoutthismethod.Iftheprecisionrises
significantlywhenaddinggraphscontainingacertainmethod,thismethodisdeemed
morelikelytocontainadefect.Suchmethodsareaddedtotheso-calledbug-relevant
functionset.Itsfunctionsusuallylineupinaformsimilartoastacktracewhichis
presentedtoauserwhenaprogrammecrashes.Therefore,thebug-relevantfunction
setservesastheoutputofthewholeapproach.Thissetisgiventoasoftwaredevel-
operwhocanuseittolocalisedefectsmoreeasily.However,theapproachdoesnot
provideanyranking,whichmakesithardtocomparetheresultstootherworks.

5.2.3TheApproachfromChengetal.
ThestudyfromChengetal.[CLZ+09]buildsonthesamegraphsasusedbyLiu
etal.[LYY+05]:totallyreducedgraphswithtemporaledges(Rtotaltmp).However,it
reliesondiscriminativesubgraphminingwiththeLEAPalgorithm[YCHY08](see
Sectiontmp3.2.2).Chengetal.firstapplyaheuristicgraphfilteringproceduretoclas-
withsifiedalRototalwercalllikelihoodgraphs.tobeThisrelatedaimsattoashrinkingdefect.Hothewevgrapher,thesizesauthorsbyremodonotvingproedgesvide
anforydefectguaranteeslocalisation.thatthisThen,doesnottheloseauthorspartsapplyofthethegraphsLEAPthatalgorithmareactuallytotherelevfilteredant
graphs,resultinginthetop-kdiscriminativesubgraphs(discriminativewithrespect
tocorrect,failing),i.e.,subgraphshavinganincreasedlikelihoodtoberelatedto
bdefects.uggingTheprocess.authorsAsthenwiththereportapproachthesefromsubgraphsLiuettoal.the[LuserYY+to05],easethetheresultsmanualcannotde-
directlybecomparedtootherapproaches,asnomethodrankingisgenerated.Be-

71

CHAPTER5.CALL-GRAPH-BASEDDEFECTLOCALISATION

tmpsidesfine-grainedtheRtotalgraphbasic-block-leverepresentation,lcallgraphsCheng(Rettotalblockal.,seealsoSectionapplytheir4.2).approachtomore

5.3Frequency-BasedandCombinedApproaches
Asmentionedbefore,thestructuralapproachesfordefectlocalisationhavetheir
strengthsinlocalisingstructure-affectingbugs(seeSection5.2).Inparticularthe
totallyreducedgraphsusedin[CLZ+09,LYY+05]loseallinformationaboutthefre-
quencyofmethodcalls(excepttheinformationwhetheracertainmethodiscalledor
not).Thismakesithardtoimpossibletolocalisefrequency-affectingbugs.However,
thesetechniquesmightfindsuchdefectswhentheinfectionleadssosideeffectsthat
changethestructureofthecallgraphs.
Wenowdevelopanoveltechniquethatspecialisesonthelocalisationoffrequency-
affectingbugsinSection5.3.1.Inordertobeabletolocaliseapossiblybroadrange
ofdefects,wethenpresentnovelapproachesforthecombinationofstructuraland
frequency-basedtechniquesinSection5.3.2.

5.3.1Frequency-BasedApproach
Wenowdevelopatechniquethatisabletolocalisefrequency-affectingbugs.Todo
so,itisnaturaltoanalysethecallfrequenciesthatareincludedasedgeweightsin
someofthecall-graphrepresentationsproposedinChapter4.Asdiscussedbefore
(seeSection3.2.1),therearenoweightedsubgraph-miningapproachesthatcanbe
useddirectlyfordefectlocalisation.Wethereforepresentapostprocessingapproach
inthefollowing.Itbuildsonfrequentsubgraphminingandfeatureselectiontoanal-
ysetheedgeweights.Similarlytothestructuralapproaches(seeSection5.2),the
aimistocalculateascore,i.e.,alikelihoodtocontainadefect,foreverymethod.In
thefollowingwedescribetheindividualsteps.

MiningGraphAfterhavingreducedthecallgraphsgainedfromcorrectandfailingprogrammeexe-
cutionsusingtheRsubtreetechnique(seeChapter4),wesearchforfrequentclosedsub-
graphsSGinthegraphdatasetGusingtheCloseGraphalgorithm[YH03](Line6
inAlgorithm5.1;seeSection2.3.3).Forthisstep,weemploytheParSeMiSgraph
miningsuite[PWDW09].Closedminingreducesthenumberofgraphsintheresult
setsignificantlyandincreasestheperformanceoftheminingalgorithm.Furthermore,
theusageofageneralsubgraph-miningalgorithminsteadofatreeminerallowwsfor
comparativeexperimentswithothergraph-reductiontechniquessuchasRtotal(see
Section5.4).Weusethesubgraphsobtainedfromthisfrequent-subgraph-mining
stepasdifferentcontextsandperformallfurtheranalysesforeverysubgraphcontext

72

5.3.FREQUENCY-BASEDANDCOMBINEDAPPROACHES

separately.Thisaimsatahigherprecisionthanananalysiswithoutsuchcontextsand
allowstolocalisedefectsthatonlyoccurinacertaincontext.
whenExamplemethod5.1:cAisfailurecalledasmightwell.occurThen,whenthemethoddefectaismightcalledbefromlocalisedmethodonlyb,inonlythe
contextofcallgraphscontainingallmethodsmentioned,butnotingraphswithout
.cmethod

eightsWofysisAnalWecreasesnowtheconsiderfrequenctheyofedgeacertainweights.methodAsaninveocationxample,aandfrequencthereforey-afthefectingweightbuofgin-the
correspondingedge.Tofindthedefect,onehastosearchforedgeweightswhichare
curinincreasedbothinfcorrectailingeandxfecutions.ailingexTodoecutions.so,weThefocusgoalisontodefrequentvelopansubgraphsapproachwhichwhichoc-
mostautomaticallysignificantdiscotoversdiscriminatewhichedgebetweenweightscorrecoftcallandgraphsfailing.fromaprogrammeare
Toidentifydiscriminativeedges,onepossibilityistoconsiderdifferentedgetypes,
e.g.,(end).edgesHowehaverving,edgestheofsameonecallingtypecanmethodappearmsmore(start)thanandtheoncesamewithincalleeonemethodsubgraphme
evand,eryofsuchcourse,location,inseveralwhichdifweferentrefertoassubgraphs.acontext.Therefore,Thisweaimsatanalyseahigheveryprobabilityedgein
todifrevferentealacontedefect.xts.TAsodoingspecifyso,theweexacttypicallylocationinvofestigateaneedgeveryinedgeitsconteweightxtinwithinmanay
once.certainInstead,subgraph,weweuseadonotuniqueuseidtheforthemethodcallingnames,nodeas(idthes)yandmayanotheroccuronemoreforthanthe
calleeedgesinmethoditsconte(idext).inAllaidscertainarevalidsubgraphwithinsgtheirwiththesubgraph.followingTosumtuple:up,(sgwe,ids,idreferencee).
Amethodcertaincallsofdefectthedoessamenottypeafinfecttheallsamemethodcontecallsxt.To(edges)allowoffortheasamemoretype,detailedbut
analysis,wetakethisinformationintoaccount,andweassembleacomprehensive
ws:folloastablefeatureNotation5.1(FeaturetablesfordefectlocalisationwithRsubtreegraphs)
Thefeaturetableshavethefollowingstructure:Therowsstandforallprogramme
egraph,xecutions,thereriseproneesentedcolumn.bytheirThecalltablegrcellsaphs.Fcontainoreverytheedgedgeeineweights,veryefrxceptequentforsub-the
verylastcolumn,whichcontainstheclass∈{correct,failing}.Graphs(rows)can
thiscontaincasea,avercertainagesasubgrreaphusednotinthejustcorronce,bespondingutseveralcellsoftimestheatdiftablefer.Ientfasubgrlocations.aphInis
notcontainedinacallgraph,thecorrespondingcellshavevalue0.

73

CHAPTER5.CALL-GRAPH-BASEDDEFECTLOCALISATION

a→ba→ba→ba→c
exec.(sg1,id1,id2)(sg1,id1,id3)(sg2,id1,id2)(sg2,id1,id3)⋯class
g100136513⋯correct
g251241812479⋯failing
⋮⋮⋮⋮⋮⋱⋮
Table5.1:Exampletableusedasinputforfeature-selectionalgorithms.

Example5.2:Table5.1servesasanexample.Thefirstcolumncontainsarefer-
encetotheprogrammeexecutionor,moreprecisely,toitsreducedcallgraphgi∈G.
Thesecondcolumncorrespondstothefirstsubgraph(sg1)andtheedgefromid1
(methoda)toid2(methodb).Thethirdcolumncorrespondstothesamesubgraph
(sg1)buttotheedgefromid1toid3.Notethatbothid2andid3representmethodb.
Thefourthcolumnrepresentsanedgefromid1toid2inthesecondsubgraph(sg2).
Thefifthcolumnrepresentsanotheredgeinsg2.Notethatidshavedifferentmean-
ingsindifferentsubgraphs.Thelastcolumncontainstheclasscorrectorfailing.
g1doesnotcontainsg1,andtherespectivecellshavevalue0.
Thetablestructuredescribedallowsforadetailedanalysisofedgeweightsindif-
ferentcontextswithinasubgraph.Algorithm5.2describesallsubsequentstepsin
thissection.Afterputtingtogetherthetable,wedeployastandardfeature-selection
algorithm,informationgain(InfoGain,seeDefinition2.7),tocalculatethediscrimi-
nativenessofthecolumnsinthetableandthusthedifferentedges.Weusetheimple-
mentationfromtheWekadata-miningsuite[HFH+09]tocalculatetheInfoGain
withrespecttotheclassoftheexecutions(correctorfailing)foreverycolumn
(Line1inAlgorithm5.2).Weinterpretthevaluesasalikelihoodofbeingresponsi-
blefordefects.ColumnswithanInfoGainof0,i.e.,theedgesalwayshavethesame
weightsinbothclasses,arediscardedimmediately(Line2inAlgorithm5.2).

Algorithm5.2ProceduretocalculatePfreq(ms,me)andPfreq(m).
Input:asetofedgese∈E,e=(sg,ids,ide)
1:assigneverye∈EitsinformationgainInfoGain
2:E=E/{e∣e.InfoGain=0}
3://removefollow-upinfections:
E=E/{e∣∃p∶p∈E,p.sg=e.sg,p.ide=e.ids,p.InfoGain=e.InfoGain}
4:E(ms,me)={e∣e∈E∧e.ids.label=ms∧e.ide.label=me}
5:Pfreq(ms,me)=e∈E(mmax,m)(e.InfoGain)
es6:Em={e∣e∈E∧e.ids.label=m}
7:Pfreq(m)=e∈Emaxm(e.InfoGain)

74

5.3.FREQUENCY-BASEDANDCOMBINEDAPPROACHES

Besidestheinformationgain(InfoGain,seeDefinition2.7),wecouldhavechosen
variousdifferentalgorithmsoriginallydesignedforfeatureselection.Inpreliminary
experiments,wehaveevaluatedanumberofsuchtechniqueswiththeresultthatthose
ducesbasedonthebestentropyresultsarebestforoursuitedforparticulardefectdatasetlocalisation,weuseinandthatSection5.4.informationConcretelygain,pro-we
haverunexperimentswiththefollowingfeature-selectionalgorithmsbesidesinfor-
mationgain(1–3arebasedonentropy,too):
1.Information-gainratio(GainRatio,seeDefinition2.7)
[WF05]uncertaintySymmetrical2.3.TheOneRdecision-stumpclassifier[WF05]
4.Thechi-squaredstatistic(see,e.g.,[WF05])
on94][KRelief5.6.Ansupportvectormachine(SVM)basedalgorithm[GWBV02]

ectionsInfw-UpFolloCallgraphsoffailingexecutionsfrequentlycontaininfection-likepatternswhichare
causedbyaprecedinginfection.Wecallsuchpatternsfollow-upinfectionsandre-
movethemfromourrankedlistoffeatures.Figure5.1illustratesafollow-upinfec-
tion:(a)representsadefect-freeversion,(b)containsadefectinmethodawhereit
callsmethodd.Here,thismethodiscalled20timesinsteadoftwice.Followingour
reductiontechnique,thisleadstothesame(oraproportional)increaseinthenumber
ofcallsinmethodd.Inourentropy-basedranking,theedgesd→eandd→finherit
thescorefroma→difthescalingoftheweightsisproportional.Thus,weinterpret
thesetwoedgesasfollow-upinfectionsandremovethemfromourranking.More
formally,weremoveedgesiftheedgeleadingtoitsdirectparentwithinthesamesub-
graphhasthesameentropyscore(Line3inAlgorithm5.2).Incaseofmorethanone
defectinaprogramme,thiswayoffollow-upinfectiondetectionmightnotfindall
suchinfections,butpreliminaryexperimentshaveshownthatitdoesdetectcommon
casesefficiently.Weleaveasidethepathologicalcasethatthistechniqueclassifies
arealinfectionasfollow-upinfection.Thisisacceptable,sincetheprobabilityofa
certainentropyvalueisthesameforeverydefect.Therefore,itisveryunlikelythat
twounrelateddefectsleadtoexactlythesameentropyvalue,whichwouldleadtoa
classification.positivefalse

FromtheInvocation-LeveltotheMethod-Level
Untilnow,wecalculatelikelihoodsofmethodinvocationstobedefectiveforevery
invocation(describedbyacallingmethodmsandamethodcalledme).Wecallthis
scorePfreq(ms,me),asitisbasedonthecallfrequencies.Todothecalculation,we
firstdeterminesetsE(ms,me)ofedgese∈EforeverymethodinvocationinLine4

75

CHAPTER5.CALL-GRAPH-BASEDDEFECTLOCALISATION

aa 21 201dbdb 1 26 1 2060
cefcef
(b)(a)Figure5.1:Follow-upinfections.

ofAlgorithm5.2.InLine5,weusethemaxfunctiontocalculatePfreq(ms,me),the
maximumInfoGainofalledges(methodinvocations)inE.Ingeneral,thereare
manyedgesinEwiththesamemethodinvocation,asaninvocationcanoccurin
differentcontexts.Withthemaxfunction,weassigneveryinvocationthescorefrom
thecontextrankedhighest.Otherinvocationswithlowervaluesmightnotberelated
defect.thetoExample5.3:Anedgefromatobiscontainedintwosubgraphs.Inonesubgraph,
thisedgea→bhasalowInfoGainvalueof0.1.Intheothersubgraph,andtherefore
inanothercontext,thesameedgehasahighInfoGainvalueof0.8,i.e.,adefect
isrelativelylikely.Asoneisinterestedinthesecases,lowerscoresforthesame
invocationarelessimportant,andonlythemaximumisconsidered.
Atthemoment,therankingdoesnotonlyprovidethescoreforamethodinvo-
cation,Pfreq(ms,me),butalsothesubgraphswhereitoccursandtheexactembed-
dings.Thisinformationmightbeimportantforasoftwaredeveloper.Wereportthis
informationadditionally.Toeasecomparisonwithotherapproachesnotproviding
thisinformation,wealsocalculatePfreq(m)foreverycallingmethodminLines6
and7ofAlgorithm5.2.Theexplanationisanalogoustotheoneofthecalculationof
Pfreq(ms,me)inLines4and5.

hesoacApprCombined5.3.2Asdiscussedbefore,structuralapproachesarewellsuitedforthelocalisationof
structure-affectingbugs,whilefrequency-basedapproachesfocusoncallfrequency-
affectingbugs.Tobeabletolocaliseabroaderrangeofdefects,itseemstobe
promisingtocombinebothapproaches.Inthefollowingwefirstintroduceanew
structuralscoreforcombinationsbeforewediscusscombinationstrategies.

76

5.3.FREQUENCY-BASEDANDCOMBINEDAPPROACHES

AStructuralScoreforCombination
Thenotionofthespecificneighbourhood(SN)asintroducedbyDiFattaetal.
[DFLS06](seeSection5.2.1)hastheproblemthatnosupportcanbecalculatedwhen
theSNisempty.1Furthermore,preliminaryexperimentsofourshaverevealedthat
thePSN-scoringonlyworkswellifasignificantnumberofgraphsiscontainedinSN.
Thisdependsonthegraphreductionandminingtechniquesandhasnotalwaysbeen
thecaseintheexperiments.Thus,tocomplementthefrequency-basedscoring(see
Section5.3.1),wedefineanotherstructuralscore.Itisbasedonthesetoffrequent
subgraphswhichoccurinfailingexecutionsonly,SGfail.Wecalculatethestructural
scorePfailasthesupportofminSGfail:

Pfail(m)∶=support(gm,SGfail)
ThisisthesupportofallgraphscontainingmethodminSGfail.

StrategiesCombinationAsafirstcombinationstrategy,wecombinethefrequency-basedapproachwiththe
PSNscore(seeSection5.2.1).Inordertocalculatetheresultingscore,weusethe
approachfromDiFattaetal.[DFLS06]withouttemporalorder:WeusetheR01munord
reductionwithageneralgraphminer,gSpan[YH02](seeSection2.3.3),inorder
tocalculatethestructuralPSNscore.Wederivethefrequency-basedPfreqscoreas
describedbeforeafterminingthesamecallgraphsbutwiththeRsubtreereduction
andtheCloseGraphalgorithm[YH03](asdescribedbefore).Inordertocombine
thetwoscoresderivedfromtheresultsoftwograph-miningruns,wecalculatethe
arithmeticmeanofthenormalisedscores:
PSN(m)∶=Pfreq(m)+PSN(m)
comb2mn∈V(sg),asgx⊆g∈D(Pfreq(n))2mn∈V(sg),asgx⊆g∈D(PSN(n))
wherenisamethodinasubgraphsgofthedatabaseofallcallgraphsD.
Asthiscombinedapproachrequirestwocostlygraph-miningexecutions,wehave
introducedthestructuralscorePfailasabasisforasimplercombineddefect-loca-
lisationapproach.Itrequiresonlyonegraph-miningexecution:Wecombinethe
frequency-basedscorewiththePfailscore,bothbasedontheresultsfromoneClose-
Graphexecution.Concretely,wecombinetheresultswiththearithmeticmean,as
before:

subtreePfreq(m)Pfail(m)
Pcomb(m)∶=+
2mn∈V(sg),asgx⊆g∈D(Pfreq(n))2mn∈V(sg),asgx⊆g∈D(Pfail(n))
1[DFLS06]usesasimplisticfall-backapproachtodealwiththiseffect.

77

CHAPTER5.CALL-GRAPH-BASEDDEFECTLOCALISATION

aluationEvExperimental5.4Wandenowlocalisationevaluatethetechniquesdifferentintroducedproposalsinthisforsection.call-graphInSectionreductions5.4.1,(seeweChapterdescribe4)
theexperimentalsetup,andinSection5.4.2wepresenttheexperimentalcomparison
ofcall-graph-basedtechniques.InSection5.4.3,wecomparethesetechniquesto
relatedworkfromsoftwareengineering.

SetupExperimental5.4.1MethodologyManyofthedefect-localisationtechniquesasdescribedinthischapterproduceor-
deredlistsofmethods.Someonedoingacodereviewwouldstartwiththefirst
methodinsuchalist.Themaximumnumberofmethodstobecheckedtofindthe
defectmeasureofthereforeresultistheaccuracpositiony.Underofthethefaultyassumptionmethodinthatthealllist.methodsThishavepositiontheissameour
suresizeandlinearlythatthequantifiessameefthefortisintellectualneededtoeffortlocalisetofindaadefectdefect.withinaSometimesmethod,twothisormea-more
subsequentpositionshavethesamescore.Astheintuitionistocountthemaximum
ofnumberthelastofmethodspositiontowithbethischeckscore.ed,allThispositionsisin-linewithwiththethesamescoremethodologyhavetheofnumberrelated
studies(e.g.,[JH05]).Ifthefirstdefectis,say,reportedatthethirdposition,thisisa
fonlyairlyhasgoodtodoresult,acoderedependingviewofonthemaximallytotalnumberthreeofmethodsmethods.oftheAtargetsoftwaredeprogramme.veloper

ProgrammeunderTestandDefects
AswerelyonJavaandAspectJinstrumentationsinthisdissertation,ourexperi-
mentsfeatureaJavaprogramme.Concretely,weuseawell-knowndifftooltaken
from[Dar04],consistingof25methodsand706linesofcode(LOC).Weinstru-
mentedthisprogrammewith14differentdefectswhichareartificial,butmimicde-
fectswhichoccurinrealityandaresimilartothedefectsusedinrelatedwork.In
particular,wehaveexaminedtheSiemensProgrammes[HFGO94]whichareused
inmanyrelatedpublicationsondynamicdefectlocalisation(seeSection3.1.2)and
haveidentifiedfivetypesofdefectswhicharemostfrequentwithinthem:

78

usedariablevWrong1.2.Off-by-one(e.g.,i+1insteadofiorviceversa)
3.Wrongcomparisonoperator(e.g.,>=insteadof>)
conditionsAdditional4.conditionsMissing5.

5.4.EXPERIMENTALEVALUATION

Ourprogrammeversionscontainthesefivetypesofdefects.TheSiemensPro-
grthanammesonedefect.mostlyTocontainmimicthedefectsinSiemenssingleProglinesrammesandjustasafcloseewasprogrammespossible,wewithhavemorein-
Wegistrumentedveanovonlyervietwowooutftheof14kindsvofersionsdefects(defectsused7inandTable8)with5.2.morethanonedefect.
data.WehaThenveewexhaecutedveeachclassifiedversiontheexoftheecutionsasprogrammecorrect100ortimesfailingwithwithdifaferenttestoracleinput
basedonadefect-freereferenceprogramme.

ExperimentstheofDesignTheexperimentsaredesignedtoanswerthefollowingquestions:

1.HoHowwdcanofrcombinedequency-basedapproacapprhesoacimprohesvetheperformresults?comparedtostructuralones?

2.InSection4.1.5,wehavecomparedreductiontechniquesbasedonthecom-
pressionratioachieved.Howdothedifferentreductiontechniquesperformin
precision?defect-localisationofterms

3.Someapproachesmakeuseofthetemporalorderofmethodcalls.Thecall-
graphrepresentationstendtobemuchlargerthanwithout.Dosuchgraph
precision?evimprorepresentations

descriptionersionvdefect1,defect10wrongvariableused
defect2,defect11additionalor-condition
defect3>=insteadof!=
defect4,defect12i+1insteadofiinarrayaccess
defect5,defect13>=insteadof>
defect6>insteadof<
defect7acombinationofdefect2anddefect4(inthesameline)
defect8i+1insteadofiinarrayaccess+additionalorcondition
defect9,defect14missingcondition
Table5.2:Defectsusedintheevaluation.

79

CHAPTER5.CALL-GRAPH-BASEDDEFECTLOCALISATION

Inconcreteterms,wecomparethefollowingfivealternatives:
E01mThestructuralunordPSN-scoringapproach[DFLS06](seeSection5.2),basedonthe
reduction.Runordered01mEsubtreeOurfrequency-basedPfreq-scoringapproach(seeSection5.3.1)basedonthe
reduction.RsubtreeESNcombOurcombinedapproachwiththePSNcombscoring(seeSection5.3.2),basedon
theR01munordandRsubtreereductions.
EcombsubtreeOurcombinedapproachwiththePcombsubtreescoring(seeSection5.3.2),solely
basedontheRsubtreereduction.
EtotalThecombinedapproachasbefore,butwiththeRwtotalreduction[LYY+05](with
weightsbutwithouttemporaledges,seeSection5.2).
ForallexperimentsrelyingontheCloseGraphalgorithmweuseaminimumsup-
portsuppminof3.Thisallowsforrelativelylargeresultsets,evenwhenthegraph
databaseisrelativelysmall.Largeresultsetspreventtheapproachesrelyingonthe
Pfailscore(experimentsEcombsubtreeandEtotal)tohavethesamescoreformanymethods,
whichwouldlowerthequalityoftheranking.

ResultsExperimental5.4.2Wepresenttheresults(thenumberofthefirstpositioninwhichadefectisfound)of
thefiveexperimentsforall14defectsinTable5.3.Werepresentadefectwhichisnot
discoveredwiththerespectiveapproachwith‘-’.Notethatwiththefrequency-based
andwheretheadefectcombinedislocatedmethodwithinrankings,atheremethod,usuallyandinistheadditionalcontextofinformationwhichavsubgraphailableit
appears.Thefollowingcomparisonsleaveasidethisadditionalinformation.

Structural,Frequency-BasedandCombinedApproaches
WhencomparingtheresultsfromE01mandEsubtree,thefrequency-basedapproach
(Esubtree)performsalmostalwaysasgoodorbetterthanthestructuralone(E01m).
Thisdemonstratesthatanalysingnumericalcallfrequenciesisadequatetolocalise
defects.Defects1,9and13illustratethatbothapproachesalonecannotfindcertain
defects.Defect9cannotbefoundbycomparingcallfrequencies(Esubtree).Thisis
becausedefect9isamodifiedconditionwhichalwaysleadstotheinvocationof
acertainmethod.Inconsequence,thecallfrequencyisalwaysthesame.Defects1
and13arenotfoundwiththepurelystructuralapproach(E01m).Botharetypicalcall-
frequency-affectingdefects:Defect1isinanif-conditioninsidealoopandleads
tomoreinvocationsofacertainmethod.Indefect13,amodifiedfor-condition

80

5.4.EXPERIMENTALEVALUATION

eExp./defect1-321334254637181910611412413-144
01mEESNsubtree31331112123132113-2132343853
Ecombcombsubtree321112211822333
Etotal15143552-25463
results.Experimental5.3:ableT

slightlychangesthecallfrequencyofamethodinsidetheloop.WiththeR01munord
reductiontechniqueusedinE01m,defect2and13havethesamegraphstructureboth
withcorrectandwithfailingexecutions.Thus,itisdifficulttoimpossibletoidentify
ferences.difstructuralThecombinedapproachesinESNcombandEcombsubtreeareintendedtotakestructuralin-
formationintoaccountaswelltoimprovsubtreeetheresultsfromEsubtree.Wedoachievethis
goal:WhencomparingEsubtreeandEcomb,weretainthealreadygoodresultsfrom
Esubtreeinninecasesandimprovetheminfive.
Whenlookingatthetwocombinationstrategies,itishardtosaywhichoneisbet-
ter.ESNcombturnsouttosubtreebebetterinfourcaseswhileEcombsubtreeisbetterinsixones.Thus,
thetechniqueinEcombisslightlybetter,butnotwitheverydefect.Furthermore,the
techniqueinESNcombislessefficientasitrequirestwograph-miningruns.

hniquesceTReduction

Lookingatthecall-graph-reductiontechniques,theresultsfromtheexperimentsdis-
cussedsofarrevealthatthesubtree-reductiontechniquewithedgeweights(Rsubtree)
usedinEsubtreeaswellunordasinbothcombinedapproachesissuperiortothezero-one-
manniquesybasedreductionon(theR01m).reduction,BesidesRsubtreethealsoincreasedproducesprecisionsmallerofgraphsthethanRlocalisationunord,whichtech-
01misgoodforscalabilityandruntime(seeSection4.1.5).
wtotalEtotalevreductionaluatesfamilythe.totalThereductionrationaleisthattechnique.thisWoneeusecanRbetotalusedasinantheinstancesameofsetupthe
asEcombsubtree.Insubtreemostcases,thetotalreduction(Etotal)performsworsethanthesubtree
reduction(Ecomb).Thisconfirmsthatthesubtree-reductiontechniqueisreason-
does.able,andHowethatverit,isinwcasesorthtowherekeepthemoresubtreestructuralreductioninformationproducesthangraphsthetotalwhicharereductiontoo
larwgeforefficientmining,andthetotalreductionproducessufficientlysmallgraphs,
RtotalcanbeanalternativetoRsubtree.

81

CHAPTER5.CALL-GRAPH-BASEDDEFECTLOCALISATION

derOremporalTTheexperimentalresultslistedinTable5.3donotshedanylightontheinfluenceof
thetemporalorder.Whenappliedtothedefectivetmpprogrammesusedinourcompar-
isons,cannotthebetotalminedinreductionawithreasonabletemporaltime.Thisedges(Ralreadytotal)showsproducesthatthegraphsofarepresentationsizewhichof
theageabletemporalanymore.orderInwithpreliminaryadditionaleedgesxperimentsmightofleadours,towegraphshavewhoserepeatedsizeisEnot01mman-with
theRord01mreductionandtheFREQT[AAK+02]rootedorderedtreeminerinorderto
evaluatetheusefulnessofthetemporalorder.Althoughwesystematicallyvariedthe
differentminingparameters,theresultsoftheseexperimentsingeneralarenotbetter
thanperformedthoseinbetterE01m.thanOnlyE01min,intwotheofotherthe14casesdefectsithastheperformedtemporal-awworse.areInaapproachcompari-has
sontheRwithordtheRreductionsubtreewithreductiontheandorderedthetreegSpanmineralgorithmdisplayeda[YH02](seesignificantlySectionincreased2.3.3),
runtime01mbyafactorof4.8onaverage.2Therefore,ourpreliminaryresultbasedonthe
defectsincreaseusedtheinprecisionthisofsectiondefectisthatthelocalisations.incorporationofthetemporalorderdoesnot

5.4.3ComparisontoRelatedWork
Sofar,theexistingcall-graph-basedtechniques[CLZ+09,DFLS06,LYY+05]have
notbeencomparedtothewell-knowntechniquesfromsoftwareengineeringdis-
cussedinChapter3.3Wenowcompareourbest-performingapproach,Ecombsubtree,to
theTarantulatechnique[JHS02],totwoofitsimprovements[AZGvG09]andtothe
SOBERmethod[LFY+06](seeSection3.1.2fordetails).Thesetechniquescanbe
seenasestablisheddefect-localisationtechniquesastheyhaveoutperformedanum-
berofcompetitiveapproaches(seeSection3.1.2).
Fortheexperimentsinthissection,wehaveimplementedTarantula,itsimprove-
mentsandSOBERforourprogrammeusedintheevaluationsinthischapter.We
havedonesoasnocompleteimplementationsarepublicallyavailable.ForSOBER,
thereisMATLABsourcecodeavailablefromtheauthorsthatperformsthestatistical
calculations.However,thereisnotoolfortheinstrumentationofpredicatesavail-
able.ForSOBERwehavethereforeimplementedanautomaticinstrumentation,and
wehavereimplementedthestatisticalcalculationsinJava.ForTarantulaandits
improvements,wehaveimplementedbothsteps,automatedinstrumentationandthe
calculations.2Infourthisnodes.comparison,SuchaFREQTrestrictionwawsarsnotestrictedsetinasingSpan.[DFLS06]toFurthermore,findwesubtreesexpectofaafurthermaximumsignificantsizeof
3speedup+whenCloseGraphisusedinsteadofgSpan.
Onlypartyb[CLZuilds09]onhasthebeenwell-knocomparedwnTartotheantulatechniquesequence-mining-based[JHS02].approachRAPID[HJO08]which

82

5.4.EXPERIMENTALEVALUATION

Bothtechniques,TarantulaandSOBER,workongranularitiesthatarefinerthan
themethodlevelusedbyourapproach(seeSection3.1.2).However,theTaran-
tulaauthorsdescribeameansofmappingtheresultstothemethodlevel[JH05].
Concretely,theauthorsassignthescorefromitshighestrankedbasicblocktothe
method.Forourcomparisonswerelyonthismapping,andwedothesamefor
SOBER,whichoriginallyworksonthepredicatelevel.
InourexperimentswithTarantula,wehavenoticedthatithappensfrequently
thatmethodshavethesamelikelihoodscore,whichworsenstheresultsfromtheap-
proach.Althoughtheauthorshavenotappliedthistechniqueintheoriginalevalua-
tions[JH05,JHS02],weusethebrightnessscorefromTarantula(seeSection3.1.2)
asasecondaryrankingcriterion.Thisis,weletthiscriteriondecidetherankingpo-
sitionincasetheoriginalscoreisthesameforsomemethods.Thisapproachseems
tobenatural,asthebrightnesswouldbeasecondarysourceofinformationfora
developerwhousestheoriginalvisualisationfromTarantula.
WhenlookingatTarantulaandourapproachfromatheoreticalperspective,our
approachconsidersmoredatathancodecoverageasutilisedbyTarantula,butatthe
coarsermethodlevel.Theinformationanalysedbyourapproachadditionallytothe
informationanalysedbyTarantulaincludes(1)callfrequencies,(2)subgraphcon-
textsand(3)theinformationwhichmethodhascalledanotherone.Thisdataispoten-
tiallyrelevantfordefectlocalisation,e.g.,tolocalisefrequency-affectingbugs(1)and
structure-affectingbugs(2,3).Wethereforeexpectwellresultsfromourapproachin
comparisontoTarantula.
Asdiscussedbefore(seeSection3.1.2),SOBERovercomessomeoftheshort-
comingsofpreviousapproachesmentioned.Itanalysesthefrequenciesofpredi-
cateevaluationsandisthereforebettersuitedthanTarantulatolocalisefrequency-
affectingbugs.However,itdoesnotanalysesubgraphcontextsasourapproachdoes,
butitspredicateanalysistakesinformationintoaccountthatwedonotconsider(e.g.,
return-valuepredicates).Itisthereforehardtoformulatetheoreticalexpectations
whetherSOBERorourapproachwillperformbetter.
Inthefollowing,wecompareourEcombsubtreeapproach(valuestakenfromTable5.4)
toTarantulainexperimentsETarantulaandETbarantula(withandwithoutthebrightness
score),totheJaccardcoefficientvariationsinexperimentsEJaccardandEbJaccard,to
theOchiaicoefficientvariationsinexperimentsEOchiaiandEbOchiaiandtoSOBERin
.ExperimenteSOBERTable5.4containstheresultsfromthecomparison.Thetableclearlyshowsthat
ourapproach(Ecombsubtree)performsbestin12outofthe14defectsandaswellbeston
average.Onlyfordefect1someoftheotherapproachessubtreeperformalittlebetter,and
fordefect9allotherapproachesperformbetterthanE.Theexplanationforthe
latterisasbefore:Defect9doesnotaffectthecallcombfrequenciesatallwhicharethe
mostimportantevidenceforourapproach.Theothercomparisonsareasexpected:
ETbarantulaleadstobetterresultsthanETarantula,andEJaccardandEOchiaiperformbetter
thanETarantula(asin[AZGvG09]).Inourcase,thebrightnessextensiondoesnot

83

CHAPTER5.CALL-GRAPH-BASEDDEFECTLOCALISATION
exp./defect1234567891011121314∅
Ecombsubtree3211122118223333.1
ETbarantula6856798656991197.6
EETJaccardarantula1166113367776643455677771111775.45.5
EbJaccard1613676445771175.4
EbOchiai1613676445771175.4
EEOchiaiSOBER313616336374634344557474119745.44.1
Table5.4:Comparisontorelatedwork(boldfaceindicatesthebestexperiments).
improvetheresultsofEJaccardandEOchiai.Thiscanbeexplainedbythefactthat
EJaccardandEOchiaiareinitiallyalotbetterthanETarantula,whichhasmorepotentialfor
isimproasvwellements.notuneInourxpected,easxperiments,inthe[AZGvG09]resultstheofOcEhiaiJaccardcoefandEficientOchiaiisdoonlynotindifvferery.feThisw
casesbetterthantheJaccardcoefficient.ESOBERinturnperformsonaveragebetter
thanallTarantulavariations,althoughthereare+fewdefectswhereESOBERperforms
worse.Thisisconsistenttotheresultsin[LFY06].
ourLookingapproachat(theEavsubtreeerage).Invalues,contrast,adevwheneloperusinghastothebestconsider3.1performingmethodsapproachwhenusingfrom
therelatedworkcombconsideredinthiscomparison,SOBER(ESOBER),thedeveloper
ewouldxperiments,havetoourconsiderapproach4.1thereforemethods.reducesBasedonthetheeffortforbenchmarkdefectdefectslocalisationusedinbythese24%
.SOBERtocomparedBesidesTa+rantulaandSOBER,wehavealsoexperimentedwiththestaticFind-
waBugssnot[AHMablelo08]localiseanydefect-localisationdefectinourtool14(seedefectiSectionve3.1.1).programmeHovweverersions.,ThisFindBugsis
notsurprising,asthedefectsofourbenchmark(listedinTable5.2)mostlyrepresent
defectsaffectingtheprogrammelogicratherthandefect-proneprogrammingpatterns
thatcanbeidentifiedbyFindBugs.
Subsumption5.5Theexperimentsinthischapterhaveshownthatourapproachperformswell(regard-
inglocalisationprecision)comparedtobothrelatedapproachesbasedoncall-graph
relatedminingwandork[CLZestablished+09,DFLS06,approachesLYYfrom+05],softwourareevaluationengineering.isbasedHoweonvera,asrelatiinvtheely
smallnumberofdefectsanditishardtodrawconclusionsforarbitrarydefectsin
arbitraryprogrammes.Nonetheless,thedefectsinourevaluationserveasabench-
84

SUBSUMPTION5.5.

mark.AccordingtoZeller[Zel09],itislikelythatatechniquethatperformsbetter
thananotheroneonabenchmarkwillperformbetteronotherprogrammes,too.
Moreconcretely,ourexperimentspresentedinthischapteraswellastheonesin
thecloselyrelatedwork[CLZ+09,DFLS06,LYY+05]sufferfromtwoissuesrelated
tothequestioniftheresultscanbegeneralised:

•Theexperimentsarebasedonartificiallyseededdefects.Althoughthesede-
fectsmimictypicaldefectsastheyoccurinreality,astudywithrealdefects
fromniques.ane(ThisxistingalsosoftwappliesaretoprojectmostofwtheouldtechniquesemphasisethedescribedvalidityinofSectionthe3.1.2,tech-
astheevaluationsrelyontheSiemensProgrammes[HFGO94]featuringartifi-
defects.)cial•Allroughlyexperimentsrangingfromfeature200ratherto700smallLOC).programmesTheprogrammescontainingtherarelydefectsconsist(i.e.,of
morethanoneclassandrepresentsituationswheredefectscouldbefoundrel-
ativelyeasybyamanualinvestigationaswell.(Thisalsoappliestomostofthe
techniquesdescribedinSection3.1.2.)Theapproachesconsideredherewill
probablynotscalewithoutanyfurthereffortforprogrammesthataremuch
largerthantheprogrammesconsideredcurrently.

invIntheestigatearemainderscalableofthissolutionthatdissertationbuildsweontacklethethesetechniquestwoissues.proposedIninChapterthis6,chapterwe.
Weevaluatetheapproachwithrealdefectsfromanopen-sourcesoftwareprojectthat
issotwfaor.ordersofFurthermore,magnitudewelarpresentgerathantheconstraint-basedprogrammesinapproachtheevinaluationsChapter8,consideredwhich
leadsSoftoar,betterwehavescalabilitynotofconsideredtheunderlyingmultithreadedgraph-miningprogrammesinalgorithms.ourevaluations.In
chapterAppendixforA,thewelocalisationpresentandofevdefectsaluateinavsuchariationofprogrammes.thetechniquepresentedinthis

85

6HierarchicalDefectLocalisation

Inthepreviouschapter,wehavepresentedourapproachfordefectlocalisation(Sec-
tionpoor5.3).scalabilityDespitewithgoodtheresults,sizeofwethehavesoftwidentifiedareprojecttwoandissuesaofdesiredthiseapproach,valuationnamelywith
real[CLZ+09,defects(seeDFLS06,SectionLYY+5.5).05].InBoththisissueschapteras,wellweaimapplyattothegeneralisingcloselyourrelatedapproachwork
fordefectlocalisationtoscaleforlargersoftwareprojects.Tothisend,wepropose
ahierarchicalprocedurethatworkswithcallgraphsatdifferentlevelsofgranular-
ity.Wefurthermoreevaluateournewapproachwithrealdefectsfromareal-world
project.aresoftwWefirstpresentanintroductoryoverviewinSection6.1.Sections6.2and6.3
explainthecall-graphrepresentationsweuseinthischapteranddefectlocalisation
basedonthem,respectively.Section6.4containstheevaluation,andSection6.5isa
.chapterthisofsubsumption

wvieOver6.1

Inisationthisandchapterat,anweevaaimatluationawithscalablerealmethoddefects.forSolvingthecall-graph-mining-basedscalabilityissuesdefectislocal-chal-
lenging,asseeminglypossiblesolutionshaveissues:(1)Usingincreasedcomputing
costs.capabilitiesWehaorveedistribxperiencedutedthisalgorithmseffectisinnotfeasiblepreliminarydueetoexperimentsxplodingaswell.computationalFurther,
spendingalotofcomputingtimeforgraphminingmightbeinappropriatefordefect
localisation.(2)Solvingthescalabilityissuewithapproximategraph-miningalgo-
rithmslocalisation.mightFbeorainstance,solution,b[CLZut+might09](seemissSectionpatterns5.2.3)whichdoesarenotimportantreportanforydefectprob-
bulemstdoescausednotbyanalysethelarbettergescalingprogrammesLEAPeither.algorithm[YCHY08](seeSection3.2.2),
Adifferentstartingpointtodealwiththescalabilityproblemincall-graph-based
defectlocalisationisthegraphrepresentation.Inthischapter,weinvestigategraph
therepresentationspackagelevelatandcoarsertheclassleabstractionsvel,andthanwethestartatmethodsuchleavelcoarse(seeSectionabstraction4.1),beforei.e.,
zooming-inintoasuspiciousregionofthecallgraphs.Thesegraphsarealotsmaller
thanmuchfeconvwerentionalcases.Howemethod-lever,vethislcallidealeadsgraphs,toandnewtheychallenges:causescalabilityproblemsin

87

CHAPTER6.HIERARCHICALDEFECTLOCALISATION

1.Call-graphrepresentationshavenotyetbeenstudiedforlevelsofabstraction
higherthanthemethodlevel.Howdorepresentationswell-suitedfordefect
e?liklooklocalisation2.Whenzooming-inintodefect-freeregionsbyaccident,thefollowingquestion
arises:Howtodesignhierarchicaldefectlocalisationinawaythatminimises
theamountofsourcecodetobeinspectedbyhumans?
3.Itisunclearwhichdefectscanindeedbelocalisedincoarsegraphrepresenta-
tions.

Ourapproachforhierarchicaldefectlocalisationbuildsonthezoom-inideaand
solvcertainesthesedefectsalotchallenges.easierIt.Inreliesmoreondetail,weightedthiscallchaptergraphs,makesmakingthefollothewinglocalisationcontribofu-
tions:

GranularitiesofCallGraphs.Wedefinecallgraphsatdifferentlevelsofgranu-
larity,featuringedge-weighttuplesthatprovidefurtherinformationbesidesthegraph
structure(challenge1).Wedosobytakingthespecificsofdefectlocalisationinto
account:WeexplicitlyconsiderAPIcallsaswellasinter-/intra-packageandinter-
calls.method/intra-class

HierarchicalDefectLocalisation.Wedescribethezoom-inoperationforcall
graphs,presentamethodologyfordefectlocalisationforthegraphsateachleveland
describehierarchicalproceduresfordefectlocalisation(challenge2).Inconcrete
terms,wepresentdifferentvariantsofadepth-firstsearchstrategytohierarchically
project.aresoftwamine

EvaluationwithaLargeSoftwareProject.Anessentialpartofthischapter
istheevaluationfeaturingrealprogrammingdefectsinMozillaRhino(challenge3).
TothisendweusetheiBUGSrepository[DZ09]andtheoriginaltestsuite.Rhino
consistsof≈49kLOC,andthedefectsintherepositorywereobtainedbyjoining
informationfromabug-trackingsystemwithdataandsourcecodefromarevision-
system.control

Ideasrelatedtozooming-inintocallgraphs,namelyGraphOLAP,havebeende-
scribedin[CYZ+09].Theauthorsproposedata-warehousingoperationstoanalyse
graphs,e.g.,drill-downandroll-upoperations,similartoourzoom-inproposal.How-
ever,[CYZ+09]doesnothelpindefectlocalisation,asitaimsatinteractiveanalyses,
anditdoesnotconsiderspecificrequirements(e.g.,APIcalls).

88

6.2.DYNAMICCALLGRAPHSATDIFFERENTLEVELS

6.2DynamicCallGraphsatDifferentLevels

Inthissection,weproposeanddefinetotally-reducedcall-graphrepresentationsfor
btheuildonmethod,theclasstotally-reducedandpackageweightedlevelcall(Sectionsgraphs(Rw6.2.1–6.2.3).)wehaveTheseintroducedrepresentationsinSec-
totaltion4.1.1.Thenweintroducethezoom-inoperationforcallgraphs(Section6.2.4).
MoreThecallcoarse-grainedgraphsintroducedmeta-pacinkathisge-lesectionvelcallcangreasilyaphsbeecouldxtendedrelyonintheeitherhierarchicaldirection:
ormoreganisationdetailedofthanthepackagesmethodandlewvel,oulde.g.,alloatwttheoleveanalyselofevbasicenlarblocksger(seeprojects.SectionGraphs4.2),
wouldallowforafinerdefectlocalisation.
+intoAsinJavaChapterprogrammes4,weandrelytoonderiveAspectJcall[KHHgraphs01]fromtoweaprogrammevetracingexecutions.functionalityThis
isyieldsaneanxample.)unreducedThisisthecall-graphbasisforallrepresentationreducedatthemethodrepresentationslevel.we(Figurediscussin6.1(a)the
following.Inconcreteterms,ourtracingfunctionalityinternallystoresunreduced
callrepresentationsgraphsinaatanpre-aggreylevelsgatedofspace-efgranularity.ficientmanner.Thisletsusderivecall-graph

6.2.1CallGraphsattheMethodLevel
Wenowproposetotalgraphreductionsthatareweighted,whereexactlyonenode
representsamethod.Furthermore,wedonotmakeuseofanytemporalinformation.
Allthisleadstoacompactgraphrepresentation(seeSection4.1).
Asaninnovation,weconsidercallsofmethodsbelongingtotheJavaclassli-
brary(API)inallgraphs.Wedosoaswebelievethatsomedefectsmightaffect
thecallsofsuchmethods.Toourknowledge,nopreviousstudyhasconsideredsuch
methodcalls.However,tokeeptheinstrumentationoverheadtoaminimum,wedo
notconsiderAPI-internalmethodcalls.Inthegraphrepresentation,weuseonenode
(API)torepresentallmethodsbelongingtotheclasslibrary.
Notation6.1(Method-levelcallgraphs,Rtotalmethod)
Inmethod-levelcallgraphs,everymethodisrepresentedbyexactlyonenode,di-
rectededgesrepresentmethodinvocations,andedgeweightsstandforthefrequen-
ciesofthecallsrepresentedbytheedges.TheAPInoderepresentsallmethodsof
theclasslibraryanddoesnothaveanyoutgoingedges.
Example6.1:Figure6.1(b)isamethod-levelcallgraph.Itisthereducedversion
ofthegraphinFigure6.1(a).TheAPInodesinFigure6.1(a)representtwoAPI
methods,aandb,representedbyonenodeinFigure6.1(b).Bothgraphsdonot
haveanyself-loops,asthecorrespondingprogrammeexecutiondoesnotinvolveany
calls.methodevrecursi

89

CHAPTER6.HIERARCHICALDEFECTLOCALISATION

A.a

A.cB.cC.aB.aB.a

A.aA.a1 112B.aB.aC.aB.cA.cB.aC.aB.cA.c
1 1111B.bB.bC.bC.bC.cAPI.aB.bC.bC.c 1
2 2APIAPI.bAPI.bAPI.aAPI.a(a)unreduced(b)methodlevel,Rtotalmethod
1,1,1A3,2,1 1,1,1,1,1P1 1,1,1 2,2,2B 5,2,4,1,2 1,1,1 1,1,1 1,1,1 4,2,2,2,3 1,1,1,1,1P2 1,1,1C 4,1,2,1,2 4,2,2APIAPI(c)classlevel,Rtotalclass(d)packagelevel,Rtotalpackage
Figure6.1:Anunreducedcallgraphanditstotalreducedrepresentationsatthe
methodlevel,classlevelandpackagelevel.Notation:class.method;
classAformspackageP1,classesBandCformpackageP2.

90

6.2.DYNAMICCALLGRAPHSATDIFFERENTLEVELS

6.2.2CallGraphsattheClassLevel
Wenowproposeclass-levelcallgraphswithtuplesofweights.Therationaleisto
includesomemoreinformation,whichwouldotherwisebelostbythemorerigorous
compression.classInNotationclass-le6.2velcall(Class-legrvaphs,elecallverygraphs,classisRrtotalepr)esentedbyexactlyonenode,andedges
representinter-classmethodmethodcalls(orintra-classcallsincaseofself-loops).The
(t,APIu,v)node.triseferassintoRthetotaltotalgraphs.numberAnofedgmethodeiscallsannotatedrepresentedwithabytupletheofedge(asweights:in
Rtotalmethodgraphs),uisthenumberofdifferentmethodsinvoked,andvisthenumberof
differentmethodsthatinvokemethods.
graphsExamplein6.2:FiguresFigure6.1(a)6.1(c)andis(b).aclass-leClass-levveellcallcallgraph,graphsitmayistheincludecompressionself-loopsof(ethex-
ceptfortheAPInode),evenifthereisnorecursion.

6.2.3CallGraphsatthePackageLevel
Thereductionforthislevelisanalogoustothepreviousones,buttocapturemore
information,weextendtheedge-weighttuplesbytwoelements:
Notation6.3(Package-levelcallgraphs,Rtotalpackage)
Inpackage-levelcallgraphs,thereisonenodeforeachpackage,andthereisan
additionalAPInode.Theedge-weighttuplesareasfollows:

(tm,uc,um,vc,vm)
whereucisthenumberofdifferentclassescalled,vcthenumberofdifferentclasses
calling,andtm,um,vmareast,u,vinNotation6.2.(‘m’standsformethod,‘c’for
class.)Example6.3:WeassumethatclassAinFigure6.1formspackageP1,thatclassesB
andCareinpackageP2,andthatmethodsAPI.aandAPI.bbelongtothesame
class.Figure6.1(d)thenisapackage-levelcallgraph,representingthecallgraphs
6.1(a)–(c).Figuresfrom

6.2.4TheZoom-InOperationforCallGraphs
Beforewediscussthezoom-inoperationforcallgraphs,wefirstdefineanauxiliary
function:

91

CHAPTER6.HIERARCHICALDEFECTLOCALISATION

function)generate(The6.1DefinitionThegeneratefunctionisofthefollowingtype:

generatelevel∶(Gunreduced,V)→Glevel
whereGunreducedstandsforunreducedcallgraphs,Glevelforcallgraphsofthelevel
specifiedbylevel∈{method,class,package}andVforsetsofvertices.A∈Vspeci-
fiestheareatobeincludedinthegraphtobegenerated,bymeansofasetofvertices
ofthepackagelevel(packagenames)incase‘level=class’oroftheclasslevel
(classnames)incase‘level=method’.Incase‘level=package’,A=☆selectsall
es.gkapacFromagivenunreducedgraph,thefunctiongeneratesasubgraphatthelevelspec-
ified,containingallnodescontainedinA(allnodesifA=☆)andedgesconnecting
thesenodes.IfA≠☆,thefunctionintroducesanewnodelabelled‘Dummy’inthe
subgraphgeneratedthatstandsforallnodesnotselectedbyA.
Inthegeneratefunction,wetreattheAPInodesseparatelyfromothernodes.They
donothavetobeexplicitlycontainedinA,butarecontainedintheresultinggraphs
bydefault,asdescribedinNotations6.2and6.3.Asthegeneratefunctionselects
certainareasofthegraph,itobviouslyomitsotherareas.Thisisaconsciousdecision,
assmallgraphstendtomakegraphminingscalable.Ascallsofmethodsinthe
omittedareasmightindicatedefectsnevertheless,thegeneratefunctionintroduces
theDummynodestokeepsomeinformationaboutthesemethods.
Tozoom-intoafinerlevelofgranularity,sayintoacertainpackagep∈V(Gp)of
apackage-levelcallgraphGptoobtainaclass-levelcallgraphGc,onecallsthegen-
eratefunctionasfollows:Gc∶=generateclass(Gu,{p}),whereGuistheunreduced
callgraphofGp.Zoomingfromaclass-levelcallgraphtoamethod-levelcallgraph
analogous.is

6.3HierarchicalDefectLocalisation
Wetroducenowdefectdescribeourlocalisationhierarchicalwithoutapproachconsideringforthedefecthierarchicallocalisation.procedure,Atfirst,i.e.,wein-we
laritydescribe(Sectionhowdefect6.3.1).Notelocalisationthatwthisorksisaforcallgeneralisationgraphsatofantheyselectedprocedurelevelodescribedfgranu-in
Sectionhierarchical5.3.1.Weprocedurethenpresent(Sectiondif6.3.2),ferentwhichapproachesarefurtherforturninggeneralisations.thistechniqueintoa

6.3.1DefectLocalisationinGeneral
WThisenoiswindiscussprincipledefectasynopsislocalisationofourwithapproachcallgraphsinatSectionarbitrary5.3.1levwithelsofgeneralisationsgranularity.

92

LOCALISADEFECTHIERARCHICAL6.3.TION

forSectionarbitrary6.2.leAftervelsaofshortovabstractionerviewoandfthewithapproach,adjustmentswefordescribethegraphssubgraphintroducedminingandin
defectlocalisationbasedonedge-weighttuples.Finally,wediscusstheincorporation
ofinformationfromstaticsource-codeanalysis.

wvieOverAlgorithm6.1workswithunreducedcallgraphsU(traces),representingprogramme
aexcertainecutions.subgraphMoreofthespecificallygraphs,itdeals(parameterwithAgraphs).Foatratheusertime-definedbeing,welevel,considerdescribingthe
packagelevel(A=☆),i.e.,withoutrestrictingthearea.Thealgorithmfirstassigns
aoraclesclass∈{arecorrtypicallyect,afailingv}ailabletoevery[JH05].graphThenu∈theU(Lineprocedure3),usinggeneratesatestoracle.reducedSuchcall
ofgraphs,thesefromgraphs,everywhichgraphprouvide(Linedif4).ferentNext,contethextsprocedure(Line6).derivTheeslastfrequentstepcalculatessubgraphs
a(i.e.,likaelihoodpackage,ofclasscontainingoramethod;defect,Linefor7).evWeryedosoftwsoarebyderientityvingeaatthediscriminatilevelvspecifiedeness
formeasureallentitiesfortheofacertainedge-weight-tuplelevelformvaalues,rankingineachofthecontextentities,separatelywhich.canThebePgivvenaluesto
softwaredevelopers.Theywouldthenreviewthesuspiciousentitiesmanually,start-
ingthewithbasistheforaonezoom-inwhichisintomostafinerlikelylevetolobefdefectigranularityve.,asAlternatidescribedvely,inthisSectionresultcan6.3.2.be
Algorithm6.1Procedureofdefectlocalisation.
Input:asetofunreducedcallgraphsU,alevel∈{package,class,method},
Aareaan1:Output:G=∅a//rankinginitialiseabasedsetonofeachcallsoftwgraphsareentity’slikelihoodtobedefectiveP(e)
3:2:forallcheckifgraphsuurefers∈Utodoacorrectexecution,
4:Gand=Gassign∪{agenerclassate∈{co(ru,rAect)},failing}tou
level6:5:SGend=fofrrequent_subgraph_mining(G)
7:calculateP(e)forallsoftwareentitieseatthelevelspecified,basedonSG
Inthischapterfocussingonhierarchicalmining,wedonotrelyonanystructural
escorexperimentsnorhavcombinationserevealedaswethathavestructuraldoneinscoresSectiondonot5.3.2.wWorkedososowellaswithpreliminarytotally
reducedgraphsfromtheparticularsoftwareprojectusedintheevaluationofthis
tendchapterto.Thisfrequentlyisashathevecallthesamegraphsfromtopologyse.veralexComparedecutionstotheofthegraphssamewehaveprogrammeused

93

CHAPTER6.HIERARCHICALDEFECTLOCALISATION

before(e.g.,intheRsubtreerepresentation),thegraphsweusenowarelessinteresting
fromastructuralpointofview,butencoderelevantinformationintheedgeweight
tuples.

MiningSubgraphAsinAlgorithmour6.1)approachminesintheSectionpuregraph5.3.1,thestructureandignoresfrequent-subgraph-miningtheedge-weightstep(Linetuples6forin
themoment.Laterstepswillmakeuseofthem.Asbefore,weusethesubgraphs
obtainedasdifferentcontextsandperformallfurtheranalysesforeverysubgraph
.separatelyxtconteForsubgraphmining,werelyontheParSeMiSimplementation[PWDW09]ofthe
CloseGraphalgorithm[YH03],whichwehavealreadyusedinSection5.3.1.We
nothewsetsuseofacallminimumgraphsofsupportcorrectvalueandoffailingmin(∣exGcorr∣,ecutions,∣Gfail∣)/2respecti,vwhereely(GGcorr=GandcorrG∪failGfailare).
theThissmallerensuresclassthatisnomissed.structurePreliminaryoccurringeinatxperimentsleasthalfhaofveallshoexwnecutionsthatthisbelongingminimumto
supportallowsforbothshortruntimesandgoodresults.
TheAPIandDummynodesaswellasself-loops(⤾)requireaspecialtreatment
mining:subgraphduring

94

•callAPIgraphnodes:hasAsaalmostconnectionalltomethodsthecallAPIAPInode.methods,Thisincreasesalmostevtheerynumbernodeinofa
edgesinacallgraphsignificantly,comparedtoagraphwithoutAPInodes,
haspossiblyanedgeleadingtoantoAPIscalabilitynode,theseissues.edgesAttheusuallysamearetime,notasinterestingalmostevforerydefectnode
localisation.edge-weightWetuplesforthereforetheomitsubsequenttheseedgesanalysisduringstep.graphThisis,mining,onlybutknodeseepandthe
edgesdrawnwithsolidlinesinFigure6.1areconsidered.

•Dummynodes:WetreatDummynodesinthesamewayaswetreatAPI
nodes,astheirstructuralanalysiswithsubgraphminingdoesnotseemtobe
promising.Dummynodestendtobeconnectedtomanyothernodesaswell,
leadingtounnecessarilylargegraphs.
•Self-loops(⤾):Suchedgesresultfromrecursionatthemethodlevel.How-
ever,atthepackageandclasslevel,aself-looprepresentscallswithinthesame
entity,whichhappensfrequently.Therefore,self-loopsenlargethegraphsig-
nificantlywhilenotbearingmuchinformation.Wethereforetreatself-loopsat
thepackageandclasslevelasAPIandDummynodes:Weomitthemduring
graphminingandkeeptheedge-weighttuplesforsubsequentanalysis.

TIONLOCALISADEFECTHIERARCHICAL6.3.

ABACA⤾Bsg⤾1C⤾AAPICAPIBCsg2⋯class
exec.tuvtuvtuvtuvtuvtuvtuvtuv(⤾,⋯API)
g⋮1⋮3⋮2⋮1⋮2⋮2⋮2⋮1⋮1⋮1⋮1⋮1⋮1⋮1⋮1⋮1⋮1⋮1⋮1⋮4⋮2⋮2⋮1⋮1⋮1⋱⋯⋱⋯cor⋮rect
gn911222111111111111422---⋯⋯failing
Table6.1:Examplefeaturetableforclass-levelcallgraphs.

Edge-Weight-BasedDefectLocalisation
aWhendefectgraph(Line7iminingnisAlgorithmcompleted,6.1).weThisiscalculateanalogoustheliktoourelihoodapproachthatainmethodSectioncontains5.3.1.
Notethatthedescriptionofanedgeisnoweasier(i.e.,withoutnodeids),aswedeal
withtotallyreducedgraphswherenodenamesareunique.Thisalsoleadstothe
effectthateachsubgraphhasmaximallyoneembeddinginagraph.Wetherefore
donothavetouseanyaveragevalues.Concretely,weassembleafeaturetableas
ws:folloNotation6.4(Featuretablesfordefectlocalisationatarbitrarylevelsofabstraction)
Ourecutions,featurreeprtablesesentedhavebythetheircallfollowinggraphs.structurFore:eTheveryredgowseinstandeveryforfrallprequentogrsubgrammeeaph,x-
thereisonecolumnforeveryedge-weight-tupleelement(i.e.,asinglecallfrequencyt
ortuplesofvalues,dependingonthegranularitylevelofthecallgraphconsidered,
seeSection6.2).ForalledgesleadingtoAPIandDummynodesaswellasfor
aallgain,forself-loopseach(⤾),subgrtheraphearsepareatelyfurther.Thecolumnstableforcellstheedgcontainthee-weight-tupleedge-weight-tupleelements;
Ifavalues,subgrexceptaphisfornottheverycontainedlastincolumn,acallwhicgrhaph,thecontainscorrtheclassesponding∈{cocellsrrecthave,afailingnull}.
(‘-’).valueasWedopreliminarynoteincludexperimentsDummyhavenodesshoinwnthethattablesthisdoeswhennotleadconsideringtoanythebenefit.methodleHovw-el,
ever,weincludeAPInodesandself-loopsatalllevels.
Example6.4:Table6.1isafeaturetablecorrespondingtoclass-levelcallgraphs,
suchastheoneinFigure6.1(c).(Thisgraphisexecutiong1inthetable.)Suppose
thattheprecedinggraph-miningstephasfoundtwosubgraphs,sg1(B←A→C)
andsgcorresponds2(B→toCsg).TheandveryedgefirstA→columnBwithliststhethetotalcallcallgraphsfrequencg∈Gy.t.TheThenextfollocolumnwing
1twocolumnscorrespondtotheremainingtwoedge-weighttupleelementsuandv
with(seeitsNotationedge-weight6.2).Thentuple(t,follou,wsv).theNext,secondalledgeself-loopsin(theA⤾same,B⤾,subgraphC⤾)(Aand→APIC)
calls(A→API,C→API)insg1arelisted.(Dummynodeswouldbelistedhere

95

CHAPTER6.HIERARCHICALDEFECTLOCALISATION

aswell,butdonotexistinthisexample.)Thesamecolumnsforsubgraphsg2and
finallytheclassoftheexecutionfollow.Graphgndoesnotcontainsg2,whichis
‘-’.byindicatedAfterassemblingthefeaturetable,weemploytheinformation-gainfeature-selec-
tionalgorithm(InfoGain,seeDefinition2.7)initsWekaimplementation[HFH+09]
tocalculatethediscriminativenessofthecolumnsandthusofthedifferentedge-
weight-tuplevalues.ThisisagainanalogoustoourapproachinSection5.3.1.
Sofar,wehavederiveddefectlikelihoodsforeverycolumninthetable.However,
weareinterestedinlikelihoodsforsoftwareentities(i.e.,packages,classesormeth-
ods),andeverysoftwareentitycorrespondstomorethanonecolumningeneral.To
obtainthedefectlikelihoodP(e)ofsoftwareentitye,weassigneverycolumntothe
callingsoftwareentity.WethencalculateP(e)asthemaximumoftheInfoGain
valuesofthecolumnsassignedtoe.Bydoingso,weidentifythedefectlikelihood
ofasoftwareentitybyitsmostsuspiciousinvocation.Thecallcontextofalikelyde-
fectivesoftwareentityandsuspiciouscolumnsaresupplementaryinformationwhich
wereporttosoftwaredeveloperstoeasedebugging.
Example6.5:Thegraphsg1(seeFigure6.1)andgninTable6.1displaysimilar
values,butrefertoacorrectandafailingexecution.SupposethatmethodA.acon-
tainsadefectwiththeimplicationsthat(1)methodB.cwillnotbecalledatall,and
(2)thatmethodB.awillbecalledninetimesinsteadoftwice.Thisisreflectedin
columns2–4,referringto(t,u,v)ofA→Binsg1.tincreasesfromthree(1×B.c,
2×B.a)tonine(9×B.a),udecreasesfromtwo(B.c,B.a)toone(B.a),andvstays
thesame–inclassA,onlymethodainvokesothermethods.TheInfoGainmeasure
willrecognisefluctuatingvaluesoftandu,leadingtoahighrankingofclassA.

ormationInfStaticofIncorporationdraThewbackedge-weightthattwooandrmoreInfoGainentities-based(i.e.,rankingpackages,procedureclassesorsometimesmethods)hahasvethethesameminor
rankingposition.Insuchcases,wefallbacktoasecondrankingcriterion:Wesort
suchentitiesdecreasinglybytheirsizein(normalised)linesofcode(LOC)derived
withdefectivLOCCenesslik[Joh00].elihoodThe[NBZ06]rationale(seeisthatSectionthe3.1.1).sizeThisfrequentlyis,largecorrelatesmethodswithtendtheto
bemoredefective.

6.3.2HierarchicalProcedures
Thedefect-localisationproceduredescribedinSection6.3.1canalreadyguideaman-
ualdebuggingprocess:Adevelopercanfirstdodefectlocalisationatthepackage
level.Sheorhecanthendecidetozoom-inintocertainsuspiciouspackages.The
developerwouldcontinuewithourdefect-localisationtechniqueattheclasslevel,

96

TIONLOCALISADEFECTHIERARCHICAL6.3.

proceedingwiththemethodleveletc.However,itmighthappenthatthedeveloper
zooms-inintoanareawherenodefectislocated.Inthiscase,thedeveloperwould
bearsbacktracktheandpotentialzoom-inthatelseimportantwhereetc.backgroundThismanualknowledgeprocess,knownguidedtobytheourdevelopertechnique,can
included.easilybeInthissection,wesayhowtoturnthemanually-guideddebuggingprocessinto
semi-automaticproceduresfordefectlocalisation.Wepresentadepth-first-search-
vbasedariant.We(DFS-based)alsoproposeprocedure,aatechniqueso-calledthatmerpartitionsge-basedlargevariantpackagesandaandparameterclasses.-free

LocalisationectDefDFS-BasedOurciousmethodDFS-basedintheproceduremostfollosuspiciouswstheclassideaintothemanuallymostinvsuspiciousestigatethepackagemostfirst.suspi-If
thismethodfirstinthemethodsameturnsclass.outIftoallnotbemethodsdefectiinvethis,weclassgotareotheinvsecondestigated,mostwesuspiciousbacktrack
tobertheofnextsoftwclassareetc.entitiesWetobefurtherinvproposeestigatedtheateachparametersstage,tok,lk,m.packages,Theyllimitclassesthenum-and
msettomethods.infinityinAlgorithmorderto6.2obtainaformalisesparameterthis-freeapproach.algorithm;Theattheparametersendofk,l,thismcansectionbe
wealsopresentameanstosettheseparameters.
Algorithm6.2iteratesthroughthreeloops,oneforpackages,oneforclassesand
oneformethods(Lines3,6and9).Ineachloop,thealgorithmcalculatesadefective-
nesslikelihoodPfortherespectivesoftwareentities.Thisis,Lines1–2,4–5and7–8
comprisethegraph-miningstep(Line6inAlgorithm6.1)andthestepthatcalcu-
latesP(Line7inAlgorithm6.1),asdescribedinSection6.3.1.Theselinesmake
useofthegeneratefunction(seeDefinition6.1),eachwiththeareaselectionbased
onthecurrentlyselectedsoftwareentityattherespectivecoarserlevel.Ultimately,
thealgorithmpresentssuspectedmethodstotheuserandterminatesincasetheuser
hasidentifiedadefect(Lines10–12).
expensiTheveDFS-basedgraph-miningprocedurestepasdescribedwellaswtheorksinteracticalculationvelyof.PThisareis,donetheonlypotentiallywhen
neededanalysed.–theThealgorithmsuspectedmightmethodsterminatearepresentedbeforealltothepackagesuserinandanclasseson-linehavemannerbeen.
itThisisaofvoidscourselongpossibleruntimestoskipbeforeLinesadev10–12eloperinactuallyAlgorithmcan6.2startanddebtosaugging.vetheHowecurrentver,
inmethoSectiondtoan6.3.1.orderedToeaselisteofxperiments,suspectedwemethods.followThisthisleadsapproachtoainourrankingevasaluation.described
Theproposedapproachobviouslyhasthedrawbackthattheuserhastosetthe
Basedparametersonourk,el,m.xperience,Whenittheisvnotalueshardaretotoosetlow,appropriatethetechniqueparametersmightbasedmissonaempirdefect.-
icalvaluesderivedfromdebuggingotherdefectsinthesameproject.Furthermore,

97

CHAPTER6.HIERARCHICALDEFECTLOCALISATION

Localisation.DefectDFS-Based6.2AlgorithmInput:asetofclassified(correct,failing)unreducedcallgraphsU,
m,lk,parametersOutput:adefectivemethod
1:SG=frequent_subgraph_mining({generatepackage(u,☆)∣u∈U})
2:calculateP(package),basedonSG
3:forallpackage∈topk(P(package)),ordereddecreasinglybyP(package)do
4:SG=frequent_subgraph_mining({generateclass(u,{package})∣u∈U})
5:calculateP(class),basedonSG
6:forallclass∈topl(P(class)),ordereddecreasinglybyP(class)do
7:SG=frequent_subgraph_mining({generatemethod(u,{class})∣u∈U})
8:calculateP(method),basedonSG
9:forallmethod∈topm(P(method)),
ordereddecreasinglybyP(method)do
10:presentmethodtotheuser
11:ifmethodisdefectivethen
12:returnmethod
ifend13:rofend14:rofend15:rofend16:

98

LOCALISADEFECTHIERARCHICAL6.3.TION

wewillpresentanautomatedchoiceofoptimalparametervaluesinthefollowing
paragraphs.

Merge-BasedVariantofDFS-BasedDefectLocalisation
Thisresultstotechniquetheuserisaninanalternation-linevetomannerthe,itDFS-basedreplacesone.LinesInstead10–12ofinpresentingAlgorithmthe6.2
withcodethatsavesallmethodsprocessed(alongwiththeirlikelihoodP)inaresult
set.Then,rightafterLine12inAlgorithm6.2,itsortsallmethodsdecreasinglyby
elihood.likdefecttheircanTheactuallydrawbackstartofdebthisugging.procedureOntheisotherthattheside,wealgorithmhypothesisehastothatterminatethedefectbeforelocal-one
isationsobtainedbythismerge-basedvariantarebetterthantheoneswiththefirst
approach.WeevaluatethishypothesisinSection6.4.
Concerningtheparametersk,l,m,themerge-basedvariantismorerobust.Asthe
mergedresultsetissortedattheveryend,largeparametervaluesusuallydonotlead
toworselocalisationresults.Theyonlyaffecttheruntime.

Parameter-FreeVariantofMerge-BasedDefectLocalisation
Asyetanothervariant,weproposeparameter-freedefectlocalisation.Hereweset
theparametersk,l,minthemerge-basedvarianttoinfinity.Thispromisestonot
missanydefectivemethod.Inaddition,ifoneusesthisvariantseveraltimeswitha
certainsoftwareproject,onecanuseittoempiricallysettheparametervalues.This
allowsparametersforanthatefareficienttoousagehighofortothespeedinteractiupvethere(on-line)gularmerDFS-basedge-basedvprocedureariant.without

PartitioningApproach
Thehierarchicalproceduresinvestigatedinthisdissertationanalysesmallzoomed-in
callgraphsatseveralgranularities.However,anumberofsoftwareprojects–espe-
ciallylargeonesandthosewithalonghistory–haveimbalancedsizesofpackages
areandclasses.consideringThisamightzoomed-inleadtolarsubgraphgegraphsonly.Itthatisancauseopenscalabilityresearchissues,questionevenhoifwtweo
overcomesuchsituations.Fornow,wepresentasampling-basedpartitioningap-
cases.suchforproachdled,Wheneweverpartitionacertainthecallgraphgraphintoattwothe(or,ipackagefneededorclassmore)levelispartitions.toolargeWetodobesohan-by
withinrandomlythesamesamplingpartition.nodesfromAsnottheallgraph.edgesWekconnecteepthenodesedgesbelongingconnectingtotwtheonodessame
partition,wewouldlosealotofinformation.Tocompensateforthiseffect,wein-
troduceadummynodeDummypartineachpartition,representingallnodesinother

99

CHAPTER6.HIERARCHICALDEFECTLOCALISATION

wepartitions.omitthemWetreatduringDummygraphpartminingnodesandineincludexactlythethesamewayedge-weight-tupleasDummyvaluesnodes,inthei.e.,
tables.featurefectWhenlocalisationgraphaspartitionsdescribedarebeforegeneratedwithandeachDummypartitionpartnodesseparatelyare.Then,inserted,wesimilarlydode-to
theandmerobtainage-baseddefectivvariant,enesswemerrankinggetheorderedrankingsbythePobtainedvaluesfromofthethedifsoftwferentarepartitionsentities.
Thisletsusproceedwithanymanualorautomatedhierarchicaldefectlocalisation
before.describedasprocedure,Thispartitioningapproachforlargepackagesandclasseshasworkedwellinpre-
mationliminaryeexists,xperiments.anddefectHowevelocalisationr,theremightmightnotbewcasesork.Fowhererainstance,lossofthinkrelevofantainfordefect-
Inwhichsuchoccurssituations,inathecertainsubgraphdefect-localisationcontextthatprocedureisdistribcanbeutedovrepeatedersevwitheraladifpartitions.ferent
partitioning,eitherbasedontheexpertiseofasoftwaredeveloperorbyusinganother
partitioning.randomforseed

6.4EvaluationwithRealSoftwareDefects

Wenowevaluateourdefect-localisationtechniquesinordertodemonstratetheiref-
fectivenessandusefulnessforlargesoftwareprojects.Afteradescriptionofthe
targetprogrammeandthedefects(Section6.4.1)weexplaintheevaluationmeasures
used(Section6.4.2).Thenwefocusondefectlocalisationatthedifferentlevels
inisolation(Section6.4.3).Finally,weevaluatethehierarchicaldefect-localisation
6.4.4).(Sectionapproaches

6.4.1TargetProgrammeandDefects:MozillaRhino
ForourevaluationwerelyonMozillaRhino,aspublishedintheiBUGSproject
[DZ09].Rhinoisanopen-sourceJavaScriptinterpreter,consistingofninepack-
proages,vides146aclassesnumberandof1,561originalmethodsdefectsorthat≈were49kLOCobtainedby(normalisedjoining37kLOC).informationiBfromUGS
thebug-trackingsystemoftheprojectwithdataandsourcecodefromitsrevision-
controlsystem.Furthermore,itcontainstheoriginaltestcasesalongwiththetest
oracles.See[DZ07]fordetailsonhowthedatawasobtained.Allinall,Rhinofrom
larthegeiBsoftwUGSarerepositoryproject.Atproleastvidesacomparedrealistictotestprogrammesscenarioforusedindefectrelatedelocalisationvaluationsina
[CLZ+09,DFLS06,LYY+05]andinSection5.4ofthisdissertationthataretwoor-
dersofmagnitudesmaller,Rhinocanbeconsideredtobearelativelylargesoftware
project.

100

6.4.EVALUATIONWITHREALSOFTWAREDEFECTS

level/defect85880114491114493137181157509159334177314179068181654181834184107185165191668194364∅
package211421113131321.9
classmethod11828216-2011112101535319110213233.56.1
Table6.2:Defect-rankingpositionsforthethreelevelsseparately.

Concretely,wemakeuseof14defects(Table6.2liststhedefectnumbers)fromthe
iBUGSRhinorepositorywhichhaveassociatedtestcasesandrepresentoccasional
bugs.Thesedefectsrepresentdifferentrealprogrammingerrors,andtheyarehardto
localise:Theyoccuroccasionallyandhavebeenchecked-inintotherevision-control
systembeforeafailingbehaviourhasbeendiscovered.SeetheiBUGSrepository
[DZ09]formoredetails.Inaddition,iBUGSprovidesabout1,200testcasescon-
sistingofsomeJavaScriptcodetobeexecutedbyRhino,togetherwiththecorre-
spondingoracles.Asinmanysoftwareprojects,thereareonlyafewfailingtestcases
foreachdefect,besidesmanypassingcases.Toobtainasufficientnumberoffailing
cases,wehavegeneratednewonesbyvaryingexistingones.Inconcreteterms,we
havemergedJavaScriptcodefromcorrectandfailingtestcases.

MeasuresaluationEv6.4.2Inordertoassesstheprecisionofourtechniques,weconsidertherankingpositions
oftheactualdefects.Thesepositionsquantifythenumberofsoftwareentities(i.e.,
packages,classesandmethods)asoftwaredeveloperhastoinvestigateinordertofind
thedefect.Asthesizesofmethodscanvarysignificantly,wedeemitmoreadequate
toassessthehierarchicalapproachesbyconsideringthenormalisedLOCratherthan
onlythenumberofmethodsinvolved.WethereforeprovidethepercentageofLOC
toexamineinadditiontotherankingposition.Wecalculatethepercentageasthe
ratioofmethodsthathastobeexaminedinthesoftwareproject,i.e.,thesumofLOC
ofallmethodswitharankingpositionsmallerthanorequaltothepositionreported,
dividedbythetotalLOC.

6.4.3ExperimentalResults(DifferentLevels)
is,Wewenowconsiderpresentthecompletedefect-localisationpackage-levelcallresultsgraphsfortheandcallthreedifgraphsferentatlethevels.classThisand
methodlevel,zoomed-inintothecorrectpackage(andclass).Wedosoinorderto
assessthedefect-localisationabilitiesforeverylevelinisolation.
Table6.2containstheexperimentalresults,therankingpositionsforalldefects
invsentationestigated,ofthesameseparatelydata.forIttheplotsthreethelevnumberels.ofFiguredefects6.2prolocalisedvidesawhengraphicaladeveloperrepre-

101

CHAPTER6.HIERARCHICALDEFECTLOCALISATION

1412108agepack6desilaoclctsefedofrnumbe01234567891011121314151617181920
sclas4dothme20numberofpackages/classes/methodstoexamine
Figure6.2:Thenumbersofdefectslocalisedwhenexaminingacertainnumberof
packages/classes/methods.

examinesacertainnumberofthetop-rankedentities.Forexample,thethirdtriangu-
larpointfromtheleftmeansthat10outof14defectsarelocalisedwhenexamining
methods.threetoupAtthepackagelevel,thedefectivepackageisrankedatpositiononeortwoin10
outof14cases,i.e.,localisationisprecise.Theexplanationforsuchgoodresultsat
thetheresultscoarsestlooklevealislittlethewsmallorseatnumberfirstofsight.nineHowevpackageser,ineightRhinodefects.Atcanthebeclasslocalisedlevel,
whenexaminingthreeclassesorless(outof146).Onlythreedefectsarehardto
13localise,ofthei.e.,defectsadevcaneloperbehaslocalisedtobyinspecte15xaminingormore10methodsclasses.orAtlessthe(outmethodofle1,561),vel,
be10oflocalisedthematwithall.threeThismethodsdefectordoesless.notafOnlyfectonethedefect,call-graphnumberstructure137181,northecannotcall
frequencies.Alllocalisationinall,thetechniquecall-graph–localisemostrepresentationsdefectsatwiththeadifhighferentleprecision.vels–Hoaswevweller,aswhenthe
usingpackage-levelcallgraphstomanuallyzoom-inintoapackage,packagesranked
atknownpositionthatthreemanyordefectsfourhamightvebeeffectsmisleading.onlyinThistheirisclosenotunexpected,neighbourhoodasitis[DZ07].well
Thismightnotaffectapackage-levelcallgraphatall.Thehierarchicalapproaches,
inparticularthemerge-basedones,trytoovercomethiseffectbyinvestigatingseveral
.systematicallypackagesWeusetheresultsfromthissectiontosettheparametersk,l,mforthehierarchical
approaches.ThemaximumlocalisationprecisioninFigure6.2isreachedatfour
packages,20classesor10methods.Whenusingthesevaluesasparameters,the
hierarchicalapproachesdonotmissanydefectstheycouldactuallylocalisewhile
avoidingtoexaminemoresourcecodethannecessary.

102

6.4.EVALUATIONWITHREALSOFTWAREDEFECTS

1008060delisalocstecfedfo%E3parameterfree
40E1DFSbased
ergebasedmE220010098969492908886848280
%ofsourcecodethatneednotbeexamined
Figure6.3:Thepercentageofdefectslocalisedwhennotexaminingacertainper-
code.sourceofcentage

6.4.4ExperimentalResults(Hierarchical)
Wenowpresenttheresultsfromthreeexperimentswiththedifferenthierarchical
6.3.2):Section(seeapproaches

localisationdefectDFS-basedE1E2Merge-basedvariantofDFS-baseddefectlocalisation
E3Parameter-freevariantofmerge-baseddefectlocalisation

Table6.3containsthenumericalresultsintwovariants:therankingpositionsatthe
methodlevelandthecorrespondingpercentageofsourcecode.Asbefore,Figure6.3
isagraphicalrepresentationofthisdata.Similarlytorelatedwork(e.g.,[JH05,
LFY+06]),itrepresentsthepercentageofdefectslocalisedversusthepercentageof
sourcecodethatdoesnotneedtobeexamined.
Inlinewithourhypothesis(seeSection6.3.2),themerge-basedvariant(E2)per-
formsbetterthanthepureDFS-basedapproach(E1)inallbutfourdatapointsin
Figure6.3.TheaveragevaluesinTable6.3reflectthisaswell.Withthemerge-based
variant(E2),onefindsadefectbyexamining6.1%ofthesourcecodeonaverage.
Notsurprisingly,parameter-freedefectlocalisation(E3)alwaysperformsworsethan
orequaltotheparameterisedvariant(E2).However,itstillallowsadevelopertofind
defectsbyinspecting7.5%ofthesourcecodeonaverage,withouthavingtosetany
parameters.Focusingonthebestapproach,themerge-basedvariant(E2),twodefectsarepin-
pointeddirectly,andsixdefectscanbelocalisedbyinvestigatinglessthan10meth-
ods.Onlyonedefectcannotbelocalisedatall(asbefore),andforonlytwodefects
100ormoremethodsneedtobeinspected.Allinall,wedeemtheseresultsvery
helpful:Onaverage,almost94%ofthesourcecodecanbeexcludedfrommanual
debugging,andtofind86%ofalldefects,onecanskip89%ofthecode.

103

CHAPTER6.

104

TIONLOCALISADEFECTHIERARCHICAL

ge-basedmerE2:localisation;defectDFS-basedE1:results.defect-localisationHierarchical6.3:ableT
xamine.etoLOCbottom:position;method-rankingop:Tlocalisation.defect-freeparameterE3:

ge-basedvariantthereof;

E3E2E1E3E2E1xp.e
/defect1.2%8.2%6.7%54523858804.5%3.4%2.6%1812911449110.3%5.4%5.4%5631144933

------13718110.5%10.3%20.6%17049461575092.6%2.6%2.6%1593341131.3%1.3%1.6%1773141158.1%7.5%9.8%7768120
1790681.9%0.3%5.1%146418165494.5%4.4%4.4%18183453317.8%10.4%7.5%1551005418410715.2%20.1%3.1%21015429185165

5.9%5.9%11.6%88551916681.5%6.7%8.3%7.5%6.1%6.4%48.23137.12545.115∅194364

Subsumption6.5

6.5.SUBSUMPTION

Inthischapter,wehavebroughtforwardcall-graph-mining-baseddefect-localisation
(seeChapter5)toahierarchicalandscalableprocedure.Ourevaluationhasshown
thatitisabletolocalisedefectsfromthefieldinarelativelylargesoftwareproject,
MozillaRhino.Theresultfromourexperimentsisthattheamountofsourcecode
adeveloperhastoexaminemanuallycanbereducedtoabout6%onaverage.This
showsthatourcall-graph-basedapproachisabletodetectrealdefectsfromthefield.
Furthermore,theresultsshowthatweareabletoreducethesourcecodetobein-
vestigatedsignificantly.However,6%inRhinostillreferto≈3,000LOC.When
appliedinthefield,weexpectthatthedomainknowledgefromasoftwaredeveloper
canfurtherreducetheamountofcodetobeinvestigated.Forinstance,adeveloper
mightbeabletoexcludecertainpackagesfrominspectionassheorheknowsthat
thecodeisnotrelatedtothekindoffailure.
InSection5.4.3,wehavecomparedourbasicapproachusingasmallprogrammeto
bothrelatedapproaches/conceptsthatrelyoncall-graphmining[CLZ+09,DFLS06,
LYY+05]andwell-knownandprovenapproachesfromthesoftware-engineering
community[AZGvG09,JHS02,LFY+06].Theexperimentsgavewaytothecon-
clusionthatourapproachperformswellcomparedtotheotherapproaches.Itwould
certainlybeinterestingtocomparetheperformanceofourhierarchicalapproachfrom
thischaptertoalternativeapproaches,too.Thiscouldbedonewithinamorecompre-
hensiveevaluationofdefect-localisationtechniqueswithsoftwarerepositoriesfrom
largeprojects(seeChapter9).However,regardingtherelatedworkbasedoncall-
graphmining,suchacomparisonwouldnotbepossibleduetoscalabilityproblems.
Thiswouldatleastnotbepossibleaslongasonedoesnotextendtheseapproaches
withahierarchicalproceduresimilartotheoneproposedinthischapter.Regarding
thedefect-localisationtechniquesfromsoftwareengineering,acomparisonwouldbe
difficult.Thisisasnocompleteimplementationsareavailable(seeSection5.4.3).
Furthermore,atleastfortheSOBERmethod[LFY+06],itisunclearifitwould
scaleforsoftwareprojectsofthesizeofRhino.Predicate-basedinstrumentationis
expensiveintermsofruntime,andwearenotawareofanyevaluationsofSOBER
size.thisofprogrammesfeaturingAsRhinowasreleasedasabenchmarkfordefect-localisationtoolswithinthe
iBUGSsuite[DZ09],weexpectthatmoreandmoreevaluationsinthefuturewillbe
basedonRhinoandcanbecomparedtoourevaluation.Sofarweareonlyaware
ofonestudyfeaturingtheRhinodataset:Theapproachbasedongraphicalmodels
fromDietzetal.[DDZS09](seeSection3.1.2)hasusedthesamebenchmark,but
inanearlierversion.However,thisapproachisratherunknowncomparedtodefect-
localisationtechniquessuchasTarantulaandSOBER.Furthermore,asmentioned
inSection3.1.2,theresultscanhardlybecomparedtoours:Theevaluationbythe
authorscoversonlysituationswhereoneconsidersupto1%ofthesourcecodein

105

CHAPTER6.HIERARCHICALDEFECTLOCALISATION

ordertofindadefect.Besidesthat,thepublishedresultssuggestthattheirapproach
mightbebetterthanourapproach,inthisparticularsituation.
Inthefollowing,weaimatimprovingthedefect-localisationprecisionfurther.
InChapter7,wedevelopatechniquethatisabletolocaliseanadditionalclassof
defects,namelythosethataffectthedataflowofaprogramme.Thisalsohelpsin
improvingthedefect-localisationprecisionofdefectsthatcanalreadybelocalised.

106

7DatafloLocalisationw-AffofectingBugs

Animportantcharacteristicofthecall-graph-baseddefect-localisationtechniques
discussedsofar(bothfromtherelatedworkandintroducedinthisdissertation)is
thatonlytheylocalisemerelydefectsanalysewhichtheaffectcall-graphthecallstructuregraphandofathecallprogrammefrequencies.executionThey(sim-can
plified,+thecontrolflow).Whilethisisanimportantclassofdefects,Chengetal.
[CLZ09]pointoutthatthecurrenttechniquesareagnosticregardingdefectsthat
afinfluencefectingbtheugsbydatafloew.xtendingInthiscallchaptergraphs,wwithepresentinformationaretechniquegardingtothelocalisedataflow.dataflow-For
thegraphrepresentationandthelocalisationtechniquewebuildonconceptsfromthe
chapters.precedingWefirstpresentanintroductoryoverviewinSection7.1.Sections7.2and7.3then
introducedataflow-enabledcallgraphs(DECgraphs)andexplainhowweusethem
fordefectlocalisation.Section7.4containstheexperimentalevaluation.Section7.5
isasubsumptionofthischapter.

wvieOver7.1Inflow-afthisfectingchapter,andwecall-grpresentaaph-affectingcall-graph-basedbugs.Dataflotechniquew-afwhichfectingblocalisesugsinfluencebothdata-the
dataexchangedbetweenmethods.Forexample,thinkofamethodwhichwrongly
calculatessomevalue,andwhichneedstobelocalised.Acall-graph-basedtech-
ment.niquecanAlthoughonlythisrecognisehappenssuchafrequentlydefect,ifitthemightinfectedoccurvinalueafmethodsfectsawhichcontrolareactu-state-
allydefect-free,leadingtoerroneouslocalisations.Insuchcases,theincorporation
ofincreasedataflothew-relatedlocalisationinformationprecision.intoIntheothercallcases,graphsandwherethusdefectstheafanalysisfecttheprocessdataflocanw
only,theincorporationofdataflowinformationisthesolepossibilitytocapturesuch
defects.Thespecificationofgraphsthatincorporatedataflow-relatedinformationisnot
obvious:Ontheonehand,acallgraphisacompactrepresentationofanexecution.
Oncallsthewithinotheronehand,execution.datafloThisw-relatedinformationinformationneedsreferstobetoavvaluesailableofatamanlevyelofmethodde-

107

CHAPTER7.LOCALISATIONOFDATAFLOW-AFFECTINGBUGS

67void a()5void main()2void b()3int c(int p1, int p2)(a)Callgraphwithcallfrequencies(notdataflowenabled).
67void a()void main()25void b()3, 3, 0, 1, 0, 2, 0, 3int c(int p1, int p2)(b)Dataflow-enabledcallgraph(DECgraph).
graphs.callExample7.1:Figure

tailwhichallowstolocatedefects.Toillustratethedifficulties,anedgeinacall
graphtypicallyrepresentsthousandstomillionsofmethodcalls.Annotatingeach
edgewiththemethod-callparametersandmethod-returnvaluesofallinvocations
correspondingtoitincurshugeannotationsandisnotpractical.Inthischapter,we
proposedataflow-enabledcallgraphs(DECgraphs)whichincorporateconcisenu-
information.wdataflomericDECgraphsareaugmentationsofcallgraphswithabstractionsofmethod-callpa-
rametersandofmethod-returnvalues.ToobtainDECgraphs,wetreatdifferentdata
typesdifferently.Inparticular,wediscretisenumericalparameterandreturnvalues.
Figure7.1(b)isaDECgraphcorrespondingtoFigure7.1(a).Thecallfrommethodb
tomethodcisattributedwithatupleofintegers,containingthetotalnumberofcalls
andthenumbersofcallswithparameterandreturnvaluesfallingintodifferentinter-
vals.WhentheDECgraphsareassembled,wedofrequentsubgraphminingwiththe
graphs,notconsideringthedataflowabstractionsforthemoment.Wethenanalyse
thetuplesofintegersassignedtotheedgesasbeforewithafeature-selectionalgo-
rithminthedifferentsubgraphsminedseparately.Finally,wederivealikelihoodof
defectivenessforeverymethodintheprogrammeconsidered.
Allinall,ourtechniquefordefectlocalisationthatallowsforthelocalisationof
dataflow-affectingbugsfeaturescontributionsatdifferentstagesoftheanalysispro-
cessandintheapplicationdomain:

Dataflow-EnabledCallGraphs.WeintroduceDECgraphsassketchedbefore,
featuringdataflowabstractions.Wedescribeanefficientimplementationoftheir
programmes.Javaforgeneration

108

7.2.DATAFLOW-ENABLEDCALLGRAPHS

AsentDefadefect-localisationect-LocalisationApprtechniqueoacforhfDECorgraphs.Dataflow-AffSimilartoectingthepreBugs.viousWechapters,pre-
itisanapplicationofweightedgraphmining,whichultimatelyidentifiesdefective
methods.

ResultsinSoftwareEngineering.Wedemonstratetheappropriatenessand
studyprecisionweeofvouraluatetheDEC-graph-basedapproachusingapproachdefectsfortheintroducedlocalisationintooftheWdefects.ekaInamachine-case
+09].[HFHsuitelearning

7.2Dataflow-EnabledCallGraphs
Inthissection,weintroduceandspecifydataflow-enabledcallgraphs(DECgraphs)
andfolloewingxplainhoSectionwwe7.3)obtainarethethem.coreTheseofourgraphsapproachandtheirtolocaliseanalysisdataflo(describedw-afinfectingthe
ugs.bareTheabstractionsbasicideaofofmethodDECgraphsparametersistoeandxtendreturnedgesvalues.incallObtaininggraphswiththesetuplesabstractionswhich
istoringaneeddata-miningtobeproblemcondensedbytoitself:enableHugealateramountsanalysisofvandaluesfromultimatelythemethod-calllocalisationmoni-
ofdefects.Weaddressthisproblembymeansofdiscretisation.
Ingrammetheexfolloecutionswing,we(Sectionfirste7.2.1).xplainWhoewthenweederixplainvetheprogrammedataflowtracesabstractionsfrompro-and
ehowxplainwewhyobtainthetheyaregraphsusefulfromfordefectprogrammelocalisationtracesandgi(Sectionvea7.2.2).concreteeFinallyxample,we(Sec-say
7.2.3).tion

7.2.1DerivationofProgrammeTraces
Asintheprecedingchapters,weemploytheaspect-orientedprogramminglanguage
AspectJ[KHH+01]toweavetracingfunctionalityintoJavaprogrammes(seeSec-
tion4.4).Foreachmethodinvocation,welogcallfrequencyanddatavalues(param-
etersandreturnvalues)thatoccuratruntime.Finally,weusethisdatatobuildcall
graphs.Whenloggingdatavalues,welogprimitivedatatypesastheyare,capturearrays
andcollectionsbytheirsizeandreducestringstotheirlength.Suchanabstraction
fromconcretedataflowhasbeforesuccessfullybeenusedintheareaofsoftwareper-
formanceprediction,e.g.[KKR10].Certainly,thesesimplificationscanbesevere,
butloggingthefulldatawouldresultinoverlylargeamountsofdata.Ourevalua-
tion(Section7.4)primarilystudiesprimitivedatatypes.Asystematicevaluationof

109

CHAPTER7.LOCALISATIONOFDATAFLOW-AFFECTINGBUGS

arrays,collectionsandstringsaswellastechniquesforcomplexdatatypesisbeyond
thescopeofthisdissertation,butisaninterestingdirectionoffuturework.
Basedontheexperiencefromthepreviouschapters,wedecidetomakeuseof
atotal-reductionvariantofcallgraphs.SeeSection4.1.1fordetailsonthetotal-
scheme.reduction

AbstractionswDataflo7.2.2Asmentionedbefore,weusediscretisationinordertofindanabstractionofmethod
parametersandreturnvaluesbasedonthevaluesmonitored.Discretisationgivesus
anumberofintervalsforeveryparameterandforthereturnvalue(wediscussrespec-
tivetechniquesinthefollowing).Wethencountthenumberofmethodinvocations
fallingintotheintervalsdeterminedandattributethesecountstotheedges.
Notation7.1(Edge-WeightTuples)
Anedge-weighttupleinadataflow-enabledcallgraph(DECgraph)consistsofthe
countsofmethodcallsfallingintotherespectiveintervals:

(t,p1i1,p1i2,...,p1in1,p2i1,p2i2,...,p2in2,...,pim1,pim2,...,pimnm,ri1,ri2,...,rinr)
wheretisthetotalnumberofcalls,p1,p2,...,pmarethemethod-callparameters,
risthemethod-returnvalueandi1,i2,...,inx(nxdenotesthenumberofintervalsof
parameter/returnvaluex)aretheintervalsoftheparameters/returnvalues.
Theideaisthatvaluesreferringtoaninfectiontendtofallintodifferentintervals
thanvalueswhicharenotinfected.Forexample,infectedvaluesmightalwaysbe
lowerthancorrectvalues.Alternatively,infectedvaluesmightbeoutlierswhichdo
notfallintotheintervalsofcorrectvaluesaswell.Inordertobesuitedfordefect
localisation,intervalsmustrespectcorrectandfailingprogrammeexecutionsaswell
asdistributionsofvalues.Generally,itmightbecounter-productivetodivideavalue
rangelikeintegerintointervalsofequalsize.Groupsofclose-byvaluesofthesame
classmightfallintodifferentintervals,whichwouldcomplicatedefectlocalisation.
Withtheformalnotationofedge-weighttuples(Notation7.1),wearenowableto
introduceDECgraphsthataretotallyreducedgraphsatthemethodlevel:
Notation7.2(Dataflow-EnabledCallGraphs(DECGraphs))
InDECgraphs,everydistinctmethodisrepresentedbyexactlyonenode.Whenone
methodhascalledanothermethodatleastonceinanexecution,adirectededge
connectsthecorrespondingnodes.Theseedgesareannotatedwithnumericaledge-
weighttuplesasintroducedinNotation7.1.
Aswewillseeinthefollowing,DECgraphscanonlybederivedforanumber
ofexecutions,asmeaningfuldiscretisationsneedtobefoundthatholdforallpro-

110

7.2.DATAFLOW-ENABLEDCALLGRAPHS

grammeexecutionsconsidered.Figure7.1(b)isanexampleDECgraph,weillustrate
inconstructionits7.1.Example

7.2.3ConstructionofDataflow-EnabledCallGraphs

Wabledenowcallegrxplainaphsho(wDECwegrderiaphsve).theThecoreedge-weighttaskfortuplestheandconstructionconstructofDECdataflow-en-graphs
isthediscretisationoftraceddatavaluesfromanumberofexecutions.TheCAIM
(class-attributeinterdependencemaximisation)algorithm[KC04]suitsourrequire-
mentsforintelligentdiscretisation:It(1)discretisessinglenumericalattributesofa
dataset,(2)takesclassesassociatedwithtuplesintoaccount(i.e.,correctandfailing
executionsinourscenario)and(3)automaticallydeterminesa(possibly)minimal
numberofintervals.Internally,thealgorithmmaximisestheattribute-classinterde-
highpendence.accuracyinComparativclassificationeexperimentssettings.bytheCAIMinventorshavedemonstrateda
returnInvconcretealueofevterms,erywemethodletcallCAIMfindcorrespondingintervalstoforaevcertaineryedge.methodWedoparametersoforandall
edgesinallcallgraphsbelongingtotheprogrammeexecutionsconsidered.Wethen
theassemblediscretisation.theedge-weightAswearetuplesfacedaswithdescribedmillionsinofNotationmethod7.1.callsExamplefrom7.1hundredsillustratesto
thousandsofprogrammeexecutions,frequentlyconsistingofduplicatevalues,we
pre-aggregatevaluesduringtheexecution.Toavoidscalabilityproblems,wethen
utiliseaproprietaryimplementationofCAIMwhichisabletohandlelargeamounts
ofdatainpre-aggregatedform.NotethatthedataflowabstractionsinDECgraphs
canonlybederivedforasetofexecutions,asdiscretisationforasingleexecutionis
meaningful.notExample7.1:WeconsiderthecallofmethodcfrommethodbinFigure7.1(a)
(e4)xinvecutionoking1ithenTsameable7.1)methodandwiththreeafurtherfrequencyofprogrammeonetoexthree.ecutionsMethod(excecutionshastw2–o
parametersp1,p2andreturnsvaluer.Adiscretisationofp1,p2andrbased
(pon1i1,pthe1i2eandxampleri1,riv2)aluesandgivthreeeninforTp2able(p2i17.1(a),p2i2,p2i3leads).toSeetwToableinterv7.1(b)alsofforp1theeandxactr
intervals.Theoccurrencesofelementsofedge-weighttuplescanthenbecounted
tupleeasilyof–bsee→Tcableinex7.1(c),ecutionthe1thendiscretisedisasversiondisplayedofTinableFigure7.1(a).7.1(b),Thereferringedge-weightto
(t,p1i1,p1i2,p2i1,p2i2,p2i3,ri1,ri2).

111

CHAPTER7.LOCALISATIONOFDATAFLOW-AFFECTINGBUGS

(a)Examplecalldata.(b)Intervalsgenerated.(c)Discretiseddata.
ec.xep1p2rclassvalueintervalsec.xep1p2r
124312correctp1i1∶[1,11.5]1i1i3i2
114411correcti2∶(11.5,23]1i1i3i2
1349correcti1∶[2,13.5]1i1i1i2
212338failingp2i2∶(13.5,38]2i2i2i1
323276failingi3∶(38,47]3i2i2i1
315285failingri1∶[5,8.5]3i2i2i1
316237failingi2∶(8.5,13]3i2i2i1
46210correct4i1i1i2
4114713correct4i1i3i2

Table7.1:Examplediscretisationforthecallofintc(intp1,intp2).

7.3LocalisingDataflow-AffectingBugs
WcipleenothewexplainapproachhowfromtoderiSectionvedefect5.3.1,withlocalisationsadoptionsfromfortheDECdataflographs.Thisw-abstractionsisinprin-as
introducedinSection7.2.Wefirstgiveanoverview(Section7.3.1),thenwedescribe
nallysubgraph,weminingintroduce(Sectionthreee7.3.2)xtensionsandtotheouractualapproachdefect(Sectionslocalisation7.3.4,(Section7.3.5and7.3.3).7.3.6).Fi-

wvieOver7.3.1Asintheearlierchapters,Algorithm7.1workswithasetoftracesTofprogramme
executions.Atfirst,itassignsaclass(correct,failing)toeverytracet∈T(Line3),
usingatestoracle.ThentheproceduregeneratesDECgraphsfromeverytracet
(Line4).Next,theprocedurederivesfrequentsubgraphsofthesegraphswhichare
usedascontextswheredefectsarelocated(Line6).Thelaststepcalculatesalikeli-
hoodofcontainingadefectforeverymethodm(Line7).Thisfacilitatesaranking
ofthemethods,whichcanbegiventosoftwaredevelopers.Theywouldthenreview
thesuspiciousmethodsmanually,startingwiththeonewhichismostlikelytobe
.evdefecti

MiningSubgraphFrequent7.3.2AsshowninLine6inAlgorithm7.1andasinthepreviouschapters,weusefre-
quentsubgraphminingtoderivesubgraphswhicharefrequentwithinthecallgraphs
considered.Weusethesesubgraphsascontextsforamoredetailedanalysis.

112

7.3.LOCALISINGDATAFLOW-AFFECTINGBUGS

Algorithm7.1ProcedureofdefectlocalisationwithDECgraphs.
Input:asetofprogrammetracest∈T
Output:arankingbasedoneachmethod’slikelihoodtobedefectiveP(m)
1:G=∅//initialiseasetofDECgraphs
2:foralltracest∈Tdo
3:checkiftwasacorrectexecutionandassignaclass∈{correct,failing}tot
4:G=G∪{derive_dataflow-enabled_call_graph(t)}
rofend5:6:SG=frequent_subgraph_mining(G)
7:calculateP(m)forallmethodsm;basedonSG

Again,werelyontheParSeMiSimplementation[PWDW09]ofCloseGraph
[YH03]forfrequentsubgraphmining.Fortheminimum-supportvalue,weuseasin
correctSectionand6.3.1failingmin(∣eGxcorr∣,ecutions,∣Gfail∣)/2respecti,wherevelyG(Gcorr=andGcorrGf∪ailGarefail).thesetsofcallgraphsof

7.3.3Entropy-BasedDefectLocalisation
Next,wecalculatethelikelihoodthatamethodcontainsadefect(Line7inAlgo-
rithm7.1).Thisisanalogoustothepreviouschapters,withtheexceptionthatwe
nowanalysethedataflowannotations,too.Tothisend,weassembleafeaturetable
ws:folloasNotation7.3(FeaturetablesfordefectlocalisationwithDECgraphs)
Ourfeaturetableshavethefollowingstructure:Therowsstandforallprogramme
executions,representedbytheirDECgraphs.Foreveryedgeineveryfrequentsub-
graph,thereisonecolumnforeveryedge-weight-tupleelement,i.e.,onecolumnfor
thetotalcallfrequenciestandcolumnsforallintervalfrequencies.Thesefrequen-
rciesatioarofecallsnormalised:fallingTheintoyaeacrehdividedinterval.bytheThecorrtablecellsespondingcontaintinorthedercall-frtoobtainequencythe
valuesandthenormalisedinterval-frequencyvalues.Theverylastcolumncontains
theclass∈{correct,failing}.Ifasubgraphisnotcontainedinacallgraph,the
correspondingcellsnowhavevalue0.

Example7.2:Table7.2isanexampletablewhichassumesthattwosubgraphswere
foundinthepreviousgraphminingstep,sg1(main→b→c)andsg2(main→a).
Thefirstcolumnliststhecallgraphsg∈G.Thesecondcolumncorrespondstosg1
andedgemain→bwiththetotalcallfrequencyt.Thefollowingeightcolumns
correspondtothesecondedgeinthissubgraph.Besidesthetotalcallfrequencyt,
thesecolumnsrepresentintervalsandarederivedfromthefrequenciesofparameter

113

CHAPTER7.LOCALISATIONOFDATAFLOW-AFFECTINGBUGS

sgsg21mainbbcmaina⋯class
ec.xettp1i1p1i2p2i1p2i2p2i3ri1ri2t
ttttttt
g1231.000.000.330.000.670.001.0067⋯correct
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋱⋮
gn291.000.000.330.000.670.670.330⋯failing
Table7.2:Examplefeaturetable.g1referstoexecution1fromExample7.1(Fig-
7.1(b)).ure

notandreturncontainvsg2alues.,andThetheverylastcorrespondingcolumncellscontainshavevthealueclass0.correctorfailing.gndoes
algorithmAfter(assemblingGainRatiothe,seetable,weDefinitionemploy2.7)theinitsWinformation-gain-rekaimplementationatiofeature-selection[HFH+09]
toweight-tuplecalculatethevalues.discriminatiWehavveenessalreadyofthesuccessfullycolumnsandusedthustheofGainRthedifatioferenttechniqueedge-
inwhenaSectioncolumn5.3.1.canInperfectlycomparisontelltoclassesInfoGainapart.,InfoGainGainRatioonlyreachesreachesvvaluealue11alwwhenays
inadditiontheclassdistributionisequal(seeSection2.3.2).Thisisanadvantage
ofGainRatiocomparedtoInfoGain,asitmakesiteasiertointerpretthevalueas
aprobability.InSection7.4.3,weevaluatetheusageofdifferentfeature-selection
techniques.weSoarefar,weinterestedhaveinderilikvedelihoodsdefectlikforelihoodsmethodsforme,veryandevcolumneryinmethodthetable.correspondsHoweverto,
othermorethanmethodsoneandcolumnmightinitselfgeneral.beinThisvokisedduefromtovthefariousactthatotheramethodmethods,canincalltheseconteveralxt
ofdifferentsubgraphs.Furthermore,methodsmighthaveseveralparametersandare-
turnvalue,eachwithpossiblyseveralintervals.ToobtainmethodlikelihoodP(m),
weassignieverycolumncontainingatotalfrequencytoraparameter-intervialfre-
calleequencypmethod.totheWecallingthencalculatemethodPand(mev)eryasthereturn-vmaximumalue-intervofthealGainRfrequencatioyvraluestotheof
oftheamethodcolumnsbyitsassignedmosttomethodsuspiciousmin.vByocationdoingandso,thewemostidentifysuspiciousthedefectelementlikofelihoodits
Thetuple.callOtherconteinxtvofaocationslikelyarelessdefectiveimportant,methodasandtheymightsuspiciousnotdatabevrelatedaluestoareasupple-defect.
mentaryinformationwhichwereporttosoftwaredeveloperstoeasedebugging.
referExampletoa7.3:correctTheandagraphsfailingg1eandxgnecution.inTableAssume7.2thatdisplaymethodverycsimilarcontainsvaalues,defectbut
whichoccasionallyleadstoawronglycalculatedreturnvalue.Thisisreflectedin

114

7.3.LOCALISINGDATAFLOW-AFFECTINGBUGS

thecolumnsrti1andrti2ofb→cinsg1.TheGainRatiomeasurewillrecognise
fluctuatingvaluesinthesecolumns,leadingtoahighrankingofmethodc.
Theprecedingexamplehasillustratedhowourtechniqueisabletolocalisedata-
flow-affectingbugsbasedontheratiosofexecutionsfallingintothedifferentinter-
valsofthemethodparametersandreturnvalues.Furthermore,itlocalisesfrequency-
affectingbugsbasedonthecallfrequenciesintheedge-weighttuples.Inaddition,our
techniqueisabletolocalisemoststructure-affectingbugsaswell:(1)Thecallstruc-
tureisimplicitlycontainedinthefeaturetables(e.g.,Table7.2)–value0indicates
subgraphsnotsupportedbyanexecution.(2)Suchdefectsarefrequentlycausedby
controlstatements(e.g.,if,for)evaluatingpreviouslywronglycalculatedvalues.
Ouranalysisbasedondataflowcandetectsuchsituationsmoredirectly.

Detectionectionw-Up-InfFollo7.3.4Callgraphsoffailingexecutionsfrequentlycontaininfection-likepatternswhichare
causedbyaprecedinginfection.AsinSection5.3.1,weemployasimplestrategy
todetectcertainfollow-upinfectionstoenhancethemethodranking.Thisstrategy
isanextensionforLine7inAlgorithm7.1:Weremovemethodswithinthesame
subgraphbelongingtoamethodcallm2→m3fromtherankingwhenthefollowing
conditionshold:(1)GainRatio(m1→m2)=GainRatio(m2→m3)(weconsider
theGainRatiovaluesfromcolumnsbelongingtototalcallfrequenciesandparame-
ters),and(2)m1→m2→m3isnotpartofacyclewithinanyg∈G.(2)isnecessary
astheoriginofaninfectioncannotbedeterminedwithinacycle.(Notethatcycles
canoccurintotallyreducedgraphsbutnotinRsubtreegraphsasusedinChapter5.)
However,asinSection5.3.1,ourdetectionisaheuristic,butitishelpfulinpractice
7.4).Section(see

7.3.5ImprovementsforStructure-AffectingBugs
Thecalisationsubgraphsofstructure-afminedinLinefecting6binugs.AlgorithmThereare7.1twocankindsbeusedofsuchforbanugs:enhanced(1)thoselo-
whichleadtoadditionalstructuresand(2)thoseleadingtomissingstructures.To
dealwithbothofthem,weusethesupportsuppofeverysubgraphsginGcorrand
Gfailseparatelytodefinetwointermediaterankings.Therationaleisthatmethodsin
tobesubgraphsdefectihave.vingWaeahighgainusesupporttheineithermaximum:correctorfailingexecutionsaremorelikely

Pcorr(m)∶=m∈V(sgmax),sg∈SG(supp(sg,Gcorr))
Pfail(m)∶=m∈V(sgmax),sg∈SG(supp(sg,Gfail))

115

CHAPTER7.LOCALISATIONOFDATAFLOW-AFFECTINGBUGS

Withthesetwovalues,wedefineastructuralscoreasfollows:
Pstruct(m)∶=∣Pcorr(m)−Pfail(m)∣
Preliminaryexperimentshaverevealedthatthiskindofstructuralscoringleadsto
betterresultswiththetotallyreducedgraphsusedinthischapterthanthestructural
scoringfunctionusedinSection5.3.2.TointegratePstructintoourGainRatio-based
methodrankingP(m)(inLine7inAlgorithm7.1),wecalculatetheaverage:
Pcomb(m)∶=P(m)+Pstruct(m)
27.3.6IncorporationofStaticInformation
Asinthepreviouschapter,staticinformationcanbeusedtoimprovetherankingac-
curacy.Thestartingpointisthehandlingofmethodswiththesamedefectlikelihood.
Asinrelatedstudies[JH05],weusetheworstrankingpositionforallmethodswhich
havethesamedefectlikelihoodbydefault.Asanextension,asecondstaticranking
criterionhelpsdistinguishingmethodswiththesamedefectlikelihood:Wesortsuch
methodsdecreasinglybytheirsizeinnormalisedlinesofcode(LOC)1.Researchhas
shownthatthesizeinLOCfrequentlycorrelateswiththedefectivenesslikelihood
[NBZ06].

aluationEvExperimental7.4Toinvestigatethedefect-localisationcapabilitiesofourapproach,weusetheWeka
machine-learningsuite[HFH+09],manuallyaddanumberofdefectstoit,instrument
thecodeandexecuteitusingtest-inputdata.Finally,wecomparethedefectranking
returnedbyourapproachwiththede-factodefectlocations.Overall,wecarryoutsix
xperiments:eE1ApplicationofthenewapproachfeaturingDECgraphs,
E2——withfollow-up-infectiondetection,
E3——withfollow-up-infectiondetectionandstructuralranking,
E4thesameapproachwithcallgraphsthatarenotdataflowenabled,
E5——withfollow-up-infectiondetectionand
E6——withfollow-up-infectiondetectionandstructuralranking.
ExperimentsE4–6essentiallyareacomparisontothetechniquepresentedinSec-
tion5.3usingRwtotalgraphs.WeusethesamelocalisationtechniqueaswiththeDEC
graphsforafaircomparison.
1Inthisdissertation,weuse“methodlinesofcode”,thesumofnon-blankandnon-commentLOC
insidemethodbodies,asderivedwiththeMetricseclipseplugin[Sau05].

116

7.4.EXPERIMENTALEVALUATION

Wenowdescribetheexperimentalsettingindetail(Section7.4.1)beforewepre-
senttheexperimentalresults(Section7.4.2).Wefurtherpresentsomesupplementary
7.4.3).(Sectionxperimentse

SettingExperimental7.4.1Wekaisadata-intensiveopen-sourceapplicationwithatotalof19,938methods
and255klinesofcode(LOC).WenowuseWekaasourprogrammeundertest,asit
heavilydealswithdatapassedbetweenmethods,whichisnotthecaseintheprevious
edifvferentaluationskindsinofthisdefects.dissertation.TheyAsareweofhathevesamedoneintypesasSectionthe5.4.1,defectsweinrelatedintroduceevfival-e
uations,e.g.,theSiemensprogrammes[HFGO94],whichareoftenusedtoevaluate
defect-localisationtechniquesforCprogrammes(seeSection3.1.2).
ThedefecttypesintroducedtoWekaaretypicalprogrammingmistakes,arenon-
crashing,occasionalanddataflow-affectingand/orcall-graph-affecting.Intotal,we
evaluatetenseparatedefects(defect1–10)aswellassixcombinationsoftwoof
thesedefects(defects11–16).Thesecombinationsmimictypicalsituationswherea
programmecontainsmorethanonedefect.
StumpWeha.veThisintroducedclassisallthedefectsimplementationinofadecision-treeweka.classifiers.trees.Decision-algorithmwhichcom-
prises18methodsor471LOC.Weemphasisethatweinstrumentall19,938methods
ofcutionWekaof,andallDecisionStumpofthemareinvolvpotentialesatotalsubjectsof30todefectmethods.locations.ThisistheAtypicalreasonexwhye-
wecancananalysethisratherlargeprojectwithoutanyhierarchicalprocedure(see
6).ChapterWeexecuteeachdefectiveversionofWekawith90setsofsampleddatafromthe
UCImachine-learningrepository[FA10]andclassifycorrectandfailingexecutions
oftheprogramme.Tothisend,wefirstexecuteacorrectreferenceversionofWeka
data.withallWe90thenUCIdatainterpretsets.anyAfterdethat,viationweinextheecuteoutputtheofdefectithevtweovversionsersionswithasathefailure.same
TheTheydifnumberferbyofafcorrectactoreofx2.7ecutionsonaviserageintheandsameby5.3rangeinasthetheworstnumbercase.offailingones.

ResultsExperimental7.4.2Wepresenttheresults–therankingpositionwhichpinpointstheactualdefect–
ofthethenumbersixeofxperimentsmethodsaforallsoftwaresixteendevdefectseloperinhasTtoablerevie7.3.winThisorderpositiontofindthequantifiesde-
fect.WecomparetheexperimentalresultspairwisebetweenDECgraphs(E1–3)
andnon-DECgraphs(E4–6),asindicatedbythearcs.Agrey-colouredcellmeans
wingsorseindicateresults,sameornon-colouredimprovedcellsresultsmeansamecomparedortoimprothevedprecedingresults.rowBold-f(separatelyacerank-for

117

CHAPTER7.LOCALISATIONOFDATAFLOW-AFFECTINGBUGS

DEC/non-DECgraphs).Inprogrammeswithmorethanonedefect(i.e.,defects11–
a16),devweeloperpresentwouldnumbersfirstfixonecorrespondingdefect,tobeforethedefectapplyingrankouredbest.techniqueThisagain.reflectsSome-that
thetimesworsttwoorrankingmorepositionmethodsforhaallvethemethodssamewithdefecttheliksameelihood.likelihood.InthisThiscase,isinwelineuse
withincorporationtheofmethodologystaticofrelatedinformationstudiesasdescribed[JH05].(WinelookSectionatthe7.3.6resultsattheendfeaturingofthisthe
experimentalevaluationsection.)

Theexperimentsclearlyshowtheimproveddefect-localisationcapabilitiesofthe
newapproachbasedonDECgraphs.Evenwithoutextensions(E1),atoprankingis
obtainedin15outof16cases.Weconsideramethodrankedtopwhenadeveloper
hastoinvestigateonly3methodsoutofthe30onesactuallyexecuted.Withnon-
DEGgraphs(E4),only6defectsarerankedtop.Inonly5outof48measurement
points,comparedto26outof48ones,theDEC-graph-basedapproachisworsethan
thereference.DECgraphshavereachedatoprankingin44cases,whereasnon-DEC
graphshadatoprankinginonly28cases.WhendirectlycomparingDECgraphs(E1)
withnon-DECgraphs(E4)withoutextensions,thedefectlocalisationwasbetterin
13outof16cases.Furthermore,lookingattheaveragevalues(‘∅’),thenumberof
methodstobeinvestigatedcouldbereducedbymorethanhalf.

hasUsinggeneratedthefolloresultsw-upofthedetectionsame(E2/5),qualitythecomparedrankingtocouldtheberespectiimprovevedinitialinallcasesapproach.or
useThisofisbothremarkable,thefolloasw-uptheandfollostructuralw-up-infectionextensiondetection(E3/6)isaresultsheuristicinfurtherapproach.improvThee-
ments.rankinginFo9rDECcasesandgraphslowers(E3)theinrankingcomparisonin3tocases,(E2),i.e.,theebetteroxtensionverallimproresults.vesFtheor
non-DECgraphs(E6)incomparisonto(E5),thepictureissimilar:10improved
casesand3worseones.

RegardingtheWekaversionswithtwodefects(defects11–16),defectlocalisation
ealwaysxplanationworksisthatbetterondefectaveragelocalisationthanforhasvaersionshigherwithchanceonlytoonebedefectcorrect(E1–10).whentwOuro
methodshaveadefect.

Overall,theexperimentsshowalargeimprovementoftherankingwiththenew
approach.Incombinationwithfollow-updetectionsandthestructuralranking(E3),
someresultsaredefects.best.TheeUsingxperimentsthestructuralalsoshowrankingthatonlyleads1.6toaoutofslightlythew19,938orserankingmethodsforof
Weka(ofwhich30methodsareactuallyexecuted)mustbeinvestigatedonaverage
inordertofindadefect(E3).Theresultspromiseastrongreductionoftimespenton
defectlocalisationinsoftware-engineeringprojects.

118

7.4.EXPERIMENTALEVALUATION

∅2.31.91.66.12.82.0

163211031

111831
1514221321
13111111
12221841
221321
11115646
109113958

83211031

712911372

6221321

52261047

43221352

31111141

2321111

1321111

defectxp.e/graphsDEC(E1)graphsDEC(E2)graphsDEC(E3)graphsNon-DEC(E4)graphsNon-DEC(E5)graphsNon-DEC(E6)

ranking.structuralincl.(E3/6)w-up,folloincl.(E2/3/5/6)results.Defect-localisation7.3:ableT

119

CHAPTER7.LOCALISATIONOFDATAFLOW-AFFECTINGBUGS

ImprovedExperimentalResultsusingStaticAnalysis
Whenweapplythesecondarystaticrankingcriterion(seeSection7.3.6)toourexper-
iments,wecanobserveanimprovementoftheaveragerankingpositionasfollows:
2.3to1.9(E1),1.9to1.7(E2),1.6to1.5(E3),6.1to3.6(E4),2.8to2.6(E5)and2.0
to1.9(E6).Althoughtheadditionalstaticrankingcriterionleadstoimprovements
inallexperiments,thenon-DECgraphs(E4–6)benefitfromtheimprovedrankingto
alargerextent.Asfeatureselectionfornon-DECgraphsconsidersfewercolumns,
thedefectlikelihoodofmethodshasfewerdifferentvaluesthanforDECgraphs,and
thismorefrequentlyleadstoequalrankings.However,evenafterthecombination
withstaticanalysis,defectlocalisationwithDECgraphsisalwaysbetteronaverage
thanwithnon-DECgraphs.Thesameobservationsasdescribedinthepreceding
hold.paragraphs

ExperimentsySupplementar7.4.3Wenowpresentsupplementaryexperimentsthatarenotintendedtodemonstratethe
technique.usefulnessofDECConcretely,graphs,weevbutaluateevaluatetheselectedfeature-selectionaspectsfromtechniquetheemployed,defect-localisationandwe
evaluateoneaspectofthefeaturetables.Thisaspectconcernsthequestionwhether
nullvaluesorvalue0inthefeaturetablesleadstobetterdefect-localisationresults.
Intheprecedingchapters,wehaveusedtheinformation-gaintechniqueforfea-
tureinformation-gainselection(rInfoGainatio(),GainRwhileatiowe).InhaveSectionuseda5.3,relatedwehavetechniquealreadyinthisdescribedchapterthat,
bothtechniquesleadtoverysimilarresults.However,gainratiohastheniceprop-
tionertythatgainitinturnreachesvalueadditionally1alwaysrequireswhenaabalancedcolumnclassdiscriminatesdistributiontoperfectlyreach.valueInforma-1
de(seevelopersSectionto2.3.2).interpretThisthepropertyresultingvfromaluesgainasaratioprobabilitymightmaktoeitcontaineasierafordefect.software
Inthefeaturetablesusedfordefectlocalisation,ithappensthatcertaincolumns
cannotbefilledwithvalueswhenacertaincallgraph(arowinthetable)doesnot
embedacertainsubgraph(correspondingtocolumns).Inthesesituationswehave
inusedaChapterzero6.(‘0’)Bothinthisalternatichaptervesareandinreasonable,Chapteras5,onewhilecanwearhaguevethatusedaanullnullvvaluealuerefers(‘-’)
tonotexistingembeddings,andasonecanlikewisearguethatazerostandsforzero
methodcalls.Wenowevaluatethesetwoalternatives.
Inoursupplementaryexperiments,wefocusondefects1–10fromtheprevious
easvwealuationwantintothisstudythechapter.pureWedresultsoso,fromasdefectsdefect11–14localisationareinthecombinationsstandardthereof,casewithand
onedefect.Table7.4containstheresultsfromthesupplementaryexperiments(the
firstlineistakenfromTable7.3).

120

SUBSUMPTION7.5.

exp./defect12345678910∅
(E1)withGainRatio(asbefore)331322123113.1
(E1)withInfoGain412423121113.1
(E1)withGainRatio&nullvalues44592786114.7
Table7.4:Supplementaryexperimentalresults.

RegardingGainRatioandInfoGain,theresultsdeviatealittlebetweentheindi-
vidualdefects.Onaverage,thedefect-localisationprecisionofbothalternativesis
equal.ThisisinlinewithourresultsinSection5.3.Therefore,weconsiderboth
sultsalternatimightvestbeoabelittleequallymoresuitedintuitivforeasdefecttheyareallocalisation.waysinHothewevesamer,theintervalGainRatio(betweenre-
).1and0pictureReisgardingdiftheferent.influenceDespiteofofnulldefectvalues7,whereinsteadtheofnullzerosvinaluestheleadfeaturetobettertables,defectthe
localisations,thevariantpresentedinthischapter(thefirstlineinTable7.4,referring
totheindicatedusagebytheofzerosincreasedintheaveragefeaturevaluestables)fortheperformsnull-vequalaluevorariant.better.TheThisadvisantageclearlyof
zeroscanbeexplainedbythefactthattuplescontainingnullvaluesareignoredwhen
InfoGainorGainRatioiscalculated.Therefore,moreinformationthatispotentially
importantfordefectlocalisationisconsideredwhenusingzeros.

Subsumption7.5Thedefect-localisationtechniquesinvestigatedintheprecedingchaptersofthisdis-
sertationareagnosticregardingthedataflow.Thisis,theyarenotabletolocalise
defectsthataffectthedataflowonly,andtheyhavedifficultieslocalisingdefectsthat
affectprimarilythedataflow.Inthischapter,wehaveextendedourcall-graphrep-
resentations(seeChapter4)withabstractionsreferringtothedataflow,resultingin
dataflow-enabledcallgraphs(DECgraphs).Further,wehaveadoptedourdefect-
localisationscheme(seeChapter5)todealwithDECgraphs.Withtheseextensions
beandusedadoptionswithinweaareablehierarchicaltolocaliseadefect-localisationbroaderrangeschemeofasdefects.introducedDECgraphsincanChapteralso6
challenges.specialyanwithoutBesideswelldefect-localisationresultsachievedwithDECgraphs,therearea
numberofpossibleimprovementsforthetechnique:

•Asthroughmentionedprimitivebeforedataintypes.SectionThisis7.2.1,partlywehavcausedebyprimarilytheabsencestudiedofdataflorespec-ws
tievvealuationdefectivofedatafloprogrammeswsrelatedfeaturingtoarrays,othercollectionssituations.andHowevstringser,awouldsystematicsub-

121

CHAPTER7.LOCALISATIONOFDATAFLOW-AFFECTINGBUGS

122

stantiatetheresultsfromthischapter.Furthermore,asmentioned,wecurrently
donotincorporatedealsuchwithdataflocomplews.xdataThen,types.compleHoxwedataver,typesonecancanbedefinehandledheuristicswithourto
techniqueinthesamewayaswehandleotherdatatypes.

•Besidesthequestionwhichdatatypestoinvestigate,notallkindsofdataflows
aredirectlyrelatedtomethodcalls.Forinstance,adataflowcanalsoberealised
byinterchangingdatathroughglobalvariables.Currently,ourapproachdoes
notcoversuchsituations.However,theymightbeintegratedintoourapproach
asfollows:Staticcodeanalysiscouldhelptoidentifyrelevantvariablesthat
arereadwithinamethod.Theycanthenbetreatedlikeadditionalmethod-call
parameters.

•Anotherdiscretisationstartingalgorithms.pointforAsfurtherdescribedinveinstigationsSectionisthe7.2.3,evwealuationhaveofdifdecidedferentto
employtheCAIMalgorithm[KC04],asitsuitsourrequirementsandhasout-
performedanumberofalternativealgorithmsintheevaluationsbytheauthors.
Furthermore,wehaveachievedwellresultswiththiskindofdiscretisationin
ourniquesev(i.e.,aluation(seediscretisationSectionof7.4).Honumericalweverdata,otherwithrespectsupervisedtoaclass)discretisationhavebeentech-
anddescribedareinintheprincipleliteraturesuitedfor[CWC95,ourDKS95,approach,FI93,too.Ker92,AlthoughLS97,wedoWC87,noteWxpectu96]
significantimprovementsinresultaccuracy,thesealternativescouldbeevalu-
ated.Besidesthediscretisationalgorithmsmentioned,decision-tree-induction
algorithms[BK98,Qui93]withdifferentparametrisationscouldbeusedfor
thistaskaswell.Whenappliedtooneattributeonly,theypartitionthevalue-
rangeintointervalscontaininghomogeneousvaluesreferringtothesameclass
elihood.likincreasedanwith

8WeightedConstraint-BasedGraphsMiningof

Inthepreviouschapters,wehavefocusedonsoftware-defectlocalisationwithcall
graphs.Concretely,wehavediscussedvariousdatarepresentations(callgraphs)and
data-miningtechniquesfortheiranalysis.Forthelatter,wehavesofarfollowed
apost-processingapproachforminingweightedcallgraphs:Wehaveanalysedthe
weightsinananalysisstepthatfollowssubgraphmining.Wenowinvestigateaninte-
gratedapproachforweightedsubgraphminingthatbringstogethersubgraphmining
andtheanalysisofedgeweights.Wedosobyproposingaconstraint-basedapproach
andbyinvestigatingitsdifficulties.Weshowthatthisapproachcangenerallybeused
forvariousapplications,includingoursoftware-defect-localisationsetting.
Inthischapter,wefirstpresentanintroductoryoverviewinSection8.1.InSec-
tion8.2,weintroduceweight-basedconstraints,andinSection8.3weexplaintheir
integrationintominingalgorithms.Section8.4describesapplicationsettings.Sec-
tion8.5containstheevaluation.Section8.6isasubsumptionofthischapter.

wvieOver8.1Twogeneralapproachesforsubgraphminingwithweightedgraphsarepreprocess-
ingandpostprocessing.Thesestrategiesrefertotheanalysisoftheweights:Arethey
variantsanalysedhavbeforeeissues:ofafterAsthediscussedmininginofSectionthegraph3.2.1,structure?discretisingHowevnumericaler,bothvaluesofdurthese-
ingpreprocessingmightloseimportantinformation.Postprocessing(asinvestigated
inChapters5–7)inturnisnotalwaysefficient:Theminingalgorithmfirstignores
evtheerweightsdiscardsandmostmightofthem.generateaCheaperhugewaysnumbertoofperformsubgraphs.frequentThesecondsubgraphstephominingw-
withweightsareapproximategraphmining(seeSection3.2.2)andconstraint-based
grminingaph(seeminingSectionwith3.2.3).weight-basedInthisconstrchapteraints,w.eiThisnvis,estigateweappranalyseoximatethefrweightsequentduringsub-
thesincevminingariousofthehigher-legraphvelstructure.analysistasksSuchaimplyconstraint-basedmeaningfulweight-basedapproachisconstraints.promising,
Inaclassificationscenario,togiveanexample,anaturalconstraintwoulddemand
leadweightstoinsmallertheresultsubgraphsets,wepatternswithhypothesiseahighthatthosediscriminativeness.application-specificWhileconstraintsconstraints

123

CHAPTER8.CONSTRAINT-BASEDMININGOFWEIGHTEDGRAPHS

aaaaa 12 8 10 7 1bbbbb 3 3 3ccc(a)(b)(c)(d)f(e)f′
graphs.Example8.1:Figure

dopliesnottoloourwersoftwtheresultqualityare-defect-localisationofthehigherscenario.-levelHoproblem.wever,Thenotevsameeryprincipleconstraintap-is
goodforpruninginastraightforwardway.Literaturehasintroducedanti-monotone
constraints(seeSection2.3.3andSection3.2).Whenusingthemforpruning,theal-
gorithmstillfindsallpatterns.However,mostweight-basedconstraintsarenotanti-
monotone,forthefollowingreason:Graphtopologyandweightsareindependentof
eachgraphsothermay,atbehaleastveintheoryunpredictably.Examplewhen8.1thesupportillustratesthatchanges.Thus,weight-basedpruningapropertiespatternof
atacertainpointbearstheriskofmissingelementsoftheresult.
Example8.1:Thinkofanupper-boundconstraintdefinedasanumericalthresholdtu
tonu.theThisavwerageouldpreweightventofaminingcertainfromedgeeaxpanding→binaallpatternsupportingfwheregraphs:avg(aavg→(ab→)>bt)≤u.
NowconsiderthegraphdatabaseDconsistingof(a)–(c)aswellaspatternfin
eFigurextendf8.1.byfoneisedge,annotatedresultingwithintheavpatterneragef′,theweightavoferagetheedgesweightinDincreases.Ifwefromnow7
to10.Graph(c)causesthiseffect.Itdoesnotsupportf′,anditsweightvalueis
erage.vawbeloDespiteanti-monotonethisadverseweight-basedcharacteristic,constraintsweinstudythischapterfrequent.Thesubgraphrationaleminingisthatwithcertainnon-
characteristicsofreal-worldgraphsgivewaytotheexpectationthatresultsaregood.
Namely,therefrequentlyisacorrelationbetweenthegraphtopologyandtheweights
graphs.weightedorldreal-winExamplemaximum8.2:speedalloConsiderwed.aLarroad-mapgecities,graphhavingwhereaevhigherynodeedgedeisgreeattrib(autedwithtopologicalthe
property),tendtohavemorehighwayconnections(highedge-weightvalues)than
smallertowns.Thisisapositivecorrelation.
callIngraphsoftwarerepresentingengineering,aasmallsimilarmethodobservationconsistingholds:ofaThinkloop.ofaThisnodemethodinatendsweightedto

124

VIEWVERO8.1.

invokeafewdifferentmethodsonly(lowdegree),butwithhighfrequency(high
weights).Thisisanegativecorrelation.
McGlohonetal.[MAF08]havestudiedanumberofweightedgraphsfromdiffer-
entdomainssuchascitationnetworks,socialnetworksandcomputer-network-traffic
networks.TheyhaveobservedsimilarcorrelationsasinExample8.2.Concretely,
theyhaveformulatedtheso-calledweightpowerlaw(WPL)andthesnapshotpower
law(SPL).TheWPLlinksthetotalweightofagraphtothenumberofedgesandto
thenumberofnodesinthegraph,eachfollowingapowerlawwithexponentsthat
arespecificforagraphdataset.Evenmoreinterestingly,similartoourroad-map
example,theSPLdescribesaproportionalrelationshipbetweentheweightsofout-
goingedgestotheout-degreeofacertainnode(andaccordinglyforincomingedges).
Thisisagainapower-lawrelationshipwithexponentsthatarespecificforthegraph
dataset.However,alltheseobservationsinreal-worldgraphsareincontrasttothe
propertysketchedbefore:Intheory,weightsmightbeindependentfromthegraph
structure.Therefore,althoughthereisstrongevidencethatcertainrelationshipsbe-
tweenweightsandgraphtopologyexist,suchrelationshipscannotbeguaranteedfor
datasets.grapharbitraryMotivatedbytheexamplesgiveninExample8.2andtheobservationsfromMc-
Glohonetal.[MAF08]referringtoreal-worldgraphs,weproposethefollowingap-
mining:subgraphweightedforproachApproach8.1(Approximateweight-constraint-basedfrequentsubgraphmining)
Givenadatabaseofweightedgraphs,findsubgraphssatisfyingaminimumfrequency
constraintanduser-definedconstraintsreferringtoweights.
Notethatthesubgraphsreturnedareunweighted–weightsareconsideredonlyin
theconstraints.Inthischapter,wecomposeaconstraint-basedminingtechniqueby
integratingconstraintsreferringtoweights(thatarenotanti-monotone)intofrequent-
subgraph-miningalgorithms.Thisleadstoapproximateresults.Wetheninvestigate
problem:wingfollothe8.1oblemPrWhatisthecompletenessandtheusefulnessofresultsobtainedfromapproximate
weight-constraint-basedfrequentsubgraphmining?
Inconcreteterms,westudythedegreeofcompletenessofminingresultscompared
tonon-constrainedresults.Toassesstheusefulnessofanapproximateresult,we
considertheresultqualityofhigher-levelanalysistasks,basedonapproximategraph-
input.asresultsminingTodealwiththisproblem,thischapterfeaturesthefollowingpoints:

Wdardpattern-groeight-Constraint-BasedwthalgorithmsforSubgraphfrequentMining.subgraphWminingesaywithhowtopruningextendbasedstan-on

125

CHAPTER8.CONSTRAINT-BASEDMININGOFWEIGHTEDGRAPHS

weight-basedconstraints.WedosoforgSpan[YH02]andCloseGraph[YH03]
Section(see2.3.3).

ApplicationtoReal-WorldProblems.Besidesourdefect-localisationapplica-
tion,wedescribefurtherdata-analysisproblemsthatbuildonweightedgraphs.We
sayhowtoemployweight-constraint-basedsubgraphminingtosolvetheseproblems.

Evaluation.Wereportontheoutcomesofanevaluationfeaturingdifferentdo-
mainsandanalysissettings.Thisincludesoursoftware-defect-localisationscenario
aswellasdataandanalysisproblemsfromlogistics.Afundamentalresultisthatthe
correlationofweightswiththegraphstructureindeedexists,andwecanexploititin
problems.analysisorldreal-w

Constraintseight-BasedW8.2WInethisdonotsection,dealwewithdefinetheanti-monotoneweight-basedconstraints,constraintssinceweweareinvestigateinterestedininthisinvchapterestigat-.
ingapproximateminingresultsfromnon-anti-monotoneconstraints.However,the
techniqueswouldworkwithanti-monotoneconstraintsaswell.
measures)eight-based(W8.1DefinitionAweight-basedmeasureisafunctionE(p)→Rwhichassignseveryedgeofagraph
patternpanumericalvalue.Thefunctiontakestheweightsofthecorresponding
edgesinallembeddingsofpinallgraphsinagraphdatabaseDintoaccount.
valueDependingsuchasaonclassthelabelactualtoeachproblem,graph.oneIncanoursoftwassignsomeare-defect-localisationnumericalorcatescenario,gorical
theselabelsstandforcorrectandfailingexecutions.MeasureslikeInfoGainand
PMCCmakeuseofsuchvalues,inadditiontotheweights.–Iflabelsarenot
everyunique,singlesubgraphsembeddingcanbeofaembeddedsubgraphattoseveralcalculatepositionsameasurewithinforaangraph.edge.Weconsider
constraints)eight-based(W8.2DefinitionAlowerboundpredicateclforapatternpisapredicatewiththefollowingstructure:
cl(p)∶=(∃e1∈E(p)∶measure(e1)>tl)∨(∣p∣<sizemin)
Anupperboundpredicatecuinturnisasfollows:
cu(p)∶=(∄e2∈E(p)∶measure(e2)>tu)∨(∣p∣<sizemin)
Aweight-basedconstraint,appliedtoapatternp,isasetcontainingcl,cu,orboth,
.conjunctivelyconnected

126

CONSTRAINTSASED-BWEIGHT8.2.

Thelower-andupper-boundpredicateslettheuserspecifyaminimumandmax-
imuminterestingnessbasedonthemeasurechosen.Wecommentonthetwopredi-
catesaswellasonparametersizemininSection8.3.NotethatDefinition8.2requires
toconsideralledgesofapatternp.Thisisnecessary,asillustratedinExample8.1.
Thevalueofthemeasureofanyedgeofpcanchangewhenthesetofgraphssup-
changes.pporting

Measureseight-BasedWAnyfunctiononasetofnumberscanbeusedasameasure.Wehavechosentoeval-
uatethreemeasureswithahighrelevanceinrealdata-analysisproblems:InfoGain,
PMCCandvariance.Noneofthesemeasuresisanti-monotone.Twoofthem,
InfoGainandPMCC,requiretheexistenceofaclassassociatedwitheachgraph.
Suchclassesareavailable,e.g.,inanygraph-classificationtask,andthegoalofthe
miningprocessistoderivesubgraphpatternsforagooddiscriminationbetweenthe
classes.variancedoesnotdependonanyclass.Itisusefulinexplorativemining
scenarioswhereoneisinterestedinsubgraphswithvaryingweights.
Example8.3:Ifonewantstosearchforpatternspwithacertainminimumvariance
ofweights,onewouldspecifythemeasure‘variance’,thethresholdvaluetlandset
sizeminto0.Theconstraintthenis‘∃e∶variance(e)>tl’.Thiscouldbeusefulwhen
analysinglogisticsdata,whereonewantstofindsubgraphswithunbalancedloador
times.transportationaryingvhighlyAlthoughwehavedealtwithsomeofthemeasuresinearlierchapters,wegivea
shortsummaryofthethreemeasureschoseninthefollowing.Besidesthesemea-
sures,manyfurthermeasuresfromstatisticsanddataanalysiscanbeusedsimilarly
tobuildweight-basedconstraints.Thisincludes,say,differentattribute-selection
measuresknownfromdecision-treeinduction[BK98].

InformationGain.TheInfoGain(seeDefinition2.7)isameasureintheinter-
val[0,1]andquantifiestheabilityofanattributeAtodiscriminatebetweenclassesin
adataset(withoutarestrictiontobinaryclasses).Inthecontextofweightedgraphs,
Areferstotheweightsofacertainedgeofasubgraphpatterninallembeddingsin
allgraphsinthegraphdatabaseD.

Pearson’sProduct-MomentCorrelationCoefficient(PMCC).Thecorre-
lationcoefficientiswidelyusedtoquantifythestrengthofthelineardependence
betweentwovariables(see,e.g.,[WF05]).Inourgraph-miningcontext,thesetwo
variablesaretheweightofacertainedgeinasubgraphpatterninallembeddingsin
graphsinDandtheirbinaryclasses.Forourpurposes,positiveandnegativecorre-

127

CHAPTER8.CONSTRAINT-BASEDMININGOFWEIGHTEDGRAPHS

1-edge

0-edge

-edge2s

...

{}

ts’

Figure8.2:Apruningpattern-gro(s′)andwthsearchconstraint-basedspacewithpruningcon(vt,neentionalwinthisisomorphism-baseddissertation).

lationhavethesameimportance,andweusetheabsolutevalue.ThenPMCCisin
theinterval[0,1]aswell.
Variance.Thevariancequantifiesthevariationofthevaluesofarandomvari-
ableY.Itisintheinterval[0,∞).Inourscenarios,Yisthesetofweightsofa
certainedgeinallsubgraphpatternsinallembeddingsinD.

Miningeight-BasedW8.3Wenowdescribehowtointegrateweight-basedconstraintsintopattern-growth-
basedfrequentsubgraphmining.Wefirstfocusonvanillapattern-growthalgorithms
beforeturningtoclosedmining.Thebasicideaistouseweight-basedconstraints–
eveniftheyarenotanti-monotone–toprunethesearchspace.
Example8.4:Figure8.2illustratespattern-growthminingwithandwithoutweight-
basedconstraints.Withoutsuchconstraints,s′anditssuccessorsarepruned,ass′is
isomorphictos.Withweight-basedconstraints,thesearchisadditionallyprunedat
patternt.Thedashededgeextendsitsparent,andtincludingthenewedgeviolates
aweight-basedconstraint.Notethatitisnotnecessarilythenewlyaddededgeitself
whichviolatestheconstraint,butanyedgeint.
Inconcreteterms,wetreatthelowerandupper-boundpredicatesclandcu(as
definedinDefinition8.2)inweight-constraint-basedminingasfollows:
8.2oachpprAWhenapatternpdoesnotsatisfyclorcu,thesearchispruned.Ifitiscuthatisnot
satisfied,pisaddedtotheminingresult,otherwisenot.

128

MININGASED-BWEIGHT8.3.

UpperpruningtheBounds.searchwhenTheasufrationaleficientlybehindaninterestingupperedgeboundweightistoisspeedfound.upminingTherefore,by
wweantsusetoitusetotheprunegraphthesearch,patternsbutminedsaveforthecurrentclassification,pattern.aFpatternorewithxample,oneifedgethewithuser
amoreverydiscriminatidiscriminativev.eweightSettingwillthebefthresholdairenough.thereforeClearlyinv,olvlaresgeragraphstrade-offcanstillbetweenbe
efficientpruningandfindingrelevantgraphs.Section8.5.3willshowthatsmall
sufchangesficientintotherelyonupperfewbounddifferentdonotthresholdchangevthealuestoresultsobtainsatisfsignificantlyactory.Itisresults.therefore

LowerBounds.Withalowerbound,theuserspecifiesaminimalinterestingness.
oneThisdoesboundnotestopsxpectminingtofindwhenanythevpatternsaluewhichspecifiedareisnotmorereached.interesting.TheHorationaleweveris,thatthis
mightmisspatterns.Theparametersizemin(seeDefinition8.2)controlsthiseffect.

Algorithmswthoattern-GrPAlgorithm8.1describestheintegrationintopattern-growth-basedfrequent-subgraph-
miningalgorithms(seeSection2.3.3).Thealgorithmworksrecursively,andthesteps
inthealgorithmareexecutedforeverynodeinFigure8.2.Lines1–2,9–13and20
arethegenericstepsinpattern-growth-basedgraphmining[YH06].Theyperform
theisomorphismtest(Lines1–2),addpatternstotheresultset(Line9)andextend
thecurrentpattern(Line11),leadingtoasetoffrequentpatternsP.Thealgorithm
thenprocessesthemrecursively(Lines12–13)andstopsdepth-firstsearchwhenP
20).(LineemptyisLines4–7and15–17arenewinourextension.Insteadofdirectlyaddingthecur-
rentpatternpintotheresultsetF,thealgorithmfirstchecksthesizeminparameter
(Line4).Onlyiftheminimumsizeisreached,itcalculatestheweight-basedmea-
orsures∞,(Linerespecti5).vely;Line7seechecksDefinitionthe8.2).constraintsIfthey(ifareclornotcuisviolated,notset,orthethethresholdsminimumaresize0
isnotreached,thealgorithmsavesthepatterntotheresultset(Line9)andcontin-
uesasingenericpatterngrowth(Lines12–13).Otherwise,thealgorithmprunesthe
ical,search,asiti.e.,itdeterminesdoesnotbothcontinuethethespeedupsearchandinthethatresultbranch.quality.NoteAsthatthismentionedstepisbefore,crit-
wealwayssavethelastpatternbeforewepruneduetoupperbounds(Lines16–17).
Thisleadstoresultsetswhicharelargerthanthosefromstandardgraphminingwhen
theconstraintsareappliedinapostprocessingstep.
Onecanrealiseconstraintsonmorethanonemeasureinthesameway,byevaluat-
ingseveralconstraintsinsteadofone,atthesamestepofthealgorithm.Asmentioned
before,miningwithweight-basedconstraintsproducesaresultsetwithunweighted
subgraphpatterns.Incaseoneneedsweightedsubgraphsintheresultset,arbitrary

129

CHAPTER8.CONSTRAINT-BASEDMININGOFWEIGHTEDGRAPHS

Algorithm8.1pattern-growth(p,D,suppmin,tl,tu,sizemin,F)
Input:currentpatternp,databaseD,suppmin,parametersmeasure,tl,tuandsizemin
FsetresultOutput:1:ifp∈Fthen
neturr2:ifend3:4:if∣p∣≥sizeminthen
5:calculateweight-basedmeasuresforalledges
ifend6:7:if(∃e1∶measure(e1)>tl∧∄e2∶measure(e2)>tu)∨(∣p∣<sizemin)then
8:if(algorithm≠CloseGraph∨pisclosed)then
9:F=F∪{p}
ifend10:11:P=extend-by-one-edge(p,D,suppmin)
12:forallp′∈Pdo
13:pattern-growth(p′,D,suppmin,tl,tu,sizemin,F)
rofend14:else15:16:if∃e∶measure(e)>tuthen
17:F=F∪{p}
ifend18:ifend19:neturr20:

130

APPLIEDMININGGRAPHWEIGHTED8.4.

functions,e.g.,theaverage,canbeusedtoderiveweightsfromthesupportinggraphs
database.graphthein

MiningClosedClosedminingreturnsclosedgraphpatternsonly(seeSection2.3.3).Whendealing
withweight-basedconstraints,wedeviatefromthischaracteristic.Wefavourgraphs
whichareinteresting(accordingtothemeasures)overgraphswhichareclosed.This
isbecausetheweight-basedconstraintsmightstopminingwhen‘interestingenough’
patternsarefound.ExtendingtheCloseGraph[YH03]algorithmisslightlymore
complicatedthanpatterngrowthasdescribedbefore.CloseGraphperformsfurther
testsinordertocheckforclosedness(Line8inAlgorithm8.1).Inourextension,
thesetestsaredoneafterweight-basedpruning.Therefore,whenthesearchispruned
duetoaconstraint,itmighthappenthatthealgorithmmissesalargerclosedpattern.
Inthiscaseitaddspatternstotheresultsetwhicharenotclosed.

ImplementationTheextensionswedescribeherearecompatiblewithanypattern-growthgraphminer.
We[YH02]forourandpartCloseGrusetheaphP[YH03]arSeMiSimplementationsgraph-miningsuite(see[PWDSectionW09]2.3.3).withitsgSpan

8.4WeightedGraphMiningApplied
Wenowsayhowtoexploittheinformationcontainedintheweightsofgraphsindif-
ferentapplicationscenariosbuildingonweight-constraint-basedfrequentsubgraph
mining.Concretely,wefirstreviewoursoftware-defect-localisationscenariofrom
tionChapter8.4.2)5and(Sectionexploitati8.4.1).vegraphThenweminingintroduce(Sectionweighted8.4.3).graphclassification(Sec-

LocalisationectSoftware-Def8.4.1Inordertolocalisedefectswithourweight-constraint-basedfrequent-subgraph-mi-
ningtechnique,wealterthedefect-localisationapproachfromSection5.3.1asfol-
lows:Insteadofemployingtwoseparateanalysisstepsforfrequentsubgraphmining
andweightanalysis(Lines6and7inAlgorithm5.1,respectively),weperformasin-
gleweight-constraint-basedsubgraph-miningstep.Ourimplementationcalculates
thevaluesoftheemployedmeasureforalledgesinallfrequentsubgraphsbyde-
fault,andweinterpretthesevaluesasdefectivenesslikelihoods.Wenowusethe
InfoGainmeasureinsteadofgainratio,asInfoGainleadstoresultsofthesame
quality(seeSection7.4.3)andcanbecalculatedmoreefficiently(seeDefinition2.7).

131

CHAPTER8.CONSTRAINT-BASEDMININGOFWEIGHTEDGRAPHS

AsinsubgraphsSectioninthe5.3.1,resultwesetthenastheusetheweight-basedmaximumlikvalueelihoodfromofaallmethodoutgoingm:edgesinall

Pw(m)∶=max(measure({(m,x)∣(m,x)∈E∧x∈V}))
whereVandEaretheunionsofthevertexandedgesetsofallsubgraphpatternsin
theresultset,andmeasureappliedtoasetcalculatesthemeasureofeveryelement
.separatelySimilartoourcombinedapproachinSection5.3.2,welookatthesubgraphstruc-
turesaswell.Theresultsetsminedwithweight-basedconstraintsletusdefinean-
withotherlikinterestingelihoodbasededgeson(accordingsupport.toThetheycontainmeasureachosen)higherthannumberaofresultsetinterestingfromvgraphsanilla
graphmining.Therefore,itseemspromisingnotonlytogiveahighlikelihoodto
edgeswithinterestingweights.Weadditionallyconsidernodes(methods)occurring
frequentlyinthegraphpatternsintheresultset.Wecalculatethisstructurallikeli-
hoodsimilartoasupportintheresultsetF:
Ps(m)∶=∣{f∣f∈F∧m∈f}∣
F∣∣Thenextstepistocombinethetwolikelihoods.Wedothisbyaveragingthe
alues:vnormalisedPconstr(m)∶=Pw(m)+Ps(m)
comb2mn∈V(sg),asgx⊆g∈D(Pw(n))2mn∈V(sg),asgx⊆g∈D(Ps(n))
wherenisamethodinasubgraphsgofthedatabaseofallcallgraphsD.
Fortheevaluationofthistechnique,onecanusethemeasureswehaveusedin
thepreviouschapters.Inparticular,asuitableevaluationmeasureistheamountof
methodsorsourcecodeadeveloperhastoinvestigatewhenthedebuggingprocessis
guidedbytherankingobtainedwithPcombconstr.

Classificationeighted-GraphW8.4.2Subgraphpatternsfromweightedgraphscannotdirectlybeusedforclassification.
Withunweightedgraphs,itiscommontousebinaryfeaturevectors,indicatingwhich
subgraphisincludedinagraph[CYH10a].Everysuchvectorcorrespondstoagraph
inthegraphdatabase.Inthefollowing,weexplainhowweassemblefeaturevectors
includingweightstousethemforclassification.Weuseonefeatureinthevector
foreveryedgeineveryfrequentsubgraphmined.Thesefeaturesarenumericaland
standforthecorrespondingweightintheoriginalgraph.Ifagraphdoesnotcontain
acertainsubgraph,thecorrespondingfeaturesarenullvalues.

132

APPLIEDMININGGRAPHWEIGHTED8.4.

Figure8.3:Twotypicalfragmentsfromasmallunconnectedgraphinthelogistics
dataset.

Example8.5:WeconstructafeaturevectorforthegraphinFigure8.3.Imaginethat
therearetwofrequentsubgraphs,A→E→FandL→M.Thevectorconsistsof
thevaluesoftheedgesA→E,E→FandL→M:(25,29,53).
Incaseswherelabelsinthesubgraphpatternsarenotunique,thepositionofan
edgeinasubgraphdescribesacertainedge.Incaseofmultipleembeddingsofa
pattern,weuseaggregatesoftheweightsfromallembeddings.Thisencodingallows
toanalyseeveryedgeweightinthecontextofeverysubgraph.
thevFinallyectors,atonylearnclassifieramodelorfeaturingtomakenumericalpredictions.attributesArbitraryandnullevvaluationaluescanwmeasuresorkwithfor
classificationcanquantifythepredictivequalityoftheweighted-graph-classification
problem.WeforourpartusetheestablishedmeasuresaccuracyandAUC(area
undertheROCcurve;see,e.g.,[WF05]).

MiningExplorative8.4.3Besidesautomatedanalysisstepsfollowinggraphmining,anotherimportantappli-
cationisexplorativemining.Here,theresultsareinterpreteddirectlybyhumans.
Oneisinterestedinderivingusefulinformationfromadataset.Inourweight-
constraint-basedscenario,suchinformationisrepresentedassubgraphswithcertain
edge-weightpropertiesinlinewiththeconstraints.Forinstance,thelogisticsdataset
iswellsuitedforexplorativemining.AsmotivatedinExample8.3,onemightbe
interestedinsubgraphsfeaturingedgeswithhighorlowvariance.
Evaluationinthiscontextisdifficult,asitissupposedtoprovideinformationfor
onhumans.basicpropertiesTherefore,ofittheisharddatasettodefinemined,ainuniversalparticularthemeasure.sizeInofthisthestudysubgraphs.,wefocusThis
sizecanbeseenasameasureofexpressiveness,aslargersubgraphstendtobemore
significant.

133

CHAPTER8.CONSTRAINT-BASEDMININGOFWEIGHTEDGRAPHS

aluationEvExperimental8.5Wenowinvestigatethecharacteristicsofpruningwithseveralnon-anti-monotone
constraints,giventhereal-worldanalysisproblemsdescribedbefore.Wedosoby
comparingdifferentapplication-specificqualitycriteriawiththespeedupinruntime
aswellasbyassessingthecompletenessofapproximateresultsets.Whileothersolu-
tionstothereal-worldproblems(withoutweightedgraphmining)mightbeconceiv-
ableaswell,studyingthemtoalargerextendisnottheconcernofthisdissertation.
(InSection5.4.3,wehavecomparedcall-graph-basedsoftware-defectlocalisation
toalternativeproposalsfromtheliterature.)WefirstdescribethedatasetsinSec-
tion8.5.1.WethenpresenttheexperimentalsettingsinSection8.5.2andtheresults
8.5.3.Sectionin

Datasets8.5.1LocalisationectSoftware-DefWeinvestigatethedatasetwehavealreadyusedinChapter5(seeSection5.4.1),
whichconsistsofclassifiedweightedcallgraphs.Itconsistsof14defectiveversions
ofaJavaprogramme.Everyversionwasexecutedexactly100timeswithdifferent
inputfailingedata,xecutions.resultinginTheroughlygraphsthearesamequitenumberhomogeneous;ofgraphsthefollorepresentingwingcorrnumbersectde-and
scribeoneofthe14datasets.Themeannumberofnodesis19.6(standarddeviation
σ=1.9),themeannumberofedgesis23.8(σ=4.6),buttheedgeweightsarequite
diversewithameanvalueof227.6(σ=434.5).
LogisticsThisdatasetistheonefrom[JVB+05].Itisorigin-destinationdatafromalogistics
portscompanfally,intoattribtwoutedclasseswithdifwithferentfulltruckloadinformation.(TL)Theandlessgraphsthanareastruckloadfollo(ws:LTLT).rans-The
transportsfromthetwoclassesformtwosetsofgraphs,whichwelabelaccordingly.
Wefurtherarrangetransports(edges)withasimilarweightoftheloadinonegraph.
Next,asthespatialcoordinatesinthedatasetarefinegrained,wecombinelocations
timeclosetoneededeachtoothergettofromaoriginsingletonode,destinatione.g.,aslocationsedgefromweight.theThesametodurationwn.Wiseausecrucialthe
parameterintransportationlogistics,andthereisnoobviousconnectiontotheclass
label.Thedatasetdescribesaweighted-graph-classificationproblem,i.e.,predictif
agraphcontainsfullyorpartly-loadedtransports.
Finally,thedatasetconsistsof51graphs.Thetwoclasslabelsareevenlydis-
tribedgesuted,isthe616.1(meanσ=2,number418.6of).Asnodesisindicated234.3by(σthe=517high.1),standardandthedemeanviations,numberthisisof

134

8.5.EXPERIMENTALEVALUATION

averydiversedataset,containingsomeverylargegraphs.Thelargegraphsarenot
problematicforminingalgorithmsinthiscase,asmostgraphsareunconnected,and
thefragmentsarequitesmall.Besidesheterogeneousstructuralproperties,theedge
weightswithameanvalueof73.2(σ=50.9)arequiteclosetoeachother.Figure8.3
isapartofoneofthelogisticsgraphs.

SettingsExperimental8.5.2InourexperimentswecomparearegularCloseGraphimplementationtoourswith
weight-basedconstraints.Weevaluatethequalityoftheresultswithscenario-specific
evaluationmeasures(seeSection8.4)alongwiththeruntime.Weuseasinglecoreof
anAMDOpteron2218with2.6GHzand8GBRAMforallexperiments.Wemine
withasuppminof3inallexperimentswiththedefect-localisationdatasetandwith
asuppminof8inallexperimentswiththelogisticsdata.Wesetthesizeminto0in
allexperiments,asweareinterestedinthepureresultswiththedifferentlowerand
bounds.upper

LocalisationectSoftware-DefInthisscenario,wecompareourresultsbasedonedge-weight-basedpruningwitha
vanillagraph-miningtechnique.Tobefair,werepeattheexperimentsfromChapter5
withslightrevisions1andthesamesuppmin(3).Weuseupper-boundconstraintson
thetwoclass-awaremeasures.

Classificationeighted-GraphWForclassificationexperiments,weusebothdatasets.Inthesoftware-defect-locali-
sationdataset,wepredicttheclasslabelscorrectorfailing,inthelogisticsdataset
thetruck-loadlabelsTLandLTL(seeSection8.5.1).Weminethegraphdatabases
andwithdifassembleferentfeatureuppervectors,-bound-constraintasdescribedthresholdsinSectiononthe8.4.twWoethenclass-awusearethemmeasuresalong
dardwiththealgorithms.correspondingInconcreteclassterms,labelsinweausethe10-fold-cross-vWekaalidationimplementationsetting[HFHwith+09]stan-of
theC4.5decision-treeclassifier[Qui93]andtheLIBSVMsupport-vectormachine
c[CL01]hi-squaredwithfeaturstandarde-selectionparameters.Fimplementationorscalability[HFH+09]reasons,forweemplodimensionalityyastandardreduc-
.LIBSVMapplyingbeforetion1InnullvChapteralues,5,asathiszeroinallothewsforfeatureafairvectorscomparisonindicatestothethatneawcertainapproach.calldoesnotoccur.Wenowuse

135

CHAPTER8.CONSTRAINT-BASEDMININGOFWEIGHTEDGRAPHS
160160140140404000
120120nonoprpruunininngg303000
100100InInffooGaGainin
8080PMCCPMCC202000
dssecondssecon60604040sondsecsondsec
20101000nopruningInfoGainPMCC
000.010.02upper0.04bound0.08thres0.1hold60.320.640.80.05upp0.1erboundt0.25hreshold0.50.75
(a)Runtimesforthesoftwaredataset.(b)Runtimesforthelogisticsdataset.
4499nonopruningpruning
3,53,577InInffooGainGain
33PMCCPMCCectpositifedonctpositiefdeon151,5ositionctpefdenoitispotceefd33
552,52,5221nopruningInfoGainPMCC1
0.010.020.040.080.160.320.640.81234567891011121314
upperboundthresholddatasetnumber
(c)Averagedefectposition.(d)Averagepositionforeachdefect.
results.Experimental8.4:FigureMiningExplorativeForexplorative-miningexperiments,weinvestigatedifferentlower-bound-constraint
thresholdsonvarianceinthelogisticsdataset.Wecomparetheirqualityandruntime
constraints.withoutrunsminingwithResultsExperimental8.5.3LocalisationectSoftware-DefFigure8.4(a)displaystheruntimesofInfoGainandPMCCwithdifferentupper-
boundthresholdsonall14versionsofthedataset.TheInfoGainconstraintisalways
fasterthantheexecutiontimewithoutpruning,irrespectiveofthethreshold.Forlow
thresholdvalues(0.01to0.04),InfoGainreachesspeedupsofaround3.5.PMCCin
turnalwaysperformsbetterthanInfoGainandreachesspeedupsofupto5.2.Thisis
natural,asthecalculationstobedoneduringmininginordertoderivethemeasures
aremorecomplicatedforInfoGain(involvinglogarithms)thanforPMCC.Forhigh
thresholds(0.32to0.8)onbothmeasures,theruntimeincreasessignificantly.Thisis
causedbylesspruningwithsuchthresholds.
Figure8.4(c)containstheresultsindefectlocalisationwithoutpruningandwith
InfoGainandPMCCpruningwithvariousupperbounds.Thefigureshowsthe
averagepositionofthedefectinthereturnedrankingofsuspiciousmethods,averaged
136

8.5.EXPERIMENTALEVALUATION

forall14versions.TheInfoGainalmostalwaysperformsalittlebit(afifthranking
positionforthetwolowestthresholds)betterthanthebaseline(‘nopruning’).As
thebaselineapproachusesInfoGainaswell,weexplainthiseffectbytheimproved
structurallikelihoodcomputation(Ps,seeSection8.4),whichtakesadvantageofthe
edge-weight-basedpruning.ThePMCCcurveisworseinmostsituations.Thisisas
expected,asweknowthatentropy-basedmeasuresperformwellindefectlocalisation
(seeSection5.3.1).Figure8.4(d)containsthedefect-localisationresultsforthe14
differentversions.Weusetheaverageofthethreeexecutionswiththebestruntime
(thresholds0.01to0.04).Thefigurerevealsthattheprecisionoflocalisationsvaries
forthedifferentdefects,andthecurverepresentingtheInfoGainpruningisbestinall
buttwocases.Concerningthethresholdvalues,observethatsmallchangesalways
leadtoverysmallchangesintheresultingdefect-localisationprecision,withmild
runtime.onfectsefNexttothedefect-localisationresults,theperformanceofclassifierslearnedwith
thesoftwaredatasetisveryhigh.ThevalueswithInfoGain-pruningonlyvary
slightlyforthedifferentthresholdsonbothclassifiers,theSVM(accuracy:0.982–
0.986;AUC:0.972–0.979)andthedecisiontree(accuracy:0.989–0.994;AUC:
0.989–0.994).Althoughthevarianceisverylow,higherthresholdsyieldslightly
highervaluesinmostcases.Thisisasexpected,aslesspruningleadstolarger
graphs,encapsulatingpotentiallymoreinformation.WithPMCC,thevaluesare
veryclosetothosebefore,andonecanmakethesameobservation.

LogisticsFigure8.4(b)showstheruntimesofbothmeasureswithdifferentupper-boundthresh-
olds.Withanupperboundofupto0.10onInfoGainorPMCC,ourextensionruns
about2.9timesfasterthanthereferencewithoutpruning.Forlargerupperbounds
onPMCC,graphminingwithourextensionstillneedsonlyhalfoftheruntime.
InfoGainbecomeslessefficientforlargervalues,andforahighthresholdof0.75it
needsthesametimeasthealgorithmwithoutedge-weight-basedpruning.Asbefore,
PMCCperformsbetterthanInfoGain.
Intheexperiments,theperformanceofclassifiersdoesnotdependontheupper
bound,independentlyofthethreshold.WeevaluatedthesamevaluesasinFig-
ure8.4(b).FortheInfoGainmeasure,accuracyandAUCoftheSVMare0.902
and0.898,andtheyarealittlelowerwiththedecisiontree:0.863and0.840.For
PMCC,theresultsarethesameformostupperbounds.Onlyforthebounds0.50
and0.75,wherelesspruningtakesplaceandmoresubgraphsaregenerated,there-
sultsareslightlybetter(decisiontreeonly).Nexttoclassificationperformance,the
runtimeschangeonlyslightlywhenthethresholdvalueschange.
Theseresultsdemonstratethattheedgeweightsinthisdatasetarewellsuitedfor
classification.Further,thedegreeofedge-weight-basedpruningdidnotinfluence
theresultssignificantly.Therefore,InfoGainandPMCCobviouslyareappropriate

137

CHAPTER8.CONSTRAINT-BASEDMININGOFWEIGHTEDGRAPHS

400400881281286464
300300663232
200200441616
secondsseconds10010022ezisnrtteapezisnrtteap44
0000fomberernstfpatonumberernstpat22conononccsto.onsnstt..crroeeffnsererencenct.&poseetproc.
88501002505007501000250050001
10.80.640.320.160.080.040.020.01
nopruninglowrunerboundtimethresholdpatternsizeupperboundthreshold
(a)Runtimes&quality,logisticsdataset.(b)Comparisonofapproximateresultsets.
results.Experimental8.5:Figure

measures.Withlowupper-boundvaluesonbothmeasures,theruntimecanbeim-
provedbyafactorofabout2.9,whiletheclassifiershavealmostthesamequality.
Ondatasettheisotherlessside,importantthesetoresultssolvealsotheshowclassificationthattheproblemgraphthanstructuretheofedgethisweights.particular
Besidestheperformanceofclassification,wealsoevaluatethevariancemeasure
inanexplorativeminingsettingonthelogisticsdataset.Figure8.5(a)showstherun-
timespatternwithsizessev(ineraledges)lowerintheboundsresultalongset.Atwiththethelowestcorrespondingthresholdav(50),eragedtheruntimesubgraph-al-
readydecreasesto73%oftheruntimewithoutpruning.Atthehighestvalue(5,000),
theerageruntimesubgraphdecreasessizetodecreases7%onlyfrom,7whichto1is.aTherefore,speedupofv13.aluesAtthebetweensame250time,andthe1,000av-
mightbegoodchoicesforthisdataset(dependingontheuserrequirements),asthe
runtimeis3to7timesfaster,whiletheaveragesubgraphsizedecreasesmoderately
from7.4to6.1and4.6.

CompletenessofApproximateResultSets
Wenowinvestigatethecompletenessofourresultsetsandlookatthedefect-loca-
lisationexperimentswithInfoGain-constraintsanothertime.Figure8.5(b)refersto
thesedisplaysethexperimentssizesofwithresultthesets(aapproximateveragedforallconstraint-based14versions).CloseGrWeaphcomparealgorithm,thesebre-ut
sultswithanon-approximatereference,obtainedfromanon-constrainedexecution,
whereweremoveallsubgraphpatternsviolatinganupperboundafterwards.Our
constraint-basedminingalgorithmssaveallpatternsviolatingupperboundsbefore
cessingpruningasthewithsearchthe(seereferenceSectionand8.3).presentFortwovcomparison,ariantsofweapplyconstraint-basedthesamepostpro-mining:
Thepurevariant(‘const.’)andthepostprocessedone(‘const.&postproc.’).Com-
paringthetwopostprocessedcurves,forthresholdsof0.64andlarger,constraint-
basedresultsetshavethesamesizeasthereferenceandaresmallerforthresholdsof
0.32andlower.Preliminaryexperimentswithdifferentsuppminvalueshaverevealed

138

SUBSUMPTION8.6.

thatthedifferencebetweenthecurvesdecreases(suppminofaround20insteadof3)
orvbefore),anishesare(alwsuppaysminlarofger70).thanTheclosedpureresultmining,esetsvenif(thosenoweconstraintsusedintheareeapplied.xperimentsTo
conclude,ourapproximateresultsetscontainlessthanhalfofthepatternsasthe
thepurenon-approximateresultsetsreference,obtainedforfromsmallsuppconstraint-basedminanduppermininginboundavshorteralues.runtimeHowev(seeer,
Figure8.4(a))containmanymoreinterestingsubgraphpatterns(seecurve‘const.’),
whichisbeneficialfortheapplications.

Subsumption8.6Inthischapter,wehavedealtwithminingofweightedgraphs.Wehaveintegrated
non-anti-monotoneconstraintsbasedonweightsintopattern-growthfrequent-sub-
graph-miningalgorithms.Thishasledtoimprovedruntimeandapproximateresults.
Theassessmentgoalofofourresultstudywascompleteness,toinvweestigatehavetheevqualityaluatedofitstheseusefulness,results.i.e.,theBesidesresultan
ationqualityshoofwshigherthata-levelcorrelationreal-woforldweightsanalysiswithproblemsthegraphbasedonstructurethisedata.xistsandTheecanvalu-be
exploitedbymeansoffasteranalyses.Frequentsubgraphminingwithweight-based
constraintshasproventobeuseful–atleastfortheproblemsinvestigated.
BesidesthehierarchicalapproachpresentedinChapter6,weight-constraint-based
Italloapproximatewstoperformminingfisasteranotheranalysescontribthanutionwithtoourscalableapproachsoftwpresentedare-defectinChapterlocalisation.5.
wareAlternativprojects.ely,Attheconstraint-basedsametime,theapproximateresultsinminingdefectallowslocalisationforevanalysesenareofalarlittlegermoresoft-
precise.Theconstraint-basedapproximate-miningapproachpresentedinthischaptercan
beemployedinahierarchical-miningscenario(seeChapter6)andwithdataflow-
Bothenabledproposalscallgraphs(hierarchical(DECgraphs,miningseeandDECChapter7)graphs)withoutfeatureanygraphsspecialwithedgeschallenges.an-
atnotatedthesamewithtime.tuplesofTherefore,weights,andconstraintsourcanapproachbedefinedcandealonallwithtupleseveralelements.constraintsHow-
evwiller,whenincreasemeasuresruntime.needtoDependingbeoncalculatedthefornumberanofincreasedtupleelementsnumberofandtheweights,naturethis
ofthedatasetinvestigated,thepost-processingapproachesusedinChapters6and7
itiesmighttobefincreaseasterthethaneftheficiencyofconstraint-basedthevimplementationariant.Howevepresentedr,thereinarethisalsochapter.possibil-For
instance,incrementaltechniquescouldbeusedforthecalculationofthemeasures
mining.during

139

9ConcResearchlusionsandDirectionsFuture

Defectlocalisationisanimportantprobleminsoftwareengineering.Inthisdisser-
relatitation,velyweharecentveinvdirectionestigatedofresearchincall-graph-mining-baseddefectlocalisation.softwareRespectidefectvelocalisation,approachesa
aimcated,atinordersupportingtosoftwreducearethedeveamountlopersofbyprocodeavidingdevhintseloperwherehastodefectsinspectmightmanuallybelo-.
Theyrelyontheanalysisofdynamiccallgraphs,whicharerepresentationsfrom
correctandfailingprogrammeexecutions.Inthisdissertation,wehaveinvestigated
call-graph-mining-basedtechniquestodrawconclusionsontheirsuitabilitytode-
riveusefuldefectlocalisations.Tothisend,wehaveextendedthestate-of-the-artin
call-graph-baseddefectlocalisationinvariousways.Thisleadstoabroaderrange
ofclusiondetectablethatdynamicdefects,calltoangraphsincreasedareindeedlocalisationasuitableprecisiondataandrepresentationfinallytoforthedefectcon-
localisation.Fromadata-miningpointofview,miningdynamiccallgraphsbearsanumberof
challenges.Mostimportantly,graphsneedtoberepresentedadequately,techniques
forminingweightedgraphsandtoderivedefectlocalisationsneedtobedeveloped,
Inandthisrespectivdissertation,etechniqueswehaveshoulddealtscalewithforallthetheseanalysischallenges,oflargesoftwresultingareinseprojects.veral
call-graphrepresentations,varioustechniquesfordefectlocalisationwithweighted
callgraphsandatechniqueforgraphminingwithweight-basedconstraints.The
weight-constraint-basedtechniqueinparticularisnotonlylimitedtothesoftware-
engineeringapplicationdomain,butisageneralapproachforconstraint-basedmin-
graphs.weightedofingInthefollowing,wereviewthedifferentcontributionsofthisdissertationinmore
detail(Section9.1),discussthelessonslearned(Section9.2)andpresentsomeinter-
estingopportunitiesoffuturework(Section9.3).

9.1SummaryofthisDissertation

Atthebeginningofthisdissertation,wehaveobservedthatrelatedcall-graph-based
defect-localisationtechniqueslocalisestructure-affectingbugswell,buthavedifficul-
tiesinlocalisingfrequency-affectingbugs(Chapter5).Thisisasthegraphsanalysed

141

CHAPTER9.CONCLUSIONSANDFUTURERESEARCHDIRECTIONS

donotencodetheinformationneededtoderivegooddefectlocalisations.Therefore,
wehaveproposedgraphrepresentationsthatincludesuchinformation:callfrequen-
ciesannotatedasnumericaledgeweights(Chapter4).Inordertousetherespective
graphsfordefectlocalisation,wehavedevelopedatechniquethatanalysesboththe
graphstructure(topology)andthenumericaledgeweights(Chapter5).Tothisend
wehavedevelopedacombinedapproachthatconsistsoffrequentsubgraphmin-
ingandfeatureselection.Besidesthis,wehaveidentifiedthattherelativelysevere
total-reductiontechniquesforcallgraphsusedintherelatedworkleadtoalossof
structuralinformation.Wehavethereforealsodefinedcall-graphrepresentationsthat
arealittlelargerthantotallyreducedgraphsandencodemorestructuralinforma-
tion.Inordertobeabletolocalisebothkindsofdefects,structure-affectingbugsand
frequency-affectingbugs,wehaveproposedcombinedapproachesfordefectlocali-
sation.Inafirstevaluation(Chapter5)withdefectswehaveartificiallyseededintoasmall
programme,wehaveshownthatourcall-graphrepresentationsandanalysistech-
niquesareindeedusefulforthelocalisationofdefects.Concretely,wehaveachieved
defect-localisationswithadoubledprecisioncomparedtorelatedcall-graph-based
techniques.Wehavealsoshownthatourapproachcandetectdefectsthatotherap-
proachescannotdetectinprinciple.Further,wehavedemonstratedthatthenumerical
informationkeptwithourcall-graphrepresentationsisimportantforgoodresults.
Besidesthecomparisontocloselyrelatedtechniques,wehavealsocomparedour
techniquetoestablishedtechniquesfromsoftwareengineering.Theresulthasbeen,
basedonouradmittedlyrelativelysmalltestsuite,thatourtechniqueperformsbetter
thantheseapproachesinmostcases.
Thenextstepinthisdissertationhasdealtwithscalabilityandwithabroadereval-
uationwitharealsoftwareprojectanddefectsfromthefield(Chapter6).Basedon
theobservationthatourapproachproposedsofarhasdifficultiesscalingtolarger
softwareprojects(theapproachesfromtherelatedworkarefacedwiththesame
problem),wehaveproposedahierarchicalprocedure:Startingwithnovelcall-graph
representationsatcoarselevelsofabstraction,ourapproachidentifiessuspiciousre-
gionsinthecallgraphsandthenzooms-inintotheseregions.There,itappliesthe
sametechniquetographsofamorefine-grainedabstractionetc.Inanevaluation
withrealdefectsfromarelativelylargeopen-sourceproject,wehaveshownthatour
newcall-graphabstractions,aswellasourhierarchicalminingprocedure,arewell
suitedtolocalisedefectsandscaleforlargersoftwareprojects.Inparticular,inour
experiments,thesourcecodeasoftwaredeveloperhastoinvestigatecouldbelimited
to6%ofthewholesoftwareprojectonaverage.Toourknowledge,thisisthefirst
studyapplyingcall-graphminingtoasoftwareprojectofthissize.
Aprincipleproblemofallcall-graph-basedapproachesfordefectlocalisation–
includingourtechniquesdescribedsofar–isthattheyanalysethegraphstructure,
butareagnosticregardingthedataflow.Therefore,theyareunabletodetectdefects
thatinfluencethedataflowonly.Inordertobeabletocapturethosedefectsaswell,

142

LEARNEDLESSONS9.2.

wehaveproposeddataflow-enabledcallgraphsthatincludeabstractionsreferring
hatovetheevdatafloaluatedwour(Chapterprocedure7).Aswithvwell,ariouswehavdefects.eadoptedTheresultourisminingthattechniqueconsideringandthe
withdataflowtechniquesinformationrelyingalloonwscallustographs.localisedefectsFurthermore,thatwithcannotthesebelocalisedenhancements,otherwisewe
areabletoincreasethedefect-localisationprecisionofanumberoffurtherdefects.
Inmostpartsofthisdissertation,wehavereliedoncombinedapproachesformin-
ingweightedgraphsandultimatelyforthelocalisationofdefects.Thisis,caused
bytheabsenceofsuitabletechniquesforminingweightedgraphsdirectly,wehave
emplosubsequentyedfrequentpostprocessingsubgraphstep.miningAttheinaendfirstofthisanalysisstepdissertationandfeature(Chapter8),selectionwehainvea
proposedConcretelya,weunifiedhaveapproachpushedfortheminingpostprocessingweightedstepgraphsintointheasinglemininganalysisalgorithmstep.by
formulatingandprocessingweight-basedconstraints.Theseconstraintsconsiderthe
graphweightsofmining.theThisgraphleadsandtoallowforspeed-upspruningofthethemininginternalalgorithmsearchspace(e.g.,of3.5timesfrequentforsub-the
defect-localisationdataset),whileobtainingresultsofacomparablequality.Forde-
fectBesidesthelocalisation,applicationwehavtoeevendefectobtainedlocalisation,miningminingresultswiththatareaweight-basedlittlemoreconstraintsprecise.
isauniversalapproach,andwehaveaswellsuccessfullyevaluateditwithdatafrom
acompletelydifferentdomain,transportationlogistics.
Inthisdissertation,wehavefocusedonthelocalisationofdefectsthatoccurin
single-threadedsequentialprogrammes.However,therearecertainclassesofdefects
relatedtotheparallelexecutionofseveralthreadswithinthesameprogramme.Inor-
dercertaintoshowdefectsthatreferringcall-graph-basedtoparallelextechniquesecutions,areweinhaveprincipleperformedaswellafirstsuitedstudytowithlocalisea
techniquecall-graph-representation(AppendixA).forTheresultmultithreadedisthatcertainprogrammesdefectsandcananbeadoptedlocalisedwell,localisationbut
thatresentationsfurtherinthatvalloestigationswforwilltheprobablylocalisationleadofatomorebroaderrangesophisticatedofdefectscall-graphrelatedrep-to
ecutions.xeparallel

LearnedLessons9.2

Throughouttheresearchconductedforthisdissertation,wehaveexperiencedand
learnedmanythings.Inthefollowing,wehighlightthemostimportantlessonswe
learned.evha

quiteDataobvious,arepresentationsdata-miningareketechniqueyforcangoodonlyfindresults.patterns,Althoughpredictitbehaseemsvioustoorbe

143

CHAPTER9.CONCLUSIONSANDFUTURERESEARCHDIRECTIONS

localisedefectsifthereisrespectiveevidenceinthedata.Inthecontextofthisdisser-
etation,xperiencewehaisvethatinvestigatedparticularlyseveralweightedcall-graphcallvgraphsariationsarekeasytodatasuccessfullyrepresentations.localiseOura
broadrangeofdefects(seeChapters5and7).Weightedcallgraphsareannotated
withnumericalinformationsuchascallfrequenciesanddataflowabstractions.Fur-
ther,findingsuitablegraphtopologiesiskeyforbothwellresults(seeChapter5)
andscalabledefectlocalisation.Whilefrequentsubgraphminingdoesnotscalefor
method-levelcallgraphsfromlargesoftwareprojects,itcanbeusedforgraphsat
coarserlevelsofgranularityorforcut-outsofcallgraphs(seeChapter6).Tosum
up,findingtherightdatarepresentationandacquiringthedataneededtosolvethe
analysisproblemisessential–maybeevenmoreimportantthantheactualanalysis
technique.Theseobservationsconfirmthe+moregeneralliteratureonthedata-mining
processandonapplieddatamining[CCK00,FPSS96,HG08].

Dynamiccallgraphsareasuitableabstractionfordefectlocalisation.
Indefectthislocalisationdissertation,inwesoftwhaveare.invAsestigatedshownthebytheesuitabilityvaluationsofindynamicthedifcallferentgraphschap-for
ters,call-graphminingdoesleadtodefectlocalisationsthatareuseful.Thisis,the
amountFurthermore,ofcodethethatcomparatineedsvtoeebevinvaluationestigatedinChaptermanually5hascanshobewnreducedthatoursignificantlytechnique.
canusingthecompetetestwithsuiteconsidered.state-of-the-artMoreapproachesconcretelythat,ourdonottechniquerelyonhascallgraphs,outperformedatleastthe
areothertypesofapproachesdefectsin12thatoutcanofbe14caseslocalisedinourwithtestoursuite,technique,andwebhautvenotshowithwnthethatotherthere
considered.techniquesisationOurholdsconclusionforthethatcall-graphdynamiccallrepresentationsgraphsareasuitableconsideredinabstractionthisfordissertation.defectlocal-This
is,inparticulargraphsthatareannotatedwithnumericalinformationreferringtocall
makesfrequenciesuseoforbothdataflowstructuralvaluesandarewellnumericalsuitede–videncealongwithencodedaninanalysisthegraphs.techniquethat

Evenifgraphminingisexpensive,itleadstogoodresults.Fortheanal-
ysisofcallgraphs,wehavesuccessfullyfollowedapproachesthatemployfrequent-
subgraph-miningtechniques.Thisallowsforaverydetailedanalysisofnumerical
weightsinthecontextofthedifferentsubgraphsandforthederivationofstructural
scoringmeasures.However,graphminingiscomputationallyexpensive,anditisthe
bottleneckofourproposedanalysistechniques.Evenifinstrumentingsourcecode
leadstomoderateruntimeoverheads,andrunningfeature-selectionalgorithmsneeds
sometimeaswell,ourexperienceisthatgraphminingisthemostexpensivestep.
However,withourhierarchicalprocedures(seeChapter6),theruntimeofgraph-
miningalgorithmsisintherangeofafewminutes.Weconsidersuchruntimesto

144

DIRECTIONSRESEARCHFUTURE9.3.

beacceptablefordefectlocalisation.However,theremightbeotherapproachesthat
analysethecall-graphrepresentationsproposedandleadtogoodlocalisationresults,
too.Investigatingallsuchpossibleapproacheswasnottheaimofthisdissertation.

Pruningwithnon-anti-monotoneconstraintsisusefulforapplications.
Mostweight-basedconstraintsarenotanti-monotone.Thus,usingthemforpruning
doesnotguaranteethecompletenessoftheminingresultsandleadstoapproximate
results.Thisisastherearenoguaranteedlawsthatrelatethegraphstructure(topol-
ogy)totheweightsattachedtothegraphs.Astopologyandweightsarenevertheless
correlatedinmanyreal-worldgraphs[MAF08],wehaveproposedweight-basedcon-
straintsandhaveintegratedthemintofrequent-subgraph-miningalgorithms.Then,
wehaveinvestigatedtheeffectfromapproximate(incomplete)resultsetsonreal-
worldanalysisproblems.Theresultisthatapproximateresultshaveverymildeffects
onthefinalresultquality,whileachievingwellspeed-upsinruntime.

9.3FutureResearchDirections
Thereareanumberofproblemswithrespecttocall-graph-mining-baseddefectlo-
Somecalisationpossiblethatewedidxtensionsnotoftheaddressinproposedordertotechniquesfocushathevescopealreadyofbeenthisdiscusseddissertation.in
thegeneralsubsumptiondirectionsofsectionspossibleofthefutureindividualresearch.chapters.TheybWuildenoonw–orhighlightarisesomefrom–morethe
techniques,resultsandlessonslearnedinthisdissertation.

project,ClusteringclassCallandpackageGraphs.sizesCausedfrequentlybythearevclasseryhierarchyimbalanced.ofaThisgrowncansoftwleadareto
alimitedapplicabilityofthehierarchicalminingapproachproposedinChapter6
andthustoscalabilityissues.Furthermore,themanualassignmentofsoftwareen-
titiestolargerunits,astypicallydonebythesoftwaredeveloper(e.g.,ofaclassto
ahavepackage),naturalisandoftenbalancedarbitrary.Thierarchies.oovercomeSuchsuchhierarchiesproblems,coulditbewouldobtainedbebyhelpfulmeansto
of(weighted)graphclustering[AW10a]oncallgraphs.Moreconcretely,cluster-
wingouldthenalgorithmsbethecouldfirstbeappliedhierarchytole(setsvel,of)andlargethecallsamegraphs.techniqueThecouldclustersbeidentifiedapplied
tivwithinely,thehierarchicalindividualclusteringclustersinmethodsordertocouldfindbemoreemployed.fine-grainedSuchanclusters.approachAlterna-has
recentlybeenproposedinthecontextofminingforcommunitystructuresin(social)
networks[HSH+10].However,itisunclearifclusteringtechniquescanbeidentified
thatwouldresultinthedesiredbalancedcall-graphhierarchies.
asFromwell,abecausegeneralourdata-miningsettingwouldperspectiprove,videanclusteringobjecticallveevgraphsaluationwouldframebework.interestingIn

145

CHAPTER9.CONCLUSIONSANDFUTURERESEARCHDIRECTIONS

thiscontext,‘objective’meansthatcluster-analysisresultsofdifferentqualityare
einxpectedcontrasttotoyieldnumerousresultsevwithaluationsdifferentwheredomaindefect-localisationexpertshavprecisionedecidedaswell.howThisgoodis
thevariousclusteringresultsare.

WpaleightedapproachesforSubgraphminingMining.weightedIncallthisgraphs:dissertation,wehaveproposedtwoprinci-

1.approach,Postprocessing,consistingi.e.,ofwefrequentanalysesubgraphweightedcallmininggraphsandbyfeaturemeansofselectionatw(Chap-o-step
5–7).ters

2.Constraint-basedmining,i.e.,welettheuserspecifyconstraintsbasedon
weightsandintegratethetwostepsfromtheapproachmentionedbeforeintoa
8).(Chapterstepanalysissingle

Besidesthetwoapproachesmentioned,thediscretisation-basedapproachespre-
sentedinSection3.2.1analyseweightsinapreprocessingstep.Asadrawbackof
suchapproaches,wehaveidentifiedalossofinformation,whichwouldpossiblylead
toworsedefect-localisationresults.However,itwouldbeinterestingtodevelopsuch
apreprocessing-basedapproachthatreliesondiscretisationandistailoredforlo-
calisingdefectswithweightedcallgraphs.Evenifdiscretisationleadstoalossof
information,thiseffectcouldbeminimisedbyemployingsuperviseddiscretisation
techniques(seeSection7.5).Further,suchanapproachmightleadtootherpositive
propertiesthatcompensateforthiseffect,suchasdecreasedruntime.
Besidesthetechniquesbasedonpreprocessing,postprocessingandweight-based
constraintsdiscussedsofar–andasmallnumberoffurtherstudiespresentedinSec-
tion3.2.1–weightedsubgraphmininghasnotdrawnalotofattention.Inparticular,
ithasneverbeenstudiedsystematically,andmostavailableapproachesdealwithvery
specificanalysisproblemsfordedicatedapplications.Asrespectivealgorithmscould
beusedinmanydomainswhereweightedgraphsarepresent,andastheypromise
toachievegoodresults,itwouldberewardingtosystematicallyinvestigateweighted
subgraphmining.Itwouldinparticularbedesirabletoproposetechniquesthatcom-
plementtheonesproposedinthisdissertation–specificallyweight-basedconstraints
–andcanbeappliedtoabroadfieldofapplications.

EvaluationswithSoftwareRepositoriesfromLargeProjects.Thisdis-
sertationcontainsanumberofsectionsthatevaluatethedefect-localisationtech-
niquesproposed.Someoftheseevaluationsbuildonrelativelysmallprogrammes
andondefectsthathavebeenseededartificiallyintothem.However,asitisdesir-
ablethatlocalisationtechniquesscaleforlargeprogrammes,evaluationswithlarger

146

DIRECTIONSRESEARCHFUTURE9.3.

tionsoftwarefeaturesprojectsdefectsthatsubstantiateactuallytheoccurredresultsofinaanerealvsoftwaluation.areFurtherproject,,thewhenevanealuationvalua-is
muchmorecredible.InChapter6,wehavepresentedanevaluationthatfeaturesboth
arealandrelativelylargeprojectanddefectsfromthefield.Inordertodrawmore
substantialconclusionsabouttheeffectivenessofdefect-localisationtechniques–not
wonlyouldthefeatureonesevenproposedlargerinthissoftwaredissertationprojects,–itwmoreouldbedefects,desirablemoreifkindsfutureofevdefectsaluationsand
atobroaderassemblemorecomparisontestofsuitesdifferentthatfulfil(partsdefect-localisationof)theaspectstechniques.mentionedThiswouldbeforerequireand
includeenoughtestcasesthatleadtobothcorrectandfailingexecutions.Similarly
totheiBUGSrepository[DZ07,DZ09]wehaveusedinChapter6,suchtestsuites
couldbederivedfromthetestcasesandrepositoriesofreal(open-source)software
systems,projects.areSuchusedinrepositories,mostlaringesoftwparticulararebprojectsug-trackingandcontainsystemslotsandofreinterestingvision-controldata
toderivetestsuitesfordefect-localisationtools.

ControlledUserExperimentswithSoftwareDevelopers.Alldefect-loca-
lisationdissertationhatechniquesvebeenthatevhavaluatedebeenwithdiscussedquantitatiorvehaevvealuationbeennewlymeasures.proposedTheseinmea-this
tosuresinvreferestigatetodirectlyfindortheindirectlydefecttowhenthetheamountrespectiofvecodealocalisationsoftwaredevtechniqueeloperwastillsem-has
plowithyed.thesameThus,efthefortyrelywhenonthethesameassumptionhintsaregithatvenallbykindsaofdefectsdefect-localisationcanbetechnique.identified
However,thisassumptionmightnotholdinreality.Thisisasdevelopersmighthave
backgroundknowledgethatcanhardlybeassessed,anddifferentdefect-localisation
resultsthatrefertothesameamountofsourcecodetobeinvestigatedmightbemore
orlesshelpfulforthedeveloper.Therefore,itwouldbeaninterestingexperiment
toletsoftwaredevelopershavingthesamelevelofexperiencefind(andfix)defects
withtheaidofdifferentdefect-localisationtechniques.

LocalisingDefectsinMultithreadedProgrammes.Thisdissertationhasfo-
cusedonthelocalisationofdefectsinsingle-threadedprogrammes.However,mul-
tithreadedprogrammesareachallengingfieldfordefectlocalisation,asrespective
defectsarenotoriouslyhardtolocalise.InSection4.3,wehavealreadypresented
somecall-graphrepresentationsformultithreadedprogrammes,andwehavecon-
ductedafirststudyoncall-graph-baseddefectlocalisationinAppendixA.Duetoa
numberofissuesrelatedtomultithreadedexecutions,wehaveemployedarelatively
simplecall-graphrepresentationinthisstudy.However,webelievethatalternatives
withmoresophisticatedgraphrepresentationsthatovercometheproblemsdiscussed
inSection4.3areworthbeinginvestigatedandmightsubstantiatetheencouraging
results.Thisismotivatedbytheexperimentswithsingle-threadedprogrammesin

147

CHAPTER9.CONCLUSIONSANDFUTURERESEARCHDIRECTIONS

Chapter5,wheregraphsmoresophisticatedthanthetotalreductionhavelocalised
defectsmoreprecisely.Graphrepresentationsformultithreadedprogrammescan,
forinstance,includeadditionalinformationonthreadIDs,aswellasinformation
aboutsynchronisationconstructsused.Section4.3containssomemoreconcrete
ideasforpossiblecall-graphextensions.Further,thestudyinAppendixAdoesnot
exploitdataflow-relatedinformation.Adataflowextensionforcallgraphsfrommul-
tithreadedprogrammes,similartotheoneinChapter7forthesinge-threadedcase,is
likelytomakeracedetectionmoreaccurate.Thisisbecauseunsynchronisedthreads
incorrectlyalterdataandaffectthevaluesinthedataflowintypicalracesituations.
Alltheseideas–aswellasfurtherproposalsforcall-graphrepresentations–are
worthbeinginvestigatedalongwithrespectivedefect-localisationtechniquestoa
xtend.egerlar

Toconcludethisdissertation,wehavedevelopeddifferenttechniquesforcall-
graph-mining-baseddefectlocalisation,andwehaveshownthattheyareuseful.With
thementioneddirectionsforfutureworkinmind,wefeelthatmoresoftwareprojects
anddata-miningproblemswillbenefitfromthisdissertation,andthatdata-mining-
baseddefectlocalisationremainsanexcitingfieldofresearch.

148

Appendix

149

ectDefMultithreadingALocalisation

Thisdissertationfocusesoncall-graph-baseddefectlocalisationinsequentialpro-
grammes.Apartfromthat,debuggingmultithreadedprogrammesisanimportant
andchallengingfieldofresearchofitsown(seeSection3.1.3).Wehaveintroduced
call-graphrepresentationsformultithreadedprogrammesinSection4.3andpresent
afirststudyonlocalisingdefectswithsuchgraphsinthisappendix.Itisavariation
fromsationourcanbeapproachusedtoinlocaliseChapter5.typicalThedefectsresultisinthatmultithreadedcall-graph-basedprogrammes.defectHolocali-w-
ever,thereareopenquestionsremaining,andwedescribesomeideashowtoextend
call-graph-baseddefectlocalisationtoadequatelydealwithdefectsinmultithreaded
programmes(seeSection4.3andChapter9).
Inthisappendix,wefirstpresentanintroductoryoverviewinSectionA.1.Sec-
tionA.2introducesasimpleapproachfordefectlocalisation.SectionA.3evaluates
thetechniqueapproach.withotherSectionA.4approaches.showsaSectiondetailedA.6iseaxample.subsumptionSectionofA.5thisappendix.comparesour

wvieOverA.1

Debuggingmultithreadedprogrammesisanimportantandchallengingproblem.De-
buggingaidsformultithreadedprogrammesthatareavailabletodayfocusoniden-
tifyingatomicityviolations,raceconditionsordeadlocks(seeSection3.1.3).These
toolsarespecialisedonaparticularclassofparallelprogrammingerrorsthatare
duefurthertowrongcauses,usagei.e.,ofanomaliessynchronisationintheexecutionconstructs.thatInmightthisproduceappendix,wewronginvparestigateallel
programmebehaviour.Letusconsideranotherexamplebesidestheonepresentedin
3.1:ExampleallocatorExampleinaA.1:Thinkmultithreadedofaconteprogrammerxtinawholanguageincorrectlywithoutusesautomaticasequentialgarbagememorycollec-
tion.Inrarecases,differentthreadscouldallocateoverlappingpartsofthememory
andperformconcurrentaccesses,whichleadstoraces.Eventhoughracedetectors
wmemoryouldbeablelocation,tomanintervyenetoolsandoffersholittlewainsightreportonwhenthearealracecauseoccursoftheonaproblem.particular

151

APPENDIXA.MULTITHREADINGDEFECTLOCALISATION

Theexamplesillustratethatthereisaneedformoregeneraldefect-localisation
techniquestodealwithsuchsituations.Thisappendixaddressesthisproblemarea
andinvshared-memoryestigatestheprogrammes.usageofThecallapproachgraphsforpresenteddefectaimstolocalisationdetectinawidermultithreadedrangeof
defectsthataffectparallelexecutionratherthanjustraceconditions.Thecontrolled
experimentswithtypicalapplicationspresentedinthisappendixshowthatminingof
callgraphsworksandthatitfindsdefectsinmultithreadedprogrammes.

LocalisationectDefMultithreadingA.2Asintheotherpartsofthisdissertation,theoverallaimofthedefect-localisation
procedurepresentedhereistoderivearankingofpotentiallydefectivemethods.We
presentanoverviewofthedefect-localisationprocedureinSectionA.2.1andthen
moredetailsonthelocalisationtechniqueinSectionA.2.2.

wvieOverA.2.1UsingAlgorithmatestA.1oracle,worksthewithalgorithmasetTassignsofatracesclass(obtainedcorrectfromorfailingprogramme)toeveeryxtraceecutions.t∈
T.Thenthealgorithmreduceseveryttoobtainanewcallgraph(usingtheRtotalmult
call-graphreduction,seeSection4.3),whichisassignedtoaclassofeithercorrect
orfmethodailingmexitslikecutions.elihoodBasedofonbeingthesedefectiRvtotalmulte.Thegraphs,liktheelihoodlastisstepusedtocalculatesrankfortheevorderery
ofpotentiallydefectivemethods.

AlgorithmA.1Overviewofcall-graph-baseddefectlocalisation.
Input:Output:aasetofrankingprogrammebasedontraceseacht∈method’TslikelihoodtobedefectiveP(m)
2:1:fGor=∅all//tracesinitialiset∈Tadosetofreducedgraphs
3:andcheckassigniftarefersclasstoa∈{corcorrectrecte,xfailingecution,}tot
4:G=G∪{reduce(t)}
5:6:endcalculateforP(m)forallmethodsminG
(LineWe3iemplonyaAlgorithmtestoracleA.1).toSuchdecideoracleswhetherareaspecificprogrammeforethexeecutionxaminediscorrectprogramme,ornot
and(i.e.,atheirfailurepurpose).Anistoobservdecideableifaproblemcertainecanxbeecutionawrongyieldsanoutputyobservorotherableproblemserroneous
behavioursuchasadeadlock.

152

A.2.MULTITHREADINGDEFECTLOCALISATION

a→bb→ca→d⋯Class
g14454457⋯failing
g21282560⋯correct
⋮⋮⋮⋮⋱⋮
TableA.1:Exampleofafeaturetable.

LikelihoodsectivenessDefCalculatingA.2.2Wenowdescribehowtocalculatethedefectlikelihoodofamethod(Line6inAlgo-
rithmA.1).Incontrasttothemethodspresentedintheearlierpartsofthisdisserta-
tion,multwenowfollowarelativelysimpleapproach:Weanalysetheedgeweightsofthe
Rtotalcallgraphs(seeSection4.3)withoutemployinganygraph-miningtechnique.
Wesultingdosocallasthegraphsdoprogrammesnotdeinviatevmuchestigatedinbetweenthistheappendixdifferentareexratherecutions.smallandConcretelythere-,
wecreateafeaturetableasfollows:
NotationA.1(Featuretablesfordefectlocalisationinmultithreadedprogrammes)
Thefeaturetableshavethefollowingstructure:Therowsstandforallprogramme
executions,representedbytheirreducedcallgraphs.Foreveryedge,thereisone
whiccolumn.hcontainsThetablethecellsclass∈{containcorrecthet,edgfailinge}.weights,Ifaneedgxcepteisfornottheverycontainedlastinacolumn,call
graph,thecorrespondingcellshavevalue0.

ExampleA.2:TableA.1servesasanexample.ThefirstcolumninTableA.1corre-
spondstotheedgefrommethodatomethodb,thesecondcolumntotheedgefrom
btoc,andthethirdcolumnrepresentsanedgefromatod.Thelastcolumncontains
theclasscorrectorfailing.Graphg2doesnotpossessedgea→d;therefore,the
respectivecellhasvalue0.
WeanalysetheedgeweightsintablesasintroducedinNotationA.1.Concretely,
weemploytheinformation-gain-ratiomeasure(GainRatio,seeDefinition2.7)in
itsWekaimplementation[HFH+09]tocalculatethestrengthofdiscriminationof
columns.Wethenusethesevaluesasdefectlikelihoodsforeverycolumninthe
table,i.e.,formethodcalls.However,weareinterestedinlikelihoodsformeth-
odsm.Asamethodcancallseveralothermethods,weassigneverycolumntothe
callingmethod.WethencalculatethemethodlikelihoodP(m)asthemaximumof
theGainRatiovaluesofthecolumnsassignedtomethodm.Weusethemaximum
becauseitreferstothemostsuspiciousinvocationofamethod.Otherinvocations
arelessimportant,astheymightnotberelatedtoadefect.However,theinforma-
tionwhichspecificinvocationwithinmethodmismostsuspicious(thecolumnwith
thehighestlikelihood)canbeimportantforasoftwaredevelopertofindandfixthe
defect.Wethereforereportthisadditionalinformationtotheuser.

153

APPENDIXA.MULTITHREADINGDEFECTLOCALISATION

Programme#MLOC#TSourceDescription
AllocationVector(Test)61332[EU04]Allocationofmemory
GarageManager304754[EU04]Simulationofagarage
Liveness(BugGen)8120100[EU04]Client-serversimulation
MergeSort112014[EU04]Recursivesortingimple-
mentationThreadTest1210150[EU04]CPUbenchmark
visions)di(randomTornado122632100[C+09]HTTPServer
Weblech8880210[PH+02]Websitedownload/
toolmirrorTableA.2:Programmesconsidered(#M/#Tisthenumberofmethods/threads).

aluationEvExperimentalA.3Wescribesnowthepresentbenchmarktheexperimentalprogrammesresultsandtheirtovdefectsalidateour(Sectionapproach.A.3.1),Thistheesectionxperimen-de-
talsetting(SectionA.3.2),themetricsusedtointerprettheresults(SectionA.3.3)
andtheactualresults(SectionA.3.4).SectionA.5presentscomparisonstorelated
techniques.

A.3.1BenchmarkProgrammesandDefects
Ourbenchmarkcontainsarangeofdifferentmultithreadedprogrammes.Thebench-
markcoversabroadrangeoftasks,frombasicsortingalgorithmsandvariousclient-
serversettingstomemoryallocators,whicharefundamentalconstructsinmanypro-
grammes[BMBW00].AsourprototypeisimplementedinAspectJ,allbenchmark
programmesareinJava.Mostoftheseprogrammeshavebeenusedinpreviousstud-
iesandweredevelopedinstudentassignments[EU04].Weslightlymodifiedsomeof
theprogrammes;forexample,intheGarageManagerapplication,wereplaceddif-
ferentprintln()statementswithmethodscontainingcodesimulatingtheassign-
mentofworktodifferenttasks.Furthermore,weincludedtwotypicalclient-server
applicationsfromtheopen-sourcecommunityinourbenchmark.Theseprogrammes
arelargerandrepresentanimportantclassofrealapplications.TableA.2listsall
programmesalongwiththeirsizeintermsofmethodsandnormalisedlinesofcode
1.(LOC)Theauthorsofthebenchmarkprogrammeshaveseededknowndefectsintothe
programmes.Inthetwoopen-sourceprogrammes,wemanuallyinsertedtypical
synchronisationdefects.Alldefectsarerepresentativeforcommonmultithreaded
1Inthisappendix,weusethesumofnon-blankandnon-commentLOCinsidemethodbodies.

154

A.3.EXPERIMENTALEVALUATION

programmingerrors,e.g.,forgottensynchronisationforsomevariable,andareoc-
casional.Thedefectscoverabroadrangeoferrorpatterns,suchasatomicityviola-
tions/raceconditions,ononeorseveralcorrelatedvariables,deadlocks,butalsoother
kindsofprogrammingerrors,e.g.,originatingfromnon-parallelconstructs,thatcan
influenceparallelprogrammebehaviour.
Wecategorisethedefectpatternsintheprogrammesofourevaluationasfollows,
accordingtotheclassificationbyFarchietal.[FNU03]:

1.TwoAllocationVstepsofector,findingdefectandpattern:allocating“two-stablocksgeforaccess”memory.accessarenotexe-
cutedatomically,eventhoughtheindividualstepsaresynchronised.Thus,two
threadsmightallocatethesamememoryandcauseincorrectinterference.
2.GarageManager,defectpattern:“blockingcriticalsection”.
forThegottendefectswitchitselfiscase.aWhencombinationthisofansituationincorrectlyoccurs,nocalculatedtaskisvalueassignedduettooaa
Thus,particularfewerthread,thanthewhileanumberglobalofvariablethreadsistreatedrecordedasasifwactiorkvehadarebeenactive.assigned.This
makestheprogrammedeadlock.WeillustratetheGarageManagerprogramme
inmoredetailinSectionA.4.
3.Liveness,defectpattern:similartothe“orphanedthread”pattern.
addedWhentotheastack.maximumAlthoughnumberthisofdataclientsisstructurereached,andatheglobalnextcounterrequestingareclientsynchro-is
tonised,theitstack.canInhappenthisthatcase,thetheservclienterwillbecomesnevaevrailableresumewhileandthewillclientnotisfinishaddedits
task.4.MergeSort,defectpattern:“two-stageaccess”.
Althoughmethodsworkingonglobalthreadcountersaresynchronised,the
variablesthemselvesarenot,whichmightleadtoatomicityviolations.Inpar-
twticularo,threadsthreadsapplyaskathowthemansameytime,subthreadsmoretheythreadsareallothanwedallotowedaregenerate.generated.When
Thiscanleadtosituationsinwhichpartsofthedataarenotsorted.
5.ThreadTest,defectpattern:“blockingcriticalsection”.
Thegenerationofnewthreadsandcheckingaglobalvariableforthemaxi-
mumnumberofthreadscurrentlyavailableisnotdonecorrectlyincaseofex-
ceptions,whichoccurrandomlyinThreadTest,duetodivisionsbyzero.This
exleadsecutiontoaasfdeadlockailingwhenwhenatallleastthreadsonethreadencounterthisencounterssituation.thisWeproblem,classifyduetoan
performance.reduced

155

APPENDIXA.MULTITHREADINGDEFECTLOCALISATION

6.Tornado,defectpattern:“nolock”.
Synchronisationstatementsareremovedinonemethod.Thisleadstoarace
conditionandultimately,inthecontextofTornado,tounansweredHTTPre-
quests.7.Weblech,defectpattern:“nolock”.
RemovedsynchronisationstatementsasinTornado,resultinginWebpages
wnloaded.donotarethatFortheWeblechprogramme,wehavetwoversions:Weblech.origandWeblech.inj.
InWeblech.inj,weintroducedadefectinmethodrun()byremovingallsynchro-
nizedstatements(ListingA.1showsanexcerptofthismethodwithonesuchstate-
ment),aimingtosimulateatypicalprogrammingerror.Duringourexperiments,we
realisedthattheoriginalnon-injectedversion(Weblech.orig)ledtofailuresinvery
rarecasesaswell.Thefailureoccurredinonly5outof5,000executions;weuseda
sampleofthecorrectexecutionsintheexperiments.Thus,Weblech.injcontainsthe
originaldefectbesidestheinjecteddefects.Withourtool,wewereabletolocalise
therealdefectbyinvestigatingtwomethodsonly.Theresultisthattwoglobalun-
synchronisedvariables(downloadsInProgressandrunning)aremodifiedin
run(),occasionallycausingraceconditions.Tofixthedefectinordertoproducea
defect-freereference,weaddedthevolatilekeywordtothevariabledeclaration
.headerclassthein

1while((queueSize()>0||downloadsInProgress>0)
2&&quit==false){
...//354nextURLsynchronized=(queue){queue.getNextInQueue();
downloadsInProgress++;6}7...//8}9running--;10ListingA.1:Methodvoidweblech.spider.run()(excerpt).

SettingExperimentalA.3.2NumberofExecutions.Ourdefect-localisationtechniquerequiresthatweex-
ecuteeveryprogrammeseveraltimesandthatweensurethattherearesufficiently
manyexamplesforcorrectandfailingexecutions.Thisisnecessarysincewefocus
onoccasionalbugs(seeChapter2),i.e.,failureswhoseoccurrencedependsoninput

156

A.3.EXPERIMENTALEVALUATION

data,randomcomponentsornon-deterministicthreadinterleavings.Furthermore,
wetriedtoachievestableresults,i.e.,analysingmoreexecutionswouldnotleadto
significantchanges.Weusedthiscriteriontodeterminethenumberofexecutions
required,inadditiontoobtainingenoughcorrectandfailingcases.TableA.3liststhe
numberofcorrectandfailingexecutionsforeachbenchmarkprogramme.

VaryingExecutionTraces.Inordertoobtaindifferentexecutiontracesfromthe
sameprogramme,werelyontheoriginaltestcasesthatareprovidedinthebench-
marksuite.MergeSort,forinstance,comeswithageneratorcreatingrandomarrays
asinputdata.Someprogrammeshaveaninternalrandomcomponentaspartof
theprogrammelogic,i.e.,theyautomaticallyleadtovaryingexecutions.Garage-
Manager,forinstance,simulatesvaryingprocessesinagarage.Otherprogrammes
producedifferentexecutionsduetodifferentthreadinterleavingsthatcanleadtoob-
servablefailuresoccasionally.Forthetwoopen-sourceprogrammes,weconstructed
typicaltestcasesourselves;fortheTornadoWebserver,westartanumberofscripts
simultaneouslydownloadingfilesfromtheserver.ForWeblech,wedownloadanum-
beroffilesfroma(defect-free)Webserver.

Testprogramme.Oracles.ForWtheetwuseoindiopen-sourcevidualtestoraclesprogrammes,thatwecomecomposewithevtesteryoraclesbenchmarkthat
example,automaticallywecomparecomparethethefilesactualdooutputwnloadedofawithWprogrammeeblechtotothetheeoriginalxpectedones.one.For

TestingEnvironment.WerunallexperimentsonastandardHPworkstation
withanAMDAthlon64X2dual-coreprocessor4800+.Weemployedastandard
SunJava6virtualmachineonMicrosoftWindowsXP.

A.3.3AccuracyMeasuresforDefect-LocalisationResults
Asintheearlierpartsofthisdissertation,thelocationsoftheactualdefectsare
known,sothereportofamethodcontainingadefectcanbedirectlycomparedto
thedefect,knowewnreferlocation.totheIftherepositionisofmorethethanfirstofonesuchlocationmethodswhichinthecanberanking.alteredFortofixcasesa
asinWeblech.origwherethedefectcanbefixedoutsideamethodbody(e.g.,inthe
classheader),onecanstillidentifymethodsthatcanbealteredtofixtheerroneous
.viourbehaInordertoevaluatetheaccuracyoftheresults,wereportthepositionofthedefec-
tivemethodinanorderedresultlist,asbefore.Similartotheapproachinvestigated
inChapters6and7,wenowalwaysuseasecondstaticrankingcriterion:Wesortthe
methodspercentagewithoftheLOCsametorelikviewelihoodadditionallydecreasinglytothebyrankingtheirsizeposition.inLOC.ThisWeisprovidecalculatedthe

157

APPENDIXA.MULTITHREADINGDEFECTLOCALISATION

LocalisationDefectecutionsExProgram#correct#failingRankingPos.%LOCtoReview
AllocationVector383117117.3%
GarageManager7426114.2%
149Liveness44.2%153MergeSort668332125.9%
ThreadTest207193118.8%
Tornado36281423.3%
Weblech.orig4945223.3%
Weblech.inj98515521.8%
results.Defect-localisationA.3:ableT

astheratioofmethodsthathastobeconsideredintheprogramme,i.e.,thesumof
LOCofallmethodshavingarankingpositionsmallerthanorequaltotheposition
reportedinthetable,dividedbythetotalLOC(seeTableA.2).

ResultsA.3.4WepresentourresultsinTableA.3.Thenumbersareencouraging:Inallfivebench-
markprogrammes,thedefectivemethodisrankedfirst.Therankingpositionislower
onlyinthetwolargeprogrammes.However,takingthesizeoftheseprogrammesinto
account,thequalityofdefectlocalisationiswithinthesamerange(seecolumn“LOC
w”).vieRetoOverall,theaveragerankingpositionformethodscontainingthedefectsis3.3.
Nevertheless,asTableA.2shows,adeveloperonlyhastoreviewjust7.1%ofall
methodstofindthedefectsor23.6%ofthenormalisedsourcecode,whichislow.
Inotherwords,adeveloperhastoconsiderlessthanaquarterofthesourcecode
ofourprogrammesinordertofindadefectintheworstcase.Thisreducesthe
percentageofmethods(code)toreviewbyafactorofseven(code:morethanbyhalf)
whencomparedtoanaverageexpectedamountof50%ofmethods(code)toreview.
Notethattheseallvaluesareobtainedwithoutanypossiblepriorknowledgeofthe
developer,whichmightfurthernarrowdownthecodetobeinspected.Furthermore,
theyaremaximumvalues,fortworeasons:(1)Usuallynotalllinesofamethod
needtobeinspected,inparticularduetoinformationreportedadditionallywhich
callwithinamethodismostsuspicious.(2)Themethodsrankedhighestfrequently
aregoodhintsforthedefect,evenifthedefectivemethoditselfisrankedlower.This
isasweknowfromourexperiencethatnon-defectivemethodsthatarerankedhigh
oftenareinthevicinityofthedefectivemethod,e.g.,theymightbeinvokedfromthe
method.evdefecti

158

EXAMPLEAILEDDETAA.4.

1008060dealizlocstcefedfo%0
40201009080706050403020100
%ofsourcecodethatneednotbeexamined
FigureA.1:Thepercentageofdefectslocalisedwhennotexaminingacertainper-
code.sourceofcentage

FigureA.1providesanillustrationofthepercentageoflocaliseddefectsversusthe
percentageofsourcecodethatdoesnotneedtobeexamined.Inourcase,itshows
thatwecanskiptheinspectionof50%ofthecodeandstillfind100%ofthedefects.
Ifweskipinspecting70%,wewouldstillfindmorethan80%ofthedefects.Thisis
asignificantgaininprogrammerproductivity.

ExampleDetailedAA.4WeusingnoewxcerptsillustrateformathetypicalGaradefectgeManaandgtheerprocessprogrammeofits[EU04]:localisationwithourapproach

TheDefect.Inourexample,thecalculationofthetaskNumbervariablecan
toproducecalculateaneitsgativemodulo-8value,value,whichiswhichreadisinthenmethodfedintoaGoToWork()switch-case(seeListingblock.ThisA.2)
Javablock,hocalculateswever,theexpectsmodulovaluesoperationbetweenona0negandativ7.eNenumbergativ.evTherealuesarecantworesultalternatiwhenve
casepositionsblock,wherebyadaddingevelopernegativcanecasesmodifyorathedefcodeaulttocase;fixthe(2)bTheug:(1)partsTheoftheswitchsource-
codewheretaskNumberiscalculated(methodSetTaskToWorker()).

FromtheDefecttoanInfection.Wenowlookatthecallgraphfromafail-
ingexecutioninmoredetail,showninFigureA.2.Thecallofrun()generates
fivethreads:Four“worker”threadscallingmethodsWaitForManager(),GoTo-
Work()andPrintCard()andone“manager”threadcallingtheremainingmeth-
ods.InWorkingOn()(adefectivemethod),theprogrammestatebecomesinfected:
anThreeegativethreadsvalue,evthusaluatetheircausingtheswitchthreadstatementnottotocall0,1anyand7,furtherbutthemethods.fourththreadhas

159

APPENDIXA.MULTITHREADINGDEFECTLOCALISATION

160

FigureA.2:Callgraphfromafailing

GarageManagerexecution.

fixGears

changingTiresworkOnBreaksWorkerFinishedTask

1 1 1

3

IsManagerArrived

WorkingOn

41

3

WaitForManager

GoToWorkPrintCard

ManagerArrived

PrintOutput

AllWorkersFinished

4

4 4

1 3574

3574

run

PrintWorkersNames

1

AdjustBugProbabilityOpenOutputFile

5SetTaskToWorker

GetWorkersNames

1 1

4

1

GetParametersFromUser

GiveTasksToWorkers

TakeWorkersFromAgency

1 1 1

main

EXAMPLEAILEDDETAA.4.

1switch(taskNumber%8){
0:case21000);WorkingOn("Cleaning",3;break45//similarforcases1to5...
6:case62200);breaks",onWorkingOn("Working78;break7:case91110break;WorkingOn("Fixingengines",2400);
}12ListingA.2:MethodvoidGoToWork()(excerpt).

FrthreadomnotantoInfcallectiontoaFailureWorkerFinishedTask().The.aforementionedThismethodinfectiondecreasescausesavtheariablefourthof
themethodglobalrun()status(seeobject.ListingThisA.3).objectisqueriedAllWorkersFinished()bywillAllWorkersFinished()neverbetruein,
asfinishedstatustheirwilltasks.alwThisayscausesindicateanthatinfiniteonlyloopthreeinoutrun()of.fourWe“workmanuallyer”threadsstoppedhavthee
loopobservafterable3,574programmeiterations.behaInviour,otherwhichwords,wetheconsiderinfectionafailure.hascausedadeadlock,an

12synchronizedSystem.out.println("Manager(status){arrived!");
status.ManagerArrived();3}45booleantasksNotFinished=true,printedOutput=false;
{(tasksNotFinished)while687synchronizedprintedOutput=(status){PrintOutput(printedOutput);
(status.AllWorkersFinished())if91110elsetasksNotFinished=false;
yield();12}13}14

ListingA.3:Methodvoidrun()(excerpt).

161

APPENDIXA.MULTITHREADINGDEFECTLOCALISATION

LocalisingtheDefect.Inourexperiments,ourapproachhasfoundthethree
methodsGoToWork(),WorkingOn()andrun()(orderedbyincreasingrank-
ingThehighposition)liktoelihoodbeformostlikelyWorkingOn()defective.isThus,duetotheafollodefectw-upwasinfection,pinpointedasitdirectlyisal-.
wayscalledfromGoToWork().Therun()methodhasahighlikelihoodaswell,
causedbythehugenumberofmethodcallsintheinfiniteloop,comparedtocorrect
executions.Bothmethodsareinherentlyconnectedtothedefect.

A.5ResultComparisonswithRelatedWork
Wenowcompareourapproachwithtwoapplicabletechniquesfromtherelatedwork.
OurexperimentswiththeIBMMulticoreSDK[QDLT09]appliedtoallprogramme
versionsfromourevaluation(seeSectionA.3)revealthatitisnotabletofindanyof
thedefects.Fromtheeightversions,theMulticoreSDKincorrectlyclassifiedseven
versionsasdefect-free,whileproducingafalse-positivewarningfortheeighthver-
sion.WeappliedFindBugs[AHM+08]toallprogrammesinourbenchmark.Weob-
servedthatFindBugsdidnotdirectlyreportanyofthedefects.Atthesametime,
FindBugsproducesfalse-positivewarnings:Onaverage,thereare5.8warningsper
programmethatonaverageaffect4.5differentmethods.Thewarningsrefertothe
correctmethodnamesinjustfouroutofeightprogrammes.Further,thewarnings
arenotprioritised,soadeveloperwouldhavetoinspecttheentirecodeofallmeth-
odswithwarnings.Ineachofthefourprogrammes,inspectionamountsto47.5%,
36.8%,29.2%and29.2%ofthesourcecode,respectively.IfFindBugswasim-
provedbyamethodrankingtechnique,suchasinspectinglargermethodsfirst(asin
thisappendix),thendeveloperscouldsavetimefindingtherespectivedefectsandre-
ducetheamountofreviewedcodeto14.2%,25.9%,25.4%and25.4%,respectively.
Incontrast,inspectingupto25.9%ofthesourcecodewithourtechniquefindsseven
outoftheeightdefects(seethelastcolumninTableA.3).Theseresultsarebetter
thanFindBugs.Comparedtoourapproach,FindBugsdoesnotofferthedeveloper
anyhintonfindingtheremainingfourdefects,astheyarenotreportedatall.

SubsumptionA.6Inbasedthisappendix,defect-localisationwehaveapproachpresented(seeandeSectionvaluated5.3.1)aforvariationmultithreadedfromourprogrammes.frequency-
wereAlthoughabletotheachiecall-graphvewellvresultsariationsininvlocalisingestigated(seedefectsSectionin4.3)multithreadedarerathersimple,programmes.we
Therangeevofaluationerrorsshothatwsafthatfectminingparallelcallprogrammegraphsisanbehaeffectiviourv.eTheseapproacherrorstodetectincludeawiderace

162

SUBSUMPTIONA.6.

conditions,deadlocksanderrorsoriginatingfromthewrongusageofnon-parallel
thatlanguageconcentrateconstructs.ondetectingThisisinspecificcontrastsituationstoexistingsuchasracemultithreadingconditions.debuggingNotably,aidsthe
defectapproachinanpresentedopen-sourcewasabletool.toHowelocalisever,aprecertainviouslydefectsunknoinawn(andmultithreadedundocumented)environ-
tomentthemightcall-graphnotbecapturedrepresentationsbytheandtoapproachtheminingpresentedtechniqueinthismightappendix.helptoExtensionsbroaden
therangeofdetectabledefects(seeSection4.3andChapter9).

163

yliographBib

+[AAK02]TatsuyaSakamotoAsai,andKSetsuoenjiAbe,ArikawaShinji.EfKaficientwasoe,SubstructureHirokiArimura,DiscoveryHiroshifrom
LartionalgeConferSemi-structuredenceonDataData.MiningPr(SDM)oceedings.of2002.the2ndSIAMInterna-

[AAUN03]TDiscoatsuyaveringAsai,FrequentHirokiArimura,SubstructuresTakineakiLarUnogeandUnorderedShin-IchiTrees.Nakano.Pro-
ceedingsofthe6thInternationalConferenceonDiscoveryScience
2003..(DS)

[AHM+08]NathanielAyewah,DavidHovemeyer,J.DavidMorgenthaler,John
PenixandWilliamPugh.UsingStaticAnalysistoFindBugs.IEEE
2008.25(5):22–29,,eSoftwar

[All70]FrancesE.Allen.ControlFlowAnalysis.ACMSIGPLANNotices,
1970.5(7):1–19,

[All74]theFrancesIFIPE.CongrAllen.ess.1974.InterproceduralDataFlowAnalysis.Proceedingsof

[AMS+96]RakeshAgrawal,HeikkiMannila,RamakrishnanSrikant,Hannu
ToivonenandA.InkeriVerkamo.FastDiscoveryofAssociationRules.
InFayyadetal.[FPSSU96],chap.12,pp.307–328.

[AP10]NathanielAyewahandWilliamPugh.TheGoogleFindBugsFixit.Pro-
ceedingsofthe19thInternationalSymposiumonSoftwareTestingand
2010..A)(ISSTAnalysis

[AS95]RakeshAgrawalandRamakrishnanSrikant.MiningSequentialPat-
terns.Proceedingsofthe11thInternationalConferenceonDataEngi-
1995..(ICDE)neering

[AW10a]CharuC.AggarwalandHaixunWang.ASurveyofClusteringAlgo-
rithmsforGraphData.InManagingandMiningGraphData[AW10c],
275–301.pp.9,chap.

165

yBibliograph

[AW10b]CharuC.AggarwalandHaixunWang.GraphDataManagementand
Mining:ASurveyofAlgorithmsandApplications.InManagingand
MiningGraphData[AW10c],chap.2,pp.13–68.

[AW10c]GrCharuaphC.Data,vAggarwol.40alofandAdvancesHaixuninWang,Databaseeds.SystemsMana.gingSpringerand,Mining2010.

[AZGvG09]RuiAbreu,PeterZoeteweij,RobGolsteijnandArjanJ.C.vanGemund.
APracticalEvaluationofSpectrum-BasedFaultLocalization.Journal
ofSystemsandSoftware,82(11):1780–1792,2009.

[BB02]ChristianBorgeltandMichaelR.Berthold.MiningMolecularFrag-
ments:FindingRelevantSubstructuresofMolecules.Proceedingsof
the2ndIEEEInternationalConferenceonDataMining(ICDM).2002.

[BBHK10]MichaelR.Berthold,ChristianBorgelt,FrankHöppnerandFrankKla-
wSenseonn.ofGuideRealtoData,Intelligvol.ent42ofDataTextsAnalysis:inComputerHowtoScienceIntellig.entlySpringerMake,
2010.

[Bei90]BorisBeizer.SoftwareTestingTechniques.VanNostrandReinhold
1990.edn.,2ndCo.,

[BGRS99]KevinBeyer,JonathanGoldstein,RaghuRamakrishnanandUriShaft.
WhenIs“NearestNeighbor”Meaningful?proceedingsofthe7thIn-
ternationalConferenceonDatabaseTheory(ICDT).1999.

[Bin07]DavidBinkley.SourceCodeAnalysis:ARoadMap.Proceedings
ofthe29thInternationalConferenceonSoftwareEngineering(ICSE).
2007.

[BK98]duktionChristianvonBorgeltandEntscheidungsbäumen:RudolfKruse.EinAttributauswÜberblick.ahlmaßeInfürGholamrezadieIn-
dungenNakhaeizadeh,,Beiträgeed.,zurWDataMining:irtschaftsinformatik,Theoretiscpp.he77–98.AspektePhysica,undAnwen-1998.

[BMB05]ChristianBorgelt,ThorstenMeinlandMichaelR.Berthold.MoSS:
AProgramforMolecularSubstructureMining.Proceedingsofthe
WorkshoponOpenSourceDataMiningSoftware(OSDM).2005.

[BMBW00]EmeryD.Berger,KathrynS.McKinley,RobertD.Blumofeand
PaulR.Wilson.Hoard:AScalableMemoryAllocatorforMulti-
threadedApplications.SIGPLANNotices,35(11):117–128,2000.

166

[BP98]+09][C+00][CCK[CH06]+08][CHS

[CL01]

+[CLZ09]

[CMNK05][CPY08][CS98][CWC95]

yBibliograph

SergeyBrinandLawrencePage.TheAnatomyofaLarge-ScaleHy-
pertextualWebSearchEngine.ComputerNetworksandISDNSystems,
1998.30(1–7):107–117,NeilConwayetal.TornadoHTTPServer,2009.Softwareavailableat
.http://tornado.sourceforge.net/PeteChapman,JulianClinton,RandyKerber,ThomasKhabaza,
ThomasReinartz,ColinShearerandRüdigerWirth.CRISP-DM1.0–
Step-by-StepDataMiningGuide.TheCRISP-DMConsortium,2000.
DianeJ.CookandLawrenceB.Holder,eds.MiningGraphData.John
Wiley&Sons,2006.
VineetChaoji,MohammadAlHasan,SaeedSalem,JeremyBessonand
MohammedJ.Zaki.ORIGAMI:ANovelandEffectiveApproachfor
MiningRepresentativeOrthogonalGraphPatterns.StatisticalAnalysis
andDataMining,1(2):67–84,2008.
Chih-ChungChangandChih-JenLin.LIBSVM:ALibraryforSup-
portVectorMachines.DepartmentofComputerScienceandandIn-
formationEngineering,NationalTaiwanUniversity,Taipei,Taiwan,
2001.Softwareavailableathttp://www.csie.ntu.edu.tw/
.~cjlin/libsvm/HongCheng,DavidLo,YangZhou,XiaoyinWangandXifengYan.
IdentifyingBugSignaturesUsingDiscriminativeGraphMining.Pro-
ceedingsofthe18thInternationalSymposiumonSoftwareTestingand
2009..A)(ISSTAnalysisYunChi,RichardR.Muntz,SiegfriedNijssenandJoostN.Kok.Fre-
quentSubtreeMining–AnOverview.FundamentaInformaticae,
2005.66(1–2):161–198,Ray-YaungChang,AndyPodgurskiandJiongYang.DiscoveringNe-
glectedConditionsinSoftwarebyMiningDependenceGraphs.IEEE
TransactionsonSoftwareEngeneering,34(5):579–596,2008.
Jong-DeokChoiandHariniSrinivasan.DeterministicReplayofJava
MultithreadedApplications.ProceedingsoftheSIGMETRICSSympo-
siumonParallelandDistributedTools(SPDT).1998.
JohnY.Ching,AndrewK.C.WongandKeithC.C.Chan.Class-
DependentDiscretizationforInductiveLearningfromContinuousand
Mixed-ModeData.IEEETransactionsonPatternAnalysisandMa-
chineIntelligence,17(7):641–651,1995.

167

yBibliograph

[CYH10a][CYH10b][CYM03][CYM04]

+09][CYZ[CYZZ10][CZ02][CZ05][Dar04]+08][DDG[DDZS09]

[DFLS06]

168

HongCheng,XifengYanandJiaweiHan.DiscriminativeFrequent
Pattern-BasedGraphClassification.InYuetal.[YHF10],chap.9,pp.
237–262.HongCheng,XifengYanandJiaweiHan.MiningGraphPatterns.In
AggarwalandWang[AW10c],chap.9,pp.365–392.
YunChi,YirongYangandRichardR.Muntz.IndexingandMining
FreeTrees.Proceedingsofthe3rdIEEEInternationalConferenceon
2003..(ICDM)MiningDataYunChi,YirongYangandRichardR.Muntz.HybridTreeMiner:An
EfficientAlgorithmforMiningFrequentRootedTreesandFreeTrees
UsingCanonicalForms.Proceedingsofthe16thInternationalCon-
ferenceonScientificandStatisticalDatabaseManagement(SSDBM).
2004.ChenChen,XifengYan,FeidaZhu,JiaweiHanandPhilipS.Yu.
GraphOLAP:AMulti-DimensionalFrameworkforGraphDataAnal-
ysis.KnowledgeandInformationSystems,21(1):41–63,2009.
LongbingCao,PhilipS.Yu,ChengqiZhangandYanchangZhao.Do-
mainDrivenDataMining.Springer,2010.
Jong-DeokChoiandAndreasZeller.IsolatingFailureInducingThread
Schedules.Proceedingsofthe11thInternationalSymposiumonSoft-
wareTestingandAnalysis(ISSTA).2002.
HolgerCleveandAndreasZeller.LocatingCausesofProgramFail-
ures.Proceedingsofthe27thInternationalConferenceonSoftware
2005..(ICSE)EngineeringIanF.Darwin.JavaCookbook.O’Reilly&Associates,2004.
ThomasG.Dietterich,PedroDomingos,LiseGetoor,StephenMuggle-
tonandPrasadTadepalli.StructuredMachineLearning:TheNextTen
Years.MachineLearning,73(1):3–23,2008.
LauraDietz,ValentinDallmeier,AndreasZellerandTobiasScheffer.
LocalizingBugsinProgramExecutionswithGraphicalModels.Pro-
ceedingsofthe23rdConferenceonNeuralInformationProcessingSys-
2009..(NIPS)temsGiuseppeDiFatta,StefanLeueandEvgheniaStegantova.Discrimi-
nativePatternMininginSoftwareFaultDetection.Proceedingsofthe

[Die06][DKS95][DLZ05][DP07][DZ07]

[DZ09]

[EA03][EB09][EB10]

[EBH08a][EBH08b]

yBibliograph

3rdInternationalWorkshoponSoftwareQualityAssurance(SOQUA).
2006.ReinhardDiestel.GraphTheory.Springer,3rdedn.,2006.
JamesDougherty,RonKohaviandMehranSahami.Supervisedand
UnsupervisedDiscretizationofContinuousFeatures.Proceedingsof
the12thInternationalConferenceonMachineLearning(ICML).1995.
ValentinDallmeier,ChristianLindigandAndreasZeller.Lightweight
DefectLocalizationforJava.Proceedingsofthe19thEuropeanCon-
ferenceonObject-OrientedProgramming(ECOOP).2005.
GuozhuDongandJianPei.SequenceDataMining,vol.33ofAdvances
inDatabaseSystems.Springer,2007.
ValentinDallmeierandThomasZimmermann.ExtractionofBug
LocalizationBenchmarksfromHistory.Proceedingsofthe22nd
IEEE/ACMInternationalConferenceonAutomatedSoftwareEngi-
2007..(ASE)neeringValentinDallmeierandThomasZimmermann.iBUGS–BugReposito-
riesExtractedfromProjectHistory.DepartmentofComputerScience,
SaarlandUniversity,Saarbrücken,Germany,2009.Repositoryavail-
.saarland.de/ibugs/http://www.st.cs.uni-atableDawsonEnglerandKenAshcraft.RacerX:Effective,StaticDetec-
tionofRaceConditionsandDeadlocks.Proceedingsofthe19thACM
SymposiumonOperatingSystemsPrinciples(SOSP).2003.
FrankEichingerandKlemensBöhm.TowardsScalabilityofGraph-
MiningBasedBugLocalisation.Proceedingsofthe7thInternational
WorkshoponMiningandLearningwithGraphs(MLG).2009.
FrankEichingerandKlemensBöhm.Software-BugLocalizationwith
GraphMining.InAggarwalandWang[AW10c],chap.17,pp.515–
546.©SpringerScience+BusinessMedia,LLC2010.Theoriginal
publicationisavailableathttp://www.springerlink.com/.
FrankEichinger,KlemensBöhmandMatthiasHuber.ImprovedSoft-
wareFaultDetectionwithGraphMining.Proceedingsofthe6thInter-
nationalWorkshoponMiningandLearningwithGraphs(MLG).2008.
FrankEichinger,KlemensBöhmandMatthiasHuber.MiningEdge-
WeightedCallGraphstoLocaliseSoftwareBugs.Proceedingsofthe

169

yBibliograph

[ECGN01]

+10][ECJ[EHB10a]

[EHB10b]

[EKKB10]

[EOB11]

[EPGB10]

170

8thEuropeanConferenceonMachineLearningandPrinciplesand
PracticeofKnowledgeDiscoveryinDatabases(ECMLPKDD).2008.
©Springer-VerlagBerlinHeidelberg2008.Theoriginalpublicationis
availableathttp://www.springerlink.com/.
MichaelD.Ernst,JakeCockrell,WilliamG.GriswoldandDavid
Notkin.DynamicallyDiscoveringLikelyProgramInvariantstoSup-
portProgramEvolution.IEEETransactionsonSoftwareEngineering,
2001.27(2):99–123,AshrafElsayed,FransCoenen,ChuntaoJiang,MartaGarcía-Fiñana
andVanessaSluming.CorpusCallosumMRImageClassification.
Knowledge-BasedSystems,23(4):330–336,2010.
FrankEichinger,MatthiasHuberandKlemensBöhm.OntheUseful-
nessofWeight-BasedConstraintsinFrequentSubgraphMining.Pro-
ceedingsofthe30thBCSSGAIInternationalConferenceonInnovative
TechniquesandApplicationsofArtificialIntelligence(AI).2010.
FrankEichinger,MatthiasHuberandKlemensBöhm.OntheUseful-
nessofWeight-BasedConstraintsinFrequentSubgraphMining.Karl-
sruheReportsinInformatics2010,10,DepartmentofInformatics,Karl-
sruheInstituteofTechnology(KIT),Karlsruhe,Germany,2010.
FrankEichinger,KlausKrogmann,RolandKlugandKlemens
Böhm.Software-DefectLocalisationbyMiningDataflow-Enabled
CallGraphs.Proceedingsofthe10thEuropeanConferenceonMa-
chineLearningandPrinciplesandPracticeofKnowledgeDiscovery
inDatabases(ECMLPKDD).2010.©Springer-VerlagBerlinHei-
delberg2010.Theoriginalpublicationisavailableathttp://www.
.springerlink.com/FrankEichinger,ChristopherOßnerandKlemensBöhm.Scalable
Software-DefectLocalisationbyHierarchicalMiningofDynamicCall
Graphs.Proceedingsofthe11thSIAMInternationalConferenceon
2011..(SDM)MiningDataFrankEichinger,VictorPankratius,PhilippW.L.GroßeandKlemens
Böhm.LocalizingDefectsinMultithreadedProgramsbyMiningDy-
namicCallGraphs.Proceedingsofthe5thTesting:AcademicandIn-
dustrialConference–PracticeandResearchTechniques(TAICPART).
2010.©Springer-VerlagBerlinHeidelberg2010.Theoriginalpubli-
cationisavailableathttp://www.springerlink.com/.

[ER97][EU04]A10][F

[FI93]

+[FLL02]

[FNU03]W87][FO[FPSS96][FPSSU96][Gai86][GJ79]

Bibliography

TapioElomaaandJuhoRousu.EfficientMultisplittingonNumerical
Data.Proceedingsofthe1stEuropeanSymposiumonPrinciplesof
DataMiningandKnowledgeDiscovery(PKDD).1997.
YanivEytaniandShmuelUr.CompilingaBenchmarkofDocumented
Multi-ThreadedBugs.Proceedingsofthe18thInternationalParallel
andDistributedProcessingSymposium(IPDPS).2004.
AndrewFrankandArthurAsuncion.UCIMachineLearningRepos-
itory.SchoolofInformationandComputerSciences,Universityof
California,Irvine,USA,2010.Repositoryavailableathttp://
.archive.ics.uci.edu/ml/UsamaM.FayyadandKekiB.Irani.Multi-IntervalDiscretization
ofContinuousvaluedAttributesforClassificationLearning.Proceed-
ingsofthe13thInternationalJointConferenceonArticialIntelligence.
1993.CormacFlanagan,K.RustanM.Leino,MarkLillibridge,GregNel-
son,JamesB.SaxeandRaymieStata.ExtendedStaticCheckingfor
Java.ProceedingsoftheACMSIGPLANConferenceonProgramming
LanguageDesignandImplementation(PLDI).2002.
EitanFarchi,YardenNirandShmuelUr.ConcurrentBugPatternsand
HowtoTestThem.Proceedingsofthe1stWorkshoponParalleland
DistributedSystems:TestingandDebugging(PADTAD).2003.
JeanneFerrante,KarlJ.OttensteinandJoeD.Warren.TheProgram
DependenceGraphandItsUseinOptimization.ACMTransactionson
ProgrammingLanguageSystems,9(3):319–349,1987.
UsamaM.Fayyad,GregoryPiatetsky-ShapiroandPadhraicSmyth.
FromDataMiningtoKnowledgeDiscovery:AnOverview.InFayyad
etal.[FPSSU96],chap.1,pp.1–34.
UsamaM.Fayyad,G.GregoryPiatetsky-Shapiro,PadhraicSmythand
RamasmyUthurusamy,eds.AdvancesinKnowledgeDiscoveryand
DataMining.AAAIPress/MITPress,1996.
JasonGait.AProbeEffectinConcurrentPrograms.Software:Practice
1986.16(3):225–233,,ExperienceandMichaelR.GareyandDavidS.Johnson.ComputersandIntractability:
AGuidetotheTheoryofNP-Completeness.W.H.Freeman&Co.,
1979.

171

yBibliograph

[GKM82]SusanL.Graham,PeterB.KesslerandMarshallK.Mckusick.gprof:
ACallGraphExecutionProfiler.ProceedingsoftheACMSIGPLAN
SymposiumonCompilerConstruction.1982.
[GRS99]MinosN.Garofalakis,RajeevRastogiandKyuseokShim.SPIRIT:Se-
quentialPatternMiningwithRegularExpressionConstraints.Proceed-
ingsofthe25thInternationalConferenceonVeryLargeDataBases
1999..(VLDB)[GWBV02]IsabelleGuyon,JasonWeston,StephenBarnhillandVladimirVap-
nik.GeneSelectionforCancerClassificationusingSupportVector
Machines.MachineLearning,46(1–3):389–422,2002.
[HCXY07]JiaweiHan,HongCheng,DongXinandXifengYan.FrequentPat-
ternMining:CurrentStatusandFutureDirections.DataMiningand
KnowledgeDiscovery,15(1):55–86,2007.
[HFGO94]MonicaHutchins,HerbFoster,TarakGoradiaandThomasOstrand.
ExperimentsontheEffectivenessofDataflow-andControlflow-Based
TestAdequacyCriteria.Proceedingsofthe16thInternationalConfer-
enceonSoftwareEngineering(ICSE).1994.
[HFH+09]MarkHall,EibeFrank,GeoffreyHolmes,BernhardPfahringer,Peter
ReutemannandIanH.Witten.TheWEKADataMiningSoftware:An
Update.SIGKDDExplorationsNewsletter,11(1):10–18,2009.
[HG08]JiaweiHanandJingGao.ResearchChallengesforDataMiningin
ScienceandEngineering.InHillolKargupta,JiaweiHan,PhilipS.
Yu,RajeevMotwaniandVipinKumar,eds.,NextGenerationofData
Mining,DataMiningandKnowledgeDiscovery,chap.1,pp.3–27.
2008.Hall/CRC,&Chapman[HJO08]Hwa-YouHsu,JamesA.JonesandAlessandroOrso.RAPID:Identi-
fyingBugSignaturestoSupportDebuggingActivities.Proceedingsof
the23rdIEEE/ACMInternationalConferenceonAutomatedSoftware
2008..(ASE)Engineering[HK00]JiaweiHanandMichelineKamber.DataMining:ConceptsandTech-
niques.TheMorganKaufmannSeriesinDataManagementSystems.
MorganKaufmann,2ndedn.,2000.
[HMS01]DavidHand,HeikkiMannilaandPadhraicSmyth.PrinciplesofData
Mining.AdaptiveComputationandMachineLearning.MITPress,
2001.

172

w78][Ho[HPY00][HS95]+10][HSH

[HWP03][IWM00]

[JCSZ08]

[JCSZ10][JCZ10]

[Jen09][JH05]

yBibliograph

WilliamE.Howden.ASurveyofDynamicAnalysisMethods.InEd-
wardMillerandWilliamE.Howden,eds.,SoftwareTestingandVali-
dationTechniques,pp.184–206.IEEEComputerSocietyPress,1978.
JiaweiHan,JianPeiandYiwenYin.MiningFrequentPatternswithout
CandidateGeneration.ProceedingsoftheACMSIGMODInternational
ConferenceonManagementofData.2000.
BrianHenderson-Sellers.Object-OrientedMetrics:MeasuresofCom-
plexity.PrenticeHall,1995.
JianbinHuang,HeliSun,JiaweiHan,HongboDeng,YizhouSunand
YaguangLiu.SHRINK:AStructuralClusteringAlgorithmforDetect-
ingHierarchicalCommunitiesinNetworks.Proceedingsofthe19th
InternationalConferenceonInformationandKnowledgeManagement
2010..(CIKM)JunHuan,WeiWangandJanPrins.EfficientMiningofFrequentSub-
graphsinthePresenceofIsomorphism.Proceedingsofthe3rdIEEE
InternationalConferenceonDataMining(ICDM).2003.
AkihiroInokuchi,TakashiWashioandHiroshiMotoda.AnApriori-
BasedAlgorithmforMiningFrequentSubstructuresfromGraphData.
Proceedingsofthe4thEuropeanConferenceonPrinciplesofData
MiningandKnowledgeDiscovery(PKDD).2000.
ChuntaoJiang,FransCoenen,RobertSandersonandMicheleZito.
Graph-BasedImageClassificationbyWeightingScheme.Proceedings
ofthe28thBCSSGAIInternationalConferenceonArtificialIntelli-
2008..(AI)encegChuntaoJiang,FransCoenen,RobertSandersonandMicheleZito.
TextClassificationusingGraphMining-BasedFeatureExtraction.
Knowledge-BasedSystems,23(4):302–308,2010.
ChuntaoJiang,FransCoenenandMicheleZito.FrequentSub-graph
MiningonEdgeWeightedGraphs.Proceedingsofthe12thInter-
nationalConferenceonDataWarehousingandKnowledgeDiscovery
(DAWAK).2010.
FinnV.Jensen.BayesianNetworks.InterdisciplinaryReviews:Com-
2009.1(3):307–315,,StatisticsputationalJamesA.JonesandMaryJeanHarrold.EmpiricalEvaluationofthe
TarantulaAutomaticFault-LocalizationTechnique.Proceedingsofthe

173

yBibliograph

[JHS02][Joh00][Jon08][Jor99]+05][JVB[KC04]er92][K+01][KHH[KK01][KKR10][KL88]

174

20thIEEE/ACMInternationalConferenceonAutomatedSoftwareEn-
2005..(ASE)gineeringJamesA.Jones,MaryJeanHarroldandJohnStasko.Visualizationof
TestInformationtoAssistFaultLocalization.Proceedingsofthe24th
InternationalConferenceonSoftwareEngineering(ICSE).2002.
PhilipM.Johnson.AComparativeReviewofLOCCandCodeCount.
Tech.Rep.CSDL-00-10,DepartmentofInformationandComputer
Sciences,UniversityofHawaii,Honolulu,USA,2000.
CapersJones.AppliedSoftwareMeasurement:AssuringProductivity
andQuality.McGraw-Hill,2008.
MichaelI.Jordan,ed.LearninginGraphicalModels.MITPress,1999.
WeiJiang,JaideepVaidya,ZahirBalaporia,ChrisCliftonandBrett
Banich.KnowledgeDiscoveryfromTransportationNetworkData.
Proceedingsofthe21stInternationalConferenceonDataEngineer-
2005..(ICDE)ingLukaszA.KurganandKrzysztofJ.Cios.CAIMDiscretizationAl-
gorithm.IEEETransactionsonKnowledgeandDataEngineering,
2004.16(2):145–153,RandyKerber.ChiMerge:DiscretizationofNumericAttributes.Pro-
ceedingsofthe10thNationalConferenceonArtificialIntelligence
1992..(AAAI)GregorKiczales,ErikHilsdale,JimHugunin,MikKersten,Jeffrey
PalmandWilliamG.Griswold.AnOverviewofAspectJ.Proceed-
ingsofthe15thEuropeanConferenceonObject-OrientedProgram-
2001..(ECOOP)mingMichihiroKuramochiandGeorgeKarypis.FrequentSubgraphDis-
covery.Proceedingsofthe1stIEEEInternationalConferenceonData
2001..(ICDM)MiningKlausKrogmann,MichaelKuperbergandRalfReussner.UsingGe-
neticSearchforReverseEngineeringofParametricBehaviourModels
forPerformancePrediction.IEEETransactionsonSoftwareEngineer-
2010.36(6):865–877,,ingBogdanKorelandJanuszLaski.DynamicProgramSlicing.Informa-
tionProcessingLetters,29(3):155–163,1988.

+97][KLM

on94][K[KPB06]

[Lam78][LAZJ03]+09][LCH

+06][LFY+05][LNZ

[LPSZ08]

[LS97]

Bibliography

GregorKiczales,JohnLamping,AnuragMendhekar,ChrisMaeda,
CristinaVideiraLopes,Jean-MarcLoingtierandJohnIrwin.Aspect-
OrientedProgramming.Proceedingsofthe11thEuropeanConference
onObject-OrientedProgramming(ECOOP).1997.
IgorKononenko.EstimatingAttributes:AnalysisandExtensionsof
RELIEF.Proceedingsofthe7thEuropeanConferenceonMachine
1994..(ECML)LearningPatrickKnab,MartinPinzgerandAbrahamBernstein.PredictingDe-
fectDensitiesinSourceCodeFileswithDecisionTreeLearners.Pro-
ceedingsoftheInternationalWorkshoponMiningSoftwareReposito-
ries(MSR)atICSE.2006.
LeslieLamport.Time,Clocks,andtheOrderingofEventsinaDis-
tributedSystem.CommunicationsoftheACM,21(7):558–565,1978.
BenLiblit,AlexAiken,AliceX.ZhengandMichaelI.Jordan.Bug
IsolationviaRemoteProgramSampling.ACMSIGPLANNotices,
2003.38(5):141–154,DavidLo,HongCheng,JiaweiHan,Siau-ChengKhooandChengnian
Sun.ClassificationofSoftwareBehaviorsforFailureDetection:A
DiscriminativePatternMiningApproach.Proceedingsofthe15thACM
SIGKDDInternationalConferenceonKnowledgeDiscoveryandData
2009..(KDD)MiningChaoLiu,LongFei,XifengYan,JiaweiHanandSamuelP.Midkiff.
StatisticalDebugging:AHypothesisTesting-BasedApproach.IEEE
TransactionsonSoftwareEngineering,32(10):831–848,2006.
BenLiblit,MayurNaik,AliceX.Zheng,AlexAikenandMichaelI.
Jordan.ScalableStatisticalBugIsolation.Proceedingsofthe2005
ACMSIGPLANConferenceonProgrammingLanguageDesignand
2005..(PLDI)ImplementationShanLu,SoyeonPark,EunsooSeoandYuanyuanZhou.Learningfrom
Mistakes–AComprehensiveStudyonRealWorldConcurrencyBug
Characteristics.SIGARCHComputerArchitectureNews,36(1):329–
2008.339,HuanLiuandRudySetiono.FeatureSelectionviaDiscretization.
IEEETransactionsonKnowledgeandDataEngineering,9(4):642–
1997.645,

175

yBibliograph

+05]YY[L[MAF08][Mas09][Mat89][Mit97][MQB07][MTV97][NBZ06][NK03][NK04][NLHP98]

176

ChaoLiu,XifengYan,HwanjoYu,JiaweiHanandPhilipS.Yu.Min-
ingBehaviorGraphsfor“Backtrace”ofNoncrashingBugs.Proceed-
ingsofthe5thSIAMInternationalConferenceonDataMining(SDM).
2005.MaryMcGlohon,LemanAkogluandChristosFaloutsos.Weighted
GraphsandDisconnectedComponents:PatternsandaGenerator.
Proceedingsofthe14thACMSIGKDDInternationalConferenceon
KnowledgeDiscoveryandDataMining(KDD).2008.
WesMasri.FaultLocalizationBasedonInformationFlowCoverage.
SoftwareTesting,VerificationandReliability,20(2):121–147,2009.
FriedemannMattern.VirtualTimeandGlobalStatesofDistributed
Systems.ProceedingsoftheInternationalWorkshoponParalleland
1989..AlgorithmsutedDistribTomMitchell.MachineLearning.McGrawHill,1997.
MadanlalMusuvathi,ShazQadeerandThomasBall.CHESS:ASys-
tematicTestingToolforConcurrentSoftware.Tech.Rep.MSR-TR-
2007.Research,Microsoft2007-149,HeikkiMannila,HannuToivonenandA.InkeriVerkamo.Discoveryof
FrequentEpisodesinEventSequences.DataMiningandKnowledge
1997.1(3):259–289,,veryDiscoNachiappanNagappan,ThomasBallandAndreasZeller.MiningMet-
ricstoPredictComponentFailures.Proceedingsofthe28thInterna-
tionalConferenceonSoftwareEngineering(ICSE).2006.
SiegfriedNijssenandJoostN.Kok.EfficientDiscoveryofFrequent
UnorderedTrees.ProceedingsofthefirstInternationalWorkshopon
MiningGraphs,TreesandSequences(MGTS)atECML/PKDD.2003.
SiegfriedNijssenandJoostN.Kok.AQuickstartinFrequentStructure
MiningCanMakeaDifference.Proceedingsofthe10thACMSIGKDD
InternationalConferenceonKnowledgeDiscoveryandDataMining
2004..(KDD)RaymondT.Ng,LaksV.S.Lakshmanan,JiaweiHanandAlexPang.
ExploratoryMiningandPruningOptimizationsofConstrainedAssoci-
ationsRules.ProceedingsoftheACMSIGMODInternationalConfer-
enceonManagementofData.1998.

+07][NTU

[OC03][OO84]

an10][P

+02][PH[PHL04]

yBibliograph

[NTU+07]SebastianNowozin,KojiTsuda,TakeakiUno,TakuKudoandGökhan
Bakir.WeightedSubstructureMiningforImageAnalysis.Proceedings
oftheIEEEComputerSocietyConferenceonComputerVisionand
PatternRecognition(CVPR).2007.
[OC03]RobertO’CallahanandJong-DeokChoi.HybridDynamicDataRace
Detection.SIGPLANNotices,38(10):167–178,2003.
[OO84]KarlJ.OttensteinandLindaM.Ottenstein.TheProgramDependence
GraphinaSoftwareDevelopmentEnvironment.SIGSOFTSoftware
1984.9(3):177–184,,NotesEngineering[Pan10]VictorPankratius.SoftwareEngineeringintheEraofParallelism.In
VictorPankratiusandSamuelKounev,eds.,EmergingResearchDirec-
tionsinComputerScience–ContributionsfromtheYoungInformatics
FacultyinKarlsruhe,pp.45–52.KITScientificPublishing,2010.
[PH+02]BrianPitcher,TomHeyetal.WebLechURLSpider,2002.Software
availableathttp://weblech.sourceforge.net/.
[PHL04]JianPei,JiaweiHanandLaksV.S.Lakshmanan.PushingConvertible
ConstraintsinFrequentItemsetMining.DataMiningandKnowledge
2004.8(3):227–252,,veryDisco[PHMA+04]JianPei,JiaweiHan,BehzadMortazavi-Asl,JianyongWang,Helen
Pinto,QimingChen,UmeshwarDayalandMei-ChunHsu.MiningSe-
quentialPatternsbyPattern-Growth:ThePrefixSpanApproach.IEEE
TransactionsonKnowledgeandDataEngineering,16(10):1424–1440,
2004.[PHW02]JianPei,JiaweiHanandWeiWang.MiningSequentialPatterns
withConstraintsinLargeDatabases.Proceedingsofthe11thIn-
ternationalConferenceonInformationandKnowledgeManagement
2002..(CIKM)[PWDW09]MichaelPhilippsen,MarcWörlein,AlexanderDrewekeandTo-
biasWerth.ParSeMiS–TheParallelandSequentialMining
Suite.DepartmentofComputerScience,SchoolofEngineering,
Friedrich-AlexanderUniversityofErlangen-Nürnberg,Erlangen,Ger-
many,2009.Softwareavailableathttp://www2.informatik.
.erlangen.de/EN/research/ParSeMiS/uni-[QDLT09]YaoQi,RajaDas,ZhiDaLuoandMartinTrotter.MulticoreSDK:
APracticalandEfficientDataRaceDetectorforReal-WorldApplica-
tions.Proceedingsofthe7thWorkshoponParallelandDistributed
Systems(PADTAD).2009.

177

yBibliograph

[Qui93][RAF04][RS09]

TI02][R[SA96a][SA96b]

[Sau05]+97][SBN

[SJYH09]

[SKT08]

+09][SNK

178

JohnRossQuinlan.C4.5:ProgramsforMachineLearning.Morgan
1993.Kaufmann,NickRutar,ChristianB.AlmazanandJeffreyS.Foster.AComparison
ofBugFindingToolsforJava.Proceedingsofthe15thInternational
SymposiumonSoftwareReliabilityEngineering(ISSRE).2004.
SayanRanuandAmbujK.Singh.GraphSig:AScalableApproach
toMiningSignificantSubgraphsinLargeGraphDatabases.Proceed-
ingsofthe25thInternationalConferenceonDataEngineering(ICDE).
2009.ResearchTriangleInstituteRTI.TheEconomicImpactsofInadequate
InfrastructureforSoftwareTesting.PlanningReport02-3,NationalIn-
stituteofStandardsandTechnology(NIST),Gaithersburg,USA,2002.
RamakrishnanSrikantandRakeshAgrawal.MiningQuantitativeAs-
sociationRulesinLargeRelationalTables.ProceedingsoftheACM
SIGMODInternationalConferenceonManagementofData.1996.
RamakrishnanSrikantandRakeshAgrawal.MiningSequentialPat-
terns:GeneralizationsandPerformanceImprovements.Proceedingsof
the5thInternationalConferenceonExtendingDatabaseTechnology
1996..(EDBT)FrankSauer.EclipseMetricsPlugin,2005.Softwareavailableat
.http://metrics.sourceforge.net/StefanSavage,MichaelBurrows,GregNelson,PatrickSobalvarro
andThomasAnderson.Eraser:ADynamicDataRaceDetectorfor
MultithreadedPrograms.ACMTransactionsonComputerSystems,
1997.15(4):391–411,RaulAndresSantelices,JamesA.Jones,YanbingYuandMaryJean
Harrold.LightweightFault-LocalizationUsingMultipleCoverage
Types.Proceedingsofthe31stInternationalConferenceonSoftware
2009..(ICSE)EngineeringHirotoSaigo,NicoleKrämerandKojiTsuda.PartialLeastSquares
RegressionforGraphMining.Proceedingsofthe14thACMSIGKDD
InternationalConferenceonKnowledgeDiscoveryandDataMining
2008..(KDD)HirotoSaigo,SebastianNowozin,TadashiKadowaki,TakuKudoand
KojiTsuda.gBoost:AMathematicalProgrammingApproachtoGraph
ClassificationandRegression.MachineLearning,75:69–89,2009.

[Som10][SOO09]

[SZZ06]+10][TCG

[TUYT07]

ap95][V[WC87][WF05][WH04][WMFP05]

u96][W

yBibliograph

IanSommerville.SoftwareEngineering.PearsonEducation,9thedn.,
2010.MasakiShinoda,TomonobuOzakiandTakenaoOhkawa.Weighted
FrequentSubgraphMininginWeightedGraphDatabases.Proceed-
ingsofthe3rdInternationalWorkshoponDomain-DrivenDataMining
2009..(DDDM)AdrianSchröter,ThomasZimmermannandAndreasZeller.Predicting
ComponentFailuresatDesignTime.Proceedingsofthe5thInterna-
tionalSymposiumonEmpiricalSoftwareEngineering.2006.
MarisaThoma,HongCheng,ArthurGretton,JiaweiHan,Hans-Peter
Kriegel,AlexSmola,LeSong,PhilipS.Yu,XifengYanandKarstenM.
Borgwardt.DiscriminativeFrequentSubgraphMiningwithOptimality
Guarantees.StatisticalAnalysisandDataMining,3(5):302–318,2010.
RachelTzoref,ShmuelUrandEladYom-Tov.InstrumentingWhereit
Hurts–AnAutomaticConcurrentDebuggingTechnique.Proceedings
ofthe16thInternationalSymposiumonSoftwareTestingandAnalysis
2007..A)(ISSTVladimirN.Vapnik.TheNatureofStatisticalLearningTheory.
1995.,SpringerAndrewK.C.WongandDavidK.Y.Chiu.SynthesizingStatistical
KnowledgefromIncompleteMixed-ModeData.IEEETransactions
onPatternAnalysisandMachineIntelligence,9(6):796–805,1987.
IanH.WittenandEibeFrank.DataMining:PracticalMachineLearn-
ingToolsandTechniqueswithJavaImplementations.SeriesinData
ManagementSystems.MorganKaufmann,2ndedn.,2005.
JianyongWangandJiaweiHan.BIDE:EfficientMiningofFrequent
ClosedSequences.Proceedingsofthe20thInternationalConference
onDataEngineering(ICDE).2004.
MarcWörlein,ThorstenMeinl,IngridFischerandMichaelPhilippsen.
AQuantitativeComparisonoftheSubgraphMinersMoFa,gSpan,
FFSM,andGaston.Proceedingsofthe10thEuropeanConference
onPrinciplesandPracticeofKnowledgeDiscoveryinDatabases
2005..(PKDD)XindongWu.ABayesianDiscretizerforReal-ValuedAttributes.The
1996.39(8):688–691,,JournalComputer

179

yBibliograph

[WZW+05]ChenWang,YongtaiZhu,TianyiWu,WeiWangandBaileShi.
Constraint-BasedGraphMininginLargeDatabase.Proceedingsof
the7thAsia-PacificWebConference(APWeb).2005.
[XTLL09]TaoXie,SureshThummalapenta,DavidLoandChaoLiu.DataMining
forSoftwareEngineering.Computer,42(8):55–62,2009.
[XY05]YiXiaandYirongYang.MiningClosedandMaximalFrequentSub-
treesfromDatabasesofLabeledRootedTrees.IEEETransactionson
KnowledgeandDataEngineering(TKDE),17(2):190–202,2005.
[YCHY08]XifengYan,HongCheng,JiaweiHanandPhilipS.Yu.MiningSignifi-
cantGraphPatternsbyLeapSearch.ProceedingsoftheACMSIGMOD
InternationalConferenceonManagementofData.2008.
[YH02]XifengYanandJiaweiHan.gSpan:Graph-BasedSubstructurePattern
Mining.Proceedingsofthe2ndIEEEInternationalConferenceon
2002..(ICDM)MiningData[YH03]XifengYanandJiaweiHan.CloseGraph:MiningClosedFrequent
GraphPatterns.Proceedingsofthe9thACMSIGKDDInternational
ConferenceonKnowledgeDiscoveryandDataMining(KDD).2003.
[YH06]XifengYanandJiaweiHan.DiscoveryofFrequentSubstructures.In
CookandHolder[CH06],chap.5,pp.99–115.
[YHA03]XifengYan,JiaweiHanandRaminAfshar.CloSpan:MiningClosed
SequentialPatternsinLargeDatabases.ProceedingsoftheInt.Con-
ferenceSIAMDataMining(SDM).2003.
[YHF10]PhilipS.Yu,JiaweiHanandChristosFaloutsos,eds.LinkMining:
Models,Algorithms,andApplications.Springer,2010.
[Zak00]MohammedJ.Zaki.ScalableAlgorithmsforAssociationMining.
IEEETransactionsonKnowledgeandDataEngineering,12(3):372–
2000.390,[Zak01]MohammedJ.Zaki.SPADE:AnEfficientAlgorithmforMiningFre-
quentSequences.MachineLearning,42(1–2):31–60,2001.
[Zel99]AndreasZeller.Yesterday,myProgramWorked.Today,itDoesNot.
Why?Proceedingsofthe7thEuropeanSoftwareEngineeringCon-
ferenceandthe7thACMSIGSOFTInternationalSymposiumonthe
FoundationsofSoftwareEngineering(ESEC/FSE).1999.

180

[Zel02]

[Zel09]

[ZH02]

[ZNZ08]

[ZYHY07]

yBibliograph

grams.AndreasACMZeller.SIGSOFTIsolatingSoftwareCause-EffectEngineeringChainsNotesfrom,27(6):1–10,Computer2002.Pro-

AndreasZeller.WhyProgramsFail:AGuidetoSystematicDebugging.
MorganKaufmann,2ndedn.,2009.

AndreasZellerandRalfHildebrandt.SimplifyingandIsolating
Failure-InducingInput.IEEETransactionsonSoftwareEngineering,
2002.28:183–200,

ThomasZimmermann,NachiappanNagappanandAndreasZeller.Pre-
dictingBugsfromHistory.InTomMensandSergeDemeyer,eds.,
SoftwareEvolution,chap.4,pp.69–88.Springer,2008.

FeidastraintZhu,PushingXifengFrameYan,workJiaweiforHanGraphandPatternPhilipS.Mining.Yu.PrgPrune:oceedingsACon-of
the11thPacific-AsiaConferenceonKnowledgeDiscoveryandData
2007..AKDD)(PMining

181