La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

The Design and Analysis of Benchmark Experiments

De
22 pages
The Design and Analysis of Benchmark
Experiments
Torsten Hothorn Friedrich Leisch Achim Zeileis
Friedrich-Alexander-Universitat Technische Universitat Wien Wirtschaftsuniversitat Wien
Erlangen-Nurnberg
Kurt Hornik
Wirtschaftsuniversitat Wien
Abstract
The assessment of the performance of learners by means of benchmark experiments is an
established exercise. In practice, benchmark studies are a tool to compare the performance of
several competing algorithms for a certain learning problem. Cross-validation or resampling
techniques are commonly used to derive point estimates of the performances which are com-
pared to identify algorithms with good properties. For several benchmarking problems, test
procedures taking the variability of those point estimates into account have been suggested.
Most of the recently proposed inference procedures are based on special variance estimators
for the cross-validated performance.
We introduce a theoretical framework for inference problems in benchmark experiments
and show that standard statistical test procedures can be used to test for di erences in the
performances. The theory is based on well de ned distributions of performance measures
which can be compared with established tests. To demonstrate the usefulness in practice,
the theoretical results are applied to regression and classi cation benchmark studies based on
arti cial and real world data.
Keywords: model comparison, performance, hypothesis testing, cross-validation, bootstrap ...
Voir plus Voir moins
TheDesignandAnalysisofBenchmarkExperimentsTorstenHothornFriedrichLeischAchimZeileisFriedrich-Alexander-Universita¨tTechnischeUniversita¨tWienWirtschaftsuniversita¨tWienErlangen-Nu¨rnbergKurtHornikWirtschaftsuniversita¨tWienAbstractTheassessmentoftheperformanceoflearnersbymeansofbenchmarkexperimentsisanestablishedexercise.Inpractice,benchmarkstudiesareatooltocomparetheperformanceofseveralcompetingalgorithmsforacertainlearningproblem.Cross-validationorresamplingtechniquesarecommonlyusedtoderivepointestimatesoftheperformanceswhicharecom-paredtoidentifyalgorithmswithgoodproperties.Forseveralbenchmarkingproblems,testprocedurestakingthevariabilityofthosepointestimatesintoaccounthavebeensuggested.Mostoftherecentlyproposedinferenceproceduresarebasedonspecialvarianceestimatorsforthecross-validatedperformance.Weintroduceatheoreticalframeworkforinferenceproblemsinbenchmarkexperimentsandshowthatstandardstatisticaltestprocedurescanbeusedtotestfordifferencesintheperformances.Thetheoryisbasedonwelldefineddistributionsofperformancemeasureswhichcanbecomparedwithestablishedtests.Todemonstratetheusefulnessinpractice,thetheoreticalresultsareappliedtoregressionandclassificationbenchmarkstudiesbasedonartificialandrealworlddata.Keywords:modelcomparison,performance,hypothesistesting,cross-validation,bootstrap.1.IntroductionInstatisticallearningwerefertoabenchmarkstudyastoanempiricalexperimentwiththeaimofcomparinglearnersoralgorithmswithrespecttoacertainperformancemeasure.Thequalityofseveralcandidatealgorithmsisusuallyassessedbypointestimatesoftheirperformancesonsomedatasetorsomedatageneratingprocessofinterest.Althoughnowadayscommonlyusedintheabovesense,theterm“benchmarking”hasitsrootingeology.Patterson(1992)describestheoriginalmeaninginlandsurveyingasfollows:Abenchmarkinthiscontextisamark,whichwasmountedonarock,abuildingorawall.Itwasareferencemarktodefinethepositionortheheightintopographicsurveyingortodeterminethetimefordislocation.Inanalogytotheoriginalmeaning,wemeasureperformancesinalandscapeoflearningalgo-rithmswhilestandingonareferencepoint,thedatageneratingprocessofinterest,inbenchmarkexperiments.Butincontrasttogeologicalmeasurementsofheightsordistancesthestatisticalmeasurementsofperformancearenotsufficientlydescribedbypointestimatesastheyarein-fluencedbyvarioussourcesofvariability.Hence,wehavetotakethisstochasticnatureofthemeasurementsintoaccountwhenmakingdecisionsabouttheshapeofouralgorithmlandscape,thatis,decidingwhichlearnerperformsbestonagivendatageneratingprocess.ThisisapreprintofanarticlepublishedinJournalofComputationalandGraphicalStatistics,Volume14,Number3,Pages675–699.Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
2TheDesignandAnalysisofBenchmarkExperimentsTheassessmentofthequalityofanalgorithmwithrespecttoacertainperformancemeasure,forexamplemisclassificationormeansquarederrorinsupervisedclassificationandregression,hasbeenaddressedinmanyresearchpapersofthelastthreedecades.Theestimationofthegeneralisationerrorbymeansofsomeformofcross-validationstartedwiththepioneeringworkofStone(1974)andmajorimprovementswerepublishedbyEfron(1983,1986)andEfronandTibshirani(1997);foranoverviewwerefertoSchiavoandHand(2000).Thetopicisstillamatterofcurrentinterest,asindicatedbyrecentempirical(WolpertandMacready1999;Bylander2002),algorithmic(BlockeelandStruyf2002)andtheoretical(DudoitandvanderLaan2005)investigations.However,themajorgoalofbenchmarkexperimentsisnotonlytheperformanceassessmentofdifferentcandidatealgorithmsbuttheidentificationofthebestamongthem.Thecomparisonofalgorithmswithrespecttopointestimatesofperformancemeasures,forexamplecomputedviacross-validation,isanestablishedexercise,atleastamongstatisticiansinfluencedbythe“algo-rithmicmodellingculture”(Breiman2001b).Infact,manyofthepopularbenchmarkproblemsfirstcameupinthestatisticalliterature,suchastheOzoneandBostonHousingproblems(byBreimanandFriedman1985).Friedman(1991)contributedthestandardartificialregressionprob-lems.OtherwellknowndatasetslikethePimaindiandiabetesdataortheforensicglassproblemplayamajorroleintextbooksinthisfield(e.g.,Ripley1996).Furtherexamplesarerecentbench-markstudies(asforexampleMeyer,Leisch,andHornik2003),orresearchpapersillustratingthegainsofrefinementstothebaggingprocedure(Breiman2001a;HothornandLausen2003).However,theproblemofidentifyingasuperioralgorithmisstructurallydifferentfromtheper-formanceassessmenttask,althoughwenoticethatasymptoticargumentsindicatethatcross-validationisabletoselectthebestalgorithmwhenprovidedwithinfinitivelylargelearningsamples(DudoitandvanderLaan2005)becausethevariabilitytendstozero.Anyway,thecomparisonofrawpointestimatesinfinitesamplesituationsdoesnottaketheirvariabilityintoaccount,thusleadingtouncertaindecisionswithoutcontrollinganyerrorprobability.Whilemanysolutionstotheinstabilityproblemsuggestedinthelastyearsareextremelysuc-cessfulinreducingthevarianceofalgorithmsbyturningweakintostronglearners,especiallyensemblemethodslikeboosting(FreundandSchapire1996),bagging(Breiman1996a)orrandomforests(Breiman2001a),thevariabilityofperformancemeasuresandassociatedtestprocedureshasreceivedlessattention.ThetaxonomyofinferenceproblemsinthespecialcaseofsupervisedclassificationproblemsdevelopedbyDietterich(1998)ishelpfultodistinguishbetweenseveralproblemclassesandapproaches.Foradatageneratingprocessunderstudy,wemayeitherwanttoselectthebestoutofasetofcandidatealgorithmsortochooseoneoutofasetofpredefinedfittedmodels(“classifiers”).Dietterich(1998)distinguishesbetweensituationswherewearefacedwithlargeorsmalllearningsamples.Standardstatisticaltestproceduresareavailableforcompar-ingtheperformanceoffittedmodelswhenanindependenttestsampleisavailable(questions3and4inDietterich1998)andsomebenchmarkstudiesrestrictthemselvestothoseapplications(BauerandKohavi1999).Theproblemwhethersomeoutofasetofcandidatealgorithmsoutperformallothersinalarge(question7)andsmallsamplesituation(question8)iscommonlyaddressedbythederivationofspecialvarianceestimatorsandassociatedtests.EstimatesofthevariabilityofthenaivebootstrapestimatorofmisclassificationerroraregiveninEfronandTibshirani(1997).Someproceduresforsolvingquestion8suchasthe5×2cvtestaregivenbyDietterich(1998),furtherinvestigatedbyAlpaydin(1999)andappliedinabenchmarkstudyonensemblemethods(Dietterich2000).Pizarro,Guerrero,andGalindo(2002)suggesttousesomeclassicalmultipletestproceduresforsolvingthisproblem.Mixedmodelsareappliedforthecomparisonofalgo-rithmsacrossbenchmarkproblems(forexampleLim,Loh,andShih2000;KimandLoh2003).Abasicproblemcommontotheseapproachesisthatthecorrelationbetweeninternalperformanceestimates,suchasthosecalculatedforeachfoldink-foldcross-validation,violatestheassumptionofindependence.Thisfactiseitherignoredwhenthedistributionofnewlysuggestedteststatisticsunderthenullhypothesisofequalperformancesisinvestigated(forexampleinDietterich1998;Alpaydin1999;VehtariandLampinen2002)orspecialvarianceestimatorstakingthiscorrelationintoaccountarederived(NadeauandBengio2003).Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik3Inthispaper,weintroduceasoundandflexibletheoreticalframeworkforthecomparisonofcan-didatealgorithmsandalgorithmselectionforarbitrarylearningproblems.Theapproachtotheinferenceprobleminbenchmarkstudiespresentedhereisfundamentallydifferentfromthepro-cedurescitedabove:Weshowhowonecansamplefromawelldefineddistributionofacertainperformancemeasure,conditionalonadatageneratingprocess,inanindependentway.Conse-quently,standardstatisticaltestprocedurescanbeusedtotestmanyhypothesesofinterestinbenchmarkstudiesandnospecialpurposeproceduresarenecessary.Thedefinitionofappropriatesamplingproceduresmakesspecial“aposteriori”adjustmentstovarianceestimatorsunnecessary.Moreover,norestrictionsoradditionalassumptions,neithertothecandidatealgorithms(likelin-earityinvariableselection,seeGeorge2000,foranoverview)nortothedatageneratingprocessarerequired.ThroughoutthepaperweassumethatalearningsampleofnobservationsL={z1,...,zn}isgivenandasetofcandidatealgorithmsaspotentialproblemsolversisavailable.Eachofthosecandidatesisatwostepalgorithma:InthefirststepamodelisfittedbasedonalearningsampleLyieldingafunctiona(∙|L)which,inasecondstep,canbeusedtocomputecertainobjectsofinterest.Forexample,inasupervisedlearningproblem,thoseobjectsofinterestarepredictionsoftheresponsebasedoninputvariablesor,inunsupervisedsituationslikedensityestimation,a(∙|L)mayreturnanestimateddensity.Whenwesearchforthebestsolution,thecandidatesneedtobecomparedbysomeproblemspecificperformancemeasure.Suchameasuredependsonthealgorithmandthelearningsampledrawnfromsomedatageneratingprocess:Thefunctionp(a,L)assessestheperformanceofthefunctiona(∙|L),thatistheperformanceofalgorithmabasedonlearningsampleL.SinceLisarandomlearningsample,p(a,L)isarandomvariablewhosevariabilityisinducedbythevariabilityoflearningsamplesfollowingthesamedatageneratingprocessasL.Itisthereforenaturaltocomparethedistributionoftheperformancemeasureswhenweneedtodecidewhetheranyofthecandidatealgorithmsperformssuperiortoalltheothers.Theideaistodrawindependentrandomsamplesfromthedistributionoftheperformancemeasureforanalgorithmabyevaluatingp(a,L),wherethelearningsampleLfollowsaproperlydefineddatageneratingprocesswhichreflectsourknowledgeabouttheworld.Byusingappropriateandwellinvestigatedstatisticaltestproceduresweareabletotesthypothesesaboutthedistributionsoftheperformancemeasuresofasetofcandidatesand,consequently,weareinthepositiontocontroltheerrorprobabilityoffalselydeclaringanyofthecandidatesasthewinner.WederivethetheoreticalbasisofourproposalinSection2andfocusonthespecialcaseofregressionandclassificationproblemsinSection3.Onceappropriaterandomsamplesfromtheperformancedistributionhavebeendrawn,theestablishedstatisticaltestandanalysisprocedurescanbeappliedandweshortlyreviewthemostinterestingoftheminSection4.Especially,wefocusontestsforsomeinferenceproblemswhichareaddressedintheapplicationspresentedinSection5.2.ComparingperformancemeasuresInthissectionweintroduceageneralframeworkforthecomparisonofcandidatealgorithms.Independentsamplesfromthedistributionsoftheperformancemeasuresaredrawnconditionallyonthedatageneratingprocessofinterest.Weshowhowstandardstatisticaltestprocedurescanbeusedinbenchmarkstudies,forexampleinordertotestthehypothesisofequalperformances.SupposethatBindependentandidenticallydistributedlearningsampleshavebeendrawnfromsomedatageneratingprocessDGP L1=z11,...,zn1DGP,. LB=z1B,...,znBDGP,Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
4TheDesignandAnalysisofBenchmarkExperimentswhereeachofthelearningsamplesLb(b=1,...,B)consistsofnobservations.FurthermoreweassumethatthereareK>1potentialcandidatealgorithmsak(k=1,...,K)availableforthesolutionoftheunderlyingproblem.Foreachalgorithmakthefunctionak(∙|Lb)isbasedontheobservationsfromthelearningsampleLb.Hence,itisarandomvariabledependingonLbandhasitselfadistributionAkonthefunctionspaceofakwhichdependsonthedatageneratingprocessoftheLb:ak(∙|Lb)∼Ak(DGP),k=1,...,K.Foralgorithmsakwithdeterministicfittingprocedure(forexamplehistogramsorlinearmodels)thefunctionak(∙|Lb)isfixedwhereasforalgorithmsinvolvingnon-deterministicfittingorwherethefittingisbasedonthechoiceofstartingvaluesorhyperparameters(forexampleneuralnetworksorrandomforests)itisarandomvariable.Notethatak(∙|Lb)isapredictionfunctionthatmustnotdependonhyperparametersanymore:Thefittingprocedureincorporatesbothtuningaswellasthefinalmodelfittingitself.AssketchedinSection1,theperformanceofthecandidatealgorithmakwhenprovidedwiththelearningsampleLbismeasuredbyascalarfunctionp:pkb=p(ak,Lb)∼Pk=Pk(DGP).TherandomvariablepkbfollowsadistributionfunctionPkwhichagaindependsonthedatageneratingprocessDGP.Foralgorithmswithnon-deterministicfittingprocedurethisimpliesthatitmaybeappropriatetointegratewithrespecttoitsdistributionAkwhenevaluatingitsperformance.TheKdifferentrandomsamples{pk1,...,pkB}withBindependentandidenticallydistributedobservationsaredrawnfromthedistributionsPk(DGP)foralgorithmsak(k=1,...,K).Theseperformancedistributionscanbecomparedbybothexploratorydataanalysistoolsaswellasformalinferenceprocedures.ThenullhypothesisofinterestformostproblemsistheequalityofthecandidatealgorithmswithrespecttothedistributionoftheirperformancemeasureandcanbeformulatedbywritingH0:P1=∙∙∙=PK.Inparticular,thishypothesisimpliestheequalityoflocationandvariabilityoftheperformances.Inordertospecifyanappropriatetestprocedureforthehypothesisaboveoneneedstodefineanalternativetotestagainst.Thealternativedependsontheoptimalitycriterionofinterest,whichweassessusingascalarfunctionalφ:Analgorithmakisbetterthananalgorithmak0withrespecttoaperformancemeasurepandafunctionalφiffφ(Pk)(Pk0).Theoptimalitycriterionmostcommonlyusedisbasedonsomelocationparametersuchastheexpectationφ(Pk)=E(Pk)orthemedianoftheperformancedistribution,thatis,theaverageexpectedloss.Inthiscaseweareinterestedindetectingdifferencesinmeanperformances:H0:E(P1)=∙∙∙=E(PK)vs.H1:i,j∈{1,...,K}:E(Pi)6=E(Pj).Otheralternativesmaybederivedfromoptimalitycriteriafocusingonthevariabilityoftheper-formancemeasures.Underanycircumstances,theinferenceisconditionalonthedatageneratingprocessofinterest.Examplesforappropriatechoicesofsamplingproceduresforthespecialcaseofsupervisedlearningproblemsaregiveninthenextsection.3.RegressionandclassificationInthissectionweshowhowthegeneralframeworkfortestingtheequalityofalgorithmsderivedintheprevioussectioncanbeappliedtothespecialbutimportantcaseofsupervisedstatisticallearningproblems.Moreover,wefocusonapplicationsthatcommonlyoccurinpracticalsituations.Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik53.1.ComparingpredictorsInsupervisedlearningproblems,theobservationszinthelearningsampleareoftheformz=(y,x)whereydenotestheresponsevariableandxdescribesavectorofinputvariables.Theaimofthelearningtaskistoconstructpredictorswhich,basedoninputvariablesonly,provideuswithinformationabouttheunknownresponsevariable.Consequently,thefunctionconstructedbyeachoftheKcandidatealgorithmsisoftheformyˆ=ak(x|Lb).Inclassificationproblemsyˆmaybethepredictedclassforobservationswithinputxorthevectoroftheestimatedconditionalclassprobabilities.Insurvivalanalysistheconditionalsurvivalcurveforobservationswithinputstatusxisofspecialinterest.ThediscrepancybetweenthetrueresponseyandthepredictedvalueyˆforonesingleobservationismeasuredbyascalarlossfunctionL(y,yˆ).TheperformancemeasurepisdefinedbysomefunctionalµofthedistributionofthelossfunctionandthedistributionofpkbdependsonthedatageneratingprocessDGPonly:pkb=p(ak,Lb)=µLy,akx|Lb∼Pk(DGP).Consequently,therandomnessofz=(y,x)andtherandomnessinducedbyalgorithmsakwithnon-deterministicfittingareremovedbyappropriateintegrationwithrespecttotheassociateddistributionfunctions.Again,theexpectationisacommonchoiceforthefunctionalµunderquadraticlossL(y,yˆ)=(yyˆ)2andtheperformancemeasureisgivenbythesocalledconditionalrisk2pkb=EakEz=(y,x)Ly,akx|Lb=EakEz=(y,x)yakx|Lb,(1)wherez=(y,x)isdrawnfromthesamedistributionastheobservationsinalearningsampleL.Otherconceivablechoicesofµarethemedian,correspondingtoabsoluteloss,oreventhesupre-mumortheoreticalquantilesofthelossfunctions.3.2.SpecialproblemsThedistributionsoftheperformancemeasurePk(DGP)foralgorithmsak(k=1,...,K)dependonthedatageneratingprocessDGP.Consequently,thewaywedrawrandomsamplesfromPk(DGP)isdeterminedbytheknowledgeaboutthedatageneratingprocessavailabletous.Insupervisedlearningproblems,onecandistinguishtwosituations:Eitherthedatageneratingprocessisknown,whichisthecaseinsimulationexperimentswithartificiallygenerateddataorincaseswherewearepracticallyabletodrawinfinitivelymanysamples(e.g.,networkdata),ortheinformationaboutthedatageneratingprocessisdeterminedbyafinitelearningsampleL.InthiscasetheempiricaldistributionfunctionofLtypicallyrepresentsthecompleteknowledgeaboutthedatageneratingprocessweareprovidedwith.InthefollowingweshowhowrandomsamplesfromthedistributionoftheperformancemeasurePk(DGP)foralgorithmakcanbedrawninthreebasicproblems:Thedatageneratingprocessisknown(simulation),alearningsampleaswellasatestsampleareavailable(competition)oronesinglelearningsampleisprovidedonly(realworld).Specialchoicesofthefunctionalµappropriateineachofthethreeproblemswillbediscussed.ThesimulationproblemArtificialdataaregeneratedfromsomedistributionfunctionZ,whereeachobservationzi(i=1,...,n)inalearningsampleisdistributedaccordingtoZ.ThelearningsampleLconsistsofnindependentobservationsfromZwhichwedenotebyL∼Zn.InthissituationthedatageneratingprocessDGP=Znisused.ThereforeweareabletodrawasetofBindependentlearningsamplesfromZn:L1,...,LB∼Zn.WeassesstheperformanceofeachalgorithmakCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
6TheDesignandAnalysisofBenchmarkExperimentsonalllearningsamplesLb(b=1,...,B)yieldingarandomsampleofBobservationsfromtheperformancedistributionPk(Zn)bycalculatingpkb=p(ak,Lb)=µLy,akx|Lb,b=1,...,B.TheassociatedhypothesisundertestisconsequentlyH0:P1(Zn)=∙∙∙=PK(Zn).IfwearenotabletocalculateµanalyticallywecanapproximateituptoanydesiredaccuracybydrawingatestsampleT∼ZmofmindependentobservationsfromZwheremislargeandcalculatingpˆkb=pˆ(ak,Lb)=µTLy,akx|Lb.HereµTdenotestheempiricalanalogueofµforthetestobservationsz=(y,x)T.Whenµisdefinedastheexpectationwithrespecttotestsampleszasin(1)(weassumeadeterministicakforthesakeofsimplicityhere),thisreducestothemeanofthelossfunctionevaluatedforeachobservationinthelearningsampleXpˆkb=pˆ(ak,Lb)=m1Ly,akx|Lb.z=(y,x)TAnalogously,thesupremumwouldbereplacedbythemaximumandtheoreticalquantilesbytheirempiricalcounterpart.ThecompetitionproblemInmostpracticalapplicationsnopreciseknowledgeaboutthedatageneratingprocessisavailablebutinsteadweareprovidedwithonelearningsampleL∼ZnofnobservationsfromsomedistributionfunctionZ.TheempiricaldistributionfunctionZˆncoversallknowledgethatwehaveaboutthedatageneratingprocess.Therefore,wemimicthedatageneratingprocessbyusingtheempiricaldistributionfunctionofthelearningsample:DGP=Zˆn.Nowweareabletodrawindependentandidenticallydistributedrandomsamplesfromthisemulateddatageneratingprocess.Inacompletelynon-parametricsetting,thenon-parametricorBayesianbootstrapcanbeappliedhereor,iftherestrictiontocertainparametricfamiliesisappropriate,theparametricbootstrapcanbeusedtodrawsamplesfromthedatageneratingprocess.ForanoverviewofthoseissueswerefertoEfronandTibshirani(1993).Undersomecircumstances,anadditionaltestsampleT∼Zmofmobservationsisgiven,forexampleinmachinelearningcompetitions.Inthissituation,theperformanceneedstobeassessedwithrespecttoTonly.Again,wewouldliketodrawarandomsampleofBobservationsfromPˆk(Zˆn),whichinthissetupispossiblebybootstrappingL1,...,LB∼Zˆn,wherePˆdenotesthedistributionfunctionoftheperformancemeasureevaluatedusingT,thatis,theperformancemeasureiscomputedbypˆkb=pˆ(ak,Lb)=µTLy,akx|LbwhereµTisagaintheempiricalanalogueofµforallz=(y,x)T.ThehypothesisweareinterestedinisH0:Pˆ1(Zˆn)=∙∙∙=PˆK(Zˆn),wherePˆkcorrespondstotheperformancemeasureµT.SincetheperformancemeasureisdefinedintermsofonesingletestsampleT,itshouldbenotedthatwemayfavouralgorithmsthatperformwellonthatparticulartestsampleTbutworseonothertestsamplesjustbychance.Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik7TherealworldproblemThemostcommonsituationweareconfrontedwithindailyroutineistheexistenceofonesinglelearningsampleL∼Znwithnodedicatedindependenttestsamplebeingavailable.Again,wemimicthedatageneratingprocessbytheempiricaldistributionfunctionofthelearningsample:DGP=Zˆn.WeredrawBindependentlearningsamplesfromtheempiricaldistributionfunctionbybootstrapping:L1,...,LB∼Zˆn.Thecorrespondingperformancemeasureiscomputedbypˆkb=pˆ(ak,Lb)=µˆLy,akx|Lbwhereµˆisanappropriateempiricalversionofµ.Therearemanypossibilitiesofchoosingµˆandthemostobviousonesaregiveninthefollowing.Ifnislarge,onecandividethelearningsampleintoasmallerlearningsampleandatestsampleL={L0,T}andproceedwithµTasinthecompetitionproblem.Ifnisnotlargeenoughforthistobefeasible,thefollowingapproachisafirstnaivechoice:Inthesimulationproblem,themodelsarefittedonsamplesfromZnandtheirperformanceisevaluatedonsamplesfromZ.Here,themodelsaretrainedonsamplesfromtheempiricaldistributionfunctionZˆnandsowecouldwanttoassesstheirperformanceonZˆwhichcorrespondstoemulatingµTbyusingthelearningsampleLastestsample,i.e.,foreachmodelfittedonabootstrapsample,theoriginallearningsampleLitselfisusedastestsampleT.Exceptforalgorithmsabletocompute‘honest’predictionsfortheobservationsinthelearningsample(forexamplebagging’sout-of-bagpredictions,Breiman1996b),thischoiceleadstooverfit-tingproblems.Thosecanbeaddressedbywellknowncross-validationstrategies.ThetestsampleTcanbedefinedintermsoftheout-of-bootstrapobservationswhenevaluatingµT:RW-OOB.ForeachbootstrapsampleLb(b=1,...,B)theout-of-bootstrapobservationsL\Lbareusedastestsample.Notethatusingtheout-of-bootstrapobservationsastestsampleleadstonon-independentob-servationsoftheperformancemeasure,however,theircorrelationvanishesasntendstoinfinity.Anotherwayistochooseacross-validationestimatorofµ:RW-CV.EachbootstrapsampleLbisdividedintokfoldsandtheperformancepˆkbisdefinedastheaverageoftheperformancemeasureoneachofthosefolds.SinceitispossiblethatoneobservationfromtheoriginallearningsampleLispartofboththelearningfoldsandthevalidationfoldduetosamplingn-out-of-nwithreplacement,thoseobservationsareremovedfromthevalidationfoldinordertopreventanybias.Suchbiasmaybeinducedforsomealgorithmsthatperformbetteronobservationsthatarepartofbothlearningsampleandtestsample.Commontoallchoicesinthissetupisthatonesinglelearningsampleprovidesallinformation.Therefore,wecannotcomputethetheoreticalperformancemeasuresandhencecannottesthy-pothesesabouttheseasthiswouldrequiremoreknowledgeaboutthedatageneratingprocess.Thestandardapproachistocomputesomeempiricalperformancemeasure,suchasthosesug-gestedhere,insteadtoapproximatethetheoreticalperformance.Foranyempiricalperformancemeasure,thehypothesisneedstobeformulatedbyH0:Pˆ1(Zˆn)=∙∙∙=PˆK(Zˆn),meaningthattheinferenceisconditionalontheperformancemeasureunderconsideration.4.TestproceduresAsoutlinedintheprevioussections,theproblemofcomparingKalgorithmswithrespecttoanyperformancemeasurereducestotheproblemofcomparingKnumericdistributionfunctionsorCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
8TheDesignandAnalysisofBenchmarkExperimentscertaincharacteristics,suchastheirexpectation.Alotofattentionhasbeenpaidtothisandsimilarproblemsinthestatisticalliteratureandsoarichtoolboxcanbeappliedhere.Wewillcommentonappropriatetestproceduresforonlythemostimportanttestproblemscommonlyaddressedinbenchmarkexperimentsandrefertothestandardliteratureotherwise.4.1.ExperimentaldesignsAmatchedpairsordependentKsamplesdesignisthemostnaturalchoiceforthecomparisonofalgorithmssincetheperformanceofallKalgorithmsisevaluatedusingthesamerandomsamplesL1,...,LB.WethereforeusethisdesignforthederivationsintheprevioussectionsandtheexperimentsinSection5andcomparethealgorithmsbasedonthesamesetoflearningsamples.TheapplicationofanindependentKsampledesignmaybemorecomfortablefromastatisticalpointofview.EspeciallythederivationofconfidenceintervalsforparameterslikethedifferenceofthemisclassificationerrorsoftwoalgorithmsorthevisualisationoftheperformancedistributionsisstraightforwardintheindependentKsamplessetup.4.2.AnalysisAsensibleteststatisticforcomparingtwoperformancedistributionswithrespecttotheirlocationsinamatchedpairsdesignisformulatedintermsoftheaveraged¯ofthedierencesdb=p1bp2b(b=1,...,B)fortheobservationsp1bandp2bofalgorithmsa1anda2.Underthenullhypothesisofequalityoftheperformancedistributions,thestudentizedstatistic¯dt=Br(2)(B1)1Pdbd¯2bisasymptoticallynormalandfollowsat-distributionwithB1degreesoffreedomwhenthedifferencesaredrawnfromanormaldistribution.Theunconditionaldistributionofthisandothersimilarteststatisticsisderivedundersomeparametricassumption,suchassymmetryornormality,aboutthedistributionsoftheunderlyingobservations.However,wedoubtthatsuchparametricassumptionstoperformancedistributionsareeverappropriate.Thequestionwhetherconditionalorunconditionaltestproceduresshouldbeappliedhassomephilosophicalaspectsandisoneofthecontroversialquestionsinrecentdiscussions(seeBerger2000;Berger,Lunneborg,Ernst,andLevine2002,forexample).Inthecompetitionandrealworldproblemshowever,theinferenceisconditionalonanobservedlearningsampleanyway,thusconditionaltestprocedures,wherethenulldistributionisdeterminedfromthedataactuallyseen,arenaturaltouseforthetestproblemsaddressedhere.Sinceweareabletodrawasmanyrandomsamplesfromtheperformancedistributionsundertestasrequired,theapplicationoftheasymptoticdistributionoftheteststatisticsofthecorrespondingpermutationtestsispossibleincaseswherethedeterminationoftheexactconditionaldistributionisdifficult.MaybethemostprominentproblemistotestwhetherK>2algorithmsperformequallywellagainstthealternativethatatleastoneofthemoutperformsallothercandidates.InadependentKsamplesdesign,theteststatistic2! PPPB1pˆkb(BK)1pˆkbb,kbkt?= !2(3)PPPPpˆkbK1pˆkbB1pˆkb+(BK)1pˆkbk,bkbk,bcanbeusedtoconstructapermutationtest,wherethedistributionoft?isobtainedbypermutingthelabels1,...,KofthealgorithmsforeachsampleLb(b=1,...,B)independently(Pesarin2001).Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik9OncetheglobalhypothesisofequalityofK>2performancescouldberejectedinadependentKsamplesdesign,itisofspecialinteresttoidentifythealgorithmsthatcausedtherejection.Allpartialhypothesescanbetestedatlevelαfollowingtheclosedtestingprinciple,wherethehierarchicalhypothesesformulatesubsetsoftheglobalhypothesisandcanbetestedatlevelα,foradescriptionseeHochbergandTamhane(1987).However,closedtestingproceduresarecom-putationallyexpensiveformorethanK=4algorithms.Inthiscase,onecanapplysimultaneoustestproceduresorconfidenceintervalsdesignedfortheindependentKsamplescasetothealignedperformancemeasures(Ha´jek,Sˇida´k,andSen1999).ItisimportanttonotethatoneisabletodetectverysmallperformancedifferenceswithveryhighpowerwhenthenumberoflearningsamplesBislarge.Therefore,practicalrelevanceinsteadofstatistialsignificanceneedstobeassessed,forexamplebyshowingrelevantsuperioritybymeansofconfidenceintervals.FurthercommentsonthoseissuescanbefoundinSection6.5.IllustrationsandapplicationsAlthoughthetheoreticalframeworkpresentedinSections2and3coversawiderrangeofapplica-tions,werestrictourselvestoafewexamplesfromregressionandclassificationinordertoillustratethebasicconcepts.AsoutlinedinSection3.2,thedegreeofknowledgeaboutthedatageneratingprocessavailabletotheinvestigatordetermineshowwellwecanapproximatethetheoreticalper-formancebyusingempiricalperformancemeasures.Forsimpleartificialdatageneratingprocessesfromaunivariateregressionrelationshipandatwo-classclassificationproblemwewillstudythepoweroftestsbasedontheempiricalperformancemeasuresforthesimulation,competitionandrealworldproblems.Maybethemostinterestingquestionaddressedinbenchmarkingexperimentsis“Arethereanydifferencesbetweenstate–of–the–artalgorithmswithrespecttoacertainperformancemeasure?”.Fortherealworldproblemweinvestigatethisforsomeestablishedandrecentlysuggestedsuper-visedlearningalgorithmsbymeansofthreerealworldlearningsamplesfromtheUCIrepository(BlakeandMerz1998).AllcomputationswereperformedwithintheRsystemforstatisticalcomputing(IhakaandGentleman1996;RDevelopmentCoreTeam2004),version1.9.1.5.1.NestedlinearmodelsInordertocomparethemeansquarederroroftwonestedlinearmodelsconsiderthedatagener-atingprocessfollowingaunivariateregressionequationy=β1x+β2x2+ε(4)wheretheinputxisdrawnfromauniformdistributionontheinterval[0,5]andtheerrortermsareindependentrealisationsfromastandardnormaldistribution.Wefixtheregressioncoefficientβ1=2andthenumberofobservationsinalearningsampleton=150.Twopredictivemodelsarecompared:a1:asimplelinearregressiontakingxasinputandthereforenotincludingaquadratictermdnaa2:asimplequadraticregressiontakingbothxandx2asinputs.Consequently,theregres-sioncoefficientβ2isestimated.Thediscrepancybetweenapredictedvalueyˆ=ak(x|L),k=1,2andtheresponseyismeasuredbysquarederrorlossL(y,yˆ)=(yyˆ)2.Basically,weareinterestedtocheckifalgorithma1performsbetterthanalgorithma2forvaluesofβ2varyinginarangebetween0and0.16.AsdescribedindetailinSection3.2,boththeperformancemeasureandthesamplingfromtheperformancedistributiondependonthedegreeCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
01TheDesignandAnalysisofBenchmarkExperimentsSimulationCompetitionRealWorldβ2m=2000m=150OOBOOB-2nCV0.0000.0000.0000.0720.0540.0590.0540.0200.0290.2870.1860.1140.1740.1090.0400.8350.6090.4510.2970.4990.2790.0600.9970.7640.6830.5540.8400.52300..01800011..00000000..98373500..98132300..97275800..99977300..9727670.1201.0000.9710.9530.9841.0000.9780.1401.0000.9880.9810.9961.0000.9960.1601.0000.9970.9901.0001.0001.000wTaorblldep1r:obRleegmrsesfsoironvaeryxipnegrivmaleunetss:ofPtohweerregorfestshieontecsotesfocirenttheβ2siomfutlhaetioqnu,adcroamtipcetteitrimo.nLaenadrnrienaglsamplesareofsizen=150.ofknowledgeavailableandwethereforedistinguishbetweenthreedifferentproblemsdiscussedabove.SimulationThedatageneratingprocessZnisknownbyequation(4)andweareabletodrawasmanylearningsamplesofn=150observationsaswewouldlike.Itis,inprinciple,possibletocalculatethemeansquarederrorofthepredictivefunctionsa1(∙|Lb)anda2(∙|Lb)whenlearningsampleLbwasobserved.Consequently,weareabletoformulatethetestproblemintermsoftheperformancedistributiondependingonthedatageneratingprocessinaone-sidedway:H0:E(P1(Zn))E(P2(Zn))vs.H1:E(P1(Zn))>E(P2(Zn)).(5)However,closedformsolutionsareonlypossibleinverysimplecasesandwethereforeapproximatePk(Zn)byPˆk(Zn)usingalargetestsampleT,inourcasewithm=2000observations.Somealgebrashowsthat,forβ2=0andmodela1,thevarianceoftheperformanceapproximatedbyatestsampleofsizemisV(pb1)=m1(350/3(2βˆ1)4+25/2(2βˆ1)2+2).Inordertostudythegoodnessoftheapproximationwe,inaddition,chooseasmallertestsamplewithm=150.Notethatinthissetuptheinferenceisconditionalunderthetestsample.CompetitionWearefacedwithalearningsampleLwithn=150observations.TheperformanceofanyalgorithmistobemeasuredbyanadditionaltestsampleTconsistingofm=150observations.Again,theinferenceisconditionalundertheobservedtestsampleandwemay,justbychance,observeatestsamplefavouringaquadraticmodelevenifβ2=0.ThedatageneratingprocessisemulatedbytheempiricaldistributionfunctionDGP=Zˆnandweresamplebyusingthenon-parametricbootstrap.RealWorldMostinterestingandmostcommonisthesituationwheretheknowledgeaboutthedatageneratingprocessiscompletelydescribedbyonesinglelearningsampleandthenon-parametricbootstrapisusedtoredrawlearningsamples.Severalperformancemeasuresarepossibleandweinvesti-gatethosebasedontheout-of-bootstrapobservations(RW-OOB)andcross-validation(RW-CV)suggestedinSection3.2.Forcross-validation,theperformancemeasureisobtainedfroma5-foldcross-validationestimator.Eachbootstrapsampleisdividedintofivefoldsandthemeansquarederroroneachofthesefoldsisaveraged.Observationswhichareelementsoftrainingandvali-dationfoldareremovedfromthelatter.Inaddition,wecomparetheout-of-bootstrapempiricalCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
Un pour Un
Permettre à tous d'accéder à la lecture
Pour chaque accès à la bibliothèque, YouScribe donne un accès à une personne dans le besoin