La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Partagez cette publication

Du même publieur

TheDesignandAnalysisofBenchmarkExperimentsTorstenHothornFriedrichLeischAchimZeileisFriedrich-Alexander-Universita¨tTechnischeUniversita¨tWienWirtschaftsuniversita¨tWienErlangen-Nu¨rnbergKurtHornikWirtschaftsuniversita¨tWienAbstractTheassessmentoftheperformanceoflearnersbymeansofbenchmarkexperimentsisanestablishedexercise.Inpractice,benchmarkstudiesareatooltocomparetheperformanceofseveralcompetingalgorithmsforacertainlearningproblem.Cross-validationorresamplingtechniquesarecommonlyusedtoderivepointestimatesoftheperformanceswhicharecom-paredtoidentifyalgorithmswithgoodproperties.Forseveralbenchmarkingproblems,testprocedurestakingthevariabilityofthosepointestimatesintoaccounthavebeensuggested.Mostoftherecentlyproposedinferenceproceduresarebasedonspecialvarianceestimatorsforthecross-validatedperformance.Weintroduceatheoreticalframeworkforinferenceproblemsinbenchmarkexperimentsandshowthatstandardstatisticaltestprocedurescanbeusedtotestfordifferencesintheperformances.Thetheoryisbasedonwelldefineddistributionsofperformancemeasureswhichcanbecomparedwithestablishedtests.Todemonstratetheusefulnessinpractice,thetheoreticalresultsareappliedtoregressionandclassificationbenchmarkstudiesbasedonartificialandrealworlddata.Keywords:modelcomparison,performance,hypothesistesting,cross-validation,bootstrap.1.IntroductionInstatisticallearningwerefertoabenchmarkstudyastoanempiricalexperimentwiththeaimofcomparinglearnersoralgorithmswithrespecttoacertainperformancemeasure.Thequalityofseveralcandidatealgorithmsisusuallyassessedbypointestimatesoftheirperformancesonsomedatasetorsomedatageneratingprocessofinterest.Althoughnowadayscommonlyusedintheabovesense,theterm“benchmarking”hasitsrootingeology.Patterson(1992)describestheoriginalmeaninginlandsurveyingasfollows:Abenchmarkinthiscontextisamark,whichwasmountedonarock,abuildingorawall.Itwasareferencemarktodefinethepositionortheheightintopographicsurveyingortodeterminethetimefordislocation.Inanalogytotheoriginalmeaning,wemeasureperformancesinalandscapeoflearningalgo-rithmswhilestandingonareferencepoint,thedatageneratingprocessofinterest,inbenchmarkexperiments.Butincontrasttogeologicalmeasurementsofheightsordistancesthestatisticalmeasurementsofperformancearenotsufficientlydescribedbypointestimatesastheyarein-fluencedbyvarioussourcesofvariability.Hence,wehavetotakethisstochasticnatureofthemeasurementsintoaccountwhenmakingdecisionsabouttheshapeofouralgorithmlandscape,thatis,decidingwhichlearnerperformsbestonagivendatageneratingprocess.ThisisapreprintofanarticlepublishedinJournalofComputationalandGraphicalStatistics,Volume14,Number3,Pages675–699.Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
2TheDesignandAnalysisofBenchmarkExperimentsTheassessmentofthequalityofanalgorithmwithrespecttoacertainperformancemeasure,forexamplemisclassificationormeansquarederrorinsupervisedclassificationandregression,hasbeenaddressedinmanyresearchpapersofthelastthreedecades.Theestimationofthegeneralisationerrorbymeansofsomeformofcross-validationstartedwiththepioneeringworkofStone(1974)andmajorimprovementswerepublishedbyEfron(1983,1986)andEfronandTibshirani(1997);foranoverviewwerefertoSchiavoandHand(2000).Thetopicisstillamatterofcurrentinterest,asindicatedbyrecentempirical(WolpertandMacready1999;Bylander2002),algorithmic(BlockeelandStruyf2002)andtheoretical(DudoitandvanderLaan2005)investigations.However,themajorgoalofbenchmarkexperimentsisnotonlytheperformanceassessmentofdifferentcandidatealgorithmsbuttheidentificationofthebestamongthem.Thecomparisonofalgorithmswithrespecttopointestimatesofperformancemeasures,forexamplecomputedviacross-validation,isanestablishedexercise,atleastamongstatisticiansinfluencedbythe“algo-rithmicmodellingculture”(Breiman2001b).Infact,manyofthepopularbenchmarkproblemsfirstcameupinthestatisticalliterature,suchastheOzoneandBostonHousingproblems(byBreimanandFriedman1985).Friedman(1991)contributedthestandardartificialregressionprob-lems.OtherwellknowndatasetslikethePimaindiandiabetesdataortheforensicglassproblemplayamajorroleintextbooksinthisfield(e.g.,Ripley1996).Furtherexamplesarerecentbench-markstudies(asforexampleMeyer,Leisch,andHornik2003),orresearchpapersillustratingthegainsofrefinementstothebaggingprocedure(Breiman2001a;HothornandLausen2003).However,theproblemofidentifyingasuperioralgorithmisstructurallydifferentfromtheper-formanceassessmenttask,althoughwenoticethatasymptoticargumentsindicatethatcross-validationisabletoselectthebestalgorithmwhenprovidedwithinfinitivelylargelearningsamples(DudoitandvanderLaan2005)becausethevariabilitytendstozero.Anyway,thecomparisonofrawpointestimatesinfinitesamplesituationsdoesnottaketheirvariabilityintoaccount,thusleadingtouncertaindecisionswithoutcontrollinganyerrorprobability.Whilemanysolutionstotheinstabilityproblemsuggestedinthelastyearsareextremelysuc-cessfulinreducingthevarianceofalgorithmsbyturningweakintostronglearners,especiallyensemblemethodslikeboosting(FreundandSchapire1996),bagging(Breiman1996a)orrandomforests(Breiman2001a),thevariabilityofperformancemeasuresandassociatedtestprocedureshasreceivedlessattention.ThetaxonomyofinferenceproblemsinthespecialcaseofsupervisedclassificationproblemsdevelopedbyDietterich(1998)ishelpfultodistinguishbetweenseveralproblemclassesandapproaches.Foradatageneratingprocessunderstudy,wemayeitherwanttoselectthebestoutofasetofcandidatealgorithmsortochooseoneoutofasetofpredefinedfittedmodels(“classifiers”).Dietterich(1998)distinguishesbetweensituationswherewearefacedwithlargeorsmalllearningsamples.Standardstatisticaltestproceduresareavailableforcompar-ingtheperformanceoffittedmodelswhenanindependenttestsampleisavailable(questions3and4inDietterich1998)andsomebenchmarkstudiesrestrictthemselvestothoseapplications(BauerandKohavi1999).Theproblemwhethersomeoutofasetofcandidatealgorithmsoutperformallothersinalarge(question7)andsmallsamplesituation(question8)iscommonlyaddressedbythederivationofspecialvarianceestimatorsandassociatedtests.EstimatesofthevariabilityofthenaivebootstrapestimatorofmisclassificationerroraregiveninEfronandTibshirani(1997).Someproceduresforsolvingquestion8suchasthe5×2cvtestaregivenbyDietterich(1998),furtherinvestigatedbyAlpaydin(1999)andappliedinabenchmarkstudyonensemblemethods(Dietterich2000).Pizarro,Guerrero,andGalindo(2002)suggesttousesomeclassicalmultipletestproceduresforsolvingthisproblem.Mixedmodelsareappliedforthecomparisonofalgo-rithmsacrossbenchmarkproblems(forexampleLim,Loh,andShih2000;KimandLoh2003).Abasicproblemcommontotheseapproachesisthatthecorrelationbetweeninternalperformanceestimates,suchasthosecalculatedforeachfoldink-foldcross-validation,violatestheassumptionofindependence.Thisfactiseitherignoredwhenthedistributionofnewlysuggestedteststatisticsunderthenullhypothesisofequalperformancesisinvestigated(forexampleinDietterich1998;Alpaydin1999;VehtariandLampinen2002)orspecialvarianceestimatorstakingthiscorrelationintoaccountarederived(NadeauandBengio2003).Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik3Inthispaper,weintroduceasoundandflexibletheoreticalframeworkforthecomparisonofcan-didatealgorithmsandalgorithmselectionforarbitrarylearningproblems.Theapproachtotheinferenceprobleminbenchmarkstudiespresentedhereisfundamentallydifferentfromthepro-cedurescitedabove:Weshowhowonecansamplefromawelldefineddistributionofacertainperformancemeasure,conditionalonadatageneratingprocess,inanindependentway.Conse-quently,standardstatisticaltestprocedurescanbeusedtotestmanyhypothesesofinterestinbenchmarkstudiesandnospecialpurposeproceduresarenecessary.Thedefinitionofappropriatesamplingproceduresmakesspecial“aposteriori”adjustmentstovarianceestimatorsunnecessary.Moreover,norestrictionsoradditionalassumptions,neithertothecandidatealgorithms(likelin-earityinvariableselection,seeGeorge2000,foranoverview)nortothedatageneratingprocessarerequired.ThroughoutthepaperweassumethatalearningsampleofnobservationsL={z1,...,zn}isgivenandasetofcandidatealgorithmsaspotentialproblemsolversisavailable.Eachofthosecandidatesisatwostepalgorithma:InthefirststepamodelisfittedbasedonalearningsampleLyieldingafunctiona(∙|L)which,inasecondstep,canbeusedtocomputecertainobjectsofinterest.Forexample,inasupervisedlearningproblem,thoseobjectsofinterestarepredictionsoftheresponsebasedoninputvariablesor,inunsupervisedsituationslikedensityestimation,a(∙|L)mayreturnanestimateddensity.Whenwesearchforthebestsolution,thecandidatesneedtobecomparedbysomeproblemspecificperformancemeasure.Suchameasuredependsonthealgorithmandthelearningsampledrawnfromsomedatageneratingprocess:Thefunctionp(a,L)assessestheperformanceofthefunctiona(∙|L),thatistheperformanceofalgorithmabasedonlearningsampleL.SinceLisarandomlearningsample,p(a,L)isarandomvariablewhosevariabilityisinducedbythevariabilityoflearningsamplesfollowingthesamedatageneratingprocessasL.Itisthereforenaturaltocomparethedistributionoftheperformancemeasureswhenweneedtodecidewhetheranyofthecandidatealgorithmsperformssuperiortoalltheothers.Theideaistodrawindependentrandomsamplesfromthedistributionoftheperformancemeasureforanalgorithmabyevaluatingp(a,L),wherethelearningsampleLfollowsaproperlydefineddatageneratingprocesswhichreflectsourknowledgeabouttheworld.Byusingappropriateandwellinvestigatedstatisticaltestproceduresweareabletotesthypothesesaboutthedistributionsoftheperformancemeasuresofasetofcandidatesand,consequently,weareinthepositiontocontroltheerrorprobabilityoffalselydeclaringanyofthecandidatesasthewinner.WederivethetheoreticalbasisofourproposalinSection2andfocusonthespecialcaseofregressionandclassificationproblemsinSection3.Onceappropriaterandomsamplesfromtheperformancedistributionhavebeendrawn,theestablishedstatisticaltestandanalysisprocedurescanbeappliedandweshortlyreviewthemostinterestingoftheminSection4.Especially,wefocusontestsforsomeinferenceproblemswhichareaddressedintheapplicationspresentedinSection5.2.ComparingperformancemeasuresInthissectionweintroduceageneralframeworkforthecomparisonofcandidatealgorithms.Independentsamplesfromthedistributionsoftheperformancemeasuresaredrawnconditionallyonthedatageneratingprocessofinterest.Weshowhowstandardstatisticaltestprocedurescanbeusedinbenchmarkstudies,forexampleinordertotestthehypothesisofequalperformances.SupposethatBindependentandidenticallydistributedlearningsampleshavebeendrawnfromsomedatageneratingprocessDGP L1=z11,...,zn1DGP,. LB=z1B,...,znBDGP,Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
4TheDesignandAnalysisofBenchmarkExperimentswhereeachofthelearningsamplesLb(b=1,...,B)consistsofnobservations.FurthermoreweassumethatthereareK>1potentialcandidatealgorithmsak(k=1,...,K)availableforthesolutionoftheunderlyingproblem.Foreachalgorithmakthefunctionak(∙|Lb)isbasedontheobservationsfromthelearningsampleLb.Hence,itisarandomvariabledependingonLbandhasitselfadistributionAkonthefunctionspaceofakwhichdependsonthedatageneratingprocessoftheLb:ak(∙|Lb)∼Ak(DGP),k=1,...,K.Foralgorithmsakwithdeterministicfittingprocedure(forexamplehistogramsorlinearmodels)thefunctionak(∙|Lb)isfixedwhereasforalgorithmsinvolvingnon-deterministicfittingorwherethefittingisbasedonthechoiceofstartingvaluesorhyperparameters(forexampleneuralnetworksorrandomforests)itisarandomvariable.Notethatak(∙|Lb)isapredictionfunctionthatmustnotdependonhyperparametersanymore:Thefittingprocedureincorporatesbothtuningaswellasthefinalmodelfittingitself.AssketchedinSection1,theperformanceofthecandidatealgorithmakwhenprovidedwiththelearningsampleLbismeasuredbyascalarfunctionp:pkb=p(ak,Lb)∼Pk=Pk(DGP).TherandomvariablepkbfollowsadistributionfunctionPkwhichagaindependsonthedatageneratingprocessDGP.Foralgorithmswithnon-deterministicfittingprocedurethisimpliesthatitmaybeappropriatetointegratewithrespecttoitsdistributionAkwhenevaluatingitsperformance.TheKdifferentrandomsamples{pk1,...,pkB}withBindependentandidenticallydistributedobservationsaredrawnfromthedistributionsPk(DGP)foralgorithmsak(k=1,...,K).Theseperformancedistributionscanbecomparedbybothexploratorydataanalysistoolsaswellasformalinferenceprocedures.ThenullhypothesisofinterestformostproblemsistheequalityofthecandidatealgorithmswithrespecttothedistributionoftheirperformancemeasureandcanbeformulatedbywritingH0:P1=∙∙∙=PK.Inparticular,thishypothesisimpliestheequalityoflocationandvariabilityoftheperformances.Inordertospecifyanappropriatetestprocedureforthehypothesisaboveoneneedstodefineanalternativetotestagainst.Thealternativedependsontheoptimalitycriterionofinterest,whichweassessusingascalarfunctionalφ:Analgorithmakisbetterthananalgorithmak0withrespecttoaperformancemeasurepandafunctionalφiffφ(Pk)(Pk0).Theoptimalitycriterionmostcommonlyusedisbasedonsomelocationparametersuchastheexpectationφ(Pk)=E(Pk)orthemedianoftheperformancedistribution,thatis,theaverageexpectedloss.Inthiscaseweareinterestedindetectingdifferencesinmeanperformances:H0:E(P1)=∙∙∙=E(PK)vs.H1:i,j∈{1,...,K}:E(Pi)6=E(Pj).Otheralternativesmaybederivedfromoptimalitycriteriafocusingonthevariabilityoftheper-formancemeasures.Underanycircumstances,theinferenceisconditionalonthedatageneratingprocessofinterest.Examplesforappropriatechoicesofsamplingproceduresforthespecialcaseofsupervisedlearningproblemsaregiveninthenextsection.3.RegressionandclassificationInthissectionweshowhowthegeneralframeworkfortestingtheequalityofalgorithmsderivedintheprevioussectioncanbeappliedtothespecialbutimportantcaseofsupervisedstatisticallearningproblems.Moreover,wefocusonapplicationsthatcommonlyoccurinpracticalsituations.Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik53.1.ComparingpredictorsInsupervisedlearningproblems,theobservationszinthelearningsampleareoftheformz=(y,x)whereydenotestheresponsevariableandxdescribesavectorofinputvariables.Theaimofthelearningtaskistoconstructpredictorswhich,basedoninputvariablesonly,provideuswithinformationabouttheunknownresponsevariable.Consequently,thefunctionconstructedbyeachoftheKcandidatealgorithmsisoftheformyˆ=ak(x|Lb).Inclassificationproblemsyˆmaybethepredictedclassforobservationswithinputxorthevectoroftheestimatedconditionalclassprobabilities.Insurvivalanalysistheconditionalsurvivalcurveforobservationswithinputstatusxisofspecialinterest.ThediscrepancybetweenthetrueresponseyandthepredictedvalueyˆforonesingleobservationismeasuredbyascalarlossfunctionL(y,yˆ).TheperformancemeasurepisdefinedbysomefunctionalµofthedistributionofthelossfunctionandthedistributionofpkbdependsonthedatageneratingprocessDGPonly:pkb=p(ak,Lb)=µLy,akx|Lb∼Pk(DGP).Consequently,therandomnessofz=(y,x)andtherandomnessinducedbyalgorithmsakwithnon-deterministicfittingareremovedbyappropriateintegrationwithrespecttotheassociateddistributionfunctions.Again,theexpectationisacommonchoiceforthefunctionalµunderquadraticlossL(y,yˆ)=(yyˆ)2andtheperformancemeasureisgivenbythesocalledconditionalrisk2pkb=EakEz=(y,x)Ly,akx|Lb=EakEz=(y,x)yakx|Lb,(1)wherez=(y,x)isdrawnfromthesamedistributionastheobservationsinalearningsampleL.Otherconceivablechoicesofµarethemedian,correspondingtoabsoluteloss,oreventhesupre-mumortheoreticalquantilesofthelossfunctions.3.2.SpecialproblemsThedistributionsoftheperformancemeasurePk(DGP)foralgorithmsak(k=1,...,K)dependonthedatageneratingprocessDGP.Consequently,thewaywedrawrandomsamplesfromPk(DGP)isdeterminedbytheknowledgeaboutthedatageneratingprocessavailabletous.Insupervisedlearningproblems,onecandistinguishtwosituations:Eitherthedatageneratingprocessisknown,whichisthecaseinsimulationexperimentswithartificiallygenerateddataorincaseswherewearepracticallyabletodrawinfinitivelymanysamples(e.g.,networkdata),ortheinformationaboutthedatageneratingprocessisdeterminedbyafinitelearningsampleL.InthiscasetheempiricaldistributionfunctionofLtypicallyrepresentsthecompleteknowledgeaboutthedatageneratingprocessweareprovidedwith.InthefollowingweshowhowrandomsamplesfromthedistributionoftheperformancemeasurePk(DGP)foralgorithmakcanbedrawninthreebasicproblems:Thedatageneratingprocessisknown(simulation),alearningsampleaswellasatestsampleareavailable(competition)oronesinglelearningsampleisprovidedonly(realworld).Specialchoicesofthefunctionalµappropriateineachofthethreeproblemswillbediscussed.ThesimulationproblemArtificialdataaregeneratedfromsomedistributionfunctionZ,whereeachobservationzi(i=1,...,n)inalearningsampleisdistributedaccordingtoZ.ThelearningsampleLconsistsofnindependentobservationsfromZwhichwedenotebyL∼Zn.InthissituationthedatageneratingprocessDGP=Znisused.ThereforeweareabletodrawasetofBindependentlearningsamplesfromZn:L1,...,LB∼Zn.WeassesstheperformanceofeachalgorithmakCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
6TheDesignandAnalysisofBenchmarkExperimentsonalllearningsamplesLb(b=1,...,B)yieldingarandomsampleofBobservationsfromtheperformancedistributionPk(Zn)bycalculatingpkb=p(ak,Lb)=µLy,akx|Lb,b=1,...,B.TheassociatedhypothesisundertestisconsequentlyH0:P1(Zn)=∙∙∙=PK(Zn).IfwearenotabletocalculateµanalyticallywecanapproximateituptoanydesiredaccuracybydrawingatestsampleT∼ZmofmindependentobservationsfromZwheremislargeandcalculatingpˆkb=pˆ(ak,Lb)=µTLy,akx|Lb.HereµTdenotestheempiricalanalogueofµforthetestobservationsz=(y,x)T.Whenµisdefinedastheexpectationwithrespecttotestsampleszasin(1)(weassumeadeterministicakforthesakeofsimplicityhere),thisreducestothemeanofthelossfunctionevaluatedforeachobservationinthelearningsampleXpˆkb=pˆ(ak,Lb)=m1Ly,akx|Lb.z=(y,x)TAnalogously,thesupremumwouldbereplacedbythemaximumandtheoreticalquantilesbytheirempiricalcounterpart.ThecompetitionproblemInmostpracticalapplicationsnopreciseknowledgeaboutthedatageneratingprocessisavailablebutinsteadweareprovidedwithonelearningsampleL∼ZnofnobservationsfromsomedistributionfunctionZ.TheempiricaldistributionfunctionZˆncoversallknowledgethatwehaveaboutthedatageneratingprocess.Therefore,wemimicthedatageneratingprocessbyusingtheempiricaldistributionfunctionofthelearningsample:DGP=Zˆn.Nowweareabletodrawindependentandidenticallydistributedrandomsamplesfromthisemulateddatageneratingprocess.Inacompletelynon-parametricsetting,thenon-parametricorBayesianbootstrapcanbeappliedhereor,iftherestrictiontocertainparametricfamiliesisappropriate,theparametricbootstrapcanbeusedtodrawsamplesfromthedatageneratingprocess.ForanoverviewofthoseissueswerefertoEfronandTibshirani(1993).Undersomecircumstances,anadditionaltestsampleT∼Zmofmobservationsisgiven,forexampleinmachinelearningcompetitions.Inthissituation,theperformanceneedstobeassessedwithrespecttoTonly.Again,wewouldliketodrawarandomsampleofBobservationsfromPˆk(Zˆn),whichinthissetupispossiblebybootstrappingL1,...,LB∼Zˆn,wherePˆdenotesthedistributionfunctionoftheperformancemeasureevaluatedusingT,thatis,theperformancemeasureiscomputedbypˆkb=pˆ(ak,Lb)=µTLy,akx|LbwhereµTisagaintheempiricalanalogueofµforallz=(y,x)T.ThehypothesisweareinterestedinisH0:Pˆ1(Zˆn)=∙∙∙=PˆK(Zˆn),wherePˆkcorrespondstotheperformancemeasureµT.SincetheperformancemeasureisdefinedintermsofonesingletestsampleT,itshouldbenotedthatwemayfavouralgorithmsthatperformwellonthatparticulartestsampleTbutworseonothertestsamplesjustbychance.Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik7TherealworldproblemThemostcommonsituationweareconfrontedwithindailyroutineistheexistenceofonesinglelearningsampleL∼Znwithnodedicatedindependenttestsamplebeingavailable.Again,wemimicthedatageneratingprocessbytheempiricaldistributionfunctionofthelearningsample:DGP=Zˆn.WeredrawBindependentlearningsamplesfromtheempiricaldistributionfunctionbybootstrapping:L1,...,LB∼Zˆn.Thecorrespondingperformancemeasureiscomputedbypˆkb=pˆ(ak,Lb)=µˆLy,akx|Lbwhereµˆisanappropriateempiricalversionofµ.Therearemanypossibilitiesofchoosingµˆandthemostobviousonesaregiveninthefollowing.Ifnislarge,onecandividethelearningsampleintoasmallerlearningsampleandatestsampleL={L0,T}andproceedwithµTasinthecompetitionproblem.Ifnisnotlargeenoughforthistobefeasible,thefollowingapproachisafirstnaivechoice:Inthesimulationproblem,themodelsarefittedonsamplesfromZnandtheirperformanceisevaluatedonsamplesfromZ.Here,themodelsaretrainedonsamplesfromtheempiricaldistributionfunctionZˆnandsowecouldwanttoassesstheirperformanceonZˆwhichcorrespondstoemulatingµTbyusingthelearningsampleLastestsample,i.e.,foreachmodelfittedonabootstrapsample,theoriginallearningsampleLitselfisusedastestsampleT.Exceptforalgorithmsabletocompute‘honest’predictionsfortheobservationsinthelearningsample(forexamplebagging’sout-of-bagpredictions,Breiman1996b),thischoiceleadstooverfit-tingproblems.Thosecanbeaddressedbywellknowncross-validationstrategies.ThetestsampleTcanbedefinedintermsoftheout-of-bootstrapobservationswhenevaluatingµT:RW-OOB.ForeachbootstrapsampleLb(b=1,...,B)theout-of-bootstrapobservationsL\Lbareusedastestsample.Notethatusingtheout-of-bootstrapobservationsastestsampleleadstonon-independentob-servationsoftheperformancemeasure,however,theircorrelationvanishesasntendstoinfinity.Anotherwayistochooseacross-validationestimatorofµ:RW-CV.EachbootstrapsampleLbisdividedintokfoldsandtheperformancepˆkbisdefinedastheaverageoftheperformancemeasureoneachofthosefolds.SinceitispossiblethatoneobservationfromtheoriginallearningsampleLispartofboththelearningfoldsandthevalidationfoldduetosamplingn-out-of-nwithreplacement,thoseobservationsareremovedfromthevalidationfoldinordertopreventanybias.Suchbiasmaybeinducedforsomealgorithmsthatperformbetteronobservationsthatarepartofbothlearningsampleandtestsample.Commontoallchoicesinthissetupisthatonesinglelearningsampleprovidesallinformation.Therefore,wecannotcomputethetheoreticalperformancemeasuresandhencecannottesthy-pothesesabouttheseasthiswouldrequiremoreknowledgeaboutthedatageneratingprocess.Thestandardapproachistocomputesomeempiricalperformancemeasure,suchasthosesug-gestedhere,insteadtoapproximatethetheoreticalperformance.Foranyempiricalperformancemeasure,thehypothesisneedstobeformulatedbyH0:Pˆ1(Zˆn)=∙∙∙=PˆK(Zˆn),meaningthattheinferenceisconditionalontheperformancemeasureunderconsideration.4.TestproceduresAsoutlinedintheprevioussections,theproblemofcomparingKalgorithmswithrespecttoanyperformancemeasurereducestotheproblemofcomparingKnumericdistributionfunctionsorCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
8TheDesignandAnalysisofBenchmarkExperimentscertaincharacteristics,suchastheirexpectation.Alotofattentionhasbeenpaidtothisandsimilarproblemsinthestatisticalliteratureandsoarichtoolboxcanbeappliedhere.Wewillcommentonappropriatetestproceduresforonlythemostimportanttestproblemscommonlyaddressedinbenchmarkexperimentsandrefertothestandardliteratureotherwise.4.1.ExperimentaldesignsAmatchedpairsordependentKsamplesdesignisthemostnaturalchoiceforthecomparisonofalgorithmssincetheperformanceofallKalgorithmsisevaluatedusingthesamerandomsamplesL1,...,LB.WethereforeusethisdesignforthederivationsintheprevioussectionsandtheexperimentsinSection5andcomparethealgorithmsbasedonthesamesetoflearningsamples.TheapplicationofanindependentKsampledesignmaybemorecomfortablefromastatisticalpointofview.EspeciallythederivationofconfidenceintervalsforparameterslikethedifferenceofthemisclassificationerrorsoftwoalgorithmsorthevisualisationoftheperformancedistributionsisstraightforwardintheindependentKsamplessetup.4.2.AnalysisAsensibleteststatisticforcomparingtwoperformancedistributionswithrespecttotheirlocationsinamatchedpairsdesignisformulatedintermsoftheaveraged¯ofthedierencesdb=p1bp2b(b=1,...,B)fortheobservationsp1bandp2bofalgorithmsa1anda2.Underthenullhypothesisofequalityoftheperformancedistributions,thestudentizedstatistic¯dt=Br(2)(B1)1Pdbd¯2bisasymptoticallynormalandfollowsat-distributionwithB1degreesoffreedomwhenthedifferencesaredrawnfromanormaldistribution.Theunconditionaldistributionofthisandothersimilarteststatisticsisderivedundersomeparametricassumption,suchassymmetryornormality,aboutthedistributionsoftheunderlyingobservations.However,wedoubtthatsuchparametricassumptionstoperformancedistributionsareeverappropriate.Thequestionwhetherconditionalorunconditionaltestproceduresshouldbeappliedhassomephilosophicalaspectsandisoneofthecontroversialquestionsinrecentdiscussions(seeBerger2000;Berger,Lunneborg,Ernst,andLevine2002,forexample).Inthecompetitionandrealworldproblemshowever,theinferenceisconditionalonanobservedlearningsampleanyway,thusconditionaltestprocedures,wherethenulldistributionisdeterminedfromthedataactuallyseen,arenaturaltouseforthetestproblemsaddressedhere.Sinceweareabletodrawasmanyrandomsamplesfromtheperformancedistributionsundertestasrequired,theapplicationoftheasymptoticdistributionoftheteststatisticsofthecorrespondingpermutationtestsispossibleincaseswherethedeterminationoftheexactconditionaldistributionisdifficult.MaybethemostprominentproblemistotestwhetherK>2algorithmsperformequallywellagainstthealternativethatatleastoneofthemoutperformsallothercandidates.InadependentKsamplesdesign,theteststatistic2! PPPB1pˆkb(BK)1pˆkbb,kbkt?= !2(3)PPPPpˆkbK1pˆkbB1pˆkb+(BK)1pˆkbk,bkbk,bcanbeusedtoconstructapermutationtest,wherethedistributionoft?isobtainedbypermutingthelabels1,...,KofthealgorithmsforeachsampleLb(b=1,...,B)independently(Pesarin2001).Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik9OncetheglobalhypothesisofequalityofK>2performancescouldberejectedinadependentKsamplesdesign,itisofspecialinteresttoidentifythealgorithmsthatcausedtherejection.Allpartialhypothesescanbetestedatlevelαfollowingtheclosedtestingprinciple,wherethehierarchicalhypothesesformulatesubsetsoftheglobalhypothesisandcanbetestedatlevelα,foradescriptionseeHochbergandTamhane(1987).However,closedtestingproceduresarecom-putationallyexpensiveformorethanK=4algorithms.Inthiscase,onecanapplysimultaneoustestproceduresorconfidenceintervalsdesignedfortheindependentKsamplescasetothealignedperformancemeasures(Ha´jek,Sˇida´k,andSen1999).ItisimportanttonotethatoneisabletodetectverysmallperformancedifferenceswithveryhighpowerwhenthenumberoflearningsamplesBislarge.Therefore,practicalrelevanceinsteadofstatistialsignificanceneedstobeassessed,forexamplebyshowingrelevantsuperioritybymeansofconfidenceintervals.FurthercommentsonthoseissuescanbefoundinSection6.5.IllustrationsandapplicationsAlthoughthetheoreticalframeworkpresentedinSections2and3coversawiderrangeofapplica-tions,werestrictourselvestoafewexamplesfromregressionandclassificationinordertoillustratethebasicconcepts.AsoutlinedinSection3.2,thedegreeofknowledgeaboutthedatageneratingprocessavailabletotheinvestigatordetermineshowwellwecanapproximatethetheoreticalper-formancebyusingempiricalperformancemeasures.Forsimpleartificialdatageneratingprocessesfromaunivariateregressionrelationshipandatwo-classclassificationproblemwewillstudythepoweroftestsbasedontheempiricalperformancemeasuresforthesimulation,competitionandrealworldproblems.Maybethemostinterestingquestionaddressedinbenchmarkingexperimentsis“Arethereanydifferencesbetweenstate–of–the–artalgorithmswithrespecttoacertainperformancemeasure?”.Fortherealworldproblemweinvestigatethisforsomeestablishedandrecentlysuggestedsuper-visedlearningalgorithmsbymeansofthreerealworldlearningsamplesfromtheUCIrepository(BlakeandMerz1998).AllcomputationswereperformedwithintheRsystemforstatisticalcomputing(IhakaandGentleman1996;RDevelopmentCoreTeam2004),version1.9.1.5.1.NestedlinearmodelsInordertocomparethemeansquarederroroftwonestedlinearmodelsconsiderthedatagener-atingprocessfollowingaunivariateregressionequationy=β1x+β2x2+ε(4)wheretheinputxisdrawnfromauniformdistributionontheinterval[0,5]andtheerrortermsareindependentrealisationsfromastandardnormaldistribution.Wefixtheregressioncoefficientβ1=2andthenumberofobservationsinalearningsampleton=150.Twopredictivemodelsarecompared:a1:asimplelinearregressiontakingxasinputandthereforenotincludingaquadratictermdnaa2:asimplequadraticregressiontakingbothxandx2asinputs.Consequently,theregres-sioncoefficientβ2isestimated.Thediscrepancybetweenapredictedvalueyˆ=ak(x|L),k=1,2andtheresponseyismeasuredbysquarederrorlossL(y,yˆ)=(yyˆ)2.Basically,weareinterestedtocheckifalgorithma1performsbetterthanalgorithma2forvaluesofβ2varyinginarangebetween0and0.16.AsdescribedindetailinSection3.2,boththeperformancemeasureandthesamplingfromtheperformancedistributiondependonthedegreeCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
01TheDesignandAnalysisofBenchmarkExperimentsSimulationCompetitionRealWorldβ2m=2000m=150OOBOOB-2nCV0.0000.0000.0000.0720.0540.0590.0540.0200.0290.2870.1860.1140.1740.1090.0400.8350.6090.4510.2970.4990.2790.0600.9970.7640.6830.5540.8400.52300..01800011..00000000..98373500..98132300..97275800..99977300..9727670.1201.0000.9710.9530.9841.0000.9780.1401.0000.9880.9810.9961.0000.9960.1601.0000.9970.9901.0001.0001.000wTaorblldep1r:obRleegmrsesfsoironvaeryxipnegrivmaleunetss:ofPtohweerregorfestshieontecsotesfocirenttheβ2siomfutlhaetioqnu,adcroamtipcetteitrimo.nLaenadrnrienaglsamplesareofsizen=150.ofknowledgeavailableandwethereforedistinguishbetweenthreedifferentproblemsdiscussedabove.SimulationThedatageneratingprocessZnisknownbyequation(4)andweareabletodrawasmanylearningsamplesofn=150observationsaswewouldlike.Itis,inprinciple,possibletocalculatethemeansquarederrorofthepredictivefunctionsa1(∙|Lb)anda2(∙|Lb)whenlearningsampleLbwasobserved.Consequently,weareabletoformulatethetestproblemintermsoftheperformancedistributiondependingonthedatageneratingprocessinaone-sidedway:H0:E(P1(Zn))E(P2(Zn))vs.H1:E(P1(Zn))>E(P2(Zn)).(5)However,closedformsolutionsareonlypossibleinverysimplecasesandwethereforeapproximatePk(Zn)byPˆk(Zn)usingalargetestsampleT,inourcasewithm=2000observations.Somealgebrashowsthat,forβ2=0andmodela1,thevarianceoftheperformanceapproximatedbyatestsampleofsizemisV(pb1)=m1(350/3(2βˆ1)4+25/2(2βˆ1)2+2).Inordertostudythegoodnessoftheapproximationwe,inaddition,chooseasmallertestsamplewithm=150.Notethatinthissetuptheinferenceisconditionalunderthetestsample.CompetitionWearefacedwithalearningsampleLwithn=150observations.TheperformanceofanyalgorithmistobemeasuredbyanadditionaltestsampleTconsistingofm=150observations.Again,theinferenceisconditionalundertheobservedtestsampleandwemay,justbychance,observeatestsamplefavouringaquadraticmodelevenifβ2=0.ThedatageneratingprocessisemulatedbytheempiricaldistributionfunctionDGP=Zˆnandweresamplebyusingthenon-parametricbootstrap.RealWorldMostinterestingandmostcommonisthesituationwheretheknowledgeaboutthedatageneratingprocessiscompletelydescribedbyonesinglelearningsampleandthenon-parametricbootstrapisusedtoredrawlearningsamples.Severalperformancemeasuresarepossibleandweinvesti-gatethosebasedontheout-of-bootstrapobservations(RW-OOB)andcross-validation(RW-CV)suggestedinSection3.2.Forcross-validation,theperformancemeasureisobtainedfroma5-foldcross-validationestimator.Eachbootstrapsampleisdividedintofivefoldsandthemeansquarederroroneachofthesefoldsisaveraged.Observationswhichareelementsoftrainingandvali-dationfoldareremovedfromthelatter.Inaddition,wecomparetheout-of-bootstrapempiricalCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica
Un pour Un
Permettre à tous d'accéder à la lecture
Pour chaque accès à la bibliothèque, YouScribe donne un accès à une personne dans le besoin