The Design and Analysis of Benchmark Experiments

Misheg - Torsten Hothorn , Friedrich Leisch , Achim Zeileis , Kurt Hornik

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

22 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

The Design and Analysis of Benchmark
Experiments
Torsten Hothorn Friedrich Leisch Achim Zeileis
Friedrich-Alexander-Universitat Technische Universitat Wien Wirtschaftsuniversitat Wien
Erlangen-Nurnberg
Kurt Hornik
Wirtschaftsuniversitat Wien
Abstract
The assessment of the performance of learners by means of benchmark experiments is an
established exercise. In practice, benchmark studies are a tool to compare the performance of
several competing algorithms for a certain learning problem. Cross-validation or resampling
techniques are commonly used to derive point estimates of the performances which are com-
pared to identify algorithms with good properties. For several benchmarking problems, test
procedures taking the variability of those point estimates into account have been suggested.
Most of the recently proposed inference procedures are based on special variance estimators
for the cross-validated performance.
We introduce a theoretical framework for inference problems in benchmark experiments
and show that standard statistical test procedures can be used to test for di erences in the
performances. The theory is based on well de ned distributions of performance measures
which can be compared with established tests. To demonstrate the usefulness in practice,
the theoretical results are applied to regression and classi cation benchmark studies based on
arti cial and real world data.
Keywords: model comparison, performance, hypothesis testing, cross-validation, bootstrap ...

Sujets

Testudo

Bootstrap Protocol

North American A-5 Vigilante

Direction générale de la Police nationale

Sport aérien

The Designers Republic

Informations

Publié par	Misheg
Nombre de lectures	20
Langue	English

Extrait

TheDesignandAnalysisofBenchmarkExperimentsTorstenHothornFriedrichLeischAchimZeileisFriedrich-Alexander-Universita¨tTechnischeUniversita¨tWienWirtschaftsuniversita¨tWienErlangen-Nu¨rnbergKurtHornikWirtschaftsuniversita¨tWienAbstractTheassessmentoftheperformanceoflearnersbymeansofbenchmarkexperimentsisanestablishedexercise.Inpractice,benchmarkstudiesareatooltocomparetheperformanceofseveralcompetingalgorithmsforacertainlearningproblem.Cross-validationorresamplingtechniquesarecommonlyusedtoderivepointestimatesoftheperformanceswhicharecom-paredtoidentifyalgorithmswithgoodproperties.Forseveralbenchmarkingproblems,testprocedurestakingthevariabilityofthosepointestimatesintoaccounthavebeensuggested.Mostoftherecentlyproposedinferenceproceduresarebasedonspecialvarianceestimatorsforthecross-validatedperformance.Weintroduceatheoreticalframeworkforinferenceproblemsinbenchmarkexperimentsandshowthatstandardstatisticaltestprocedurescanbeusedtotestfordiﬀerencesintheperformances.Thetheoryisbasedonwelldeﬁneddistributionsofperformancemeasureswhichcanbecomparedwithestablishedtests.Todemonstratetheusefulnessinpractice,thetheoreticalresultsareappliedtoregressionandclassiﬁcationbenchmarkstudiesbasedonartiﬁcialandrealworlddata.Keywords:modelcomparison,performance,hypothesistesting,cross-validation,bootstrap.1.IntroductionInstatisticallearningwerefertoabenchmarkstudyastoanempiricalexperimentwiththeaimofcomparinglearnersoralgorithmswithrespecttoacertainperformancemeasure.Thequalityofseveralcandidatealgorithmsisusuallyassessedbypointestimatesoftheirperformancesonsomedatasetorsomedatageneratingprocessofinterest.Althoughnowadayscommonlyusedintheabovesense,theterm“benchmarking”hasitsrootingeology.Patterson(1992)describestheoriginalmeaninginlandsurveyingasfollows:Abenchmarkinthiscontextisamark,whichwasmountedonarock,abuildingorawall.Itwasareferencemarktodeﬁnethepositionortheheightintopographicsurveyingortodeterminethetimefordislocation.Inanalogytotheoriginalmeaning,wemeasureperformancesinalandscapeoflearningalgo-rithmswhilestandingonareferencepoint,thedatageneratingprocessofinterest,inbenchmarkexperiments.Butincontrasttogeologicalmeasurementsofheightsordistancesthestatisticalmeasurementsofperformancearenotsuﬃcientlydescribedbypointestimatesastheyarein-ﬂuencedbyvarioussourcesofvariability.Hence,wehavetotakethisstochasticnatureofthemeasurementsintoaccountwhenmakingdecisionsabouttheshapeofouralgorithmlandscape,thatis,decidingwhichlearnerperformsbestonagivendatageneratingprocess.ThisisapreprintofanarticlepublishedinJournalofComputationalandGraphicalStatistics,Volume14,Number3,Pages675–699.Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica

2TheDesignandAnalysisofBenchmarkExperimentsTheassessmentofthequalityofanalgorithmwithrespecttoacertainperformancemeasure,forexamplemisclassiﬁcationormeansquarederrorinsupervisedclassiﬁcationandregression,hasbeenaddressedinmanyresearchpapersofthelastthreedecades.Theestimationofthegeneralisationerrorbymeansofsomeformofcross-validationstartedwiththepioneeringworkofStone(1974)andmajorimprovementswerepublishedbyEfron(1983,1986)andEfronandTibshirani(1997);foranoverviewwerefertoSchiavoandHand(2000).Thetopicisstillamatterofcurrentinterest,asindicatedbyrecentempirical(WolpertandMacready1999;Bylander2002),algorithmic(BlockeelandStruyf2002)andtheoretical(DudoitandvanderLaan2005)investigations.However,themajorgoalofbenchmarkexperimentsisnotonlytheperformanceassessmentofdiﬀerentcandidatealgorithmsbuttheidentiﬁcationofthebestamongthem.Thecomparisonofalgorithmswithrespecttopointestimatesofperformancemeasures,forexamplecomputedviacross-validation,isanestablishedexercise,atleastamongstatisticiansinﬂuencedbythe“algo-rithmicmodellingculture”(Breiman2001b).Infact,manyofthepopularbenchmarkproblemsﬁrstcameupinthestatisticalliterature,suchastheOzoneandBostonHousingproblems(byBreimanandFriedman1985).Friedman(1991)contributedthestandardartiﬁcialregressionprob-lems.OtherwellknowndatasetslikethePimaindiandiabetesdataortheforensicglassproblemplayamajorroleintextbooksinthisﬁeld(e.g.,Ripley1996).Furtherexamplesarerecentbench-markstudies(asforexampleMeyer,Leisch,andHornik2003),orresearchpapersillustratingthegainsofreﬁnementstothebaggingprocedure(Breiman2001a;HothornandLausen2003).However,theproblemofidentifyingasuperioralgorithmisstructurallydiﬀerentfromtheper-formanceassessmenttask,althoughwenoticethatasymptoticargumentsindicatethatcross-validationisabletoselectthebestalgorithmwhenprovidedwithinﬁnitivelylargelearningsamples(DudoitandvanderLaan2005)becausethevariabilitytendstozero.Anyway,thecomparisonofrawpointestimatesinﬁnitesamplesituationsdoesnottaketheirvariabilityintoaccount,thusleadingtouncertaindecisionswithoutcontrollinganyerrorprobability.Whilemanysolutionstotheinstabilityproblemsuggestedinthelastyearsareextremelysuc-cessfulinreducingthevarianceofalgorithmsbyturningweakintostronglearners,especiallyensemblemethodslikeboosting(FreundandSchapire1996),bagging(Breiman1996a)orrandomforests(Breiman2001a),thevariabilityofperformancemeasuresandassociatedtestprocedureshasreceivedlessattention.ThetaxonomyofinferenceproblemsinthespecialcaseofsupervisedclassiﬁcationproblemsdevelopedbyDietterich(1998)ishelpfultodistinguishbetweenseveralproblemclassesandapproaches.Foradatageneratingprocessunderstudy,wemayeitherwanttoselectthebestoutofasetofcandidatealgorithmsortochooseoneoutofasetofpredeﬁnedﬁttedmodels(“classiﬁers”).Dietterich(1998)distinguishesbetweensituationswherewearefacedwithlargeorsmalllearningsamples.Standardstatisticaltestproceduresareavailableforcompar-ingtheperformanceofﬁttedmodelswhenanindependenttestsampleisavailable(questions3and4inDietterich1998)andsomebenchmarkstudiesrestrictthemselvestothoseapplications(BauerandKohavi1999).Theproblemwhethersomeoutofasetofcandidatealgorithmsoutperformallothersinalarge(question7)andsmallsamplesituation(question8)iscommonlyaddressedbythederivationofspecialvarianceestimatorsandassociatedtests.EstimatesofthevariabilityofthenaivebootstrapestimatorofmisclassiﬁcationerroraregiveninEfronandTibshirani(1997).Someproceduresforsolvingquestion8suchasthe5×2cvtestaregivenbyDietterich(1998),furtherinvestigatedbyAlpaydin(1999)andappliedinabenchmarkstudyonensemblemethods(Dietterich2000).Pizarro,Guerrero,andGalindo(2002)suggesttousesomeclassicalmultipletestproceduresforsolvingthisproblem.Mixedmodelsareappliedforthecomparisonofalgo-rithmsacrossbenchmarkproblems(forexampleLim,Loh,andShih2000;KimandLoh2003).Abasicproblemcommontotheseapproachesisthatthecorrelationbetweeninternalperformanceestimates,suchasthosecalculatedforeachfoldink-foldcross-validation,violatestheassumptionofindependence.Thisfactiseitherignoredwhenthedistributionofnewlysuggestedteststatisticsunderthenullhypothesisofequalperformancesisinvestigated(forexampleinDietterich1998;Alpaydin1999;VehtariandLampinen2002)orspecialvarianceestimatorstakingthiscorrelationintoaccountarederived(NadeauandBengio2003).Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica

TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik3Inthispaper,weintroduceasoundandﬂexibletheoreticalframeworkforthecomparisonofcan-didatealgorithmsandalgorithmselectionforarbitrarylearningproblems.Theapproachtotheinferenceprobleminbenchmarkstudiespresentedhereisfundamentallydiﬀerentfromthepro-cedurescitedabove:Weshowhowonecansamplefromawelldeﬁneddistributionofacertainperformancemeasure,conditionalonadatageneratingprocess,inanindependentway.Conse-quently,standardstatisticaltestprocedurescanbeusedtotestmanyhypothesesofinterestinbenchmarkstudiesandnospecialpurposeproceduresarenecessary.Thedeﬁnitionofappropriatesamplingproceduresmakesspecial“aposteriori”adjustmentstovarianceestimatorsunnecessary.Moreover,norestrictionsoradditionalassumptions,neithertothecandidatealgorithms(likelin-earityinvariableselection,seeGeorge2000,foranoverview)nortothedatageneratingprocessarerequired.ThroughoutthepaperweassumethatalearningsampleofnobservationsL={z1,...,zn}isgivenandasetofcandidatealgorithmsaspotentialproblemsolversisavailable.Eachofthosecandidatesisatwostepalgorithma:IntheﬁrststepamodelisﬁttedbasedonalearningsampleLyieldingafunctiona(∙|L)which,inasecondstep,canbeusedtocomputecertainobjectsofinterest.Forexample,inasupervisedlearningproblem,thoseobjectsofinterestarepredictionsoftheresponsebasedoninputvariablesor,inunsupervisedsituationslikedensityestimation,a(∙|L)mayreturnanestimateddensity.Whenwesearchforthebestsolution,thecandidatesneedtobecomparedbysomeproblemspeciﬁcperformancemeasure.Suchameasuredependsonthealgorithmandthelearningsampledrawnfromsomedatageneratingprocess:Thefunctionp(a,L)assessestheperformanceofthefunctiona(∙|L),thatistheperformanceofalgorithmabasedonlearningsampleL.SinceLisarandomlearningsample,p(a,L)isarandomvariablewhosevariabilityisinducedbythevariabilityoflearningsamplesfollowingthesamedatageneratingprocessasL.Itisthereforenaturaltocomparethedistributionoftheperformancemeasureswhenweneedtodecidewhetheranyofthecandidatealgorithmsperformssuperiortoalltheothers.Theideaistodrawindependentrandomsamplesfromthedistributionoftheperformancemeasureforanalgorithmabyevaluatingp(a,L),wherethelearningsampleLfollowsaproperlydeﬁneddatageneratingprocesswhichreﬂectsourknowledgeabouttheworld.Byusingappropriateandwellinvestigatedstatisticaltestproceduresweareabletotesthypothesesaboutthedistributionsoftheperformancemeasuresofasetofcandidatesand,consequently,weareinthepositiontocontroltheerrorprobabilityoffalselydeclaringanyofthecandidatesasthewinner.WederivethetheoreticalbasisofourproposalinSection2andfocusonthespecialcaseofregressionandclassiﬁcationproblemsinSection3.Onceappropriaterandomsamplesfromtheperformancedistributionhavebeendrawn,theestablishedstatisticaltestandanalysisprocedurescanbeappliedandweshortlyreviewthemostinterestingoftheminSection4.Especially,wefocusontestsforsomeinferenceproblemswhichareaddressedintheapplicationspresentedinSection5.2.ComparingperformancemeasuresInthissectionweintroduceageneralframeworkforthecomparisonofcandidatealgorithms.Independentsamplesfromthedistributionsoftheperformancemeasuresaredrawnconditionallyonthedatageneratingprocessofinterest.Weshowhowstandardstatisticaltestprocedurescanbeusedinbenchmarkstudies,forexampleinordertotestthehypothesisofequalperformances.SupposethatBindependentandidenticallydistributedlearningsampleshavebeendrawnfromsomedatageneratingprocessDGP L1=z11,...,zn1∼DGP,. LB=z1B,...,znB∼DGP,Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica

4TheDesignandAnalysisofBenchmarkExperimentswhereeachofthelearningsamplesLb(b=1,...,B)consistsofnobservations.FurthermoreweassumethatthereareK>1potentialcandidatealgorithmsak(k=1,...,K)availableforthesolutionoftheunderlyingproblem.Foreachalgorithmakthefunctionak(∙|Lb)isbasedontheobservationsfromthelearningsampleLb.Hence,itisarandomvariabledependingonLbandhasitselfadistributionAkonthefunctionspaceofakwhichdependsonthedatageneratingprocessoftheLb:ak(∙|Lb)∼Ak(DGP),k=1,...,K.Foralgorithmsakwithdeterministicﬁttingprocedure(forexamplehistogramsorlinearmodels)thefunctionak(∙|Lb)isﬁxedwhereasforalgorithmsinvolvingnon-deterministicﬁttingorwheretheﬁttingisbasedonthechoiceofstartingvaluesorhyperparameters(forexampleneuralnetworksorrandomforests)itisarandomvariable.Notethatak(∙|Lb)isapredictionfunctionthatmustnotdependonhyperparametersanymore:Theﬁttingprocedureincorporatesbothtuningaswellastheﬁnalmodelﬁttingitself.AssketchedinSection1,theperformanceofthecandidatealgorithmakwhenprovidedwiththelearningsampleLbismeasuredbyascalarfunctionp:pkb=p(ak,Lb)∼Pk=Pk(DGP).TherandomvariablepkbfollowsadistributionfunctionPkwhichagaindependsonthedatageneratingprocessDGP.Foralgorithmswithnon-deterministicﬁttingprocedurethisimpliesthatitmaybeappropriatetointegratewithrespecttoitsdistributionAkwhenevaluatingitsperformance.TheKdiﬀerentrandomsamples{pk1,...,pkB}withBindependentandidenticallydistributedobservationsaredrawnfromthedistributionsPk(DGP)foralgorithmsak(k=1,...,K).Theseperformancedistributionscanbecomparedbybothexploratorydataanalysistoolsaswellasformalinferenceprocedures.ThenullhypothesisofinterestformostproblemsistheequalityofthecandidatealgorithmswithrespecttothedistributionoftheirperformancemeasureandcanbeformulatedbywritingH0:P1=∙∙∙=PK.Inparticular,thishypothesisimpliestheequalityoflocationandvariabilityoftheperformances.Inordertospecifyanappropriatetestprocedureforthehypothesisaboveoneneedstodeﬁneanalternativetotestagainst.Thealternativedependsontheoptimalitycriterionofinterest,whichweassessusingascalarfunctionalφ:Analgorithmakisbetterthananalgorithmak0withrespecttoaperformancemeasurepandafunctionalφiﬀφ(Pk)<φ(Pk0).Theoptimalitycriterionmostcommonlyusedisbasedonsomelocationparametersuchastheexpectationφ(Pk)=E(Pk)orthemedianoftheperformancedistribution,thatis,theaverageexpectedloss.Inthiscaseweareinterestedindetectingdiﬀerencesinmeanperformances:H0:E(P1)=∙∙∙=E(PK)vs.H1:∃i,j∈{1,...,K}:E(Pi)6=E(Pj).Otheralternativesmaybederivedfromoptimalitycriteriafocusingonthevariabilityoftheper-formancemeasures.Underanycircumstances,theinferenceisconditionalonthedatageneratingprocessofinterest.Examplesforappropriatechoicesofsamplingproceduresforthespecialcaseofsupervisedlearningproblemsaregiveninthenextsection.3.RegressionandclassiﬁcationInthissectionweshowhowthegeneralframeworkfortestingtheequalityofalgorithmsderivedintheprevioussectioncanbeappliedtothespecialbutimportantcaseofsupervisedstatisticallearningproblems.Moreover,wefocusonapplicationsthatcommonlyoccurinpracticalsituations.Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica

TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik53.1.ComparingpredictorsInsupervisedlearningproblems,theobservationszinthelearningsampleareoftheformz=(y,x)whereydenotestheresponsevariableandxdescribesavectorofinputvariables.Theaimofthelearningtaskistoconstructpredictorswhich,basedoninputvariablesonly,provideuswithinformationabouttheunknownresponsevariable.Consequently,thefunctionconstructedbyeachoftheKcandidatealgorithmsisoftheformyˆ=ak(x|Lb).Inclassiﬁcationproblemsyˆmaybethepredictedclassforobservationswithinputxorthevectoroftheestimatedconditionalclassprobabilities.Insurvivalanalysistheconditionalsurvivalcurveforobservationswithinputstatusxisofspecialinterest.ThediscrepancybetweenthetrueresponseyandthepredictedvalueyˆforonesingleobservationismeasuredbyascalarlossfunctionL(y,yˆ).TheperformancemeasurepisdeﬁnedbysomefunctionalµofthedistributionofthelossfunctionandthedistributionofpkbdependsonthedatageneratingprocessDGPonly:��pkb=p(ak,Lb)=µLy,akx|Lb∼Pk(DGP).Consequently,therandomnessofz=(y,x)andtherandomnessinducedbyalgorithmsakwithnon-deterministicﬁttingareremovedbyappropriateintegrationwithrespecttotheassociateddistributionfunctions.Again,theexpectationisacommonchoiceforthefunctionalµunderquadraticlossL(y,yˆ)=(y−yˆ)2andtheperformancemeasureisgivenbythesocalledconditionalrisk��2pkb=EakEz=(y,x)Ly,akx|Lb=EakEz=(y,x)y−akx|Lb,(1)wherez=(y,x)isdrawnfromthesamedistributionastheobservationsinalearningsampleL.Otherconceivablechoicesofµarethemedian,correspondingtoabsoluteloss,oreventhesupre-mumortheoreticalquantilesofthelossfunctions.3.2.SpecialproblemsThedistributionsoftheperformancemeasurePk(DGP)foralgorithmsak(k=1,...,K)dependonthedatageneratingprocessDGP.Consequently,thewaywedrawrandomsamplesfromPk(DGP)isdeterminedbytheknowledgeaboutthedatageneratingprocessavailabletous.Insupervisedlearningproblems,onecandistinguishtwosituations:•Eitherthedatageneratingprocessisknown,whichisthecaseinsimulationexperimentswithartiﬁciallygenerateddataorincaseswherewearepracticallyabletodrawinﬁnitivelymanysamples(e.g.,networkdata),•ortheinformationaboutthedatageneratingprocessisdeterminedbyaﬁnitelearningsampleL.InthiscasetheempiricaldistributionfunctionofLtypicallyrepresentsthecompleteknowledgeaboutthedatageneratingprocessweareprovidedwith.InthefollowingweshowhowrandomsamplesfromthedistributionoftheperformancemeasurePk(DGP)foralgorithmakcanbedrawninthreebasicproblems:Thedatageneratingprocessisknown(simulation),alearningsampleaswellasatestsampleareavailable(competition)oronesinglelearningsampleisprovidedonly(realworld).Specialchoicesofthefunctionalµappropriateineachofthethreeproblemswillbediscussed.ThesimulationproblemArtiﬁcialdataaregeneratedfromsomedistributionfunctionZ,whereeachobservationzi(i=1,...,n)inalearningsampleisdistributedaccordingtoZ.ThelearningsampleLconsistsofnindependentobservationsfromZwhichwedenotebyL∼Zn.InthissituationthedatageneratingprocessDGP=Znisused.ThereforeweareabletodrawasetofBindependentlearningsamplesfromZn:L1,...,LB∼Zn.WeassesstheperformanceofeachalgorithmakCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica

6TheDesignandAnalysisofBenchmarkExperimentsonalllearningsamplesLb(b=1,...,B)yieldingarandomsampleofBobservationsfromtheperformancedistributionPk(Zn)bycalculating��pkb=p(ak,Lb)=µLy,akx|Lb,b=1,...,B.TheassociatedhypothesisundertestisconsequentlyH0:P1(Zn)=∙∙∙=PK(Zn).IfwearenotabletocalculateµanalyticallywecanapproximateituptoanydesiredaccuracybydrawingatestsampleT∼ZmofmindependentobservationsfromZwheremislargeandcalculating��pˆkb=pˆ(ak,Lb)=µTLy,akx|Lb.HereµTdenotestheempiricalanalogueofµforthetestobservationsz=(y,x)∈T.Whenµisdeﬁnedastheexpectationwithrespecttotestsampleszasin(1)(weassumeadeterministicakforthesakeofsimplicityhere),thisreducestothemeanofthelossfunctionevaluatedforeachobservationinthelearningsampleX��pˆkb=pˆ(ak,Lb)=m−1Ly,akx|Lb.z=(y,x)∈TAnalogously,thesupremumwouldbereplacedbythemaximumandtheoreticalquantilesbytheirempiricalcounterpart.ThecompetitionproblemInmostpracticalapplicationsnopreciseknowledgeaboutthedatageneratingprocessisavailablebutinsteadweareprovidedwithonelearningsampleL∼ZnofnobservationsfromsomedistributionfunctionZ.TheempiricaldistributionfunctionZˆncoversallknowledgethatwehaveaboutthedatageneratingprocess.Therefore,wemimicthedatageneratingprocessbyusingtheempiricaldistributionfunctionofthelearningsample:DGP=Zˆn.Nowweareabletodrawindependentandidenticallydistributedrandomsamplesfromthisemulateddatageneratingprocess.Inacompletelynon-parametricsetting,thenon-parametricorBayesianbootstrapcanbeappliedhereor,iftherestrictiontocertainparametricfamiliesisappropriate,theparametricbootstrapcanbeusedtodrawsamplesfromthedatageneratingprocess.ForanoverviewofthoseissueswerefertoEfronandTibshirani(1993).Undersomecircumstances,anadditionaltestsampleT∼Zmofmobservationsisgiven,forexampleinmachinelearningcompetitions.Inthissituation,theperformanceneedstobeassessedwithrespecttoTonly.Again,wewouldliketodrawarandomsampleofBobservationsfromPˆk(Zˆn),whichinthissetupispossiblebybootstrappingL1,...,LB∼Zˆn,wherePˆdenotesthedistributionfunctionoftheperformancemeasureevaluatedusingT,thatis,theperformancemeasureiscomputedby��pˆkb=pˆ(ak,Lb)=µTLy,akx|LbwhereµTisagaintheempiricalanalogueofµforallz=(y,x)∈T.ThehypothesisweareinterestedinisH0:Pˆ1(Zˆn)=∙∙∙=PˆK(Zˆn),wherePˆkcorrespondstotheperformancemeasureµT.SincetheperformancemeasureisdeﬁnedintermsofonesingletestsampleT,itshouldbenotedthatwemayfavouralgorithmsthatperformwellonthatparticulartestsampleTbutworseonothertestsamplesjustbychance.Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica

TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik7TherealworldproblemThemostcommonsituationweareconfrontedwithindailyroutineistheexistenceofonesinglelearningsampleL∼Znwithnodedicatedindependenttestsamplebeingavailable.Again,wemimicthedatageneratingprocessbytheempiricaldistributionfunctionofthelearningsample:DGP=Zˆn.WeredrawBindependentlearningsamplesfromtheempiricaldistributionfunctionbybootstrapping:L1,...,LB∼Zˆn.Thecorrespondingperformancemeasureiscomputedbypˆkb=pˆ(ak,Lb)=µˆ�L�y,ak�x|Lbwhereµˆisanappropriateempiricalversionofµ.Therearemanypossibilitiesofchoosingµˆandthemostobviousonesaregiveninthefollowing.Ifnislarge,onecandividethelearningsampleintoasmallerlearningsampleandatestsampleL={L0,T}andproceedwithµTasinthecompetitionproblem.Ifnisnotlargeenoughforthistobefeasible,thefollowingapproachisaﬁrstnaivechoice:Inthesimulationproblem,themodelsareﬁttedonsamplesfromZnandtheirperformanceisevaluatedonsamplesfromZ.Here,themodelsaretrainedonsamplesfromtheempiricaldistributionfunctionZˆnandsowecouldwanttoassesstheirperformanceonZˆwhichcorrespondstoemulatingµTbyusingthelearningsampleLastestsample,i.e.,foreachmodelﬁttedonabootstrapsample,theoriginallearningsampleLitselfisusedastestsampleT.Exceptforalgorithmsabletocompute‘honest’predictionsfortheobservationsinthelearningsample(forexamplebagging’sout-of-bagpredictions,Breiman1996b),thischoiceleadstooverﬁt-tingproblems.Thosecanbeaddressedbywellknowncross-validationstrategies.ThetestsampleTcanbedeﬁnedintermsoftheout-of-bootstrapobservationswhenevaluatingµT:•RW-OOB.ForeachbootstrapsampleLb(b=1,...,B)theout-of-bootstrapobservationsL\Lbareusedastestsample.Notethatusingtheout-of-bootstrapobservationsastestsampleleadstonon-independentob-servationsoftheperformancemeasure,however,theircorrelationvanishesasntendstoinﬁnity.Anotherwayistochooseacross-validationestimatorofµ:•RW-CV.EachbootstrapsampleLbisdividedintokfoldsandtheperformancepˆkbisdeﬁnedastheaverageoftheperformancemeasureoneachofthosefolds.SinceitispossiblethatoneobservationfromtheoriginallearningsampleLispartofboththelearningfoldsandthevalidationfoldduetosamplingn-out-of-nwithreplacement,thoseobservationsareremovedfromthevalidationfoldinordertopreventanybias.Suchbiasmaybeinducedforsomealgorithmsthatperformbetteronobservationsthatarepartofbothlearningsampleandtestsample.Commontoallchoicesinthissetupisthatonesinglelearningsampleprovidesallinformation.Therefore,wecannotcomputethetheoreticalperformancemeasuresandhencecannottesthy-pothesesabouttheseasthiswouldrequiremoreknowledgeaboutthedatageneratingprocess.Thestandardapproachistocomputesomeempiricalperformancemeasure,suchasthosesug-gestedhere,insteadtoapproximatethetheoreticalperformance.Foranyempiricalperformancemeasure,thehypothesisneedstobeformulatedbyH0:Pˆ1(Zˆn)=∙∙∙=PˆK(Zˆn),meaningthattheinferenceisconditionalontheperformancemeasureunderconsideration.4.TestproceduresAsoutlinedintheprevioussections,theproblemofcomparingKalgorithmswithrespecttoanyperformancemeasurereducestotheproblemofcomparingKnumericdistributionfunctionsorCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica

8TheDesignandAnalysisofBenchmarkExperimentscertaincharacteristics,suchastheirexpectation.Alotofattentionhasbeenpaidtothisandsimilarproblemsinthestatisticalliteratureandsoarichtoolboxcanbeappliedhere.Wewillcommentonappropriatetestproceduresforonlythemostimportanttestproblemscommonlyaddressedinbenchmarkexperimentsandrefertothestandardliteratureotherwise.4.1.ExperimentaldesignsAmatchedpairsordependentKsamplesdesignisthemostnaturalchoiceforthecomparisonofalgorithmssincetheperformanceofallKalgorithmsisevaluatedusingthesamerandomsamplesL1,...,LB.WethereforeusethisdesignforthederivationsintheprevioussectionsandtheexperimentsinSection5andcomparethealgorithmsbasedonthesamesetoflearningsamples.TheapplicationofanindependentKsampledesignmaybemorecomfortablefromastatisticalpointofview.EspeciallythederivationofconﬁdenceintervalsforparameterslikethediﬀerenceofthemisclassiﬁcationerrorsoftwoalgorithmsorthevisualisationoftheperformancedistributionsisstraightforwardintheindependentKsamplessetup.4.2.AnalysisAsensibleteststatisticforcomparingtwoperformancedistributionswithrespecttotheirlocationsinamatchedpairsdesignisformulatedintermsoftheaveraged¯ofthediﬀerencesdb=p1b−p2b(b=1,...,B)fortheobservationsp1bandp2bofalgorithmsa1anda2.Underthenullhypothesisofequalityoftheperformancedistributions,thestudentizedstatistic¯√dt=Br�(2)(B−1)−1Pdb−d¯2bisasymptoticallynormalandfollowsat-distributionwithB−1degreesoffreedomwhenthediﬀerencesaredrawnfromanormaldistribution.Theunconditionaldistributionofthisandothersimilarteststatisticsisderivedundersomeparametricassumption,suchassymmetryornormality,aboutthedistributionsoftheunderlyingobservations.However,wedoubtthatsuchparametricassumptionstoperformancedistributionsareeverappropriate.Thequestionwhetherconditionalorunconditionaltestproceduresshouldbeappliedhassomephilosophicalaspectsandisoneofthecontroversialquestionsinrecentdiscussions(seeBerger2000;Berger,Lunneborg,Ernst,andLevine2002,forexample).Inthecompetitionandrealworldproblemshowever,theinferenceisconditionalonanobservedlearningsampleanyway,thusconditionaltestprocedures,wherethenulldistributionisdeterminedfromthedataactuallyseen,arenaturaltouseforthetestproblemsaddressedhere.Sinceweareabletodrawasmanyrandomsamplesfromtheperformancedistributionsundertestasrequired,theapplicationoftheasymptoticdistributionoftheteststatisticsofthecorrespondingpermutationtestsispossibleincaseswherethedeterminationoftheexactconditionaldistributionisdiﬃcult.MaybethemostprominentproblemistotestwhetherK>2algorithmsperformequallywellagainstthealternativethatatleastoneofthemoutperformsallothercandidates.InadependentKsamplesdesign,theteststatistic2! PPPB−1pˆkb−(BK)−1pˆkbb,kbkt?= !2(3)PPPPpˆkb−K−1pˆkb−B−1pˆkb+(BK)−1pˆkbk,bkbk,bcanbeusedtoconstructapermutationtest,wherethedistributionoft?isobtainedbypermutingthelabels1,...,KofthealgorithmsforeachsampleLb(b=1,...,B)independently(Pesarin2001).Copyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica

TorstenHothorn,FriedrichLeisch,AchimZeileis,KurtHornik9OncetheglobalhypothesisofequalityofK>2performancescouldberejectedinadependentKsamplesdesign,itisofspecialinteresttoidentifythealgorithmsthatcausedtherejection.Allpartialhypothesescanbetestedatlevelαfollowingtheclosedtestingprinciple,wherethehierarchicalhypothesesformulatesubsetsoftheglobalhypothesisandcanbetestedatlevelα,foradescriptionseeHochbergandTamhane(1987).However,closedtestingproceduresarecom-putationallyexpensiveformorethanK=4algorithms.Inthiscase,onecanapplysimultaneoustestproceduresorconﬁdenceintervalsdesignedfortheindependentKsamplescasetothealignedperformancemeasures(Ha´jek,Sˇida´k,andSen1999).ItisimportanttonotethatoneisabletodetectverysmallperformancediﬀerenceswithveryhighpowerwhenthenumberoflearningsamplesBislarge.Therefore,practicalrelevanceinsteadofstatistialsigniﬁcanceneedstobeassessed,forexamplebyshowingrelevantsuperioritybymeansofconﬁdenceintervals.FurthercommentsonthoseissuescanbefoundinSection6.5.IllustrationsandapplicationsAlthoughthetheoreticalframeworkpresentedinSections2and3coversawiderrangeofapplica-tions,werestrictourselvestoafewexamplesfromregressionandclassiﬁcationinordertoillustratethebasicconcepts.AsoutlinedinSection3.2,thedegreeofknowledgeaboutthedatageneratingprocessavailabletotheinvestigatordetermineshowwellwecanapproximatethetheoreticalper-formancebyusingempiricalperformancemeasures.Forsimpleartiﬁcialdatageneratingprocessesfromaunivariateregressionrelationshipandatwo-classclassiﬁcationproblemwewillstudythepoweroftestsbasedontheempiricalperformancemeasuresforthesimulation,competitionandrealworldproblems.Maybethemostinterestingquestionaddressedinbenchmarkingexperimentsis“Arethereanydiﬀerencesbetweenstate–of–the–artalgorithmswithrespecttoacertainperformancemeasure?”.Fortherealworldproblemweinvestigatethisforsomeestablishedandrecentlysuggestedsuper-visedlearningalgorithmsbymeansofthreerealworldlearningsamplesfromtheUCIrepository(BlakeandMerz1998).AllcomputationswereperformedwithintheRsystemforstatisticalcomputing(IhakaandGentleman1996;RDevelopmentCoreTeam2004),version1.9.1.5.1.NestedlinearmodelsInordertocomparethemeansquarederroroftwonestedlinearmodelsconsiderthedatagener-atingprocessfollowingaunivariateregressionequationy=β1x+β2x2+ε(4)wheretheinputxisdrawnfromauniformdistributionontheinterval[0,5]andtheerrortermsareindependentrealisationsfromastandardnormaldistribution.Weﬁxtheregressioncoeﬃcientβ1=2andthenumberofobservationsinalearningsampleton=150.Twopredictivemodelsarecompared:•a1:asimplelinearregressiontakingxasinputandthereforenotincludingaquadratictermdna•a2:asimplequadraticregressiontakingbothxandx2asinputs.Consequently,theregres-sioncoeﬃcientβ2isestimated.Thediscrepancybetweenapredictedvalueyˆ=ak(x|L),k=1,2andtheresponseyismeasuredbysquarederrorlossL(y,yˆ)=(y−yˆ)2.Basically,weareinterestedtocheckifalgorithma1performsbetterthanalgorithma2forvaluesofβ2varyinginarangebetween0and0.16.AsdescribedindetailinSection3.2,boththeperformancemeasureandthesamplingfromtheperformancedistributiondependonthedegreeCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica

01TheDesignandAnalysisofBenchmarkExperimentsSimulationCompetitionRealWorldβ2m=2000m=150OOBOOB-2nCV0.0000.0000.0000.0720.0540.0590.0540.0200.0290.2870.1860.1140.1740.1090.0400.8350.6090.4510.2970.4990.2790.0600.9970.7640.6830.5540.8400.52300..01800011..00000000..98373500..98132300..97275800..99977300..9727670.1201.0000.9710.9530.9841.0000.9780.1401.0000.9880.9810.9961.0000.9960.1601.0000.9970.9901.0001.0001.000wTaorblldep1r:obRleegmrsesfsoironvaeryxipnegrivmaleunetss:ofPtohweerregorfestshieontecsotesﬃfocirenttheβ2siomfutlhaetioqnu,adcroamtipcetteitrimo.nLaenadrnrienaglsamplesareofsizen=150.ofknowledgeavailableandwethereforedistinguishbetweenthreediﬀerentproblemsdiscussedabove.SimulationThedatageneratingprocessZnisknownbyequation(4)andweareabletodrawasmanylearningsamplesofn=150observationsaswewouldlike.Itis,inprinciple,possibletocalculatethemeansquarederrorofthepredictivefunctionsa1(∙|Lb)anda2(∙|Lb)whenlearningsampleLbwasobserved.Consequently,weareabletoformulatethetestproblemintermsoftheperformancedistributiondependingonthedatageneratingprocessinaone-sidedway:H0:E(P1(Zn))≤E(P2(Zn))vs.H1:E(P1(Zn))>E(P2(Zn)).(5)However,closedformsolutionsareonlypossibleinverysimplecasesandwethereforeapproximatePk(Zn)byPˆk(Zn)usingalargetestsampleT,inourcasewithm=2000observations.Somealgebrashowsthat,forβ2=0andmodela1,thevarianceoftheperformanceapproximatedbyatestsampleofsizemisV(pb1)=m−1(350/3∙(2−βˆ1)4+25/2∙(2−βˆ1)2+2).Inordertostudythegoodnessoftheapproximationwe,inaddition,chooseasmallertestsamplewithm=150.Notethatinthissetuptheinferenceisconditionalunderthetestsample.CompetitionWearefacedwithalearningsampleLwithn=150observations.TheperformanceofanyalgorithmistobemeasuredbyanadditionaltestsampleTconsistingofm=150observations.Again,theinferenceisconditionalundertheobservedtestsampleandwemay,justbychance,observeatestsamplefavouringaquadraticmodelevenifβ2=0.ThedatageneratingprocessisemulatedbytheempiricaldistributionfunctionDGP=Zˆnandweresamplebyusingthenon-parametricbootstrap.RealWorldMostinterestingandmostcommonisthesituationwheretheknowledgeaboutthedatageneratingprocessiscompletelydescribedbyonesinglelearningsampleandthenon-parametricbootstrapisusedtoredrawlearningsamples.Severalperformancemeasuresarepossibleandweinvesti-gatethosebasedontheout-of-bootstrapobservations(RW-OOB)andcross-validation(RW-CV)suggestedinSection3.2.Forcross-validation,theperformancemeasureisobtainedfroma5-foldcross-validationestimator.Eachbootstrapsampleisdividedintoﬁvefoldsandthemeansquarederroroneachofthesefoldsisaveraged.Observationswhichareelementsoftrainingandvali-dationfoldareremovedfromthelatter.Inaddition,wecomparetheout-of-bootstrapempiricalCopyright c2005AmericanStatisticalAssociation,InstituteofMathematicalStatistics,andInterfaceFoundationofNorthAmerica