CS535D Project: Bayesian Logistic Regression through Auxiliary Variables

profil-ontoe-2012 - Mark Schmidt

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

19 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Niveau: Supérieur, Licence, Bac+1
CS535D Project: Bayesian Logistic Regression through Auxiliary Variables Mark Schmidt Abstract This project deals with the estimation of Logistic Regression parameters. We first review the binary logistic regression model and the multinomial extension, including standard MAP parameter estimation with a Gaussian prior. We then turn to the case of Bayesian Logistic Regression under this same prior. We review the cannonical approach of performing Bayesian Probit Regression through auxiliary variables, and extensions of this technique to Bayesian Logistic Regression and Bayesian Multinomial Regression. We then turn to the task of feature selection, outlining a trans-dimensional MCMC approach to variable selection in Bayesian Logistic Regression. Finally, we turn to the case of estimating MAP parameters and performing Bayesian Logistic Regression under L1 penalties and other sparsity promoting priors. 1 Introduction In this project, we examined the highly popular Logistic Regression model. This model has tradition- ally been appealing due to its performance in classification, the potential to use its outputs as probabilitic estimates since they are in the range [0, 1], and the interpretation of the coefficients in terms of the 'log- odds' ratio [1]. It is especially popular in biostatistical applications where binary classification tasks occur frequently [1]. In this first part of the report, we review this model, its multi-class generalization, and standard methods of performing maximum likelihood (ML) or maximum a posteriori (MAP) para- meter estimation under a zero-mean Gaussian prior for the regression coefficients.

gaussian

parameter estimation

multinomial

auxiliary variable

bayesian logistic

regression

binary probit

function

Sujets

Schmidt

Gaussian

Estimation theory

Multinomial

Régression

Function

Informations

Publié par	profil-ontoe-2012
Nombre de lectures	18
Langue	English

Extrait

CS535DProject:BayesianLogisticRegressionthroughAuxiliaryVariablesMarkSchmidtAbstractThisprojectdealswiththeestimationofLogisticRegressionparameters.Werstreviewthebinarylogisticregressionmodelandthemultinomialextension,includingstandardMAPparameterestimationwithaGaussianprior.WethenturntothecaseofBayesianLogisticRegressionunderthissameprior.WereviewthecannonicalapproachofperformingBayesianProbitRegressionthroughauxiliaryvariables,andextensionsofthistechniquetoBayesianLogisticRegressionandBayesianMultinomialRegression.Wethenturntothetaskoffeatureselection,outliningatrans-dimensionalMCMCapproachtovariableselectioninBayesianLogisticRegression.Finally,weturntothecaseofestimatingMAPparametersandperformingBayesianLogisticRegressionunderL1penaltiesandothersparsitypromotingpriors.1IntroductionInthisproject,weexaminedthehighlypopularLogisticRegressionmodel.Thismodelhastradition-allybeenappealingduetoitsperformanceinclassication,thepotentialtouseitsoutputsasprobabiliticestimatessincetheyareintherange[0,1],andtheinterpretationofthecoefcientsintermsofthe’log-odds’ratio[1].Itisespeciallypopularinbiostatisticalapplicationswherebinaryclassicationtasksoccurfrequently[1].Inthisrstpartofthereport,wereviewthismodel,itsmulti-classgeneralization,andstandardmethodsofperformingmaximumlikelihood(ML)ormaximumaposteriori(MAP)para-meterestimationunderazero-meanGaussianpriorfortheregressioncoefcients.WethenturntothecaseofobtainingBayesianposteriordensityestimatesoftheregressioncoefcients.Inparticular,weexaminetherecentlyproposedextensionsoftheBayesianProbitRegressionauxiliaryvariablemodeltotheLogisticRegressionandMultinomialRegressionscenarios.Finally,weturntothechallengingtask

ofincorporatingfeatureselectionintothesemodels,focusingontrans-dimensionalsamplingmethods,andMAPand/orBayesianestimationunderpriorsthatencouragesparsity.1.1BinaryLogisticRegressionModelWeuseXtodenotethenbypdesignmatrix,containingpfeaturesmeasuredforninstances.Weuseytodenotethelengthnclasslabelvector,wherethevaluestakeoneither+1or−1,correspondingtotheclasslabelforthenthinstance.Finally,wewillusewtorepresentthelengthpvectorofparametersofthemodel.Primarilyforeaseofpresentation,wewillnotaddressthe‘bias’termw0inthisdocument,butalltechniqueshereinareeasilymodiedtoincludeabiasterm.Underthestandard(binary)LogisticRegressionmodel,weexpresstheprobabilitythataninstanceibelongstotheclass+1as:1π(yi=+1|xi,w)=1+exp(−wTxi)(1)Forbinaryreponses,wecancomputetheprobabilityofthe’negative’classusingthesumruleofprobability:π((yi=−1|xi,w)=1−π(yi=+1|xi,w).WetypicallyassumeindependentGaussianpriorswithmeansof0andvarianceofvonthecoefcientsofthemodel:wi∼N(0,v))2(ToperformMAPparameterestimation,wetakethelogofthelikelihood(1)overallexamples,timestheprior(2)overallparameters(ignoringthenormalizingconstant)togivethefollowingobjectivefunction:n1Xf=−log(1+exp(−yiwTxi))−2wTw(3)v21=iFromthisexpression,weseethattheMaximumLikelihoodestimateisobtainedbysettingvto∞.DifferentiatingtheabovewithrespecttowtoweobtainthefollowingexpressionsforthegradientandHessian(usingσtodenotethesigmoidfunctionσ(x)=1/(1+exp(−x)):2

nwXg=−(1−σ(yiwTxi))yixi−2v1=in1XH=−σ(wTxi)(1−σ(wTxi))xixiT−2Ipv1=i()4)5(WenotethattheHessianisnegative-deniteandsubsequentlythattheoriginalfunctionis(log)concave,indicatingthatanylocalmaximizerofthisobjectivewillbeaglobalmaximizer.AsimplemethodtomaximizethisobjectiveistorepeatNewtoniterationsstartingfromaninitialvalueofwuntilthenormofthegradientissufcientlysmall(notingthatthatthegradientwillbe0atamaximizer).Thisresultsinasimplexed-pointiterativeupdateasfollows:w=w+αHm−1g)6(WhereHmisamodicationoftheHessiantobesufcientlynegative-denite,oranegative-deniteapproximationtothe(inverse)Hessian(see[2]).Thestepsizeαcanbesetto1,butconvergencemaybehastenedbyusinglinesearchmethodssatisfyingsufcientdescentconditions(see[3]or[4]).WehaveimplementedanapproachofthistypemakinguseofMatlab’s‘fminunc’functioninthedirectory‘LOGREG’,andanexamplecallingthiscodeisincludedas‘exampleLOGREG.m’.Othermethodsforoptimizingthisobjectivearediscussedandcomparedin[4].1.2MultinomialLogisticRegressionModelThebinaryLogisticRegressionmodelhasanaturalextensiontothecasewherethenumberofclasses,K,isgreaterthan2.Thisisdoneusingthesoftmaxgeneralization[1]:exp(wkTxi)π(yi,k|xi,w)=PKexp(wTxi)(7)j1=jInthiscase,wehaveamatrixoftargetlabelsythatisnbyK,andy(i,j)issetto+1ifinstanceihasclassj,and0otherwise.TheweightsarenowexpandedtoapbyKmatrix,andwenowhave3

anindividualweightvectorcorrespondingtoeachclass.Notethatwritingtheclassprobabilitiesinthiswaymakesitclearthatweareemployinganexponentialfamilydistribution.Byobservingthatthenormalizingdenominatorenforcesthattheprobilitiessummedovertheclassesmustbeequalto1,wecansettheparametervectorforoneoftheclassestobeavectorofzeros.Usingthis,wecanseethatinthiscasethesoftmaxlikelihoodwillbeidenticaltothebinarylogisticregressioncasewhenwehavetwoclasses.Notealsothatthecoefcientsusedinasoftmaxfunctionretaintheirinterpretabilityintermsofchangestothelog-odds,butthatthesechangesarenowrelativetotheclasswhoseparametersaresettozero[1].AgainassuminganindependentGaussianpriorontheelementsofw,wecanwritethemulti-classpenalizedlog-likelihoodforuseinMAPestimationasfollows:XnXK1XKf=−[yiwTxi−log(exp(wkTxi))]−2wjTwj(8)i=1j=12vj=1Abovewehaveintroducedyiasanindicatortoselecttheappropriatecolumnofwfortheinstacei.Weseethatthelog-likelihoodtermhasthefamiliar(numerator-denominator)form,subsequentlyweexpectthegradientandHessiantocontainmomentsofthedistribution.IfweuseSM(i,k)todenotethesoftmaxprobabilityofinstanceiforclassk,andδi==jasthekroneckerdeltafunctionforiandj,weexpressthegradientfortheparametersofclasskandtheHessianfortheparametersofclasseskandjasfollows:n1Xgk=−[xi(yi−SM(i,k))]−2wkv1=inδXHkj=−xixiT[SM(i,k)(δi=j−SM(i,j))]−j=2kv1=i)9()01(TheHessianremiainsnegativedenite,butnowhas(pK)2elementsinsteadof(pK),makingcomput-ingand/orinvertingtheHessianmuchmoreexpensive.Itisnoteworthythatinthesoftmaxcasewecan(andwill)haveahigherdegreeofcorrelationbetweenvariablesthanwedidforthebinarycasesincewe4

haveadditionalcorrelationbetweentheclasses.WehaveimplementedMAPestimationformultinomialregressionmakinguseofMatlab’s‘fminunc’function(andhenceusingupdatesbasedonthegradientandaninverseHessianapproximationasdiscussedforthebinarycase)inthedirectory‘MLOGREG’,andanexamplecallingthiscodeisincludedas‘exampleMLOGREG.m’.Notethatthiscodeisnotvectorized,socouldbemademuchmoreefcient.2BayesianAuxiliaryVariableMethodsAbovewehavedescribedindetailthelogisticandmultinomialregressionmodels,andoverviewedsomestraightforwardmethodstoperformMAPparameterestimationinsuchmodelsunderaGaussianprior.However,wewouldmuchratherbedoingBayesianparameterestimationinthesemodels,inordertoobtainposteriordistributionsofthemodelparameters.WenowturntoBayesianmethodsofestimatingposteriordistributionsinlogisticregressionmodels.Inparticular,wewillfocusontheGibbssamplingmethodemployingauxiliaryvariablesandjointupdatestotheregressioncoefcientsandauxiliaryvariablesproposedin[5].2.1BinaryProbitRegressionAsdiscussedinclass,wecanderiveconjugatepriorsforthelogisticregressionlikelihoodfunction,buttheyarenotterriblyintuitive.Fortunately,wecantransformthemodelintoanequivalentformulationthatincludesauxiliaryvariables,andadmitsstandardconjugatepriorstothelikelihoodfunction.Thismethodisanextensionofthewell-knownauxiliaryvariablemethodforBinaryProbitRegressionof[6].Beforediscussingthelogisticregression,wewillrstreviewthesimplerBayesianBinaryProbitRegressionmodel,aspresentedin[5].UsingΦtodenotetheGaussiancumulativedistributionfunction,BinaryProbitRegressionusesthefollowinglikelihood:π(yi=1|xi)=Φ(xiTw)5)11(

SincethereisnoconjugatepriortotheGaussiancumulativedistributionfunction,weintroduceasetofnauxiliaryvariablesziwith:zi=xiTw+i)21(Here,i∼N(0,1),andyitakesthevalueof1iffziispositive.Introducingtheauxiliaryvariablesgivesanequivalentmodel,butthismodelismoreamenabletosamplingsincewisremovedfromthelikelihood.IntheparticularcaseofaGaussianprioronw,itadmitsastraightforwardGibbssamplingstrategywhereziissampledfromindependent(univariate)truncatedGaussiandistributions,andwcancanbesampledfromamultvariateGaussiandistribution.Specically,astraightforwardGibbssamplingstrategywithπ(w)∼N(b,v)canbeimplementedusingthefollowing[5]:zi|w∝N(xiTw,1)I(zi>0)yi=1Tzi|w∝N(xiw,1)I(zi≤0)yi6=0w|z,y∼N(B,V)B=V(v−1b+XTz)V=(v−1+XTX)−1)31()41(()51)61()71()81(Asbefore,wewillassumethatthemeanofthepriorontheregressioncoefcientsbiszero.Follow-ingfromthis,wenoteabovethatBisthenormalequationsinLeastSquaresestimation,andthatVisthecorrespondinginverseHessian(correspondingtotheinverseconvarianceorprecisionmatrix).TheGibbssamplerproducedbytheabovestrategyistrivialtoimplement(givencurrentlyavailablelinearalgebrasoftware),sinceitonlyrequiressamplingfromamultivariateGaussian,andtruncatedunivari-ateGaussians.Unfortunately,asdiscussedinclass,thisstraightforwardGibbssamplingapproachisinnefcientsincetheelementofwarehighlycorrelatedwiththeelementsofz.6

Tocombatthecorrelationinherentinwandzintheabovemodel,[5]proposedamethodtoupdatewandzjointlybyusingtheproductruletodecomposethejointprobabilityofthemodelasfollows:π(w,z|y)=π(z|y)π(w|z))91(TheproposedmethodsampleseachzifromaGaussiandistributionwithmeansandvariancesde-rivedfromaleave-one-outmarginalpredictivedensity(see(5)in[5]),updatingtheconditionalmeansaftereachupdatetozi,thensamplingwfromitsconditionalnormaldistributionafterallofthezihavebeensampled.Althoughresultsarepresentedshowingthatthisjointupdatingstrategyoffersanad-vantage,wewillnotreviewitindetailsincethereexistsasimplermethodtofacilitatejointupdatinginlogisticregression.Nevertheless,wehaveimplementedthisblockupdatingscheme(basedonthepseudocodepresentinthepaper)inthedirectory‘PROBREGSAMP’.Forourimplementations,weusedasimplerejectionsamplingapproachforsamplingfromtruncateddistributions,whereeachre-jectedsamplerestrictsthesamplingdensityenvelope.Anexampleshowinghowtorunthiscodeisat‘examplePROBREGSAMP’.2.2BinaryLogisticRegressionBeginningfromthebinaryBayesianProbitRegressionmodelabove,[5]proposetoperformbinaryBayesianLogisticRegressionbyreplacingtheindependentGaussianprioronwithindependentlogisticdistributions.Unfortunately,thissignicantlycomplicatesthesimplesamplingstrategiesabove.Tofacilitatestraightforwardsamplingofthismodel[5]introduceanaddtionalsetofauxiliaryvariablesλ1:nandmodifythenoisefunctiontobeascalemixtureofnormalswithamarginallogisticdistributionasfollows(whereKSdenotestheKolmogorov-Smirnovdistribution):i∼N(0,λi)λi=(2ψi)27)02(2()1

SK∼ψi()22Ifwetemporarilyviewλasconstant,weseethatthisisidenticaltotheProbitmodelabove,exceptthateachvalueofzihasanindividualtermλiforitsnoisevariance.Subsequently,weknowhowtosamplefromthismodelforxedλ.ItsimplyinvolvesusingWeightedLeastSquaresinsteadofLeastSquares(andassociatedinverseHessian),andsamplingfromindividualtruncatednormalsthathavedifferentvariances.SubsequentlywecanimplementaGibbssamplerifweareabletosamplefromtheKSdistribution.Fortunately,[5]outlineanrejectionsamplingmethodtosimulatefromtheKSdistributionusingtheGeneralizedInverseGaussianasthesamplingdensity.Thus,wecanimplementastraightforwardGibbssamplerforbinaryBayesianLogisticRegressionusingthisrejectionsamplingapproachinadditiontothefollowing:zi|w,λ∝N(xiTw,λi)I(zi>0)ifyi=1zi|w,λ∝N(xiTw,λi)I(zi≤0)ifyi6=0w|z,y,λ∼N(B,V)B=V(v−1b+XTWz)V=(v−1+XTWX)−1W=diag(λ−1))32()42(()52)62()72()82()92(Twoapproachesarepresentedin[5]toperformblocksamplingoftheparameters.TherstusesthesamestrategyasintheProbitmodel,wherewandzareupdatedjointlyinthesamewayusingtheabovemodicationstotheconditionals,followedbyanupdatetotheaddtionalauxiliaryvariablesλ.Weimplementedthisstrategy(basedonthepseudocodefromthepaper)inthedirectory‘LOGREGSAMP’,‘exampleLOGRESAMP’isanexamplescriptcallingthisfunction.8

Thesecond(andtheauthor’spreferred)strategyforblocksamplingpresentedin[5]updateszandλjointly,followedbyanupdatetow.Samplingwandλremainsidenticalinthisapproach,butsamplingzibecomeseasier.Inthisapproach,zi|w,λfollowsatruncatedlogisticdistributionwithmeanxiTwandascaleof1.Notonlydoesthisobviatetheneedforcomputingmarginalpredictivedensities,theinverseofthecumulativedistributionfunctionofthelogisticdistributionhasaclosedformandissubsequentlytrivialtosamplefrom(althoughweagainusedasimpleadaptiverejectionsamplingtechniqueinourimplementation).Weimplementedthisstrategy(basedonthepseudocodefromthepaper)inthedirectory‘LOGREGSAMP’,‘exampleLOGRESAMP2’isanexamplescriptcallingthisfunction.2.3MultinomialLogisticRegressionUnliketheProbitRegressioncase,thebinaryLogisticRegressionsamplingtechniquesabovehaveatrivialextensiontothemulti-classscenario.InadditiontohavingayandwvariableforeachclassaswesawinSection1,wenowhaveazandλvectorforeachclass.TheGibbssamplerpresentedin[5]simplyloopsovertheclasses,performingthebinarylogisticregressionsamplingtechniqueforthecurrentclasseskeepingallotherclassesxed.Weimplementedthisstrategy(basedonthepseudocodefromthepaper)inthedirectory‘MLOGREGSAMP’,‘exampleMLOGRESAMP’isanexamplescriptcallingthisfunction.Unfortunately,wefoundthatthisdoesnotmakeanespeciallyeffectivesamplingstrategy,andthatthetechniquestaysinareasofthedistributionthatwerefarfromtheMAPestimate,anddidnotproduceaccurateclassicationresults.Wehypothesizethatthisisduetoseveralfactors.Therstfactorissimplythelargernumberofparametersinthismodel.Anotherfactoristhat,asdiscussedpreviously,therecaninherentlybeamuchhigherdegreeofcorrelationinthesoftmaxcasethaninthebinaryscenario.Finally,wenotethatthesamplingstrategyofloopingovertheclasses,andrunningthebinarysamplerisnotespeciallycleveraboutdealingwiththesecorrelations,sinceitrequiresseperatesamplingofthezvaluesforeachclassinadditiontothewvaluesforeachclass,andajointupdatewouldlikelyimprovetheperformance.Beforemovingontofeatureselection,Iwouldliketooutlinesomeextensionsoftheabovemodels9

thatIwouldhavelikedtoexplored,ifIhadmoretime.Oneideawithsignicantpotentialforimprov-ingthesamplingstrategiesistointegrateoutparameters.Giventhehighdegreeofcorrelationbetweenvariables(especiallyinthemulti-classcase),thiswouldlikelyimprovethesamplingstrategiessignif-icantly.Anotherareaofexplorationistonottheviewcovarianceorhyper-parametersasxed,andexploreposteriorestimateswithpriorsonthesedistributions.Thisisespeciallyrelevantfromthepointofviewofmodelgeneralization,sincethecovarianceandhyper-parameterscansignicantlyaffecttheclassicationperformanceofthemodel.3FeatureSelectionAmajorappealofLogisticRegression,besidesitsintuitivemulti-classgeneralization,istheinter-pretationofitscoefcients.Asdiscussedin[1],researchersoftenexploredifferentcombinationsofthefeaturesinordertoproduceaparsimoniousregressionmodelthatstillprovideseffectivepredictionper-formance.Inthissection,wediscussautomatedapproachestothisfeatureselectionproblem.Werstpresentanextensiontotheabovemodelsthatincorporatesfeatureselectionthroughtrans-dimensionalsampling.Wethenturnourfocustopriorsthatencouragesparsityinthenalmodel.3.1Trans-DimensionalSamplingFocusingonthebinarylogisticregressionscenario,onemethodtoincorporatefeatureselectionintotheprocedureistoaddyetanothersetofauxiliaryvariables,γ1:p.Specically,ifthebinaryvariableγiissetto1thenthecorrespondingvariableisincludedinthemodel,andifγiissetto0thenthecorrespondingvariableisexcludedfromthemodel(ie.setto0).[5]proposesthismodel,andsug-gestusingthemodelpresentedearlierforbinarylogisticregressionwiththeseauxiliaryvariables,withjointupdatesto{z,λ}andto{γ,w}.Theyproposethatγ|zcanbesampledusing(Reversible-Jump)Metropolis-Hastingssteps.Specically,samplingz,λ,andwremainsthesame(butusingonlytheactivecovariateset),andweacceptatrans-dimensionalstepfromγtoγ?(underasymmetricproposal)usingthefollowingacceptanceprobability:01

|Vγ?|1/2|vγ|1/2exp(0.5BγTVγ−1Bγ?)α=min{1,??}(30)|Vγ|1/2|vγ?|1/2exp(0.5BγTVγ−1Bγ)[5]usesasimpleproposaldistribution,theyipthevalueofarandomlychosenelementofγ.Weimplementedsamplingfromtheabovemodelinthecaseofbinarylogisticregression(withfeatureselec-tion)inthedirectory‘LOGREGSAMPFS’,anexamplerunningthisroutineis‘exampleLOGREGSAMPFS’.3.2PriorsEncouragingSparsityAlthoughtheabovestrategytoincorporatefeatureselectionintothemodelisasimpleextensionofthelogisticregressionmodel,ithasmajordrawbacks.Specically,updatingsinglecomponentscausesveryslowexplorationofthespaceof2pvariables.Asdiscussedinclass,wecouldjointlyupdatecorrelatedcomponentstosignicantlyimprovetheresults.Analternatestrategy,especiallyrelevantwhenpisverylarge,istousepriorsthatencouragesparsity.3.3MAPEstimationoftheLogisticLASSOTheLASSOprioradvocatedin[7](bututlilizedearlierunderthename‘BasisFunctionPursuit’[8])iscurrentlyapopularstrategyforenforcingsparsityintheweightsoftheregressioncoefcients.Fromthepointofviewofoptimization,theLASSOpriorconsistsofusingascaledvalueoftheL1-normoftheweightsasthepenalty/regularizationfunction,insteadofthesquaredL2-normdiscussedearlier.Specically,ourobjectivefunctionbecomes:nXf=−log(1+exp(−yiwTxi))−1||w||1(31)v1=iAlthoughtheaboveobjectiveisstillconcave,amajordisadvantageofthisobjectivefunctionisthatitisnon-differentiableatpointswhereanywiiszero.Hence,weneedtouseslightlylessgenericop-timizationapproachesforndingtheMAPestimates.Furthermore,wecannotuseefcientmethodssuchastheonepresentedin[9]forLeastSquaresestimationunderanL1penaltyinordertooptimizethelogisticregressionlikelihoodfunction.Themostwidelyusedmethodforoptimizingthelogistic11