Comment

Erni - Trevor Hastie , Ji Zhu

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

6 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Statistical Science2006, Vol. 21, No. 3, 352–357DOI: 10.1214/088342306000000466Main article DOI: 10.1214/088342306000000493© Institute of Mathematical Statistics, 2006CommentTrevorHastieandJiZhuWe congratulate the authors for a well written and reproducing kernel Hilbert space H (RKHS) gener-Kthoughtful survey of some of the literature in this area. ated by K(·,·) (see Burges, 1998; Evgeniou, Pontil andThey are mainly concerned with the geometry and the Poggio, 2000; and Wahba, 1999, for details).computational learning aspects of the support vector Suppose the positive deﬁnite kernel K(·,·) has amachine (SVM). We will therefore complement their (possibly ﬁnite) eigenexpansion,review by discussing from the statistical function esti-∞ mation perspective. In particular, we will elaborate on K(x,x ) = δ φ (x)φ (x ),j j jthe following points: j=1where δ ≥ δ ≥ ··· ≥ 0 are the eigenvalues and• Kernel regularization is essentially a generalized 1 2ridge penalty in a certain feature space. φ (x)’s are the corresponding eigenfunctions. Ele-j• In practice, the effective dimension of the data ker- ments of H have an expansion in terms of theseKnel matrix is not always equal to n, even when the eigenfunctionsimplicit dimension of the feature space is inﬁnite;∞hence, the training data are not always perfectly sep- (2) f(x) = β φ (x),j jarable. j=1• Appropriate regularization plays an important role inwith the constraint thatthe success of the SVM ...

Informations

Publié par	Erni
Nombre de lectures	24
Langue	English

Extrait

We congratulate the authors for a well written andreproducing kernel Hilbert spaceHK(RKHS) gener thoughtful survey of some of the literature in this area.ated byK(∙,∙)(see Burges,1998; Evgeniou, Pontil and They are mainly concerned with the geometry and thePoggio,2000; and Wahba,1999, for details). computational learning aspects of the support vectorSuppose the positive deﬁnite kernelK(∙,∙)has a machine (SVM). We will therefore complement their(possibly ﬁnite) eigenexpansion, review by discussing from the statistical function esti∞    mation perspective. In particular, we will elaborate on K(x,x)=δjφj(x)φj(x), the following points:j=1 •whereKernel regularization is essentially a generalizedδ1≥δ2≥ ∙∙ ∙≥0 are the eigenvalues and ridge penalty in a certain feature space.φj(x)’s are the corresponding eigenfunctions. Ele •ments ofIn practice, the effective dimension of the data kerHKhave an expansion in terms of these nel matrix is not always equal toneigenfunctions, even when the implicit dimension of the feature space is inﬁnite;∞  hence, the training data are not always perfectly sep (2)f (x)=βjφj(x), arable.j=1 •Appropriate regularization plays an important role in with the constraint that the success of the SVM. ∞  •The SVM is not fundamentally different from many2 2 def ∞, f =βj/δj< HK statistical tools that our statisticians are familiar j=1 with, for example, penalized logistic regression. wherefHis the norm induced byK(∙,∙). K We acknowledge that many of the comments are based Then we can rewrite (1) as on our earlier paper Hastie, Rosset, Tibshirani and   n∞ ∞2  β Zhu (2004).j (3) min yi, β0+βjφj(xi)+λ , β0,βδj i=1j=1j=1 KERNEL REGULARIZATION AND THE 2 GENERALIZED RIDGE PENALTYand we can see that the regularization termfin HK (1) can be interpreted as a generalized ridge penalty,   Given a positive deﬁnite kernelK(x,x), wherex,x where eigenfunctions with small eigenvalues in the ex belong to a certain domainX, we consider the general pansion (2) get penalized more and vice versa. function estimation problem Formulation (3) seems to be an inﬁnite dimensional n   λoptimization problem, but according to the representer 2 1)x) ( min yi, β0+f (xi)+ f (HK.theorem (Kimeldorf and Wahba,1971; Wahba1990), β ,f 02 i=1 the solution is ﬁnite dimensional and has the form Here(∙,∙)is a convex loss function that describes then  “closeness” between the observed data and the ﬁttedf (x)=αiK(x,xi).  model, andfis an element in the span of{K(∙,x),i=1  x∈X}. More precisely,f∈HKis a function in the Using the reproducing property ofHK, that is, K(∙,xi), K(∙,xi) =K(xi,xi), (3) also reduces to a   Trevor Hastie is Professor, Department of Statistics, ﬁnitedimensional criterion, Stanford University, Stanford, California 94305, USA T (4) minL(y, β0+Kα)+λαKα. (e-mail:hastie@stanford.edu). Ji Zhu is Assistant β0,α Professor, Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, USA (e-mail:Here we use vector notation,Kis then×ndata ker  jizhu@umich.edu).nel matrix with elements equal toK(xi,xi), i, i=  352

COMMENT353  n 1, . . . , n, andL(y, β0+Kα)=(yi, β0+f (xi)). i=1 We reparametrize (4) using the eigendecomposition of T K=UDU, whereDis diagonal andUis orthogonal. T LetKα=Uβ, whereβ=DUα. Then (4) becomes T−1 (5) minL(y, β0+Uβ)+λβDβ. β0,β Now the columns ofUare unitnorm basis functions that span the column space ofK; and again, we see that those members that correspond to small eigenval ues (the elements of the diagonal matrixD) get heavily penalized and vice versa. In the machine learning community, people tend to view the kernel as providing an implicit map ofx fromXto a certain highdimensional feature space, andK(∙,∙)computes inner products in this (possibly inﬁnitedimensional) feature space. Speciﬁcally, the features are   I TFG. 1.Eigenvalues(on the log scale)of the data kernel ma-hj(x)=δjφj(x)orh(x)=h1(x), h2(x), . . ., trixK. and we have EFFECTIVE DIMENSION OF THE DATA   K(x,x)= h(x),h(x). KERNEL MATRIX 1/2 Furthermore, letθj=βj/ δjandH=UD. Then (3) and (5As we have seen in the previous section, the kernel) become   n∞ ∞K(∙,∙)mapsxfrom its original input space to some   2 ( highdimensionalfeature spaceh(x). In the case of 6) min yi, θjhj(x)+λ θj β ,θ 0 i=1j=1j=1classiﬁcation, it is sometimes argued that the implicit feature space can be inﬁnitedimensional (e.g., via the and radial basis kernel), which suggests that perfect sepa T (7) minL(y, β0+Hθ)+λθ θ, ration of the training data is always possible. However, β0,θ this is not always true in practice. respectively. This shows kernel regularization as an ex To illustrate the point, we consider a twoclass classi act ridge penalty in the feature space, but unlike (3) ﬁcation example. The data were generated from a pair and (5), it hides the fact that eigenfunctions are dif of mixture Gaussian densities, described in detail by ferentially penalized according to their corresponding Hastie, Tibshirani and Friedman (2001). The radial ker eigenvalues.  2 nel functionK(x,x)=exp(−γx−x)was used. To illustrate the point, we consider a simple example. Four different values ofγ(0.1,0.5,1 and 5) were tried. The dataxi’s are onedimensional and were generated For each of the values ofγ, the SVM was ﬁtted for a from the standard Gaussian distribution (n=50). The  2 sequence of values ofλ, ranging from the most regu radial kernel function)K(x, x=exp(−γx−x) larized model to the least regularized model. was used, withγ=1. Figure1shows the eigenval We were at ﬁrst surprised to discover that not all ues of the kernel matrixK. The left panel of Figure2 these sequences achieved zero training errors on the shows the ﬁrst 16 eigenvectors ofK(columns ofU) and the right panel shows the corresponding features200 training data points, at their least regularized ﬁt. (columns ofHThe minimal training errors and the corresponding val). As we can see from the left panel of Figure2ues for, eigenvectors with large eigenvalues (henceγare summarized in Table1. The second row get penalized less) tend to be smooth, while eigenvecof the table shows the effective rank of the data kernel tors with small eigenvalues (hence get penalized more)matrixK(which we deﬁned to be the number of eigen 2−12 tend to be wiggly; therefore,fis also always invalues greater than 10). This 200×200 matrix has HK  terpreted as a roughness measure of the functionf. WeelementsK(xi,xi), i, i=1, . . . ,200. In this example,  can also see from the right panel of Figure2that manya full rankKis required to achieve perfect separation. of the features are “norm challenged,” that is, they areSimilar observations have also appeared in Williams squashed down dramatically by their eigenvalues.and Seeger (2000) and Bach and Jordan (2002).

354

T. HASTIE AND J. ZHU

FIG. 2.The left panel shows the eigenvectors of the data kernel matrixKand the right panel shows the corresponding features(eigenvectors scaled by the corresponding eigenvalues).

TABLE1 Results for the mixture simulation example

γ05 1.5 0.1 Training errors0 1221 33 Effective rank200 177 14376

This emphasizes again the fact that not all features in the feature map implied byK(∙,∙)are of equal stature (see the right panel in Figure2); many of them are

shrunken way down to zero. The regularization in (3) and (5) penalizes unitnorm eigenvectors by the in verse of their eigenvalues, which effectively annihi lates some, depending onγ. Smallγimplies wide, ﬂat kernels and a suppression of wiggly, rough func tions. Figure3shows the eigenvalues ofKfor the four values ofγ. The larger eigenvalues correspond in this case to smoother eigenfunctions, the smaller ones to rougher. The rougher eigenfunctions get penalized ex ponentially more than the smoother ones. Hence, for smaller values ofγ, the effective dimension of the fea ture space is truncated.

FIG. 3.The eigenvalues(on the log scale)for the data kernel matricesKthat correspond to the four values ofγ.

COMMENT355 THE NEED FOR CAREFUL REGULARIZATIONFigure4shows the test error as a function ofλfor the mixture data example. Here we see a dramatic range in The SVM has been very successful for the classiﬁ the correct choice ofλ. Whenγ=5, the most regu cation problem and gained a lot of attention in the ma larized model is called for. On the other hand, when chine learning community in the past ten years. Many γ=0.1, we would want to choose among the least papers have been published to explain why it performs regularized models. Depending on the value ofγ, the so well. Most of this literature concentrates on the con cept of margin. Various misclassiﬁcation error boundsoptimalλcan occur at either end of the spectrum or have been derived based on the margin (Vapnik,1995in between, emphasizing the need for care; anywhere Bartlett and ShaweTaylor,1999ful selection.; ShaweTaylor and Cristianini,1999). However, our view is a little different from that based CONNECTION WITH OTHER STATISTICAL TOOLS on the concept of margin. Several researchers have noted the relationship between the SVM and regularLast, we would like to comment on the connection ized function estimation in RKHS (Evgeniou, Ponbetween the SVM and some statistical tools that statis til and Poggio,2000; Wahba,1999). The regularizedticians are familiar with. function estimation problem contains two parts: a loss As we have seen in previous sections, what is spe function and a penalty term [e.g., (1)]. The SVM uses cial with the SVM is not the regularization term, but the socalled hinge loss function (see Figure5). The is rather the loss function, that is, the hinge loss. Lin margin maximizing property of the SVM derives from (2002) pointed out that the hinge loss is Bayes con the hinge loss function. Hence margin maximization sistent, that is, the population minimizer of the loss is by nature a nonregularized objective, and solving function agrees with the Bayes rule in terms of clas it in highdimensional space is likely to lead to over siﬁcation. This is important in explaining the success ﬁtting and bad prediction performance. This has been of the SVM, because it implies that the SVM is trying observed in practice by many researchers, in particular to implement the Bayes rule. Breiman (1999) and Marron and Todd (2002). On the other hand, notice that the hinge loss and the Theloss+penaltyformulation emphasizes the role binomial deviance have very similar shapes (see Fig of regularization. In many situations we have sufﬁcient ure5): both increase linearly asyfgets very small features (e.g., gene expression arrays) to guarantee sep (negative), and both encourageyandfto have the aration. We may nevertheless avoid the maximum mar gin separator (λ↓0) in favor of a more regularized sosame sign. Hence it is reasonable to conjecture that by lution. replacingthe hinge loss of the SVM with the binomial

FIG. 4.Test error curves for the mixture example,using four different values for the radial kernel parameterγ.Large values ofλcorrespond to heavy regularization,small values ofλto light regularization.Depending on the value ofγ,the optimalλcan occur at either end of the spectrum or anywhere in between,emphasizing the need for careful selection.

356

T. HASTIE AND J. ZHU

FIG. 5.Comparing the hinge loss and the binomial deviance,y∈ {−1,1}.

deviance, which is also Bayes consistent, we should be able to get a ﬁtted model that performs similarly to the SVM. In fact, in Zhu and Hastie (2005), we show that under certain conditions, the classiﬁcation bound ary of the resulting penalized logistic regression (using the binomial deviance) and that of the SVM coincide. Penalized logistic regression has been studied by many statisticians (see Green and Silverman,1994; Wahba, Gu, Wang and Chappell,1995; and Lin et al.,2000, for details). We understand why it can work well. The same reasoning could be applied to the SVM. Penalized logistic regression is not the only model that performs similarly to the SVM; replacing the hinge loss with any sensible loss function will give a sim ilar result, for example, the exponential loss func tion of boosting (Freund and Schapire,1997) and the squared error loss (Zhang and Oles,2001; Bühlmann and Yu,2003). These loss functions are all Bayes con sistent. The binomial deviance and the exponential loss are marginmaximizing loss functions, but the squared error loss is not. The distance weighted discrimination (Marron and Todd,2002) is designed speciﬁcally for notmaximizing the margin and works well with high dimensional data, which in a way also justiﬁes that margin maximization is not the key to the success of the SVM.

ACKNOWLEDGMENTS Hastie is partially supported by NSF Grant DMS05 05676. Zhu is partially supported by NSF Grant DMS 0505432.

REFERENCES BACH, F. and JORDAN, M. (2002). Kernel independent component analysis.J. Mach. Learn. Res.31–48.MR1966051 BARTLETT, P. and SHAWETAYLOR, J. (1999). Generalization performance of support vector machines and other pattern clas siﬁers. InAdvances in Kernel Methods—Support Vector Learn-ing(B. Schölkopf, C. J. C. Burges and A. J. Smola, eds.) 43–54. MIT Press, Cambridge, MA. BREIMAN, L. (1999). Prediction games and arcing algorithms. Neural Computation111493–1517. BÜHLMANN, P. and YU, B. (2003). Boosting with theLloss: 2 Regression and classiﬁcation.J. Amer. Statist. Assoc.98324– 340.MR1995709 BURGES, C. J. C. (1998). A tutorial on support vector machines for pattern recognition.Data Min. Knowl. Discov.2121–167. EVGENIOU, T., PONTIL, M. and POGGIO, T. (2000). Regulariza tion networks and support vector machines.Adv. Comput. Math. 131–50.MR1759187 FREUND, Y. and SCHAPIRE, R. (1997). A decision theoretic gen eralization of online learning and an application to boosting. J. Comput. System Sci.55119–139.MR1473055 GREEN, P. and SILVERMAN, B. (1994).Nonparametric Regres-sion and Generalized Linear Models:A Roughness Penalty Ap-proach. Chapman and Hall, London.MR1270012 HASTIE, T., ROSSET, S., TIBSHIRANI, R. and ZHU, J. (2004). The entire regularization path for the support vector machine.J. Mach. Learn. Res.51391–1415.

COMMENT

HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2001).The Ele-ments of Statistical Learning. Springer, New York.MR1851606 KIMELDORF, G. and WAHBA, G. (1971). Some results on Tcheby chefﬁan spline functions.J. Math. Anal. Appl.3382–95. MR0290013 LIN, X., WAHBA, G., XIANG, D., GAO, F., KLEIN, R. and KLEIN, B. (2000). Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV.Ann. Statist.281570–1600.MR1835032 LIN, Y. (2002). Support vector machines and the Bayes rule in clas siﬁcation.Data Min. Knowl. Discov.6259–275.MR1917926 MARRON, J. and TODD, M. (2002). Distance weighted discrimi nation. Technical report, School Operations Res. and Industrial Engineering, Cornell Univ. SHAWETAYLOR, J. and CRISTIANINI, N. (1999). Margin distri bution bounds on generalization. InComputational Learning Theory. Lecture Notes in Artiﬁcial Intelligence1572263–273. Springer, Berlin. VAPNIK, V. (1995).The Nature of Statistical Learning Theory. Springer, New York.MR1367965 WAHBA, G. (1990).Spline Models for Observational Data. SIAM, Philadelphia.MR1045442

357

WAHBA, G. (1999). Support vector machines, reproducing ker nel Hilbert space and randomized GACV. InAdvances in Ker-nel Methods—Support Vector Learning(B. Schölkopf, C. J. C. Burges and A. J. Smola, eds.) 69–88. MIT Press, Cambridge, MA. WAHBA, G., GU, C., WANG, Y. and CHAPPELL, R. (1995). Soft classiﬁcation, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. InThe Mathematics of Generalization(D. Wolpert, ed.) 329–360. Addison–Wesley, Reading, MA. WILLIAMS, C. and SEEGER, M. (2000). The effect of the input density distribution on kernelbased classiﬁers. InProc. Seven-teenth International Conference on Machine Learning1159– 1166. Morgan Kaufmann, San Francisco. ZHANG, T. and OLES, F. (2001). Text categorization based on reg ularized linear classiﬁcation methods.Information Retrieval4 5–31. ZHU, J. and HASTIE, T. (2005). Kernel logistic regression and the import vector machine.J. Comput. Graph. Statist.14185–205. MR2137897