Predictive low rank decomposition for kernel methods

profil-zyak-2012 - Francis Bach

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

8 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Niveau: Supérieur, Doctorat, Bac+8
Predictive low-rank decomposition for kernel methods Francis R. Bach Centre de Morphologie Mathematique, Ecole des Mines de Paris 35 rue Saint-Honore, 77300 Fontainebleau, France Michael I. Jordan Computer Science Division and Department of Statistics University of California, Berkeley, CA 94720, USA Abstract Low-rank matrix decompositions are essen- tial tools in the application of kernel meth- ods to large-scale learning problems. These decompositions have generally been treated as black boxes—the decomposition of the kernel matrix that they deliver is indepen- dent of the specific learning task at hand— and this is a potentially significant source of inefficiency. In this paper, we present an algorithm that can exploit side informa- tion (e.g., classification labels, regression re- sponses) in the computation of low-rank de- compositions for kernel matrices. Our al- gorithm has the same favorable scaling as state-of-the-art methods such as incomplete Cholesky decomposition—it is linear in the number of data points and quadratic in the rank of the approximation. We present simulation results that show that our algo- rithm yields decompositions of significantly smaller rank than those found by incomplete Cholesky decomposition. 1. Introduction Kernel methods provide a unifying framework for the design and analysis of machine learning algo- rithms (Scholkopf and Smola, 2001, Shawe-Taylor and Cristianini, 2004).

kernel matrix

can thus

gadvk?1 ?

look-ahead decompositions

cholesky decomposition

learning task

jordan jordan

Sujets

Mines ParisTech

Cholesky decomposition

Informations

Publié par	profil-zyak-2012
Nombre de lectures	25
Langue	English

Extrait

Predictive lowrank

decomposition for kernel methods

Francis R. Bach CentredeMorphologieMath´ematique,EcoledesMinesdeParis 35rueSaintHonor´e,77300Fontainebleau,France Michael I. Jordan Computer Science Division and Department of Statistics University of California, Berkeley, CA 94720, USA

Abstract Lowrank matrix decompositions are essen tial tools in the application of kernel meth ods to largescale learning problems. These decompositions have generally been treated as black boxes—the decomposition of the kernel matrix that they deliver is indepen dent of the speciﬁc learning task at hand— and this is a potentially signiﬁcant source of ineﬃciency. In this paper, we present an algorithm that can exploit side informa tion (e.g., classiﬁcation labels, regression re sponses) in the computation of lowrank de compositions for kernel matrices. Our al gorithm has the same favorable scaling as stateoftheart methods such as incomplete Cholesky decomposition—it is linear in the number of data points and quadratic in the rank of the approximation. We present simulation results that show that our algo rithm yields decompositions of signiﬁcantly smaller rank than those found by incomplete Cholesky decomposition.

1. Introduction Kernel methods provide a unifying framework for the design and analysis of machine learning algo rithms(Sch¨olkopfandSmola,2001,ShaweTaylorand Cristianini, 2004). A key step in any kernel method is the reduction of the data to akernel matrix, also known as aGram matrix. Given the kernel matrix, generic matrixbased algorithms are available for solv ing learning problems such as classiﬁcation, prediction, anomaly detection, clustering and dimensionality re

nd Appearing inInternational ConferProceedings of the 22 ence on Machine LearningCopy, Bonn, Germany, 2005. right 2005 by the author(s)/owner(s).

francis.bach@mines.org

jordan@cs.berkeley.edu

duction. There are two principal advantages to this division of labor: (1) any reduction that yields a pos itive semideﬁnite kernel matrix is allowed, a fact that opens the door to specialized transformations that ex ploit domainspeciﬁc knowledge; and (2) expressed in terms of the kernel matrix, learning problems often take the form of convex optimization problems, and powerful algorithmic techniques from the convex op timization literature can be brought to bear in their solution.

An apparent drawback of kernel methods is the naive computational complexity associated with manipulat ing kernel matrices. Given a set ofndata points, the kernel matrixKis of sizen×nsuggests a compu. This 2 tational complexity of at leastO(n); in fact most ker nel methods have at their core operations such as ma trix inversion or eigenvalue decomposition which scale 3 asO(n). Moreover, some kernel algorithms make use of sophisticated tools such as semideﬁnite program ming and have even higherorder polynomial complex ities (Lanckriet et al., 2004). These generic worstcase complexities can often be skirted, and this fact is one of the major reasons for the practical success of kernel methods. The under lying issue is that the kernel matrices often have a spectrum that decays rapidly and are thus of small nu merical rank (Williams and Seeger, 2000). Standard algorithms from numerical linear algebra can thus be exploited to compute an approximation of the form >n×m L=GG, whereG∈R, and where the rankm is generally signiﬁcantly smaller thannit. Moreover, is often possible to reformulate kernelbased learning algorithms to make use ofGinstead ofK. The re sulting computational complexity generally scales as 3 2 O(m+m nlinear scaling in). This nmakes kernel based methods viable for largescale problems. To achieve this desirable result, it is of course neces sary that the underlying numerical linear algebra rou