Advanced probabilistic models for clustering and projection [Elektronische Ressource] / von Shipeng Yu

ludwig-maximilians-universitat_munchen - Shipeng Yu

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

188 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Advanced Probabilistic Models for
Clustering and Projection
Dissertation im Fach Informatik
an der Fakult¨at fur¨ Mathematik, Informatik und Statistik
der Ludwig-Maximilians-Universit¨at Mun¨ chen
von
Shipeng Yu
Tag der Einreichung: 21 Juni, 2006
Tag der mu¨ndlichen Pru¨fung: 29 September, 2006
Berichterstatter:
Prof. Dr. Hans-Peter Kriegel, Ludwig-Maximilians-Universit¨at Mun¨ chen
Prof. Dr. Thomas Hofmann, Technische Universit¨at Darmstadt
Dr. Volker Tresp, Siemens AG, Mu¨nchenTo my parentsAbstract
Probabilistic modeling for data mining and machine learning problems is a fundamental
research area. The general approach is to assume a generative model underlying the
observed data, and estimate model parameters via likelihood maximization. It has the
deep probability theory as the mathematical background, and enjoys a large amount of
methods from statistical learning, sampling theory and Bayesian statistics. In this thesis
we study several advanced probabilistic models for data clustering and feature projection,
which are the two important unsupervised learning problems.
Thegoalofclusteringistogroupsimilardatapointstogethertouncoverthedataclus-
ters. While numerous methods exist for various clustering tasks, one important question
still remains, i.e., how to automatically determine the number of clusters. The ﬁrst part
of the thesis answers this question from a mixture modeling perspective. A ﬁnite mixture
model is ﬁrst introduced for clustering, in which each mixture component is assumed to
be an exponential family distribution for generality. The model is then extended to an
inﬁnite mixture model, and its strong connection to Dirichlet process (DP) is uncovered
which is a non-parametric Bayesian framework. A variational Bayesian algorithm called
VBDMA is derived from this new insight to learn the number of clusters automatically,
and empirical studies on some 2D data sets and an image data set verify the eﬀectiveness
of this algorithm.
In feature projection, we are interested in dimensionality reduction and aim to ﬁnd a
low-dimensional feature representation for the data. We ﬁrst review the well-known prin-
cipal component analysis (PCA) and its probabilistic interpretation (PPCA), and then
generalize PPCA to a novel probabilistic model which is able to handle non-linear pro-
jection known as kernel PCA. An expectation-maximization (EM) algorithm is derived
for kernel PCA such that it is fast and applicable to large data sets. Then we propose
a novel supervised projection method called MORP, which can take the output informa-
tion into account in a supervised learning context. Empirical studies on various data sets
show much better results compared to unsupervised projection and other supervised pro-
jection methods. At the end we generalize MORP probabilistically to propose SPPCA
2for supervised projection, and we can also naturally extend the model to S PPCA which
is a semi-supervised projection method. This allows us to incorporate both the label
information and the unlabeled data into the projection process.
In the third part of the thesis, we introduce a uniﬁed probabilistic model which can
iii ABSTRACT
handle data clustering and feature projection jointly. The model can be viewed as a clus-
tering model with projected features, and a projection model with structured documents.
A variational Bayesian learning algorithm can be derived, and it turns out to iterate the
clustering operations and projection operations until convergence. Superior performance
can be obtained for both clustering and projection.Acknowledgements
This dissertation is based on my research work that I carried out as a Ph.D. student
in a joint Ph.D. program between the KDD group at University of Munich (LMU) and
the Learning Systems Department (CT IC 4, previously named Neural Computation) of
Siemens AG. During the past three years, I have been extremely fortunate to have the
guidance, support, andfriendshipofanumberofpeoplewhohelpedmegrowacademically
and personally.
First, I would like to thank Prof. Hans-Peter Kriegel, my supervisor, for his encour-
agements, constructive suggestions and constant support during this research. His door
is always open to me whenever I need his help. I was impressed by his open-mindedness
and academic guidance that make the KDD group at LMU so successful.
I am also greatly thankful to Prof. Thomas Hofmann, who kindly agreed to allocate
his time on supervising my thesis, despite his extremely busy research and teaching work.
I would also like to thank Prof. Martin Wirsing and Prof. Martin Hofmann, for their very
patient instructions on my oral examination.
My co-supervisor at Siemens AG, Dr. Volker Tresp, is the person who had the greatest
role in my intellectual development. He introduced me into the ﬁeld of statistical machine
learning. Hisenthusiasm,sharpthoughts,open-mindedness, andhumormademyresearch
a really memorable and joyful journey.
I am indebted to the friendship and fellowship with Dr. Kai Yu, who really helped
me a lot in both academic research and personal life. I was amazed by not only his solid
knowledgeinmachinelearning,butalsohiswayofthinking. Hecanalwaysﬁndthecritical
point of the problem, and we had a very eﬀective and memorable cooperation during my
Ph.D. work.
I feel grateful to Prof. Bernd Schu¨rmann, the leader of the Learning Systems Depart-
ment at Siemens AG, for his constant support to my research. I appreciate his emphasis
on both scientiﬁc research and real-world applications, which greatly inﬂuences my com-
mitment to my career plan.
Finishing a thesis is not a one-person thing. Here I wish to thank the following peo-
ple: Prof. Christian B¨ohm, Prof. Zoubin Ghahramani, Zhao Xu, Yi Huang, Mrs. Su-
sanne Grienberger, Franz Krojer, Dr. Matthias Schubert, Dr. Peer Kr¨oger, Mrs. Christa
iiiiv ACKNOWLEDGEMENTS
Singer, Mrs. Christine Herzog, Dr.KaiHeesche, Dr. ChristophTietz, Dr. Dengyong Zhou,
Dr. Mingrui Wu, Karsten Borgwardt, Arthur Zimek, Markus Bundschus, Stefan Reckow,
Christof St¨ormann, Huaien Gao, Dr. Hugo Zaragoza, Dr. Ralf Herbrich, Dr. Thore Grae-
pel, Dr. Ji-Rong Wen, Dr. Wei-Ying Ma, ...
Of course, I am grateful to my parents, for their patience and love. Without them this
work would never have come into existence.
Shipeng Yu
Munich, Germany
May, 2006Table of Contents
Abstract i
Acknowledgements iii
List of Algorithms ix
List of Figures xi
List of Tables xv
Preﬁx xvii
I Probabilistic Clustering Models 1
1 Overview of Clustering Models 3
1.1 Flat Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Model-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Similarity-based Clustering . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Organization of the Following Chapters . . . . . . . . . . . . . . . . . . . . 7
2 Finite Mixture Models for Clustering 9
2.1 Mixture of Exponential Family Distributions . . . . . . . . . . . . . . . . . 10
2.1.1 Exponential Family Distributions and the Conjugate Family. . . . . 10
2.1.2 Mixture of Exponential Family Distributions . . . . . . . . . . . . . 12
2.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Model Inference and Parameter Estimation . . . . . . . . . . . . . . . . . . 16
2.2.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Variational Bayesian Solution . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Model Fitting as Clustering . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.4 Variational Bayes for Example Models . . . . . . . . . . . . . . . . . 24
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Inﬁnite Mixture Models 29
vvi TABLE OF CONTENTS
3.1 Inﬁnite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Dirichlet Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Inference and Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Gibbs Sampling with P´olya Urn Process . . . . . . . . . . . . . . . . 35
3.3.2 Truncated Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Dirichlet-Multinomial Allocation . . . . . . . . . . . . . . . . . . . . 39
3.3.4 Comparison of Variational Bayesian Methods for DP . . . . . . . . . 41
3.4 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Old Faithful Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
II Probabilistic Projection Models 53
4 Overview of Projection Models 55
4.1 Projection Models for Continuous Data . . . . . . . . . . . . . . . . . . . . 56
4.1.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . 56
4.1.2 Kernel Principal Component Analysis (Kernel PCA) . . . . . . . . . 57
4.1.3 Local Projection Methods . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Projection Methods for Discrete Data . . . . . . . . . . . . . . . .