Statistical issues in machine learning [Elektronische Ressource] : towards reliable split selection and variable importance measures / vorgelegt von: Carolin Strobl

ludwig-maximilians-universitat_munchen - Carolin Strobl

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

198 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Statistical Issues in Machine Learning –
Towards Reliable Split Selection and
Variable Importance Measures
Dissertation
am
Institut fur¨ Statistik
der
Fakult¨at fu¨r Mathematik, Informatik und Statistik
der
Ludwig-Maximilians-Universit¨at Munc¨ hen
Vorgelegt von: Carolin Strobl
Munc¨ hen, den 26. Mai 2008Erstgutachter: Prof. Dr. Thomas Augustin
Zweitgutachter: Prof. Dr. Gerhard Tutz
Externer Gutachter: Prof. Dr. Kurt Hornik
Rigorosum: 2. Juli 2008Abstract
Recursivepartitioningmethodsfrommachinelearningarebeingwidelyappliedinmanyscientiﬁc
ﬁeldssuchas, e.g., geneticsandbioinformatics. Thepresentworkisconcernedwiththetwomain
problems that arise in recursive partitioning, instability and biased variable selection, from a
statistical point of view. With respect to the ﬁrst issue, instability, the entire scope of methods
from standard classiﬁcation trees over robustiﬁed classiﬁcation trees and ensemble methods such
as TWIX, bagging and random forests is covered in this work. While ensemble methods prove to
be much more stable than single trees, they also loose most of their interpretability. Therefore an
adaptive cutpoint selection scheme is suggested with which a TWIX ensemble reduces to a single
tree if the partition is suﬃciently stable. With respect to the second issue, variable selection
bias, the statistical sources of this artifact in single trees and a new form of bias inherent in
ensemble methods based on bootstrap samples are investigated. For single trees, one unbiased
split selection criterion is evaluated and another one newly introduced here. Based on the results
for single trees and further ﬁndings on the eﬀects of bootstrap sampling on association measures,
it is shown that, in addition to using an unbiased split selection criterion, subsampling instead of
bootstrap sampling should be employed in ensemble methods to be able to reliably compare the
variable importance scores of predictor variables of diﬀerent types. The statistical properties and
the null hypothesis of a test for the random forest variable importance are critically investigated.
Finally, a new, conditional importance measure is suggested that allows for a fair comparison in
the case of correlated predictor variables and better reﬂects the null hypothesis of interest.Zusammenfassung
Die Anwendung von Methoden des rekursiven Partitionierens aus dem maschinellen Lernen ist
in vielen Forschungsgebieten, wie z.B. in der Genetik und Bioinformatik, weit verbreitet. Die
vorliegende Arbeit setzt sich aus statistischer Sicht mit den zwei Hauptproblemen des rekursiven
Partitionierens, Instabilit¨at und verzerrter Variablenselektion, auseinander. Im Hinblick auf das
erste Thema, die Instabilit¨at, wird das gesamte Methodenspektrum von herk¨ommlichen Klassi-
ﬁkationsb¨aumen u¨ber robustiﬁzierte Klassiﬁkationsb¨aume und Ensemble Methoden wie TWIX,
Bagging und Random Forests in dieser Arbeit abgedeckt. Ensemble Methoden erweisen sich im
VergleichzueinzelnenKlassiﬁkationsb¨aumenalsdeutlichstabiler,verlierenaberauchgr¨oßtenteils
ihre Interpretierbarkeit. Deshalb wird ein adaptives Bruchpunkt-Selektionskriterium vorgeschla-
gen, mit dem ein TWIX-Ensemble auf einen einzelnen Klassiﬁkationsbaum reduziert wird, falls
diePartitionstabilgenugist. ImHinblickaufdaszweiteThema,dieverzerrteVariablenselektion,
werden die statistischen Ursachen fur¨ dieses Artefakt in einzelnen B¨aumen und eine neue Form
von Verzerrung, die in Ensemble Methoden auftritt die auf Bootstrap-Stichproben beruhen, un-
tersucht. Fu¨r einzelne Bau¨ me wird ein unverzerrtes Selektionskriterien evaluiert und ein anderes
hier neu eingefuh¨ rt. Anhand der Ergebnisse fu¨r einzelne B¨aume und weiteren Untersuchungen zu
den Auswirkungen von Bootstrap-Stichprobenverfahren auf Assoziationsmaße wird gezeigt dass,
neben der Verwendung von unverzerrten Selektionskriterien, Teilstichprobenverfahren anstelle
von Bootstrap-Stichprobenverfahren in Ensemble Methoden verwendet werden sollten, um die
Variable Importance-Werte von Pradikt¨ orvariablen unterschiedlicher Art zuverl¨assig vergleichen
zu k¨onnen. Die statistischen Eigenschaften und die Nullhypothese eines Test fu¨r die Variable
Importance von Random Forests werden kritisch untersucht. Abschliessend wird eine neue, be-
dingte Variable Importance vorgeschlagen, die im Fall von korrelierten Pr¨adiktorvariablen einen
fairen Vergleich erlaubt und die interessierende Nullhypothese besser widerspiegelt.Contents
Scope of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Classiﬁcation trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Split selection and stopping rules . . . . . . . . . . . . . . . . . . . 5
1.1.2 Prediction and interpretation . . . . . . . . . . . . . . . . . . . . . 10
1.1.3 Variable selection bias and instability . . . . . . . . . . . . . . . . . 13
1.2 Robust classiﬁcation trees and ensemble methods . . . . . . . . . . . . . . 16
1.3 Characteristics and caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.1 “Small n large p” applicability . . . . . . . . . . . . . . . . . . . . . 19
1.3.2 Out-of-bag error estimation . . . . . . . . . . . . . . . . . . . . . . 21
1.3.3 Missing value handling . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.4 Randomness and stability . . . . . . . . . . . . . . . . . . . . . . . 22
2. Variable selection bias in classiﬁcation trees . . . . . . . . . . . . . . . . . 25
2.1 Entropy estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.1 Binary splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28ii Contents
2.1.2 k-ary splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2 Multiple comparisons in cutpoint selection . . . . . . . . . . . . . . . . . . 34
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3. Evaluation of an unbiased variable selection criterion . . . . . . . . . . . 37
3.1 Optimally selected statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Null case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Power case I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Power case II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Application to veterinary data . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Variable selection ranking . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Selected splitting variables . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4. Robust and unbiased variable selection in k-ary splitting. . . . . . . . . 54
4.1 Classiﬁcation trees based on imprecise probabilities . . . . . . . . . . . . . 55
4.1.1 Total impurity criteria . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Split selection procedure . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.3 Characteristics of the total impurity criterion TU2 . . . . . . . . . 60
4.2 Empirical entropy measures in split selection . . . . . . . . . . . . . . . . . 64
4.2.1 Estimation bias for the empirical Shannon entropy . . . . . . . . . 64
4.2.2 Eﬀects in classiﬁcation trees based on imprecise probabilities . . . . 65Contents iii
4.2.3 Suggested corrections based on the IDM . . . . . . . . . . . . . . . 67
4.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5. Adaptive cutpoint selection in TWIX ensembles . . . . . . . . . . . . . . 77
5.1 Building TWIX ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Instability of cutpoint selection in recursive partitioning. . . . . . . 80
5.1.2 Selecting extra cutpoints . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 A new, adaptive criterion for selecting extra cutpoints . . . . . . . . . . . . 83
5.2.1 Adding virtual observations . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Recomputation of the split criterion . . . . . . . . . . . . . . . . . . 85
5.3 Behavior of the adaptive criterion . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Application to olives data . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.2 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Outlook on credal prediction and aggregation schemes . . . . . . . . . . . . 93
5.4.1 Credal prediction rules . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.2 Aggregation schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6. Unbiased variable importance in random forests and bagging . . . . . . 99
6.1 Random forest variable importance measures . . . . . . . . . . . . . . . . . 100
6.2 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.1 Null case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105iv Contents
6.2