//img.uscri.be/pth/77e6d86289c82ff6be9b84cfcb0a6e8780dfe14d
Cet ouvrage fait partie de la bibliothèque YouScribe
Obtenez un accès à la bibliothèque pour le lire en ligne
En savoir plus

Parametric Link Models for Knowledge Transfer in Statistical Learning

De
40 pages
Parametric Link Models for Knowledge Transfer in Statistical Learning 1 Chapter 1 PARAMETRIC LINK MODELS FOR KNOWLEDGE TRANSFER IN STATISTICAL LEARNING Beninel F.1, Biernacki C.2, Bouveyron C.3, Jacques J.?2 and Lourme A.4 1CREST-ENSAI, Bruz, France 2Université Lille 1 & CNRS & INRIA, Lille, France 3Université Paris 1 Panthéon-Sorbonne, Paris, France 4Université de Pau et des Pays de l'Adour, Pau, France Abstract When a statistical model is designed in a prediction purpose, a major assumption is the absence of evolution in the modeled phenomenon between the training and the prediction stages. Thus, training and future data must be in the same feature space and must have the same distribution. Unfortunately, this assumption turns out to be of- ten false in real-world applications. For instance, biological motivations could lead to classify individuals from a given species when only individuals from another species are available for training. In regression, we would sometimes use a predictive model for data having not exactly the same distribution that the training data used for esti- mating the model. This chapter presents techniques for transfering a statistical model estimated from a source population to a target population. Three tasks of statistical learning are considered: Probabilistic classification (parametric and semi-parametric), linear regression (including mixture of regressions) and model-based clustering (Gaus- sian and Student).

  • transfer learning

  • quantitative space

  • numerous techniques related

  • examples e2

  • parametric link

  • learning techniques

  • related sections below

  • since parametric


Voir plus Voir moins

ParametricLinkModelsforKnowledgeTransferinStatisticalLearning1

Chapter1

P
ARAMETRICLINKMODELSFORKNOWLEDGE
TRANSFERINSTATISTICALLEARNING
BeninelF.
1
,BiernackiC.
2
,BouveyronC.
3
,JacquesJ.

2
andLourmeA.
4
1
CREST-ENSAI,Bruz,France
2
UniversitéLille1&CNRS&INRIA,Lille,France
3
UniversitéParis1Panthéon-Sorbonne,Paris,France
4
UniversitédePauetdesPaysdel'Adour,Pau,France

Abstract
Whenastatisticalmodelisdesignedinapredictionpurpose,amajorassumption
istheabsenceofevolutioninthemodeledphenomenonbetweenthetrainingandthe
predictionstages.Thus,trainingandfuturedatamustbeinthesamefeaturespaceand
musthavethesamedistribution.Unfortunately,thisassumptionturnsouttobeof-
tenfalseinreal-worldapplications.Forinstance,biologicalmotivationscouldleadto
classifyindividualsfromagivenspecieswhenonlyindividualsfromanotherspecies
areavailablefortraining.Inregression,wewouldsometimesuseapredictivemodel
fordatahavingnotexactlythesamedistributionthatthetrainingdatausedforesti-
matingthemodel.Thischapterpresentstechniquesfortransferingastatisticalmodel
estimatedfroma
source
populationtoa
target
population.Threetasksofstatistical
learningareconsidered:Probabilisticclassication(parametricandsemi-parametric),
linearregression(includingmixtureofregressions)andmodel-basedclustering(Gaus-
sianandStudent).Ineachsituation,theknowledgetransferiscarriedoutbyintroduc-
ingparametriclinksbetweenbothpopulations.Theuseofsuchtransfertechniques
wouldimprovetheperformanceoflearningbyavoidingmuchexpensivedatalabeling
efforts.
KeyWords
:Adaptiveestimation,linkbetweenpopulations,transferlearning,classi-
cation,regression,clustering,EMalgorithm,applications.
A
MSSubjectClassication:
62H30,62J99.

E-mailaddress:julien.jacques@polytech-lille.fr

Beninel
etal.

21.Introduction
Statisticallearning[18]isakeytoolformanyscienceandapplicationareassinceitallows
toexplainandtopredictdiversephenomenafromtheobservationofrelateddata.Itleadsto
awidevarietyofmethods,dependingontheparticularproblemathand.Examplesofsuch
problemsarenumerous:

Examples
E
1
:In
CreditScoring
,predictthebehaviorofborrowerstopaybackloan,
onthebasisofinformationknownaboutthesecustomers;In
Medicine
,predictthe
riskoflungcancerrecurrenceforapatienttreatedforarstcancer,onthebasis
ofthetypeoftreatmentusedfortherstcancerandonclinicalanddemographic
measurementsforthatpatient.

Examples
E
2
:In
Economics
,predictthehousingpriceonthebasisofseveralhous-
ingdescriptivevariables;In
Finance
,predicttheprotabilityofanancialassetsix
monthsafterpurchase.

Examples
E
3
:In
Marketing
,createcustomersgroupsaccordingtotheirpurchasehis-
toryinordertotargetamarketingcampaign;In
Biology
,identifygroupsinasample
ofbirdsdescribedbysomebiometricfeatureswhichnallyrevealthepresenceof
differentgenders.
Inatypicalstatisticallearningproblem,aresponsevariable
y

Y
hastobepredicted
fromasetof
d
featurevariables(orcovariates)
x
=(
x
1
,...,
x
d
)

X
.Spaces
X
and
Y
areusuallyquantitativeorcategorical.Itisalsopossibletohaveheterogeneityinfeatures
variables(bothquantitativeandcategoricalforinstance).Theanalysisalwaysreliesona
trainingdataset
S
=(
x
,
y
)
,inwhichtheresponseandfeaturevariablesareobservedfora
setof
n
individualswhicharerespectivelydenotedby
x
=(
x
1
,...,
x
n
)
and
y
=(
y
1
,...,
y
n
)
.
Using
S
,apredictivemodelisbuiltinordertopredicttheresponsevariableforanewindi-
vidual,forwhichthecovariates
x
areobservedbutnottheresponse
y
.Thistypicalsituation
iscalled
supervised
learning.Inparticular,if
Y
isacategoricalspace,itcorrespondstoa
discriminantanalysis
situation;ItaimstosolveproblemswhichlooklikeExamples
E
1
.If
Y
isaquantitativespace,itcorrespondstoa
regression
situationandaimstosolveproblems
similartoExamples
E
2
.Notealsothatif
y
isonlypartiallyknownin
S
,itexhibitswhatis
called
semi-supervised
learning.
Anothertypicalstatisticallearningproblemconsistsinpredictingthewholeresponses
y
whilehavingneverobservethem.Inthiscaseonlythefeaturevariablesareknown,
thus
S
=
x
,anditcorrespondstoan
unsupervised
learningsituation.If
Y
isrestricted
toacategoricalspace(themostfrequentcase),itconsistsina
clustering
purpose,related
problemsbeingillustratedbyExamples
E
3
.
Inthischapter,wefocusonstatisticalmodelingforsolvingaswellsupervisedand
unsupervisedlearning.Manyclassicalprobabilisticmethodsexistandwewillgiveuseful
references,whennecessary,throughoutthechapter.Thus,thereaderinterestedforsuch
referencesisinvitedtohavealookinrelatedsectionsbelow.
Amainassumptioninsupervisedlearningistheabsenceofevolutioninthemodeled
phenomenonbetweenthetrainingofthemodelandthepredictionoftheresponseforanew

ParametricLinkModelsforKnowledgeTransferinStatisticalLearning3
individual.Moreprecisely,thenewindividualisassumedtoarisefromthesamestatistical
populationthanthetrainingone.Inunsupervisedlearning,itisalsoimplicitlyassumedthat
allindividualsarisefromthesamepopulation.Unfortunately,suchclassicalhypotheses
maynotholdinmanyrealisticsituationsasreectedbyrevisitedExamples
E
1
to
E
3
:

Examples
E
1

:In
CreditScoring
,thestatisticalscoringmodelhasbeentrainedona
datasetofcustomersbutisusedtopredictbehaviorofnon-customers;In
Medicine
,
theriskoflungcancerrecurrenceislearnedforanEuropeanpatientbutwillbeap-
pliedtoanAsianpatient.

Examples
E
2

:In
Economics
,areal-estateagencyimplantedforalongtimeonthe
USEastCoastaimstoconquernewmarketsbyopeningseveralagenciesontheWest
Coastbutbothmarketsarequitedifferent;In
Finance
,expertiseinnancialassetof
thepastyearissurelydifferentfromthecurrentone.

Examples
E
3

:In
Marketing
,customerstobeclassiedcorrespondinfacttoapooled
panelofnewandoldercustomers;In
Biology
,differentsubpeciesofbirdsarepooled
togetherandmayconsequentlyhavehighlydifferentfeaturesforthesamegender.
Inthesupervisedsetting,thequestionis
Q
1
:Isitnecessarytorecollectnewtraining
dataandtobuildanewstatisticallearningmodelorcantheprevioustrainingdatastillbe
useful?Intheunsupervisedsetting,thequestionis
Q
2
:Isitbettertoperformaunique
clusteringonthewholedatasetortoperformseveralindependantclusteringsonsome
identiedsubsets?.
Question
Q
1
isaddressedas
transfer
learningandageneraloverviewisgivenin[31].
Transferlearningtechniquesaimtotransfertheknowledgelearnedonasourcepopulation
W
toatargetpopulation
W

,inwhichthisknowledgewillbeusedinapredictionpurpose.
Thesetechniquesaredividedintotwoimportantsituations:Thetransferofamodel
does
need
or
doesnotneed
toobservesomeresponsevariablesinthetargetdomain.Therstcase
isquotedas
inductivetransfer
learningwhereasthesecondoneisquotedas
transductive
transfer
learning.Usually,theclassicationpurposeasdescribedinExamples
E
1

canbe
solvedbyeithertransductiveorinductivetransferlearning,thischoicedependingonthe
modelathand(generativeorpredictivemodels).Contrariwise,theregressionpurposeas
describedinExamples
E
2

canbeonlysolvedbyinductivetransferlearningsinceonly
predictivemodelsareinvolved.Question
Q
2
isadressedas
unsupervisedtransfer
learning.
Itcorrespondstosimultaneousclusteringofseveralsamplesand,thus,itconcernsExamples
∗.E3Acommonexpectedadvantageofallthesetransferlearningtechniquesisarealpre-
dictivebenetsinceknowledgelearnedonthesourcepopulationisusedinadditiontothe
availableinformationonthetargetpopulation.However,thecommonchallengeistoestab-
lishatransferfunctionbetweenthesourceandthetargeptopulations.Inthischapter,we
focusonparametricstatisticalmodels.Besidesbeinggoodcompetitorstononparametric
modelsintermsofprediction,thesemodelshavetheadvantageofbeingeasilyinterpreted
bypractitioners.Sinceparametricmodelswillbeused,itwillbenaturaltomodelizethe
transferfunctionbysomeparametriclinks.Thus,inadditiontoapredictivebenet,the
interpretabilityofthelinkparameterswillgivetopractitionersusefulinformationonthe
evolutionandthedifferencesbetweenthesourceandtargetpopulations.

4Beninel
etal.
Thischapterisorganizedasfollows.Section2.presentstransferlearningfordifferent
discriminantanalysiscontexts:Gaussianmodel(continuouscovariates),Bernoullimodel
(binarycovariates)andlogisticmodel(continuousorbinarycovariates).Section3.consid-
ersthetransferofregressionmodelsforaquantitativeresponsevariableintwosituations:
Usualregressionandmixtureofregressions.Finally,Section4.proposesmodelstoclus-
tersimultaneouslyasourceandatargetpopulationintwosituationsagain:Mixturesof
GaussianandStudentdistributions.Eachsectionstartswithapresentationoftheclassical
statisticalmodelbeforepresentingthecorrespondingtransfertechniques,anditconcludes
byanapplicationonrealdata.
Ausefulnotation
Inthefollowingthenotation

willrefertothetargetpopulation.

2.Parametrictransferlearningindiscriminantanalysis
Discriminantanalysisisalargemethodologicaleldcoveringmachinelearningtechniques
dealingwithdatawhereindividualsaredescribedbythesamesetof
d
covariatesorfeature
vector
x
andaresponsecategoricalvariable
y

Y
=
{
1
,...,
K
}
relatedto
K
classes,where
y
=
k
iftheindividualdescribedby
x
belongstothe
k
thclass.Inastatisticalsetting,
thecouple
(
x
,
y
)
isassumedtobearealizationofarandomvector
(
X
,
Y
)
where
X
=
(
X
1
,...,
X
d
)
.Thenthe
n
-sample
S
=(
x
,
y
)
isassumedtobe
n
i.i.d.realizationsof
(
X
,
Y
)
.
Thepurposeofdiscriminantanalysisistopredictthegroupmembership
y
,onlyon
thebasisofthecovariates
x
.Thediscriminantanalysisproceedsasfollows:Using
S
,an
allocationruleisbuiltinordertoclassifynon-labeledindividuals.Manybooksexplainin
detailthenumeroustechniquesrelatedtodiscriminantanalysis[16,18,30,32],amongwhich
themainareparametricones,semi-parametricones,non-parametriconesandborderline-
basedones.Inthissection,weareinterestedonlybyparametric(GaussianandBernoulli
distributions)andsemi-parametric(logisticregression)methods.
2.1.Gaussiandiscriminantanalysis
2.1.1.Thestatisticalmodel
Gaussiandiscriminantanalysisassumesthat,conditionallytothegroup
y
,thefeaturevari-
ables
x

X
=
R
d
arisefromarandomvector
X
distributedaccordingtoa
d
-variateGaus-
siandistribution
X
|
Y
=
k

N
d
(
µ
k
,
Σ
k
)
,
where
µ
k

R
d
and
Σ
k

R
d
×
d
arerespectivelytheassociatedmeanandcovariancematrix.
Theprobabilitydensityof
X
conditionallyto
Y
=
k
is
111−′f
k
(

;
µ
k
,
Σ
k
)=(
2
p
)
d
/
2
|
Σ
|
1
/
2
exp

2
(
•−
µ
k
)
Σ
k
(
•−
µ
k
)
.
kThemarginaldistributionof
X
isthenamixtureofGaussiandistributions
KX

f
(

;
θ
)=
å
p
k
f
k
(

;
µ
k
,
Σ
k
)
,
1k=

ParametricLinkModelsforKnowledgeTransferinStatisticalLearning5
where
(
p
1
,...,
p
K
)
arethemixingproportions(
p
k
>
0and
å
kK
=
1
p
k
=
1)and
θ
=
{
(
p
k
,
µ
k
,
Σ
k
)
:
k
=
1
,...,
K
}
isthewholeparameter.Whenthecostsofbadclassication
areassumedtobesymmetric,the
MaximumAPosteriori
(MAP)ruleconsistsinassign-
inganewindividual
x
tothegroup
y
maximizingthemembershipconditionalprobabil-
ity
t
y

(
x
;
θ
)
:
y

=
argmax
t
k
(
x
;
θ
)
,
(1)
k
∈{
1
,...,
K
}
erehwt
k
(
x
;
θ
)=
P
(
Y
=
k
|
X
=
x
;
θ
)=
p
k
f
k
(
x
;
α
k
)
.
(2)
f
(
x
;
θ
)
Inthegeneralheteroscedasticsituation(quadraticdiscriminantanalysisorQDA),
θ
isesti-
matedbyitsclassicalempiricalestimates:
11np

k
=
k
,
µ

k
=
å
x
i
,
Σ

k
=
å
(
x
i

µ

k
)(
x
i

µ

k
)

,
nn
k
{
i
:
y
i
=
k
}
n
k

1
{
i
:
y
i
=
k
}
where
n
k
=
card
{
i
:
y
i
=
k
}
isthenumberofindividualsofthetrainingsample
S
belonging
tothegroup
k
.Intherestrictedhomoscedasticsituation
Σ
k
=
Σ
forall
k
(lineardiscrimi-
nantanalysisorLDA),thecovariancematrixisestimatedby
K1Σ

=
n

K
åå
(
x
i

µ

k
)(
x
i

µ

k
)

.
k
=
1
{
i
:
y
i
=
k
}
2.1.2.Thetransferlearninganditsestimation
Nowweassumethatthedataconsistoftwosamples:Arstlabeled
n
-sample
S
=(
x
,
y
)
,
drawnfromasourcepopulation
W
,andasecondunlabeled
n

-sample
S

=
x

,drawnfrom
atargetpopulation
W

.Ourgoalistobuildaclassicationruleforthetargetpopulation
usingbothsamples
S
and
S

.Anextensiontoapartially-labeledtargetsample
S

willbe
alsopresentedlater.Thesourcelabeledsample
S
iscomposedby
n
pairs
(
x
i
,
y
i
)
,assumed
tobei.i.d.realizationsoftherandomcouple
(
X
,
Y
)
ofdistribution
X
|
Y
=
k

N
d
(
µ
k
,
Σ
k
)
and
Y

M
1
(
p
1
,...,
p
K
)
,
where
M
1
istheone-ordermultinomialdistribution.Thetargetunlabeledsample
S

is
composedby
n

pairs
x
i

i.i.d.realizationsof
X

withthefollowingGaussianmixture
distribution
X


f
(

;
θ

)
.
Inordertousebothsamples
S
and
S

fortheclassicationof
S

sample(orofanynew
individual
x
from
W

),theapproachdeveloppedin[3]consistsinestablishingastochas-
ticrelationship
f
k
(
R
d
7→
R
d
)
betweenfeaturevectorsofbothpopulationsconditionallyto
groups,
i.e
.
DX

|
Y
=
k
=
f
k
(
X
|
Y
=
k
)=[
f
k
1
(
X
|
Y
=
k
)
,...,
f
kd
(
X
|
Y
=
k
)]
,
where
D
meansthattheequalityisindistribution,and
f
kj
,
j
=
1
,...,
d
,isanapplication
(
R
d
7→
R
)
.Twonaturalassumptionsareconsidered:

6Beninel
etal.

A
1
:The
j
thcomponent
f
kj
(
X
|
Y
=
k
)
onlydependsonthe
j
thcomponentof
X
|
Y
=
k
,

A
2
:Each
f
kj
is
C
1
.
Asaconsequenceofthepreviousassumptions[10]derivethe
K
relations
X

|
Y

=
k
=
D
D
k
X
|
Y
=
k
+
b
k
(
k
=
1
,...,
K
)
(3)
with
D
k
a
d
×
d
realdiagonalmatrixand
b
k
a
d
dimensionalrealvector.Therefore,we
establishthefollowingrelationsbetweenparametersoftheGaussiandistributionsrelated
topopulations
W
and
W

:
µ
k

=
D
k
µ
k
+
b
k
and
Σ
k

=
D
k
Σ
k
D
k
.
(4)
Suchrelationsallowtodeterminetheallocationrulesforpopulation
W

usingparametersof
featurevectordistributionforindividualsof
W
.Indeed,ifthe
K
pairs
(
D
k
,
b
k
)
areknownit
iseasytoderivepairs
(
µ
k

,
Σ
k

)
from
(
µ
k
,
Σ
k
)
byplug-in.Inwhatfollowswediscussissues
wherethepairs
(
D
k
,
b
k
)
areunkwownandweproposeseveralscenariosforestimating
.mehtConstrainedmodels
Foridentiabilityreasonsweimposethat
b
k
=
0
forall
k
=
1
,...,
K
.
ThisassumptionisdiscussedintheseminalarticleontransferlearninginGaussiandiscrim-
inantanalysis[3],andvalidatedonthebiologicalapplicationanalysedinthisarticle.The
casewithoutconstraintson
b
k
istreatedin[24]whichprovidesspeciccomputationap-
proachforavoidingidentiabilityproblems(seealsoSection4.ofthepresentchapter).
Inordertodeneparsimoniousandmeaningfullmodels,constraintsarenowimposed
ontheparametersoftransfer
D
k
(
k
=
1
,...,
K
)
:

Model
M
1
:
D
k
=
I
d
:The
K
distributionsarethesame(
I
d
:identitymatrixof
R
d
×
d
).

Model
M
2
:
D
k
=
a
I
d
:Transformationsarefeatureandgroupindependent.

Model
M
3
:
D
k
=
D
:Transformationsareonlygroupindependent.

Model
M
4
:
D
k
=
a
k
I
d
:Transformationsareonlyfeatureindependent.

Model
M
5
:
D
k
isunconstrained,
i.e.
itisthemostgeneralsituation.
Model
M
1
consistsinusingallocationruleson
W

basedonlyon
S
,
i.e.
wedealhere
withclassicaldiscriminantanalysis.Models
M
2
and
M
3
preservehomoscedasticityand
consequentlyaneventuallinearityoftherule:If
Σ
1
=
...
=
Σ
K
for
W
,then
Σ
1

=
...
=
Σ

K
for
W

.Lastmodels
M
4
and
M
5
maytransformlinearallocationrulesintoquadraticones
on
W

withfewparameterstoestimate.
Foreachmodel,anadditionalassumptiononthemixingproportionsisdone:They
arethesameinbothpopulationsortheyhavetobeestimatedinthetargetpopulation.
Correspondingmodelsarequotedby
M
j
andrespectively
p
M
j
(1

j

5).Thenumberof
freeparametersforeachmodelaregiveninTable1.

ParametricLinkModelsforKnowledgeTransferinStatisticalLearning7
M
1
M
2
M
3
M
4
M
5
p
M
1
p
M
2
p
M
3
p
M
4
p
M
5
01
dKdKK

1
Kd
+
K

12
K

1
dK
+
K

1
Table1.Numberofestimatedparametersforeachmodel.

Parameterestimation
Asequential
plug-in
procedureisusedtoestimatematrices
D
1
,...,
D
K
(andeventually
p
1

,...,
p

K
).Thecorrespondingestimatorswilldependonpa-
rameter
θ
ofpopulation
W
.Whenthislastisunknown,itissimplyreplacedbyitsestimate.
Estimatingall
p
k

andall
D
k
isperformedbymaximizingthefollowinglikelihood,under
theconstraintsgivenin(4)andundertheconstraintofoneofthepreviousparsimonious
models
M
j
or
p
M
j
(
j
=
1
,...,
5),
∗nL
(
θ

)=
Õ
f
(
x
i

;
θ

)
.
(5)
1i=Ausualwaytomaximizethelikelihoodwhenthegroupmembership
y
i

areunknownisto
useanEMalgorithm[11]whichconsistsiniteratingthetwofollowingsteps:

Estep
:Estimationofthegroupmembership
y
i

byitsexpectationconditionaltothe
observeddata:
y
i


=
argmax
t
k
(
x
i

;
θ

)
.
k
∈{
1
,...,
K
}

Mstep
:Computationoftheparameter
θ

maximizing,undertheconstraintsgiven
in(4)andundertheconstraintofagivenparsimoniousmodels(
M
j
or
p
M
j
),the
followingcompletedlog-likelihood:
Kℓ
c
(
θ

)=
åå
ln
[
p
k

f
k
(
x
i

;
α
k

)]
.
k
=
1
{
i
:
y
i

=
k
}
TheEMalgorithmstopswhenthegrowthofthelikelihoodislowerthanaxedthreshold.
Inordertochoosebetweenseveralconstrainedmodels,theBICcriterion(
Bayesian
InformationCriterion
,[34])isused:
BIC
=

2ln

+
|
θ

|
ln
n

(6)
where

isthemaximumlog-likelihoodvalueand
|
θ

|
denotesthenumberofcontinuous
modelparametersin
θ

.ThemodelleadingtotheminimumBICvalueisretained.Note
thattheBICcriterionisfastertocomputethatanycross-validationcriterion.
2.1.3.Abiologicalapplication
Data
DataarerelatedtoseabirdsfromCory'sShearwater
Calanectrisdiomedea
species
breedingintheMediterraneanandNorthAtlantic,wherepresumablycontrastedoceano-
graphicconditionshaveledtotheexistenceofmarkedsubspeciesdifferinginsizeaswell

8Beninel
etal.
ascolorationandbehavior[35].Subspeciesare
borealis
,livingintheAtlanticislands(the
Azores,Canaries,etc.),
diomedea
,livingintheMediterraneanislands(Balearics,Corsica,
etc.),and
edwardsii
,fromtheCapeVerdeIslands.
Asampleof
borealis
(
n
=
206,45%females)wasmeasuredusingskinsinseveral
NationalMuseums.Fivemorphologicalvariablesaremeasured:Culmen(billlength),tar-
sus,wingandtaillengths,andculmendepth.Similarly,asampleofsubspecies
diomedea
(
n
=
38,58%females)wasmeasuredusingthesamesetofvariables.Figure1plotsculmen
depthandtarsuslengthfor
borealis
and
diomedea
samples.
4662bdioormeaeldisea
0685654525058411121314Culme1n5 depth16171819
Figure1.
Borealis
and
diomedea
forvariablesculmendepthandtarsuslength.
Inthefollowing,weconsiderthe
borealis
sampleasbeingthesourcelabeledpopulation,
andthe
diomedea
asbeingthetargetpopulation(non-labeledofpartially-labeled).Inreal-
ity,inourdata,bothsamplesaresexedbutsexof
diomedea
willbeonlyusedtomeasure
qualityofresultsprovidedbytheproposedmethod.

Resultsinthenon-sexedcase
Weconsiderinthissectionthatall
diomedea
specimen
arenon-sexed.Lineardiscriminantanalysismodelisselectedforthe
borealis
population.
Weapplyparametersestimatedbythe
borealis
sampleusingthe10modelstothenon-
sexed
diomedea
sample.Results,empiricalerrorrate(deducedfromthetruepartitionof
diomedea
)andBICvalue,aregivenforeachmodelinTable2.Moreover,empiricalerror
rateoftheclusteranalysissituationisreportedatthelastcolumnofTable2.Theclustering
procedure(seeforinstance[9])consistsinestimatingtheGaussianmixtureparametersof
thenon-sexedsample
diomedea
withoutusingthe
borealis
sample.
Higherrorratesaregenerallyobtainedwithstandarddiscriminantanalysis(models
M
1
and
p
M
1
)andwithstandardclusteranalysis,ascomparedtotheothertransferlearn-
ingmodels.Thebestmodelselectedbytheempiricalerrorrateis
p
M
3
.Thismodel
preserveshomoscedasticity,arelevantpropertysincebothdiscriminantrulesselectedby
cross-validationcriterionseparatelyoneachsample
S
and
S

werehomoscedastic(LDA).
Moreoveritindicatesthattheproportionoffemalesisnotthesameinthetwosamples.
ModelselectedbytheBICcriterionis
M
3
andtheerrorrateisthesecondbestvalue.So,

ParametricLinkModelsforKnowledgeTransferinStatisticalLearning9
model
M
1
M
2
M
3
M
4
M
5
error42.1131.5818.4328.9521.06
BIC-753.49-502.13
-451.51
-503.74-457.69
model
p
M
1
p
M
2
p
M
3
p
M
4
p
M
5
clustering
error42.1142.11
15.79
42.1121.0644.73
BIC-725.24-489.43-453.20-491.23-459.51
Table2.Empiricalerrorrate(error)andBICvalue(BIC)inthenon-sexedcase.

transformationfrom
borealis
to
diomedea
seemstobesex-independentbutnotvariable-
independent.ItshouldbenotedalsothatBIC'svaluefor
p
M
3
isveryclosetotheonefor
.M3

Resultsinthepartially-sexedcase
Weconsiderinthissectionthattwolabels(therefore
5.26%ofthedataset)areknowninthe
diomedea
sample,thusapart
y


of
y

isknownand
S

=(
x

,
y


)
.Empiricalerrorrateisobtainedforthe36
apriori
non-sexedbirds.Thetwo
labelsarechoosenatrandom30timesand,so,itleadsto30partially-sexedsamples.The
10modelsandclusteranalysis(usingalsothisnewsexinformation,whatleadstoasemi-
supervisedsituation)areappliedsuccessivelytothe30partially-sexed
diomedea
samples.
MeanoftheerrorrateandtheBICcriterionaredisplayedinTable3.

model
M
1
M
2
M
3
M
4
M
5
error42.4131.9418.7029.9118.98
BIC-753.49-502.13
-451.56
-503.92-457.95
model
p
M
1
p
M
2
p
M
3
p
M
4
p
M
5
clustering
error42.4142.69
15.37
42.6920.9321.13
BIC-725.99-489.95-453.32-491.77-460.74

Table3.Meanonthe30samplesoftheempiricalerrorrate(error)andtheBICvalue(BIC)
inthepartially-sexedcase.

Partialinformationonsexprovideslowererrorratesinmodels
p
M
3
,
p
M
5
,
M
5
andthe
clusteringmethod,withthemodel
p
M
3
stillbeingthebest.TheBICcriterionstillselects
themodel
M
3
(withalowerrorrate)andthen
p
M
3
.Wenotethat,exceptmodel
M
5
,only
adaptedmodelsimprovethankstothisnewlabelknowledge.Moreover,themorecomplex
themodelis,themoretheerrorofclassicationstronglydecreases.Thisisthecasefor
clustering:Ithasagoodimprovementinthisexample,comingfromthelastranktoalevel
closeto
p
M
5
.

10Beninel
etal.
2.2.Discriminantanalysisforbinarydata
2.2.1.Thestatisticalmodel
Wenowconsiderdiscriminantanalysisforbinaryfeaturevariables,so
X
=
{
0
,
1
}
d
.If
theGaussianassumptioniscommonforquantitativefeaturevariables,binaryfeature
x
j
iscommonlyassumedtoarisefromarandomvariable
X
j
having,conditionallyon
Y
,a
Bernoullidistribution
B
(
a
kj
)
ofparameter
a
kj
(0
<
a
kj
<
1):
X
j
|
Y
=
k

B
(
a
kj
)(
j
=
1
,...,
d
)
.
(7)
Usingtheassumptionofconditionalindependenceoftheexplanatoryvariables[8,13],the
probabilitydensityfunctionof
X
,conditionallyon
Y
,is:
df
k
(
x
;
α
k
)=
Õ
a
kjx
j
(
1

a
kj
)
1

x
j
,
(8)
1j=where
α
k
=(
a
k
1
,...,
a
kd
)
.Themixingproportions
p
k
andthewholeparameter
θ
=
{
(
p
k
,
α
k
)
,
k
=
1
,...,
K
}
arethendenedsimilarlytothepreviousGaussiansituation.Max-
imumlikelihood(ML)estimatesofall
a
kj
aresimplygivenbythefollowingrelativeem-
piricalfrequencies:
a

=
card
{
i
:
y
i
=
k
,
x
ij
=
1
}
.
jknkTheestimationofany
y
isthenobtainbytheMAPprinciplegivenin(1),
θ
beingplug-inby
itsestimate(estimateofthemixingproportion
p
k
arethesameasintheGaussiansituation).
2.2.2.Thetransferlearning
Deningatransferfunction
Featurevariablesinthetargetpopulation
W

areassumed
tohavethesamedistributionas(7)butwithpossiblydifferentparameters
a
k

j
X
j

|
Y

=
k

B
(
a
k

j
)
.
Inamultinormalcontext,thetransferlearningchallengehasbeenreachedbyconsidering
alinearstochasticrelationshipbetweenthesource
W
andthetarget
W

.Thislinkwasnot
onlyjustied(underveryfewassumptions)butalsointuitive[3].Inthebinarycontext,
suchanintuitiverelationshipseemsmoredifculttoexhibit.Theideadevelopedin[22]is
toassumethatthebinaryvariablesresultfromthediscretizationofsomelatentGaussian
variables.Fromastochasticlinkbetweenthelatentvariablesanalogousto(3),thefollowing
linkbetweentheparameters
a
k

j
of
W

and
a
kj
of
W
isobtained:
a
k

j
=
Fd
kj
F

1
(
a
kj
)+
l
j
g
kj
,
(9)
where
F
isthecumulativedensityfunctionof
N
(
0
,
1
)
,
d
kj

R
+
\{
0
}
,
l
j
∈{−
1
,
1
}
and
g
kj

R
.Notethatthisrelationshipcorrespondstoalinearlinkbetweenthe
probit
functions
ofboth
a
kj
and
a
k

j
.
Conditionallytothefactthat
a
kj
areknown(theywillbeestimatedinpractice),es-
timationofthe
Kd
continuousparameters
a
k

j
isthusobtainedfromestimatesofthelink