190
pages

Voir plus
Voir moins

Vous aimerez aussi

in density estimation

Dissertation

zur Erlangung des Doktorgrades Dr. rer. nat.

der Fakultät für Mathematik und Wirtschaftswissenschaften

der Universität Ulm

vorgelegt von

Christian Wagner

aus

Öhringen

2009Amtierender Dekan: Prof. Dr. Werner Kratz

Erstgutachter: Prof. Dr. Ulrich Stadtmüller

Zweitgutachter: Prof. Dr. Volker Schmidt

Tag der Promotion: 8. Juni 2009Contents

Introduction and Summary iii

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Concepts in density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Focus of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Deconvolution of densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Results about approximative deconvolution estimators . . . . . . . . . . . . . . . . . vii

Results about contaminated-data-only models . . . . . . . . . . . . . . . . . . . . . viii

Aggregated data models and corresponding results . . . . . . . . . . . . . . . . . . . . . . x

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Density Estimation Methods 1

1.1 Direct density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Quality measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2 Asymptotic properties of the direct density estimation . . . . . . . . . . . . . 4

1.2 Density deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Classical consistency results . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.2 Supersmooth target densities . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.3 Approximative deconvolution methods . . . . . . . . . . . . . . . . . . . . . 14

1.3 Aggregated data models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Unknown Error Density 19

2.1 TAYLEX and SIMEX estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.1 Justiﬁcation of the appearing bias reduction . . . . . . . . . . . . . . . . . . 19

2.1.2 Consistency results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.2 Modiﬁed variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.1 Model and estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.2 Consistency results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.2.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.2.4 Proof of Theorem 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.3 Additional error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

2.3.1 Model and estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

2.3.2 Consistency results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

2.3.3 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

iii Contents

3 Aggregated Data Models 103

3.1 Estimators and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.2 Consistency and minimax rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3.3 Properties of the unweighted estimator . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.4 Aggregated data models in density deconvolution . . . . . . . . . . . . . . . . . . . 112

3.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

3.5.1 Proofs of Lemmas 3.2.1 and 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . 113

3.5.2 Proofs of Theorems 3.2.1 and 3.2.2. . . . . . . . . . . . . . . . . . . . . . . 116

3.5.3 Proof of Theorem 3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.5.4 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

3.5.5 Proof of Theorem 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

A Appendix 141

A.1 Spaces of continuous functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.2 Integration theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.3 Fourier transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.4 Characteristic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

A.5 Sobolev spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

A.6 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

B Auxiliary Results 149

List of Figures 155

List of Symbols 157

Bibliography 161

German Summary 165Introduction and Summary

Motivation

In many circumstances repeated measurements of a quantity are observed and one would like to

gain as much information as possible from these observations. Examples are the log return per

day of a company share, or the policy holders’ lifespans for life insurances. Other quantities of

interest could be participants’ blood pressures in a clinical study or households’ consumptions

of electricity per year. There are a variety of other situations where repeated observations are

possibleforinstanceinbiology, geology, otherﬁeldsofscienceorsocialscience. Instatisticsthese

repeated measurements are often modeled as realisations of random objects, so-called random

variables. Under this assumption the ﬁrst characteristics of the data one might consider are the

mean or the variance. Yet, both values contain only partial information about the distribution

of a random variable, whereas complete information is given by the cumulative distribution

function, with its empirical counterpart, the so-called empirical distribution function. However,

when plotting the empirical distribution function, one only receives a step function and it is

hard to obtain more detailed information from this graph. In case of an absolutely continuous

random variable, the density function of the quantity of interest can be estimated to circumvent

this downside. Using an appropriate estimate of the density function allows for information

about modes, symmetry, and frequent values of the random variable to be gathered. Even more

information about modes or the change in frequency of the random variable’s values should be

attainable through an estimate of the density’s derivatives. Both the estimation of a density and

its derivatives will be addressed in this work. However, since both problems lead to comparable

considerations, the focus is on density estimation in the introduction.

Concepts in density estimation

There are two basically diﬀerent approaches for estimating a density. The ﬁrst is the parametric

density estimation approach, where one assumes that the observations come from a parametric

family of densities that has to be speciﬁed in advance. Then, the task is to estimate the pa-

rameters that ﬁt the data best. The second approach is the nonparametric density estimation,

where one does not impose a certain functional form on the density. Instead, one tries to ﬁnd an

estimate using only minor assumptions such as some smoothness of the density. The parametric

estimation procedure has advantages but also large drawbacks. Foremost, it is diﬃcult to specify

an appropriate parametric family since this information might often be not directly accessible

from the situation, where the data was observed. However, a misspeciﬁcation of the family of

densities will lead to an estimate that does not capture important structures of the density; a

fact that contradicts the objective of obtaining as much information from the data as possible.

Additionally, even if one insists on using a parametric approach, a nonparametric estimate will

give a good starting point to ﬁnd an appropriate model. For this reason, nonparametric density

estimation will be studied in more detail here.

iiiiv Introduction and Summary

There are various density estimation procedures in nonparametric density estimation like

orthogonal series methods and histograms with ﬁxed or random partition, among others. His-

tograms with ﬁxed partition were for example studied in Révész [1971], for an introduction to

the other mentioned methods see for instance Prakasa Rao [1983]. However, the estimators most

commonly used in nonparametric density estimation are the so-called kernel density estimators

introduced in Rosenblatt [1956] and studied in more detail in Parzen [1962]. An estimator of this

type has already been studied in a less general setting in Akaike [1954]. Due to its importance

the explicit formula will be given here. For n independent random variables X ,...,X that1 n

are identically distributed as X, the kernel density estimator of the density f at the point ξ isX

given by nX1 ξ−Xjˆf (ξ) = K ,X nh h

j=1

wherehisapositiverealnumber,theso-calledbandwidth,andK(y)isaso-calledkernelfunction.

For instance, the kernel could be a standard normal density.

The kernel density estimator is not very sensitive to the choice of the kernel, in contrast

to the choice of the bandwidth h. The bandwidth’s importance is justiﬁed by the observation

that the kernel density estimator is in general not an unbiased estimator of f , which meansX

that its expected value is not the value f (ξ) itself. However, it is common for estimators inX

nonparametric estimation procedures to be biased. Because of this bias, there are two opposing

objectives to ﬁnd an appropriate bandwidth. On the one hand, it would be good to choose h

small such that only observations very close to ξ have an impact on the density estimate at the

point ξ. This approach will usually give an estimate with a small bias, whereas it has a large

variance since only very few observations determine the value of the estimator. On the other

hand, it would therefore be good to utilize a largeh in order to use many observations, resulting

in a small variance but a large bias. This so-called bias variance tradeoﬀ is an intrinsic diﬃculty

of nonparametric estimation procedures. Since this tradeoﬀ can be controlled by the choice of

the bandwidth, it is very important to ﬁnd appropriate ones.

In ﬁnite samples, there are various procedures to ﬁnd reasonable bandwidths. A popular ap-

proachinthiscontextisleastsquarescrossvalidationdevelopedindependentlyinRudemo[1982]

and Bowman [1984]. Another commonly used technique is the plug-in bandwidths selection, in

which an approximative formula for the so-called mean integrated squared error (MISE), deﬁned

in(1.1.5)below,isderivedandsubsequentlyminimizedinh. Furthermethodsare,amongothers,

smoothedcross-validation,seeMüller[1985]andStaniswalis[1989],orempirical-biasbandwidths

selection (EBBS), see Ruppert [1997].

Inthecontextofthiswork, however, asymptoticallyoptimalbandwidthschoicesarerelevant.

This means that a sequence of bandwidths h with good properties has to be found. Thus, itn

is the aim to ﬁnd sequences h such that the distance of the resulting estimate to the truen

underlying density f tends to zero with the fastest possible rate.

ˆTherefore, some concept of distance between an estimator f and the true density f isn

required. One possible approach is to deﬁne the error only at a single point, but one estimates

a function deﬁned on the whole real line, hence it is preferable to measure the distance on the

whole real line as well. The two mostly studied measures of deviation on the real line are the

2 ˆMISE, which gives the expected value of the squared L -distance between the estimator f andn

the target density f, and the mean integrated absolute error (MIAE), deﬁned in (1.1.7) below,

1 1which utilizes the L -distance. Although using the L -norm seems to give the correct distance,

considering that the distance between densities has to be evaluated, the MISE is more commonly

used. This popularity is justiﬁed by the facts that the MISE allows a direct decomposition in

a bias and variance part, see (1.1.6) below, and its easier manageable computational properties.Introduction and Summary v

Results about the MIAE as a measure of quality in density estimation are for instance given in

Devroye and Györﬁ [1985], Devroye [1987], and Eggermont and LaRiccia [2001].

In order to ﬁnd the fastest possible rates of convergence for the distances introduced above,

it is necessary to restrict the considerations to subsets of all densities, so-called density classes,

suchthatthediﬃcultyofthedensityestimationiscomparableforthewholesubset. Theoptimal

rates of convergence of the MISE for diﬀerent density classes were ﬁrst studied in Watson and

Leadbetter [1963]. In Davis [1977], it is proved that the usual kernel density estimator reaches

these rates when using a special kernel, the so-called sinc-kernel deﬁned in (1.1.8) below. It is

not clear in advance, however, that choosing appropriate bandwidths h the kernel estimatorn

can also reach optimal rates of convergence for more general kernels. Nonetheless, e.g. for the

MIAE in Devroye [1987], see Theorem 1.1.2 below, a bandwidths choice can be found such that

the kernel estimator reaches the fastest attainable rates.

The choice of the optimal bandwidths and the optimal convergence rates usually depend on

n and parameters inﬂuenced by the kernel K and the underlying density f that are in general

unknowninadvance. Hence, itcouldbearguedthatfortheﬁnitesamplecasetheasymptotically

optimal bandwidths choices are not helpful. However, to analyse the convergence properties of

an estimator for growing sample size n and to compare the derived rates to the best attainable

ones in the corresponding situation are important questions in their own right. This importance

is justiﬁed by the fact that for growing sample size these rates indicate how large the error

improvement is that one can hope for.

Theparameterthatessentiallydeterminestheoptimalconvergenceratesisthesmoothnessof

thedensityf. Inshort,thesmootherf is,thefasteristheattainablerateofconvergence. Yet,for

commonsmoothnessclassesthebestattainableratesoftheMISE, thatcanbereacheduniformly

over the whole class, are slower than the usual parametric rate 1/n, see Theorem 1.1.4 below.

It is nonetheless possible to reach the rate 1/n here too, as proved in Watson and Leadbetter

[1963]. Inthisreferenceitisshownthatfordensitiesf withcharacteristicfunctionwithbounded

support the rate 1/n is attainable. Moreover, it is proved there that this rate is the best rate any

density estimator can reach for arbitrary densities f. It is important to note that to reach the

rate 1/n in parametric density estimation it is mandatory that the density follows a parametric

model and one speciﬁed the correct one beforehand.

Focus of this work

Although density estimation is well studied, there are many settings where classical approaches

are not directly applicable. From the introductory examples it can already be seen that in

many practical applications direct access to the data of interest might not be possible. In such

situations, the target densityf has to be restored from the data. With the purpose of explaining

this problem further, the example of the blood pressure from above is considered in more detail.

There, the observations might additionally depend on the time of day the blood pressure was

measured, the person that performed the measurement, or some plain measurement error. Yet,

of interest is only the participant’s true blood pressure. Such models where the data is not

observable directly but only contaminated with some unobservable additive eﬀects are so-called

errors-in-variables models. In this situation a so-called deconvolution problem for the densities

has to be solved, which will be explained a little later.

Another possible situation, where reconstruction of the density is crucial, is exhibited in the

model of electricity consumption. Here one might not only be interested in the consumption

per household but also in the consumption per individual. In this setting, however, a large

amount of the data is not a direct observation of an individual person’s consumption but the

consumption of a larger household. Hence, these observations are the sum of the consumptionsvi Introduction and Summary

of more than one individual. Here, the data obtained from the larger households cannot be used

directly, whereas from a statistical point of view it would be desirable to include this data in

an estimation procedure. This type of models are called aggregated data models, it also will be

introduced in more detail in the next sections.

The interest in this work is on introducing estimators for the diﬀerent studied settings that

ﬁt realistic datasets better than the classical approaches. For all estimators their respective con-

vergence properties will be analysed and, in particular, optimal rates will be derived if possible.

Therefore, all proved rates will be compared to the - under additional assumptions - optimal

rates that are known and in one situation a minimax rate of convergence will be proved. Since

the interest is on asymptotic properties, data dependent choices of the parameters will not be

addressed here.

Deconvolution of densities

As explained above, for many realistic datasets an errors-in-variables model is useful. Further

examples and justiﬁcations for such models can for instance be found in Carroll et al. [1995]. In

these models, the observable quantity is usually modeled as a random variableW, which can be

written as the sum of the random variable X of interest and the error variableε. Consequently,

the observable random variable is given by W =X +ε, whereX andε are assumed to be inde-

pendent. Hence, one can only observe a sample from the convoluted distribution and assuming

X and ε to be absolutely continuous from the convoluted density f = f ∗f . Thus, beingW X ε

interested in the density f and requiring f to be known, a deconvolution problem has to beεX

solved. Usually these problems are easier to solve on the Fourier domain.

To ﬁnd an estimator for f , one uses the fact that a convolution becomes a multiplicationX

on the Fourier domain, i.e. ϕ (t) = ϕ (t)ϕ (t), where ϕ (t), ϕ (t) denote the characteristicε εW X W

function of the random variables W and ε respectively. Hence, if ϕ (t) is nonzero on the wholeε

real line, it is possible to evaluate ϕ from ϕ (t) = ϕ (t)/ϕ (t). A commonly used idea isX X W ε

ˆto ﬁnd an estimator of ϕ (t), called φ (t). Afterwards, this estimator is divided by ϕ (t) andW W ε

ˆFourier inversion is applied to deﬁne an estimator forf . In general, the quotientφ (t)/ϕ (t) isX W ε

not integrable, so some regularization technique for the inverse Fourier transform is needed. The

amountofnecessaryregularizationdependsheavilyuponthebehaviourofϕ (t)as|t|approachesε

inﬁnity,theso-calledtailbehaviour. Inordertodistinguishdiﬀerenttailbehaviours,twodiﬀerent

types of density classes are usually studied. First, for so-called supersmooth random variables

ε the characteristic function ϕ is supposed to have exponential decay, see (1.2.6) below forε

an exact deﬁnition. This exponential decay implies that f is inﬁnitely often diﬀerentiable, seeε

Theorem A.5.3 below. The second density class studied consists of so-called ordinary smooth

random variables ε, where the characteristic function ϕ is supposed to have polynomial decay,ε

see (1.2.7) below for an exact deﬁnition. Here the decay implies the existence of ﬁnitely many

derivatives of f , see again Theorem A.5.3.ε

For the best attainable convergence rates it is very important whether the error density is

ordinary smooth or supersmooth. In case of an ordinary smooth density f the optimal ratesX

are algebraic if the known error density is also ordinary smooth, see Theorem 1.2.6 below for a

precisestatement,whereasthebestattainableratesincaseofa knownsupersmootherrordensity

f are only logarithmic, see also Theorem 1.2.6. These logarithmic rates are rather unpleasantε

since many popular densities are supersmooth, like the normal density for instance. The ﬁrst

proofs of lower bounds for the rates were given in Fan [1991a]. More precisely, it is shown in this

reference that the convergence rates mentioned before are optimal for the estimation of a density

pand its derivatives at a point ξ. The same result for L -norms over bounded intervals is shown

in Fan [1993]. Yet, in case the density f is supersmooth faster rates are attainable. There inX