//img.uscri.be/pth/18d7de3f656788d9700650778632d063ab2881ce
La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

20050421-smoothing-tutorial

33 pages
NLP Lunch Tutorial: SmoothingBill MacCartney21 April 2005Preface Everything is from this great paper by Stanley F. Chen and JoshuaGoodman (1998), “An Empirical Study of Smoothing Techniquesfor Language Modeling”, which I read yesterday. Everything is presented in the context of n-gram language models,but smoothing is needed in many problem contexts, and most ofthe smo methods we’ll look at generalize without difficulty.1The Plan Motivation– the problem– an example All the smoothing methods– formula after formula– intuitions for each So which one is the best?– (answer: modified Kneser-Ney) Excel “demo” for absolute discounting and Good-Turing?2Probabilistic modeling You have some kind of probabilistic model, which is a distributionp(e) over an event space E. You want to estimate the parameters of your model distribution pfrom data. In principle, you might to like to use maximum likelihood (ML)estimates, so that your model isc(x)p (x)=PMLc(e)eBut...3Problem: data sparsity But, you have insufficient data: there are many events x such thatc(x)=0, so that the ML estimate is p (x)=0.ML In problem settings where the event space E is unbounded (e.g.most NLP problems), this is generally undesirable. Ex: a language model which gives probability 0 to unseen words. Just because an event has never been observed in training data doesnot mean it cannot occur in test data. So if c(x)=0, what should p(x) be? If data sparsity isn’t a ...
Voir plus Voir moins
NLP
Lunch
Tutorial:
Bill
21
Smo
MacCartney
April
2005
othing
Preface
Everything is from this great paper by Stanley F. Chen and Joshua Goodman (1998), “An Empirical Study of Smoothing Techniques for Language Modeling”, which I read yesterday.
Everything is presented in the context ofn-gram language models, but smoothing is needed in many problem contexts, and most of the smoothing methods we’ll look at generalize without difficulty.
1
The Plan
Motivation
the problem an example
All the smoothing methods
formula after formula intuitions for each
So which one is the best?
(answer: modified Kneser-Ney)
Excel “demo” for absolute discounting and Good-Turing?
2
Probabilistic modeling
You have some kind of probabilistic model, which is a distribution p(e) over an event spaceE.
You want to estimate the parameters of your model distributionp from data.
 youIn principle, might to like to use maximum likelihood (ML) estimates, so that your model is
But...
p =( )c(x) M LxPec(e)
3
Problem: data sparsity
 are many events thereBut, you have insufficient data:xsuch that c(x so that the ML estimate is) = 0,pM L(x) = 0.
In problem settings where the event spaceEis unbounded (e.g. most NLP problems), this is generally undesirable.
language model which gives probability 0 to unseen words. Ex: a
Just because an event has never been observed in training data does not mean it cannot occur in test data.
So ifc(x what should) = 0,p(x) be?
If data sparsity isn’t a problem for you, your model is too simple!
4
“Whenever data sparsity is an issue, smoothing can help performance, and data sparsity is almost always an issue in statistical modeling. In the extreme case where there is so much training data that all parameters can be accurately trained without smoothing, one can almost always expand the model, such as by moving to a highern-gram model, to achieve improved performance. With more parameters data sparsity becomes an issue again, but with proper smoothing the models are usually more accurate than the original models. Thus, no matter how much data one has, smoothing can almost always help performace, and for a relatively small effort.”
Chen & Goodman (1998)
5
Example:
bigram model
JOHN READ MOBY DICK MARY READ A DIFFERENT BOOK SHE READ A BOOK BY CHER
p(wi|wi1)
p(s)
=
=
c(wi1wi) Pwic(wi1wi) l+1 Yp(wi|wi1) i=1
6
JOHN READ MOBY DICK MARY READ A DIFFERENT BOOK SHE READ A BOOK BY CHER
p(JOHN READ A BOOK)
=p(JONH|)p(READ|OJNH)p(A|READ)p(BOOK|A)
c(HNJO)c(JOHN READ)c(READ A)c(A BOOK) =Pwc(w)PwcNOH(Jw)PwcDAER(w)Pwc(Aw) 1 =3131122
0.06
p(|KOOB)
cOBKO() PwcOK(OBw) 1 2
7
JOHN READ MOBY DICK MARY READ A DIFFERENT BOOK SHE READ A BOOK BY CHER
p(CHER READ A BOOK)
=p(CHER|)p(READ|CHER)p(A|READ)p(OKBO|A)
=c(CH(EwR))c(PCwHcERREA(CHERw))DPcw(cDREADA)R(AEw)c(A BOOK) PwcPwc(Aw) =0021231 3
= 0
p(|KBOO)
cOKBO() PwcKOOB(w) 1 2
8
Add-one smoothing
wi) 1 +c(wi1wi) = p(wi|wi1) =Pw1i1[++c(cw(iwi11wi)]|V|+Pwic(wi1wi)
Originally due to Laplace.
Typically, we assumeV={w:c(w)>0} ∪ {UNK}
smoothing is generally a horrible choice.Add-one
9
JOHN READ MOBY DICK MARY READ A DIFFERENT BOOK SHE READ A BOOK BY CHER
p(JOHN READ A BOOK)
=1+1111+1+132+11+11++112 1+3 11
0.0001
p(CHER READ A BOOK)
=1+0 1+0 1+2 1+1 11+3 11+1 11+3 11+2
0.00003
1+1 11+2
1+1 11+2
10