Likelihood1_20 by Tommaso Moro

Likelihoodtheory:Maximumlikelihood estimation (Anoverview)

N.Torelli,G.DiCredico,R.MacrìDemartino

Fall2025

UniversityofTrieste

Thelikelihoodfunction1

Maximumlikelihoodestimation:theory

Somenumericalaspects

1 Agresti,Kateri:sec4.2

Thelikelihoodfunction

IntroducedbySirRonaldFisher,the likelihoodfunction foracertain statisticalmodel fθ (y) forthedata y isgivenbythefollowingfunctionof theparameter θ

L : Θ → R+ θ → c (y) fθ (y) , where c (y) > 0isanarbitraryconstantofproportionality.

Wemaywrite L(θ; y) tostressthefactthatthedataenterthefunction, thoughitsargumentisgivenby θ.

Interpretingthelikelihoodfunction

Thelikelihoodfunctionassignssupport(credibility)topossiblevaluesof θ,meaningthatif L(θ1) > L(θ2) then θ1 ismoresupportedbythe observeddatathan θ2.

Sothe likelihoodratioL(θ1)/L(θ2) allowsforthecomparisonbetween θ1 and θ2;notethattheconstant c (y) cancelsout.

Amathematicaljustificationfortheaboveinterpretationisgivenbythe Waldinequality:if θ t isthe trueparametervalue,then Eθt {log L(θt ; Y)} > Eθt {log L(θ; Y)} θ = θ t .

Theabovefactcanbeprovenbystraightforwardapplicationofthe Jensen’sinequality.

Theloglikelihoodfunction

Inthepreviousslidethe loglikelihoodfunction hasbeenintroduced, whichissimplythelogarithmof L(θ),namely ℓ(θ)=log L(θ) .

Theloglikelihoodfunctioncarriesthesameinformationofthelikelihood function,butitismuchmoremanageable.Indeed,forarandomsample

Noticethat ℓ(θ) isdefineduptoanadditiveconstant,dependingonlyon thedata y.

Example1:thePoissonmodel

Forarandomsample y1,..., yn ,with Yi ∼P(λ) i.i.d.,wereadilyget

L(λ)= λ n i =1 yi exp{−n λ} n i =1 yi ! ,

sothat

ℓ(λ)=log(λ) n i =1 yi n λ, neglectingthetermwhichdoesnotdependon λ.

Rlab:thePoissonloglikelihood

Assumethatforasample n = 10weobserve i yi = 90.

lik_pois <- function(lam,n,sumy) log(lam) * sumy - n * lam

xx <- seq(6.5, 12, l= 30)

ll <- sapply(xx,lik_pois, sumy= 90, n= 10)

par(pty= "s")

plot(xx,ll - max(ll), type= "l", xlab= expression(lambda),

ylab= expression(l(lambda)-max(l(lambda))), cex.lab= 2)

Example2:thenormalmodel

Forarandomsample y1,..., yn ,with Yi ∼N (µ,σ2) i.i.d. L(µ,σ 2)= n i =1 1

andthenwithsomesimplealgebra

Sufficientstatistics

Thedefinitionofsufficientstatistic,givenintheprobabilitypart,canbe re-interpretedfortheloglikelihoodfunction: t(y) is sufficient for θ if L(θ) canbewrittenas

The minimalsufficientstatistic allowsforthemaximalreductionof dimensionality,inthesensethataminimalsufficientstatisticisa functionofeveryothersufficientstatistic.

ForthePoissonmodel,the i yi (or,equivalently,thesamplemean y )is sufficientfor λ,whereasforthenormalmodelthesufficientstatisticis givenbythepair ( i yi , i y 2 i ) (or,equivalently,bythepair (y , s 2)).

Thesetwostatisticalmodelsareaninstanceofan exponentialfamily, animportantmodelclassthatincludesalsootherimportantelements, suchasthebinomialdistribution.Theyplayanimportantroleinthe theoryof generalizedlinearmodels.

Maximumlikelihoodestimation

Giventheinterpretationofthe(log)likelihood,themaximumof ℓ(θ) is thevalueoftheparameterwhichismostsupportedbythedata.

Anaturalstepistotakeitasthepointestimate,the maximum likelihoodestimate (MLE)of θ

Noticethatsince ℓ(θ) isalsoafunctionof y,theMLEisastatistic.

TheMLEinthetwoexamples

ForthePoissonmodel,simplecalculusgives

Forthenormalmodel,weneedtomaximizeafunctionoftwovariables, andweget

Maximumlikelihoodestimationhasa centralrole inmodernstatistics (andmachinelearning).Thereareseveralreasonsforthis:

1. TheMLEalgorithmis automatic:givenaparametricstatistical modelforthedata,theMLEfollowsfromthechosenmodel.

2. TheMLEofafunctionofaparameter ψ = g (θ) isdefinedbythe simpleplug-inrule ψ = g (θ),whichisveryconvenientfor applications.

3. TheMLEhas excellentproperties,whichweillustrateinwhat follows.

Maximumlikelihoodestimation: theory

Likelihoodquantities

Thefirsttwoderivativesof ℓ(θ) playanimportantrole.

Thevectoroffirstderivativesiscalledthe scorefunction U (θ)= U (θ; y)=

Thematrixofsecondderivatives,withnegativesign,iscalledthe observedinformationmatrix:

Someproperties

Thederivativesoftheloglikelihoodfunctionsatisfysomeimportant properties,providedthatsome regularityconditions hold(weshall returnonthemlateron).

Theproofsaresimple,andtheyarereportedintheCSbook.

1. Zeroexpectedscore Eθ {U (θ; Y)} = 0

2. 2ndBartlettidentity

Theexpectedvalue I(θ) oftheobservedinformationmatrixiscalledthe Fisherinformationmatrix (orjustthe expectedinformationmatrix).

TheCramér-Raolowerbound

Thethirdpropertyisimportant,andwefirststateitforaone-parameter model(scalar θ).

3. TheCramér-Raolowerbound:thevarianceof anyunbiased estimator ˜ θ cannotbesmallerthanthereciprocaloftheexpected information: varθ { ˜ θ(Y)}≥ 1 I(θ) . Actually,bydifferentiationoftheunbiasednessconditionwith respectto θ itfollowsthat covθ { ˜ θ, U (θ; Y)} = 1,whichreadily impliestheCramér-lowerbound.

Theextensiontomultiparametermodelsisgivenbytheconditionthat thematrix cov( ˜ θ)= I(θ)−1 ispositivesemi-definite.

WearereadytostatethefirstcrucialpropertyoftheMLE: Maximumlikelihoodestimatorsareusuallyconsistent,thatisifthe samplesizetendstoinfinity θ tendsto θt ,the true parametervalue.

Ajustificationfortheresultisgivenbythefactthatinregularsituations

ℓ(θ)/n → Eθ {ℓ(θ)}/n as n →∞,sothateventuallythemaximumof

ℓ(θ) and E {ℓ(θ)} mustcoincideat θ t bytheWaldinequality.

Theformalproof(typically)employsthelawoflargenumbers. Consistencycanfailifthenumberofparametersincreaseswiththe samplesize.

Large-sampledistributionofMLE

WeestablishitbyaTaylorexpansionforthescorefunction:

withequalitywhen n →∞ since θ − θt → 0.

Fromthedefinitionof θ,weget U (θ)= 0.Undermildassumptions

whereas U (θt ) isarandomvectorwithmeanvector 0 andcovariance

matrix I(θt ).

Inthelargesamplelimit

Large-samplenormalityofMLE

Inthecasewhenthesampleisformedbyindependentobservations,it followsthattheloglikelihoodisthesumofindependentcontributions: undermildconditionsthecentrallimittheoremapplies,andinthelarge samplelimit

Noticethatwheneverthisholds,itwouldbepossible(and recommendable,insomesense)touse J (θ t ) inplaceof I(θ t ).

Again,since θ t isunknown,wereplaceitby θ,obtainingthefollowing estimatedstandarderrorforthe k -thcomponentof θ SE(θk )= J (θ) 1 kk

Note:for regularmodels (seenextslide),theobservedinformationis positivedefiniteat θ,sothatthe SE aboveiswelldefined.

Regularityconditions

Weendthesummaryofthetheorybymentioningthe regularity conditions,whicharesomeassumptionsonthestatisticalmodel, requiredforthepreviousresultstobevalid.

TheCSbookliststhefollowingones:

1. Thepdfof y definedbydifferentvaluesof θ aredistinct,namelythe modelis identifiable.

2. Thetrueparametervalue θ t isinteriorto Θ.

3. Withinsomeneighbourhoodof θ t ,thefirstthreederivativesof ℓ

existandarebounded,whiletheexpectedinformationsatisfiesthe 2ndBartlettidentity,ispositivedefiniteandfinite. Thesearemildconditions,whicharegenerallyvalidinmostcases.

(θ)

Thepreviousresultshaveillustratedthat

1. TheMLEisa consistentestimator.

2. TheMLEis asymptoticefficient,sinceitsasymptoticvariance attainstheCramér-Raolowerbound.

3. Thelargesampledistribution(akatheapproximatedistribution)of theMLEis multivariatenormal,withstandarderrorthatcanbe estimatedbytheobservedinformationevaluatedattheparameter estimate.

Example1:Poissonmodel

Here λ = y ,andconsistencyfollowsfromthelawoflargenumbers,in agreementwithlikelihoodtheory. Furthermore,theCLTstatesthatforlarge n λ · ∼N (λ,λ/n) .

Thisresultcanbeobtainedalsofromlikelihoodtheory.Indeed,weget

sothat I(λ)= n/λ and I(λ) 1 = λ/n.

Example2:normalexample

Hereweget

Theimplicationisthat

thetwoestimatedstandarderrorsare

Somenumericalaspects

Numericaloptimisation

ThealgorithmicnatureoftheMLEestimationmethodtranslatesthe statisticalmodelintoanoptimisationproblem:oncea(sensible) statisticalmodelhasbeenspecifiedforthedata,weobtainparameter estimateswithexcellentpropertiesbymaximizingtheloglikelihood. Insomesimplesettings,likeintheexamplesabove,itispossibletofind theanalyticalexpressionfortheMLE,butingeneralwemustresortto numericaloptimisation oftheloglikelihood. Thereareindeedseveralmethodsavailableforthetask.Someknowledge ofthemostimportantissuesrelatedtoitturnsoutparticularlyuseful evenfortheapplicationofoff-the-shelfroutinesin R (orother environments).

Newton’smethodforoptimisationiscommonlyusedforminimization,in thiscaseoftheobjectivefunction f (θ)= −ℓ(θ).

ThetheoryiswelldescribedintheCSbook,herewementionthemost importantaspects.Theideaistolocallyapproximate f (θ) asaquadratic function,whichisrepeatedlyminimised. Theresultingmethodconsistsinan iterativealgorithm,whichisstarted with k = 0anda guesstimate θ[0],anditeratesthefollowingsteps:

1. Evaluate ℓ(θ[k ]), U (θ[k ]) and J (θ[k ]).

2. If U (θ[k ]) . = 0 and J (θ[k ]) ispositivedefinitethenstop.

3. If H = J (θ[k ]) isnotpositivedefinite,perturbitsothatitis.

4. Solve H δ = U (θ[k ]) forthesearchdirection δ.

5. If ℓ(θ[k ] + δ) isnot> ℓ(θ[k ]),repeatedlyhalve δ untilitis(thisis thestep-lengthcontrol).

6. Set θ[k +1] = θ[k ] + δ,increment k byoneandreturntostep1.

Wheneveravailable,itisalwaysagoodideatoreplacetheobserved informationwiththeexpectedinformation I(θ[k ]) intheNewton’s method. Theresultingalgorithmhasalongsuccessfultraditioninstatistics,itis called Fisherscoring and,indeed,ithasbetterconvergenceproperties.

Anothervariantavoidsthecomputationofeither J (θ[k ]) or I(θ[k ]),by buildinganapproximationtothesecondderivativeof ℓ(θ) asthe optimizationproceeds.Thisistheapproachofthe Quasi-Newton methods,suchasthewidelyusedBFGSalgorithm. Quasi-Newtonmethodsareimplementedinseveral R functionsand packages;seethe CRANTaskView for Optimisation (https://cran.r-project.org/web/views/Optimization.html).

Anexample:logisticregression

WefollowtheMASSbookforasimpleexampleonadose-response model.

Namely,weassumethat yi isthenumberofdeadbudworms(outof20) foradoseofinsecticide x ∗ i .Inparticular,thestatisticalmodelis Yi ∼Bi (20,πi ) i = 1,..., 12,independent with

with xi =log(x ∗ i ).

Thisisasimpleinstanceofa logisticregressionmodel.

Rlab:budwormdata

Therearetwoobservationsateachdose(M/Fbudworms),butherefor thesakeofsimplicityweignorethedifferentsex.

ldose <- rep(0:5, 2)

numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)

plot(ldose,numdead / 20, pch=16)

numdead/20

Withsomesimplealgebraweget:

Rlab:likelihoodandscorefunctions

loglik <- function(theta,data){

eta <- theta[1] + theta[2] * data$x

out <- sum(data$y * eta - 20 * log(1+exp(eta)))

return(out) }

score <- function(theta,data){

prob <- plogis(theta[1] + theta[2] * data$x)

out <- c(sum(data$y - prob * 20), sum((data$y - prob * 20) * data$x))

return(out) }

info <- function(theta,data){

prob <- plogis(theta[1] + theta[2] * data$x)

info11 <- sum(20 * prob * (1-prob))

info12 <- sum(20 * prob * (1-prob) * data$x)

info22 <- sum(20 * prob * (1-prob) * data$x ˆ2)

out <- matrix(c(info11,info12,info12,info22), 2, 2) return(out) }

Let’sstartfrom α = β = 0:weobtain

theta0 <- c(0, 0);budw <- data.frame(y= numdead, x= ldose)

loglik(theta0,budw)

##[1]-166.3553

score(theta0,budw)

##[1]-9105

info(theta0,budw)

##[,1][,2]

##[1,]60150

##[2,]150550

H <- info(theta0,budw)

u0 <- score(theta0,budw)

delta <- solve(H,u0)

theta1 <- theta0 + delta

theta1

##[1]-1.97142860.7285714

loglik(theta1,budw)

##[1]-114.7219 whichisclearlyanimprovement.

##k=1theta=-1.9714290.7285714loglik=-114.7219

##k=2theta=-2.6214360.9572079loglik=-111.8192

##k=3theta=-2.7605851.004947loglik=-111.734

##k=4theta=-2.7660791.006804loglik=-111.7339

##k=5theta=-2.7660871.006807loglik=-111.7339

##k=6theta=-2.7660871.006807loglik=-111.7339

##k=7theta=-2.7660871.006807loglik=-111.7339

##k=8theta=-2.7660871.006807loglik=-111.7339

##k=9theta=-2.7660871.006807loglik=-111.7339

##k=10theta=-2.7660871.006807loglik=-111.7339

Thealgorithmconvergesquickly,andactuallyafter10iterations cat(score(theta10,budw), det(info(theta10,budw)), sqrt(diag(solve(info(theta10,budw)))))

##3.552714e-153.552714e-152361.4620.37013420.1235889

budworm.lg0 <- glm(cbind(y, 20-y) ~ x,binomial,budw)

(budworm.lg0, cor= FALSE)