Likelihoodtheory:Maximumlikelihood estimation (Anoverview)
N.Torelli,G.DiCredico,R.MacrìDemartino
Fall2025
UniversityofTrieste
Thelikelihoodfunction1
Maximumlikelihoodestimation:theory
Somenumericalaspects
1 Agresti,Kateri:sec4.2
![]()
N.Torelli,G.DiCredico,R.MacrìDemartino
Fall2025
UniversityofTrieste
1 Agresti,Kateri:sec4.2
IntroducedbySirRonaldFisher,the likelihoodfunction foracertain statisticalmodel fθ (y) forthedata y isgivenbythefollowingfunctionof theparameter θ
L : Θ → R+ θ → c (y) fθ (y) , where c (y) > 0isanarbitraryconstantofproportionality.
Wemaywrite L(θ; y) tostressthefactthatthedataenterthefunction, thoughitsargumentisgivenby θ.
Thelikelihoodfunctionassignssupport(credibility)topossiblevaluesof θ,meaningthatif L(θ1) > L(θ2) then θ1 ismoresupportedbythe observeddatathan θ2.
Sothe likelihoodratioL(θ1)/L(θ2) allowsforthecomparisonbetween θ1 and θ2;notethattheconstant c (y) cancelsout.
Amathematicaljustificationfortheaboveinterpretationisgivenbythe Waldinequality:if θ t isthe trueparametervalue,then Eθt {log L(θt ; Y)} > Eθt {log L(θ; Y)} θ = θ t .
Theabovefactcanbeprovenbystraightforwardapplicationofthe Jensen’sinequality.
Inthepreviousslidethe loglikelihoodfunction hasbeenintroduced, whichissimplythelogarithmof L(θ),namely ℓ(θ)=log L(θ) .
Theloglikelihoodfunctioncarriesthesameinformationofthelikelihood function,butitismuchmoremanageable.Indeed,forarandomsample
Noticethat ℓ(θ) isdefineduptoanadditiveconstant,dependingonlyon thedata y.
Forarandomsample y1,..., yn ,with Yi ∼P(λ) i.i.d.,wereadilyget
L(λ)= λ n i =1 yi exp{−n λ} n i =1 yi ! ,
sothat
ℓ(λ)=log(λ) n i =1 yi n λ, neglectingthetermwhichdoesnotdependon λ.
Assumethatforasample n = 10weobserve i yi = 90.
lik_pois <- function(lam,n,sumy) log(lam) * sumy - n * lam
xx <- seq(6.5, 12, l= 30)
ll <- sapply(xx,lik_pois, sumy= 90, n= 10)
par(pty= "s")
plot(xx,ll - max(ll), type= "l", xlab= expression(lambda),
ylab= expression(l(lambda)-max(l(lambda))), cex.lab= 2)
Forarandomsample y1,..., yn ,with Yi ∼N (µ,σ2) i.i.d. L(µ,σ 2)= n i =1 1
andthenwithsomesimplealgebra
Thedefinitionofsufficientstatistic,givenintheprobabilitypart,canbe re-interpretedfortheloglikelihoodfunction: t(y) is sufficient for θ if L(θ) canbewrittenas
The minimalsufficientstatistic allowsforthemaximalreductionof dimensionality,inthesensethataminimalsufficientstatisticisa functionofeveryothersufficientstatistic.
ForthePoissonmodel,the i yi (or,equivalently,thesamplemean y )is sufficientfor λ,whereasforthenormalmodelthesufficientstatisticis givenbythepair ( i yi , i y 2 i ) (or,equivalently,bythepair (y , s 2)).
Thesetwostatisticalmodelsareaninstanceofan exponentialfamily, animportantmodelclassthatincludesalsootherimportantelements, suchasthebinomialdistribution.Theyplayanimportantroleinthe theoryof generalizedlinearmodels.
Giventheinterpretationofthe(log)likelihood,themaximumof ℓ(θ) is thevalueoftheparameterwhichismostsupportedbythedata.
Anaturalstepistotakeitasthepointestimate,the maximum likelihoodestimate (MLE)of θ
Noticethatsince ℓ(θ) isalsoafunctionof y,theMLEisastatistic.
ForthePoissonmodel,simplecalculusgives
Forthenormalmodel,weneedtomaximizeafunctionoftwovariables, andweget
Maximumlikelihoodestimationhasa centralrole inmodernstatistics (andmachinelearning).Thereareseveralreasonsforthis:
1. TheMLEalgorithmis automatic:givenaparametricstatistical modelforthedata,theMLEfollowsfromthechosenmodel.
2. TheMLEofafunctionofaparameter ψ = g (θ) isdefinedbythe simpleplug-inrule ψ = g (θ),whichisveryconvenientfor applications.
3. TheMLEhas excellentproperties,whichweillustrateinwhat follows.
Thefirsttwoderivativesof ℓ(θ) playanimportantrole.
Thevectoroffirstderivativesiscalledthe scorefunction U (θ)= U (θ; y)=
Thematrixofsecondderivatives,withnegativesign,iscalledthe observedinformationmatrix:
Thederivativesoftheloglikelihoodfunctionsatisfysomeimportant properties,providedthatsome regularityconditions hold(weshall returnonthemlateron).
Theproofsaresimple,andtheyarereportedintheCSbook.
1. Zeroexpectedscore Eθ {U (θ; Y)} = 0
2. 2ndBartlettidentity
Theexpectedvalue I(θ) oftheobservedinformationmatrixiscalledthe Fisherinformationmatrix (orjustthe expectedinformationmatrix).
Thethirdpropertyisimportant,andwefirststateitforaone-parameter model(scalar θ).
3. TheCramér-Raolowerbound:thevarianceof anyunbiased estimator ˜ θ cannotbesmallerthanthereciprocaloftheexpected information: varθ { ˜ θ(Y)}≥ 1 I(θ) . Actually,bydifferentiationoftheunbiasednessconditionwith respectto θ itfollowsthat covθ { ˜ θ, U (θ; Y)} = 1,whichreadily impliestheCramér-lowerbound.
Theextensiontomultiparametermodelsisgivenbytheconditionthat thematrix cov( ˜ θ)= I(θ)−1 ispositivesemi-definite.
WearereadytostatethefirstcrucialpropertyoftheMLE: Maximumlikelihoodestimatorsareusuallyconsistent,thatisifthe samplesizetendstoinfinity θ tendsto θt ,the true parametervalue.
Ajustificationfortheresultisgivenbythefactthatinregularsituations
ℓ(θ)/n → Eθ {ℓ(θ)}/n as n →∞,sothateventuallythemaximumof
ℓ(θ) and E {ℓ(θ)} mustcoincideat θ t bytheWaldinequality.
Theformalproof(typically)employsthelawoflargenumbers. Consistencycanfailifthenumberofparametersincreaseswiththe samplesize.
WeestablishitbyaTaylorexpansionforthescorefunction:
withequalitywhen n →∞ since θ − θt → 0.
Fromthedefinitionof θ,weget U (θ)= 0.Undermildassumptions
whereas U (θt ) isarandomvectorwithmeanvector 0 andcovariance
matrix I(θt ).
Inthelargesamplelimit
Inthecasewhenthesampleisformedbyindependentobservations,it followsthattheloglikelihoodisthesumofindependentcontributions: undermildconditionsthecentrallimittheoremapplies,andinthelarge samplelimit
Noticethatwheneverthisholds,itwouldbepossible(and recommendable,insomesense)touse J (θ t ) inplaceof I(θ t ).
Again,since θ t isunknown,wereplaceitby θ,obtainingthefollowing estimatedstandarderrorforthe k -thcomponentof θ SE(θk )= J (θ) 1 kk
Note:for regularmodels (seenextslide),theobservedinformationis positivedefiniteat θ,sothatthe SE aboveiswelldefined.
Weendthesummaryofthetheorybymentioningthe regularity conditions,whicharesomeassumptionsonthestatisticalmodel, requiredforthepreviousresultstobevalid.
TheCSbookliststhefollowingones:
1. Thepdfof y definedbydifferentvaluesof θ aredistinct,namelythe modelis identifiable.
2. Thetrueparametervalue θ t isinteriorto Θ.
3. Withinsomeneighbourhoodof θ t ,thefirstthreederivativesof ℓ
existandarebounded,whiletheexpectedinformationsatisfiesthe 2ndBartlettidentity,ispositivedefiniteandfinite. Thesearemildconditions,whicharegenerallyvalidinmostcases.
Thepreviousresultshaveillustratedthat
1. TheMLEisa consistentestimator.
2. TheMLEis asymptoticefficient,sinceitsasymptoticvariance attainstheCramér-Raolowerbound.
3. Thelargesampledistribution(akatheapproximatedistribution)of theMLEis multivariatenormal,withstandarderrorthatcanbe estimatedbytheobservedinformationevaluatedattheparameter estimate.
Here λ = y ,andconsistencyfollowsfromthelawoflargenumbers,in agreementwithlikelihoodtheory. Furthermore,theCLTstatesthatforlarge n λ · ∼N (λ,λ/n) .
Thisresultcanbeobtainedalsofromlikelihoodtheory.Indeed,weget
sothat I(λ)= n/λ and I(λ) 1 = λ/n.
Theimplicationisthat
thetwoestimatedstandarderrorsare
ThealgorithmicnatureoftheMLEestimationmethodtranslatesthe statisticalmodelintoanoptimisationproblem:oncea(sensible) statisticalmodelhasbeenspecifiedforthedata,weobtainparameter estimateswithexcellentpropertiesbymaximizingtheloglikelihood. Insomesimplesettings,likeintheexamplesabove,itispossibletofind theanalyticalexpressionfortheMLE,butingeneralwemustresortto numericaloptimisation oftheloglikelihood. Thereareindeedseveralmethodsavailableforthetask.Someknowledge ofthemostimportantissuesrelatedtoitturnsoutparticularlyuseful evenfortheapplicationofoff-the-shelfroutinesin R (orother environments).
Newton’smethodforoptimisationiscommonlyusedforminimization,in thiscaseoftheobjectivefunction f (θ)= −ℓ(θ).
ThetheoryiswelldescribedintheCSbook,herewementionthemost importantaspects.Theideaistolocallyapproximate f (θ) asaquadratic function,whichisrepeatedlyminimised. Theresultingmethodconsistsinan iterativealgorithm,whichisstarted with k = 0anda guesstimate θ[0],anditeratesthefollowingsteps:
1. Evaluate ℓ(θ[k ]), U (θ[k ]) and J (θ[k ]).
2. If U (θ[k ]) . = 0 and J (θ[k ]) ispositivedefinitethenstop.
3. If H = J (θ[k ]) isnotpositivedefinite,perturbitsothatitis.
4. Solve H δ = U (θ[k ]) forthesearchdirection δ.
5. If ℓ(θ[k ] + δ) isnot> ℓ(θ[k ]),repeatedlyhalve δ untilitis(thisis thestep-lengthcontrol).
6. Set θ[k +1] = θ[k ] + δ,increment k byoneandreturntostep1.
Wheneveravailable,itisalwaysagoodideatoreplacetheobserved informationwiththeexpectedinformation I(θ[k ]) intheNewton’s method. Theresultingalgorithmhasalongsuccessfultraditioninstatistics,itis called Fisherscoring and,indeed,ithasbetterconvergenceproperties.
Anothervariantavoidsthecomputationofeither J (θ[k ]) or I(θ[k ]),by buildinganapproximationtothesecondderivativeof ℓ(θ) asthe optimizationproceeds.Thisistheapproachofthe Quasi-Newton methods,suchasthewidelyusedBFGSalgorithm. Quasi-Newtonmethodsareimplementedinseveral R functionsand packages;seethe CRANTaskView for Optimisation (https://cran.r-project.org/web/views/Optimization.html).
WefollowtheMASSbookforasimpleexampleonadose-response model.
Namely,weassumethat yi isthenumberofdeadbudworms(outof20) foradoseofinsecticide x ∗ i .Inparticular,thestatisticalmodelis Yi ∼Bi (20,πi ) i = 1,..., 12,independent with
with xi =log(x ∗ i ).
Thisisasimpleinstanceofa logisticregressionmodel.
Therearetwoobservationsateachdose(M/Fbudworms),butherefor thesakeofsimplicityweignorethedifferentsex.
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
plot(ldose,numdead / 20, pch=16)
numdead/20
Withsomesimplealgebraweget:
loglik <- function(theta,data){
eta <- theta[1] + theta[2] * data$x
out <- sum(data$y * eta - 20 * log(1+exp(eta)))
return(out) }
score <- function(theta,data){
prob <- plogis(theta[1] + theta[2] * data$x)
out <- c(sum(data$y - prob * 20), sum((data$y - prob * 20) * data$x))
return(out) }
info <- function(theta,data){
prob <- plogis(theta[1] + theta[2] * data$x)
info11 <- sum(20 * prob * (1-prob))
info12 <- sum(20 * prob * (1-prob) * data$x)
info22 <- sum(20 * prob * (1-prob) * data$x ˆ2)
out <- matrix(c(info11,info12,info12,info22), 2, 2) return(out) }
Let’sstartfrom α = β = 0:weobtain
theta0 <- c(0, 0);budw <- data.frame(y= numdead, x= ldose)
loglik(theta0,budw)
##[1]-166.3553
score(theta0,budw)
##[1]-9105
info(theta0,budw)
##[,1][,2]
##[1,]60150
##[2,]150550
H <- info(theta0,budw)
u0 <- score(theta0,budw)
delta <- solve(H,u0)
theta1 <- theta0 + delta
theta1
##[1]-1.97142860.7285714
loglik(theta1,budw)
##[1]-114.7219 whichisclearlyanimprovement.
##k=1theta=-1.9714290.7285714loglik=-114.7219
##k=2theta=-2.6214360.9572079loglik=-111.8192
##k=3theta=-2.7605851.004947loglik=-111.734
##k=4theta=-2.7660791.006804loglik=-111.7339
##k=5theta=-2.7660871.006807loglik=-111.7339
##k=6theta=-2.7660871.006807loglik=-111.7339
##k=7theta=-2.7660871.006807loglik=-111.7339
##k=8theta=-2.7660871.006807loglik=-111.7339
##k=9theta=-2.7660871.006807loglik=-111.7339
##k=10theta=-2.7660871.006807loglik=-111.7339
Thealgorithmconvergesquickly,andactuallyafter10iterations cat(score(theta10,budw), det(info(theta10,budw)), sqrt(diag(solve(info(theta10,budw)))))
##3.552714e-153.552714e-152361.4620.37013420.1235889
budworm.lg0 <- glm(cbind(y, 20-y) ~ x,binomial,budw)
(budworm.lg0, cor= FALSE)