8:[["$","script",null,{"type":"application/ld+json","dangerouslySetInnerHTML":{"__html":"{\"@context\":\"https://schema.org\",\"@type\":\"DigitalDocument\",\"name\":\"Likelihood1_20\",\"description\":\"\",\"image\":\"https://image.isu.pub/260204144151-dc58cc80e3851a03194b356159a87957/jpg/page_1_thumb_large.jpg\",\"datePublished\":\"2026-02-04T00:00:00.000Z\",\"inLanguage\":\"en\",\"author\":{\"@type\":\"Person\",\"name\":\"Tommaso Moro\",\"url\":\"https://issuu.com/tmor-bsp\"},\"publisher\":{\"@type\":\"Organization\",\"name\":\"Issuu\",\"url\":\"https://issuu.com\"}}"}}],["$","$L2e",null,{"documentTextVersion":{"pageTexts":["Likelihood theory: Maximum likelihood\nestimation\n(An overview)\n\nN. Torelli, G. Di Credico, R. Macrì Demartino\nFall 2025\nUniversity of Trieste\n\n1","The likelihood function1\n\nMaximum likelihood estimation: theory\n\nSome numerical aspects\n\n1 Agresti, Kateri: sec 4.2\n\n2","The likelihood function","The likelihood function\n\nIntroduced by Sir Ronald Fisher, the likelihood function for a certain\nstatistical model fθ (y) for the data y is given by the following function of\nthe parameter θ\n\nL\n\n:\n\nΘ → R+\nθ → c(y) fθ (y) ,\n\nwhere c(y) > 0 is an arbitrary constant of proportionality.\nWe may write L(θ; y) to stress the fact that the data enter the function,\nthough its argument is given by θ.\n\n3","Interpreting the likelihood function\n\nThe likelihood function assigns support (credibility) to possible values of\nθ, meaning that if L(θ 1 ) > L(θ 2 ) then θ 1 is more supported by the\nobserved data than θ 2 .\nSo the likelihood ratio L(θ 1 )/L(θ 2 ) allows for the comparison between θ 1\nand θ 2 ; note that the constant c(y) cancels out.\nA mathematical justification for the above interpretation is given by the\nWald inequality: if θ t is the true parameter value, then\nEθt {log L(θ t ; Y)} > Eθt {log L(θ; Y)}\n\nθ ̸= θ t .\n\nThe above fact can be proven by straightforward application of the\nJensen’s inequality.\n\n4","The log likelihood function\nIn the previous slide the log likelihood function has been introduced,\nwhich is simply the logarithm of L(θ), namely\nℓ(θ) = log L(θ) .\nThe log likelihood function carries the same information of the likelihood\nfunction, but it is much more manageable. Indeed, for a random sample\nL(θ) =\n\nn\nY\n\nfθ (yi )\n\ni=1\n\nbut\nℓ(θ) =\n\nn\nX\n\nlog fθ (yi ) .\n\ni=1\n\nNotice that ℓ(θ) is defined up to an additive constant, depending only on\nthe data y.\n5","Example 1: the Poisson model\n\nFor a random sample y1 , . . . , yn , with Yi ∼ P(λ) i.i.d., we readily get\nPn\nλ i=1 yi exp{−n λ}\nQn\n,\nL(λ) =\ni=1 yi !\nso that\nℓ(λ) = log(λ)\n\nn\nX\n\nyi − n λ ,\n\ni=1\n\nneglecting the term which does not depend on λ.\n\n6","R lab: the Poisson log likelihood\nAssume that for a sample n = 10 we observe\n\nP\n\ni yi = 90.\n\n−1\n−2\n−3\n\nl(λ) − max l(λ)\n\n0\n\nlik_pois <- function(lam, n, sumy) log(lam) * sumy - n * lam\nxx <- seq(6.5, 12, l = 30)\nll <- sapply(xx, lik_pois, sumy = 90, n = 10)\npar(pty = \"s\")\nplot(xx, ll - max(ll), type = \"l\", xlab = expression(lambda),\nylab = expression(l(lambda)-max(l(lambda))), cex.lab = 2)\n\n7","Example 2: the normal model\n\nFor a random sample y1 , . . . , yn , with Yi ∼ N (µ, σ 2 ) i.i.d.\nn\nY\n\n(yi − µ)2\n√\nL(µ, σ ) =\nexp −\n2 σ2\n2 π σ2\ni=1\n2\n\n1\n\n \n\n \n,\n\nand then with some simple algebra\nℓ(µ, σ 2 ) = −\n\nn\nn\n1 X\nlog(σ 2 ) −\n(yi − µ)2 .\n2\n2 σ 2 i=1\n\n8","Sufficient statistics\nThe definition of sufficient statistic, given in the probability part, can be\nre-interpreted for the log likelihood function: t(y) is sufficient for θ if\nL(θ) can be written as\nL(θ) = h(y) gθ {t(y)} .\nThe minimal sufficient statistic allows for the maximal reduction of\ndimensionality, in the sense that a minimal sufficient statistic is a\nfunction of every other sufficient statistic.\nP\nFor the Poisson model, the i yi (or, equivalently, the sample mean y ) is\nsufficient for λ, whereas for the normal model the sufficient statistic is\nP\nP\ngiven by the pair ( i yi , i yi2 ) (or, equivalently, by the pair (y , s 2 )).\nThese two statistical models are an instance of an exponential family,\nan important model class that includes also other important elements,\nsuch as the binomial distribution. They play an important role in the\ntheory of generalized linear models.\n9","Maximum likelihood estimation\n\nGiven the interpretation of the (log) likelihood, the maximum of ℓ(θ) is\nthe value of the parameter which is most supported by the data.\nA natural step is to take it as the point estimate, the maximum\nlikelihood estimate (MLE) of θ\nb = argmax ℓ(θ)\nθ\nθ∈Θ\n\nNotice that since ℓ(θ) is also a function of y, the MLE is a statistic.\n\n10","The MLE in the two examples\n\nFor the Poisson model, simple calculus gives\ny\n1 X\nb\nyi = y .\nλ=\nn i=1\n\nFor the normal model, we need to maximize a function of two variables,\nand we get\n\nµ\nb=y\n1 Pn\n2\nσ\nb2 =\ni=1 (yi − y ) .\nn\n\n11","MLE: comments\n\nMaximum likelihood estimation has a central role in modern statistics\n(and machine learning). There are several reasons for this:\n1. The MLE algorithm is automatic: given a parametric statistical\nmodel for the data, the MLE follows from the chosen model.\n2. The MLE of a function of a parameter ψ = g(θ) is defined by the\nb which is very convenient for\nsimple plug-in rule ψb = g(θ),\napplications.\n3. The MLE has excellent properties, which we illustrate in what\nfollows.\n\n12","Maximum likelihood estimation:\ntheory","Likelihood quantities\n\nThe first two derivatives of ℓ(θ) play an important role.\nThe vector of first derivatives is called the score function\nU(θ) = U(θ; y) =\n\n∂ℓ(θ)\n∂θ\n\nThe matrix of second derivatives, with negative sign, is called the\nobserved information matrix:\nJ(θ) = J(θ; y) = −\n\n∂ 2 ℓ(θ)\n∂θ∂θ ⊤\n\n13","Some properties\n\nThe derivatives of the log likelihood function satisfy some important\nproperties, provided that some regularity conditions hold (we shall\nreturn on them later on).\nThe proofs are simple, and they are reported in the CS book.\n1. Zero expected score\nEθ {U(θ; Y)} = 0\n2. 2nd Bartlett identity\ncovθ {U(θ; Y)} = Eθ {J(θ; Y)} = I(θ)\nThe expected value I(θ) of the observed information matrix is called the\nFisher information matrix (or just the expected information matrix).\n\n14","The Cramér-Rao lower bound\n\nThe third property is important, and we first state it for a one-parameter\nmodel (scalar θ).\n3. The Cramér-Rao lower bound: the variance of any unbiased\nestimator θ̃ cannot be smaller than the reciprocal of the expected\ninformation:\n1\nvarθ {θ̃(Y)} ≥\n.\nI(θ)\nActually, by differentiation of the unbiasedness condition with\nrespect to θ it follows that covθ {θ̃, U(θ; Y)} = 1, which readily\nimplies the Cramér-lower bound.\nThe extension to multiparameter models is given by the condition that\n−1\nthe matrix cov(θ̃) = I(θ) is positive semi-definite.\n\n15","Consistency of MLE\n\nWe are ready to state the first crucial property of the MLE:\nMaximum likelihood estimators are usually consistent, that is if the\nb tends to θ t , the true parameter value.\nsample size tends to infinity θ\nA justification for the result is given by the fact that in regular situations\nℓ(θ)/n → Eθ {ℓ(θ)}/n as n → ∞, so that eventually the maximum of\nℓ(θ) and E {ℓ(θ)} must coincide at θ t by the Wald inequality.\nThe formal proof (typically) employs the law of large numbers.\nConsistency can fail if the number of parameters increases with the\nsample size.\n\n16","Large-sample distribution of MLE\nWe establish it by a Taylor expansion for the score function:\n.\nb =\nb − θ t ) J(θ t ) ,\nU(θ)\nU(θ t ) − (θ\nb − θ t → 0.\nwith equality when n → ∞ since θ\nb we get U(θ)\nb = 0. Under mild assumptions\nFrom the definition of θ,\nI(θ t )\nJ(θ t )\n→\n,\nn\nn\nwhereas U(θ t ) is a random vector with mean vector 0 and covariance\nmatrix I(θ t ).\nIn the large sample limit\n·\nb − θt ∼\nI(θ t )−1 U(θ t ; y) ,\nθ\n\nb − θ t ) = 0 and cov(θ\nb − θ t ) = I(θ t )−1 .\nimplying that E (θ\n17","Large-sample normality of MLE\nIn the case when the sample is formed by independent observations, it\nfollows that the log likelihood is the sum of independent contributions:\nunder mild conditions the central limit theorem applies, and in the large\nsample limit\n·\nb∼\nθ\nN {θ t , I(θ t )−1 } .\nNotice that whenever this holds, it would be possible (and\nrecommendable, in some sense) to use J(θ t ) in place of I(θ t ).\nb obtaining the following\nAgain, since θ t is unknown, we replace it by θ,\nestimated standard error for the k-th component of θ\nrh\ni\nbk ) =\nb −1\nSE(θ\nJ(θ)\nkk\n\nNote: for regular models (see next slide), the observed information is\nb so that the SE above is well defined.\npositive definite at θ,\n18","Regularity conditions\n\nWe end the summary of the theory by mentioning the regularity\nconditions, which are some assumptions on the statistical model,\nrequired for the previous results to be valid.\nThe CS book lists the following ones:\n1. The pdf of y defined by different values of θ are distinct, namely the\nmodel is identifiable.\n2. The true parameter value θ t is interior to Θ.\n3. Within some neighbourhood of θ t , the first three derivatives of ℓ(θ)\nexist and are bounded, while the expected information satisfies the\n2nd Bartlett identity, is positive definite and finite.\nThese are mild conditions, which are generally valid in most cases.\n\n19","Winding up\n\nThe previous results have illustrated that\n1. The MLE is a consistent estimator.\n2. The MLE is asymptotic efficient, since its asymptotic variance\nattains the Cramér-Rao lower bound.\n3. The large sample distribution (aka the approximate distribution) of\nthe MLE is multivariate normal, with standard error that can be\nestimated by the observed information evaluated at the parameter\nestimate.\n\n20","Example 1: Poisson model\n\nb = y , and consistency follows from the law of large numbers, in\nHere λ\nagreement with likelihood theory.\nFurthermore, the CLT states that for large n\n·\nb∼\nλ\nN (λ, λ/n) .\n\nThis result can be obtained also from likelihood theory. Indeed, we get\nP\nyi\nJ(λ) = i2\nλ\nso that I(λ) = n/λ and I(λ)−1 = λ/n.\n\n21","Example 2: normal example\nHere we get\n\nJ(µ, σ 2 ) =  n\nσ4\nand therefore\n\nn\nσ2\n(ȳ − µ)\n\n\nn\n(ȳ − µ)\n4\nσ\n\nn\n1 Pn\n2\n− 4+ 6\ni=1 (yi − µ)\n2σ\nσ\nn\n2\n\nI(µ, σ 2 ) =  σ\n0\n\n\n0\nn \n2 σ4\n\nThe implication is that µ\nb and σ\nb2 are (asymptotically) uncorrelated, and\nthe two estimated standard errors are\n√ 2\nσ\nb\n2σ\nb\n2\nSE(b\nµ) = √ ,\nSE(b\nσ )= √ .\nn\nn\n\n22","Some numerical aspects","Numerical optimisation\n\nThe algorithmic nature of the MLE estimation method translates the\nstatistical model into an optimisation problem: once a (sensible)\nstatistical model has been specified for the data, we obtain parameter\nestimates with excellent properties by maximizing the log likelihood.\nIn some simple settings, like in the examples above, it is possible to find\nthe analytical expression for the MLE, but in general we must resort to\nnumerical optimisation of the log likelihood.\nThere are indeed several methods available for the task. Some knowledge\nof the most important issues related to it turns out particularly useful\neven for the application of off-the-shelf routines in R (or other\nenvironments).\n\n23","Newton’s method\nNewton’s method for optimisation is commonly used for minimization, in\nthis case of the objective function f (θ) = −ℓ(θ).\nThe theory is well described in the CS book, here we mention the most\nimportant aspects. The idea is to locally approximate f (θ) as a quadratic\nfunction, which is repeatedly minimised.\nThe resulting method consists in an iterative algorithm, which is started\nwith k = 0 and a guesstimate θ [0] , and iterates the following steps:\n1. Evaluate ℓ(θ [k] ), U(θ [k] ) and J(θ [k] ).\n.\n2. If U(θ [k] ) = 0 and J(θ [k] ) is positive definite then stop.\n3. If H = J(θ [k] ) is not positive definite, perturb it so that it is.\n4. Solve H δ = U(θ [k] ) for the search direction δ.\n5. If ℓ(θ [k] + δ) is not > ℓ(θ [k] ), repeatedly halve δ until it is (this is\nthe step-length control).\n6. Set θ [k+1] = θ [k] + δ, increment k by one and return to step 1.\n24","Fisher scoring and Quasi-Newton.\n\nWhenever available, it is always a good idea to replace the observed\ninformation with the expected information I(θ [k] ) in the Newton’s\nmethod.\nThe resulting algorithm has a long successful tradition in statistics, it is\ncalled Fisher scoring and, indeed, it has better convergence properties.\nAnother variant avoids the computation of either J(θ [k] ) or I(θ [k] ), by\nbuilding an approximation to the second derivative of ℓ(θ) as the\noptimization proceeds. This is the approach of the Quasi-Newton\nmethods, such as the widely used BFGS algorithm.\nQuasi-Newton methods are implemented in several R functions and\npackages; see the CRAN Task View for Optimisation\n(https://cran.r-project.org/web/views/Optimization.html).\n\n25","An example: logistic regression\n\nWe follow the MASS book for a simple example on a dose-response\nmodel.\nNamely, we assume that yi is the number of dead budworms (out of 20)\nfor a dose of insecticide xi∗ . In particular, the statistical model is\nYi ∼ Bi (20, πi )\n\ni = 1, . . . , 12, independent\n\nwith\nπi (α, β) =\n\ne α+β xi\n1 + e α+β xi\n\nwith xi = log(xi∗ ).\nThis is a simple instance of a logistic regression model.\n\n26","R lab: budworm data\nThere are two observations at each dose (M/F budworms), but here for\nthe sake of simplicity we ignore the different sex.\n\n0.8\n0.4\n0.0\n\nnumdead/20\n\nldose <- rep(0:5, 2)\nnumdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)\nplot(ldose, numdead / 20, pch=16)\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nldose\n27","Logistic regression: likelihood quantities\n\nWith some simple algebra we get:\nX \nℓ(α, β) =\nyi (α + β xi ) − 20 log(1 + e α+β xi )\ni\n\nP\nU(α, β) =\n\nI(α, β) =\n\ni {yi − 20 πi (α, β)}\n\n!\n\nP\n\ni {yi − 20 πi (α, β)} xi\n\nP\n\ni 20 πi (α, β) {1 − πi (α, β)}\n\nP\n\nP\n\ni 20 πi (α, β) {1 − πi (α, β)} xi\n\nP\n\ni 20 πi (α, β) {1 − πi (α, β)} xi\n\n!\n\n2\ni 20 πi (α, β) {1 − πi (α, β)} xi\n\nNotice that for this model J(α, β) = I(α, β).\n\n28","R lab: likelihood and score functions\n\nloglik <- function(theta, data){\neta <- theta[1] + theta[2] * data$x\nout <- sum(data$y * eta - 20 * log(1+exp(eta)))\nreturn(out)\n}\nscore <- function(theta, data){\nprob <- plogis(theta[1] + theta[2] * data$x)\nout <- c(sum(data$y - prob * 20),\nsum((data$y - prob * 20) * data$x))\nreturn(out)\n}\n\n29","R lab: information function\n\ninfo <- function(theta, data){\nprob <- plogis(theta[1] + theta[2] * data$x)\ninfo11 <- sum(20 * prob * (1-prob))\ninfo12 <- sum(20 * prob * (1-prob) * data$x)\ninfo22 <- sum(20 * prob * (1-prob) * data$xˆ2)\nout <- matrix(c(info11, info12, info12, info22), 2, 2)\nreturn(out)\n}\n\n30","R lab: starting point\nLet’s start from α = β = 0: we obtain\ntheta0 <- c(0, 0); budw <- data.frame(y = numdead, x = ldose)\nloglik(theta0, budw)\n## [1] -166.3553\nscore(theta0, budw)\n## [1]\n\n-9 105\n\ninfo(theta0, budw)\n##\n[,1] [,2]\n## [1,]\n60 150\n## [2,] 150 550\n31","R lab: first step\nH <- info(theta0, budw)\nu0 <- score(theta0, budw)\ndelta <- solve(H, u0)\ntheta1 <- theta0 + delta\ntheta1\n## [1] -1.9714286\n\n0.7285714\n\nloglik(theta1, budw)\n## [1] -114.7219\nwhich is clearly an improvement.\n32","R lab: first 10 steps\n## k = 1 theta= -1.971429 0.7285714 loglik= -114.7219\n## k = 2 theta= -2.621436 0.9572079 loglik= -111.8192\n## k = 3 theta= -2.760585 1.004947 loglik= -111.734\n## k = 4 theta= -2.766079 1.006804 loglik= -111.7339\n## k = 5 theta= -2.766087 1.006807 loglik= -111.7339\n## k = 6 theta= -2.766087 1.006807 loglik= -111.7339\n## k = 7 theta= -2.766087 1.006807 loglik= -111.7339\n## k = 8 theta= -2.766087 1.006807 loglik= -111.7339\n## k = 9 theta= -2.766087 1.006807 loglik= -111.7339\n## k = 10 theta= -2.766087 1.006807 loglik= -111.7339\nThe algorithm converges quickly, and actually after 10 iterations\ncat(score(theta10, budw), det(info(theta10, budw)),\nsqrt(diag(solve(info(theta10, budw)))))\n## 3.552714e-15 3.552714e-15 2361.462 0.3701342 0.1235889\n\n33","R lab: glm analysis\nbudworm.lg0 <- glm(cbind(y, 20-y) ~ x, binomial, budw)\nsummary(budworm.lg0, cor = FALSE)\n\n##\n## Call:\n## glm(formula = cbind(y, 20 - y) ~ x, family = binomial, data =\n##\n## Coefficients:\n##\nEstimate Std. Error z value Pr(>|z|)\n## (Intercept) -2.7661\n0.3701 -7.473 7.82e-14 ***\n## x\n1.0068\n0.1236\n8.147 3.74e-16 ***\n## --## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’\n##\n## (Dispersion parameter for binomial family taken to be 1)\n##\n34\n##\nNull deviance: 124.876 on 11 degrees of freedom"]},"initialDocumentData":{"document":{"access":"PUBLIC","contentRating":{"isAdsafe":false,"isExplicit":false,"isReviewed":true},"description":"","language":"en","path":{"type":"user","username":"tmor-bsp","documentName":"likelihood1_20"},"fullPageImageDimensions":[{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122},{"width":1497,"height":1122}],"originalPublishDateInISOString":"2026-02-04T00:00:00.000Z","pageCount":37,"publicationId":"dc58cc80e3851a03194b356159a87957","revisionId":"260204144151","title":"Likelihood1_20","isDocumentGated":false},"publisher":{"owner":{"type":"user","username":"tmor-bsp","userId":5512894052},"displayName":"Tommaso Moro","userPlan":"BASIC","licenses":{"removeSignupButton":false,"hideAdsInReader":true,"hideReadMoreSection":true}},"visitor":{"isLoggedIn":false},"iabCategories":["132"]},"initialPageNumber":1,"endPage":"$undefined","isOnlyMode":false,"initialViewportWidth":1024,"referrer":"","isCrawler":true,"experiments":[]}]]