Chapter 17 Appendix C: Maximum Likelihood Theory
Chapter Preview. Appendix Chapter 15 introduced the maximum likelihood theory regarding estimation of parameters from a parametric family. This appendix gives more specific examples and expands some of the concepts. Section 17.1 reviews the definition of the likelihood function, and introduces its properties. Section 17.2 reviews the maximum likelihood estimators, and extends their large-sample properties to the case where there are multiple parameters in the model. Section 17.3 reviews statistical inference based on maximum likelihood estimators, with specific examples on cases with multiple parameters.
17.1 Likelihood Function
In this section, you learn
- the definitions of the likelihood function and the log-likelihood function
- the properties of the likelihood function.
From Appendix Chapter 15, the likelihood function is a function of parameters given the observed data. Here, we review the concepts of the likelihood function, and introduces its properties that are bases for maximum likelihood inference.
17.1.1 Likelihood and Log-likelihood Functions
Here, we give a brief review of the likelihood function and the log-likelihood function from Appendix Chapter 15. Let \(f(\cdot|\boldsymbol\theta)\) be the probability function of \(X\), the probability mass function (pmf) if \(X\) is discrete or the probability density function (pdf) if it is continuous. The likelihood is a function of the parameters (\(\boldsymbol \theta\)) given the data (\(\mathbf{x}\)). Hence, it is a function of the parameters with the data being fixed, rather than a function of the data with the parameters being fixed. The vector of data \(\mathbf{x}\) is usually a realization of a random sample as defined in Appendix Chapter 15.
Given a realized of a random sample \(\mathbf{x}=(x_1,x_2,\cdots,x_n)\) of size \(n\), the likelihood function is defined as \[L(\boldsymbol{\theta}|\mathbf{x})=f(\mathbf{x}|\boldsymbol{\theta})=\prod_{i=1}^nf(x_i|\boldsymbol{\theta}),\] with the corresponding log-likelihood function given by
\[ l(\boldsymbol{\theta}|\mathbf{x})=\log L(\boldsymbol{\theta}|\mathbf{x})=\sum_{i=1}^n\log f(x_i|\boldsymbol{\theta}), \]
where \(f(\mathbf{x}|\boldsymbol{\theta})\) denotes the joint probability function of \(\mathbf{x}\). The log-likelihood function leads to an additive structure that is easy to work with.
In Appendix Chapter 15, we have used the normal distribution to illustrate concepts of the likelihood function and the log-likelihood function. Here, we derive the likelihood and corresponding log-likelihood functions when the population distribution is from the Pareto distribution family.
Show Example
17.1.2 Properties of Likelihood Functions
In mathematical statistics, the first derivative of the log-likelihood function with respect to the parameters, \(u(\boldsymbol\theta)=\partial l(\boldsymbol \theta|\mathbf{x})/\partial \boldsymbol \theta\), is referred to as the score function, or the score vector when there are multiple parameters in \(\boldsymbol\theta\). The score function or score vector can be written as
\[ u(\boldsymbol\theta)=\frac{ \partial}{\partial \boldsymbol \theta} l(\boldsymbol \theta|\mathbf{x}) =\frac{ \partial}{\partial \boldsymbol \theta} \log \prod_{i=1}^n f(x_i;\boldsymbol \theta ) =\sum_{i=1}^n \frac{ \partial}{\partial \boldsymbol \theta} \log f(x_i;\boldsymbol \theta ), \]
where \(u(\boldsymbol\theta)=(u_1(\boldsymbol\theta),u_2(\boldsymbol\theta),\cdots,u_p(\boldsymbol\theta))\) when \(\boldsymbol\theta=(\theta_1,\cdots,\theta_p)\) contains \(p>2\) parameters, with the element \(u_k(\boldsymbol\theta)=\partial l(\boldsymbol \theta|\mathbf{x})/\partial \theta_k\) being the partial derivative with respect to \(\theta_k\) (\(k=1,2,\cdots,p\)).
The likelihood function has the following properties:
- One basic property of the likelihood function is that the expectation of the score function with respect to \(\mathbf{x}\) is 0. That is,
\[ \mathrm{E}[u(\boldsymbol\theta)]=\mathrm{E} \left[ \frac{ \partial}{\partial \boldsymbol \theta} l(\boldsymbol \theta|\mathbf{x}) \right] = \mathbf 0 . \]
To illustrate this, we have
\[ \begin{aligned} \mathrm{E} \left[ \frac{ \partial}{\partial \boldsymbol \theta} l(\boldsymbol \theta|\mathbf{x}) \right] &= \mathrm{E} \left[ \frac{\frac{\partial}{\partial \boldsymbol \theta}f(\mathbf{x};\boldsymbol \theta)}{f(\mathbf{x};\boldsymbol \theta )} \right] = \int\frac{\partial}{\partial \boldsymbol \theta} f(\mathbf{y};\boldsymbol \theta ) d \mathbf y \\ &= \frac{\partial}{\partial \boldsymbol \theta} \int f(\mathbf{y};\boldsymbol \theta ) d \mathbf y = \frac{\partial}{\partial \boldsymbol \theta} 1 = \mathbf 0.\end{aligned} \]
Denote by \({ \partial^2 l(\boldsymbol \theta|\mathbf{x}) }/{\partial \boldsymbol \theta\partial \boldsymbol \theta^{\prime}}={ \partial^2 l(\boldsymbol \theta|\mathbf{x}) }/{\partial \boldsymbol \theta^{2}}\) the second derivative of the log-likelihood function when \(\boldsymbol\theta\) is a single parameter, or by \({ \partial^2 l(\boldsymbol \theta|\mathbf{x}) }/{\partial \boldsymbol \theta\partial \boldsymbol \theta^{\prime}}=(h_{jk})=({ \partial^2 l(\boldsymbol \theta|\mathbf{x}) }/\partial x_j\partial x_k)\) the hessian matrix of the log-likelihood function when it contains multiple parameters. Denote \([{ \partial l(\boldsymbol \theta|\mathbf{x})}{\partial\boldsymbol \theta}][{ \partial l(\boldsymbol \theta|\mathbf{x})}{\partial\boldsymbol \theta'}]=u^2(\boldsymbol \theta)\) when \(\boldsymbol\theta\) is a single parameter, or let \([{ \partial l(\boldsymbol \theta|\mathbf{x})}{\partial\boldsymbol \theta}][{ \partial l(\boldsymbol \theta|\mathbf{x})}{\partial\boldsymbol \theta'}]=(uu_{jk})\) be a \(p\times p\) matrix when \(\boldsymbol\theta\) contains a total of \(p\) parameters, with each element \(uu_{jk}=u_j(\boldsymbol \theta)u_k(\boldsymbol \theta)\) and \(u_j(\boldsymbol \theta)\) being the \(k\)th element of the score vector as defined earlier. Another basic property of the likelihood function is that sum of the expectation of the hessian matrix and the expectation of the Kronecker product of the score vector and its transpose is \(\mathbf 0\). That is, \[\mathrm{E} \left( \frac{ \partial^2 }{\partial \boldsymbol \theta\partial \boldsymbol \theta^{\prime}} l(\boldsymbol \theta|\mathbf{x}) \right) + \mathrm{E} \left( \frac{ \partial l(\boldsymbol \theta|\mathbf{x})}{\partial\boldsymbol \theta} \frac{ \partial l(\boldsymbol \theta|\mathbf{x})}{\partial\boldsymbol \theta^{\prime}}\right) = \mathbf 0.\]
Define the Fisher information matrix as
\[ \mathcal{I}(\boldsymbol \theta) = \mathrm{E} \left( \frac{ \partial l(\boldsymbol \theta|\mathbf{x})}{\partial \boldsymbol \theta} \frac{ \partial l(\boldsymbol \theta|\mathbf{x})}{\partial \boldsymbol \theta^{\prime}} \right) = -\mathrm{E} \left( \frac{ \partial^2}{\partial \boldsymbol \theta \partial \boldsymbol \theta^{\prime}} l(\boldsymbol \theta|\mathbf{x}) \right). \]
As the sample size \(n\) goes to infinity, the score function (vector) converges in distribution to a normal distribution (or multivariate normal distribution when \(\boldsymbol \theta\) contains multiple parameters) with mean 0 and variance (or covariance matrix in the multivariate case) given by \(\mathcal{I}(\boldsymbol \theta)\).
17.2 Maximum Likelihood Estimators
In this section, you learn
- the definition and derivation of the maximum likelihood estimator (mle) for parameters from a specific distribution family
- the properties of maximum likelihood estimators that ensure valid large-sample inference of the parameters
- why using the mle-based method, and what caution that needs to be taken.
In statistics, maximum likelihood estimators are values of the parameters \(\boldsymbol \theta\) that are most likely to have been produced by the data.
17.2.1 Definition and Derivation of MLE
Based on the definition given in Appendix Chapter 15, the value of \(\boldsymbol \theta\), say \(\hat{\boldsymbol \theta}_{MLE}\), that maximizes the likelihood function, is called the maximum likelihood estimator (mle) of \(\boldsymbol \theta\).
Because the log function \(\log(\cdot)\) is a one-to-one function, we can also determine \(\hat{\boldsymbol{\theta}}_{MLE}\) by maximizing the log-likelihood function, \(l(\boldsymbol \theta|\mathbf{x})\). That is, the mle is defined as
\[ \hat{\boldsymbol \theta}_{MLE} = {\mbox{argmax}}_{\boldsymbol{\theta}\in\Theta}~l(\boldsymbol{\theta}|\mathbf{x}). \]
Given the analytical form of the likelihood function, the mle can be obtained by taking the first derivative of the log-likelihood function with respect to \(\boldsymbol{\theta}\), and setting the values of the partial derivatives to zero. That is, the mle are the solutions of the equations of
\[ \frac{\partial l(\hat{\boldsymbol{\theta}}|\mathbf{x})}{\partial\hat{\boldsymbol{\theta}}}=\mathbf 0. \]
Show Example
17.2.2 Asymptotic Properties of MLE
From Appendix Chapter 15, the MLEMaximum likelihood estimate has some nice large-sample properties, under certain regularity conditions. We presented the results for a single parameter in Appendix Chapter 15, but results are true for the case when \(\boldsymbol{\theta}\) contains multiple parameters. In particular, we have the following results, in a general case when \(\boldsymbol{\theta}=(\theta_1,\theta_2,\cdots,\theta_p)\).
The mle of a parameter \(\boldsymbol{\theta}\), \(\hat{\boldsymbol{\theta}}_{MLE}\), is a consistent estimator. That is, the mle \(\hat{\boldsymbol{\theta}}_{MLE}\) converges in probability to the true value \(\boldsymbol{\theta}\), as the sample size \(n\) goes to infinity.
The mle has the asymptotic normality property, meaning that the estimator will converge in distribution to a multivariate normal distribution centered around the true value, when the sample size goes to infinity. Namely, \[\sqrt{n}(\hat{\boldsymbol{\theta}}_{MLE}-\boldsymbol{\theta})\rightarrow N\left(\mathbf 0,\,\boldsymbol{V}\right),\quad \mbox{as}\quad n\rightarrow \infty,\] where \(\boldsymbol{V}\) denotes the asymptotic variance (or covariance matrix) of the estimator. Hence, the mle \(\hat{\boldsymbol{\theta}}_{MLE}\) has an approximate normal distribution with mean \(\boldsymbol{\theta}\) and variance (covariance matrix when \(p>1\)) \(\boldsymbol{V}/n\), when the sample size is large.
The mle is efficient, meaning that it has the smallest asymptotic variance \(\boldsymbol{V}\), commonly referred to as the Cramer–Rao lower bound. In particular, the Cramer–Rao lower bound is the inverse of the Fisher information (matrix) \(\mathcal{I}(\boldsymbol{\theta})\) defined earlier in this appendix. Hence, \(\mathrm{Var}(\hat{\boldsymbol{\theta}}_{MLE})\) can be estimated based on the observed Fisher information.
Based on the above results, we may perform statistical inference based on the procedures defined in Appendix Chapter 15.
Show Example
17.2.3 Use of Maximum Likelihood Estimation
The method of maximum likelihood has many advantages over alternative methods such as the method of moment method introduced in Appendix Chapter 15.
- It is a general tool that works in many situations. For example, we may be able to write out the closed-form likelihood function for censored and truncated data. Maximum likelihood estimation can be used for regression models including covariates, such as survival regression, generalized linear models and mixed models, that may include covariates that are time-dependent.
- From the efficiency of the mle, it is optimal, the best, in the sense that it has the smallest variance among the class of all unbiased estimators for large sample sizes.
- From the results on the asymptotic normality of the mle, we can obtain a large-sample distribution for the estimator, allowing users to assess the variability in the estimation and perform statistical inference on the parameters. The approach is less computationally extensive than re-sampling methods that require a large of fittings of the model.
Despite its numerous advantages, mle has its drawback in cases such as generalized linear models when it does not have a closed analytical form. In such cases, maximum likelihood estimators are computed iteratively using numerical optimization methods. For example, we may use the Newton-Raphson iterative algorithm or its variations for obtaining the mle. Iterative algorithms require starting values. For some problems, the choice of a close starting value is critical, particularly in cases where the likelihood function has local minimums or maximums. Hence, there may be a convergence issue when the starting value is far from the maximum. Hence, it is important to start from different values across the parameter space, and compare the maximized likelihood or log-likelihood to make sure the algorithms have converged to a global maximum.
17.3 Statistical Inference Based on Maximum Likelihood Estimation
In this section, you learn how to
- perform hypothesis testing based on mle for cases where there are multiple parameters in \(\boldsymbol\theta\)
- perform likelihood ratio test for cases where there are multiple parameters in \(\boldsymbol\theta\)
In Appendix Chapter 15, we have introduced maximum likelihood-based methods for statistical inference when \(\boldsymbol\theta\) contains a single parameter. Here, we will extend the results to cases where there are multiple parameters in \(\boldsymbol\theta\).
17.3.1 Hypothesis Testing
In Appendix Chapter 15, we defined hypothesis testing concerning the null hypothesis, a statement on the parameter(s) of a distribution or model. One important type of inference is to assess whether a parameter estimate is statistically significant, meaning whether the value of the parameter is zero or not.
We have learned earlier that the mle \(\hat{\boldsymbol{\theta}}_{MLE}\) has a large-sample normal distribution with mean \(\boldsymbol \theta\) and the variance covariance matrix \(\mathcal{I}^{-1}(\boldsymbol \theta)\). Based on the multivariate normal distribution, the \(j\)th element of \(\hat{\boldsymbol{\theta}}_{MLE}\), say \(\hat{\theta}_{MLE,j}\), has a large-sample univariate normal distribution.
Define \(se(\hat{\theta}_{MLE,j})\), the standard error (estimated standard deviation) to be the square root of the \(j\)th diagonal element of \(\mathcal{I}^{-1}(\boldsymbol \theta)_{MLE}\). To assess the null hypothesis that \(\theta_j=\theta_0\), we define the \(t\)-statistic or \(t\)-ratio to be \(t(\hat{\theta}_{MLE,j})=(\hat{\theta}_{MLE,j}-\theta_0)/se(\hat{\theta}_{MLE,j})\).
Under the null hypothesis, it has a Student-\(t\) distribution with degrees of freedom equal to \(n-p\), with \(p\) being the dimension of \(\boldsymbol{\theta}\).
For most actuarial applications, we have a large sample size \(n\), so the \(t\)-distribution is very close to the (standard) normal distribution. In the case when \(n\) is very large or when the standard error is known, the \(t\)-statistic can be referred to as a \(z\)-statistic or \(z\)-score.
Based on the results from Appendix Chapter 15, if the \(t\)-statistic \(t(\hat{\theta}_{MLE,j})\) exceeds a cut-off (in absolute value), then the test for the \(j\) parameter \(\theta_j\) is said to be statistically significant. If \(\theta_j\) is the regression coefficient of the \(j\) th independent variable, then we say that the \(j\)th variable is statistically significant.
For example, if we use a 5% significance level, then the cut-off value is 1.96 using a normal distribution approximation for cases with a large sample size. More generally, using a \(100 \alpha \%\) significance level, then the cut-off is a \(100(1-\alpha/2)\%\) quantile from a Student-\(t\) distribution with the degree of freedom being \(n-p\).
Another useful concept in hypothesis testing is the \(p\)-value, shorthand for probability value. From the mathematical definition in Appendix Chapter 15, a \(p\)-value is defined as the smallest significance level for which the null hypothesis would be rejected. Hence, the \(p\)-value is a useful summary statistic for the data analyst to report because it allows the reader to understand the strength of statistical evidence concerning the deviation from the null hypothesis.
17.3.2 MLE and Model Validation
In addition to hypothesis testing and interval estimation introduced in Appendix Chapter 15 and the previous subsection, another important type of inference is selection of a model from two choices, where one choice is a special case of the other with certain parameters being restricted. For such two models with one being nested in the other, we have introduced the likelihood ratio test (LRT) in Appendix Chapter 15. Here, we will briefly review the process of performing a LRT based on a specific example of two alternative models.
Suppose that we have a (large) model under which we derive the maximum likelihood estimator, \(\hat{\boldsymbol{\theta}}_{MLE}\). Now assume that some of the \(p\) elements in \(\boldsymbol \theta\) are equal to zero and determine the maximum likelihood estimator over the remaining set, with the resulting estimator denoted \(\hat{\boldsymbol{\theta}}_{Reduced}\).
Based on the definition in Appendix Chapter 15, the statistic, \(LRT= 2 \left( l(\hat{\boldsymbol{\theta}}_{MLE}) - l(\hat{\boldsymbol{\theta}}_{Reduced}) \right)\), is called the likelihood ratio statistic. Under the null hypothesis that the reduced model is correct, the likelihood ratio has a Chi-square distribution with degrees of freedom equal to \(d\), the number of variables set to zero.
Such a test allows us to judge which of the two models is more likely to be correct, given the observed data. If the statistic \(LRT\) is large relative to the critical value from the chi-square distribution, then we reject the reduced model in favor of the larger one. Details regarding the critical value and alternative methods based on information criteria are given in Appendix Chapter 15.
Contributors
- Lei (Larry) Hua, Northern Illinois University, and Edward W. (Jed) Frees, University of Wisconsin-Madison, are the principal authors of the initial version of this chapter. Email: lhua@niu.edu or jfrees@bus.wisc.edu for chapter comments and suggested improvements.