2.3 Maximum Likelihood
2.3.1 Mechanics
Suppose we have a random sample from a distribution \(f(x \mid \theta)\). We find the maximum likelihood (ML) estimator \(\hat{\theta}\) of \(\theta\) by maximizing the likelihood of the observed data with respect to \(\theta\).
In short, we take the likelihood from Section 2.1.1.1 and find the parameter \(\theta\) that maximizes it.
In practice, to make the math and/or computation a bit easier, we manipulate the likelihood function in three ways:
- Relabel the likelihood function \(f(x \mid \theta) = L(\theta)\), since it’s weird to maximize with respect to a conditioning variable.
- Work with \(\log L(\theta)\) rather than \(L(\theta)\). Because \(\log()\) is a monotonically increasing function, the \(\theta\) that maximizes \(L(\theta)\) also maximizes \(\log L(\theta)\).
Suppose we have samples \(x_1, x_2, ..., x_N\) from \(f(x \mid \theta)\). Then the likelihood is \(f(x \mid \theta) = \prod_{n = 1}^N f(x_n \mid \theta)\) and \(\log L(\theta) = \sum_{n = 1}^N \log \left[ f(x_n \mid \theta) \right]\). The ML estimator \(\hat{\theta}\) of \(\theta\) is \(\arg \max \log L(\theta)\).
In applied problems, we might be able to simplify \(\log L\) substantially. Occasionally, we can find a nice analytical maximum. In many cases, we have a computer find the parameter that maximizes \(\log L\).
2.3.2 Example: Toothpaste Cap Problem
For the toothpaste cap problem, we have the following likelihood, which I’m borrowing directly from Section 2.1.1.1. \[ \text{the likelihood: } f(x | \pi) = \pi^{k} (1 - \pi)^{(N - k)}, \text{where } k = \sum_{n = 1}^N x_n \\ \] Then, we relabel.
\[ \text{the likelihood: } L(\pi) = \pi^{k} (1 - \pi)^{(N - k)}\\ \] Then, we take the log and simplify.
\[ \text{the likelihood: } \log L(\pi) = k \log (\pi) + (N - k) \log(1 - \pi)\\ \] To find the ML estimator, we find \(\hat{\pi}\) that maximizes \(\log L\). In this case, the analytical optimum is easy.
\[ \begin{aligned} \frac{d \log L}{d\hat{\pi}} = k \left( \frac{1}{\hat{\pi}}\right) + (N - k) \left( \frac{1}{1 - \hat{\pi}}\right)(-1) &= 0\\ \frac{k}{\hat{\pi}} - \frac{N - y}{1 - \hat{\pi}} &= 0 \\ \frac{k}{\hat{\pi}} &= \frac{N - y}{1 - \hat{\pi}} \\ k(1 - \hat{\pi}) &= (N - y)\hat{\pi} \\ k - y\hat{\pi} &= N\hat{\pi} - y\hat{\pi} \\ k &= N\hat{\pi} \\ \hat{\pi} &= \frac{k}{N} = \text{avg}(x)\\ \end{aligned} \] The ML estimator for the Bernoulli distribution is the fraction of successes, or, equivalently, the average of the Bernoulli trials.
The collected data consist of 150 trials and 8 successes, so the ML estimate of \(\pi\) is \(\frac{8}{150} \approx 0.053\). Compare the ML estimate with the posterior mean, median, and mode above.
2.3.3 Example: German Tank Problem
The German tank problem is an excellent example of an ML estimator because one can think through the logic intuitively rather than mathematically.
Recall that the Allies observe \(N\) serial numbers of German tanks. By treating these serial numbers as samples without replacement from a set of sequential serial numbers \(S = \{1, 2, ..., \nu\}\), they can estimate the total number of German tanks \(\nu\).
Because we’re sampling without replacement, the likelihood is quite complicated.
\[ f(x \mid \nu) = f(x_1) f(x_2 \mid x_1) ... f(x_N | x_{N - 1}, ..., x_2, x_1), \\ \text{where } x_i \neq x_j \text{ for } i \neq j. \]
Remember that \(\nu\) is the total number of tanks. So the chance that we observe \(x_1\) first is \(\frac{1}{\nu}\). The chance we observe \(x_2\) second, given that we observe \(x_1\) first is \(\frac{1}{\nu - 1}\). The chance we observe \(x_N\) \(N\)th is \(\frac{1}{\nu - (N - 1)}\).
\[ L(\nu) = \frac{1}{\nu} \times \frac{1}{\nu - 1} \times ... \times \frac{1}{\nu - (N - 1)} \] It’s clear, then, that in order to make \(L\) as large as possible, we need to make \(\nu\) as small as possible. However, notice that \(\nu\) cannot be less than \(\max \{x_1, ..., x_N\}\) (i.e., the largest serial number must be greater than or equal to the largest observed serial number).
So what’s the smallest value of \(\nu\) that satisfies \(\nu \geq \max \{x_1, ..., x_N\}\)? Of course it’s \(\hat{\nu} = \max \{x_1, ..., x_N\}\).
To illustrate, I simulate the same five data sets from above and compute the MM and ML estimates. Take a look at the five data sets and five estimates. Would you say that one estimator seems better? Can the ML estimate ever be too large? Can the MM estimate ever be too small?
# the estimators
<- function(x) {
calc_nu_hat_mm <- length(x)
N <- sum(x)
sum_x <- ((2 / N) * sum_x) - 1
nu_hat return(nu_hat)
}# the estimators
<- function(x) {
calc_nu_hat_ml <- max(x)
nu_hat return(nu_hat)
}
# simulation parameters
<- 5 # the number of repeated samples
n_sims <- 5 # the number of observations in each sample
N <- 100 # the true value of nu
nu
# do the simulation
set.seed(12345)
for (i in 1:n_sims) {
# take the ith sample
<- sample(1:nu, size = N, replace = FALSE)
x <- sort(x) # sort x for convenience
x # compute the estimates
<- calc_nu_hat_mm(x)
nu_hat_mm <- calc_nu_hat_ml(x)
nu_hat_ml # print the results
<- paste0(stringr::str_pad(x, 3, pad = "0"), collapse = ", ")
data <- paste0("MM estimate = ", round(nu_hat_mm, 2))
mm_estimate <- paste0("ML estimate = ", round(nu_hat_ml, 2))
ml_estimate cat("Sim. #", i, ": ", data, " --> ", ml_estimate, "; ", mm_estimate, "\n", sep = "")
}
## Sim. #1: 014, 051, 080, 090, 092 --> ML estimate = 92; MM estimate = 129.8
## Sim. #2: 024, 058, 075, 093, 096 --> ML estimate = 96; MM estimate = 137.4
## Sim. #3: 002, 038, 075, 086, 088 --> ML estimate = 88; MM estimate = 114.6
## Sim. #4: 010, 032, 040, 081, 094 --> ML estimate = 94; MM estimate = 101.8
## Sim. #5: 001, 030, 038, 039, 076 --> ML estimate = 76; MM estimate = 72.6
2.3.4 Example: Poisson Distribution
Suppose we collect \(N\) random samples \(x = \{x_1, x_2, ..., x_N\}\) and model each draw as a random variable \(X \sim \text{Poisson}(\lambda)\). Find the ML estimator of \(\lambda\).
\[ \begin{aligned} \text{Poisson likelihood: } f(x \mid \lambda) &= \prod_{n = 1}^N \frac{\lambda^{x_n} e^{-\lambda}}{x_n!} \\ L(\lambda) &= \prod_{n = 1}^N \frac{\lambda^{x_n} e^{-\lambda}}{x_n!} \\ \log L(\lambda) &= \sum_{n = 1}^N \log \left[ \frac{\lambda^{x_n} e^{-\lambda}}{x_n!} \right]\\ &= \sum_{n = 1}^N \left[ x_n \log \lambda + (-\lambda) \log e - \log x_n! \right]\\ &= \log \lambda \left[ \sum_{n = 1}^N x_n \right] -N\lambda + \sum_{n = 1}^N \log (x_n!) \\ \end{aligned} \] To find the ML estimator, we find \(\hat{\lambda}\) that maximizes \(\log L\). In this case, the analytical optimum is easy.
\[ \begin{aligned} \frac{d \log L}{d\hat{\lambda}} = \frac{1}{\hat{\lambda}} \left[ \sum_{n = 1}^N x_n \right] - N &= 0\\ \frac{1}{\hat{\lambda}} \left[ \sum_{n = 1}^N x_n \right] &= N \\ \left[ \sum_{n = 1}^N x_n \right] &= N \hat{\lambda} \\ \hat{\lambda} &= \frac{ \sum_{n = 1}^N x_n }{N} = \text{avg}(x) \\ \end{aligned} \] The ML estimator for the Poisson distribution is just the average of the samples.
2.3.5 Remarks
The ML estimator is extremely common in political science because they are general, fast, and work extremely well. Lots of models that you’ve heard of, such as logistic regression, are estimated with ML.
We can even obtain ML estimates for the linear regression model. We assume that the observed data are samples from a normal distribution with mean \(\mu_n = \alpha + \beta x_n\) and variance \(\sigma^2\). For this model, the least-squares estimate that we learned earlier is also the ML estimate.
2.3.6 Exercises
Exercise 2.5 Suppose we collect \(N\) random samples \(x = \{x_1, x_2, ..., x_N\}\) and model each draw as a random variable \(X \sim \text{exponential}(\lambda)\) with pdf \(f(x_n | \lambda) = \lambda e^{-\lambda x_n}\). Find the maximum estimator of \(\lambda\).
Solution
The math follows the Poisson example closely. However, the solution is the inverse–\(\hat{\lambda} = \frac{N}{\sum_{n = 1}^N x_n } = \frac{1}{\text{avg}(x)}\).Exercise 2.6 Model the data from Exercise 2.2 as draws from an exponential distribution with rate parameter \(\lambda\). Use the maximum estimator from Exercise 2.5 to estimate \(\lambda\).
What if you want to estimate the mean duration \(\mu = \frac{1}{\lambda}\) rather than the failure rate \(\lambda\)? Can you just use \(\hat{\mu} = \frac{1}{\hat{\lambda}}\)?
Solution
<- c(142, 773, 333, 432, 1823)/365 # rescale to years
x 1/mean(x) # ml estimator of lambda
## [1] 0.520982
Like MM estimators, ML estimators are invariant under transformation, though this is not as obvious as it is for MM estimators. For this problem, \(\hat{\mu} = \frac{1}{\hat{\lambda}}\).