2.2 Method of Moments

2.2.1 Mechanics

Suppose a random variable \(X\). Then we refer to \(E(X^k)\) as the \(k\)-th moment of the distribution or population. Similarly, we refer to \(\text{avg}(x^k)\) as the \(k\)-th sample moment.

For example, recall that \(V(X) = E \left(X^2 \right) - \left[ E(X)\right]^2\). In example the variance of \(X\) is the difference between the second moment and the square of the first moment.

Recall that the law of large numbers guarantees that \(\text{avg}(x) \xrightarrow[]{p} E(X)\). Thus, the first sample moment (the average) converges in probability to the first moment of \(f\) (the expected value or mean).

By the law of the unconscious statistician, we can similarly guarantee that \(\text{avg}(x^k) \xrightarrow[]{p} E(X^k)\). Thus, the sample moments converge in distribution to moments of \(f\).

Now suppose that \(f\) has parameters \(\theta_1, \theta_2, ..., \theta_k\) so that \(X \sim f(\theta_1, \theta_2, ..., \theta_k)\). We know (or can solve) for the moments of \(f\) so that \(E(X^1) = g_1(\theta_1, \theta_2, ..., \theta_k)\), \(E(X^2) = g_2(\theta_1, \theta_2, ..., \theta_k)\), and so on.

To use the method of moments, set the first \(k\) sample moments equal to the first \(k\) moments of \(f\) and relabel \(\theta_i\) as \(\hat{\theta}_i\). Solve the system of equations for each \(\hat{\theta}_i\).

2.2.2 Example: Gamma Distribution

Suppose we have \(N\) samples with replacement \(x_n\) for \(n \in \{1, 2, ..., N\}\) from a \(\text{gamma}(\alpha, \beta)\) distribution. We can use the method of moments to derive estimators of \(\alpha\) and \(\beta\). We have two parameters, so we use the first two moments of \(f\). For the gamma distribution, \(E(X) = \frac{\alpha}{\beta}\) and \(E(X^2) = V(X) + E(X)^2 = \frac{\alpha}{\beta^2} + \left( \frac{\alpha}{\beta} \right)^2 = \frac{\alpha(\alpha + 1)}{\beta^2}\).

Since we have two parameters, we set the first two sample moments equal to the first two moments of the gamma distribution, add hats, and solve.

\[ \begin{aligned} \displaystyle \frac{1}{N} \sum_{n = 1}^{N} x_n &= \frac{\hat{\alpha}}{\hat{\beta}}\\ \displaystyle \frac{1}{N} \sum_{n = 1}^{N} x_n^2 &= \frac{\hat{\alpha}(\hat{\alpha} + 1)}{\hat{\beta}^2}\\ \end{aligned} \]

For simplicity, let \(m_1 = \frac{1}{N} \sum_{n = 1}^{N} x_n\) and \(m_2 = \frac{1}{N} \sum_{n = 1}^{N} x_n^2\). Then, solving for \(\hat{\alpha}\) and \(\hat{\beta}\), we obtain

\[ \begin{aligned} \displaystyle \hat{\alpha} &= \frac{m_1^2}{m_2 - m_1^2}\text{ and }\\ \displaystyle \hat{\beta} &= \frac{m_1}{m_2 - m_1^2}\text{.}\\ \end{aligned} \]

2.2.3 Example: Toothpaste Cap Problem

We have a toothpaste cap–one with a wide bottom and a narrow top. We’re going to toss the toothpaste cap. It can either end up lying on its side, its (wide) bottom, or its (narrow) top.

We want to estimate the probability of the toothpaste cap landing on its top.

We can model each toss as a Bernoulli trial, thinking of each toss as a random variable \(X\) where \(X \sim \text{Bernoulli}(\pi)\). If the cap lands on its top, we think of the outcome as 1. If not, as 0.

Suppose we toss the cap \(N\) times and observe \(k\) tops. How can we estimate the parameter \(\pi\)?

We can use the method of moments to estimate this single parameter.

Set the first moment of the sample to the first moment of the Bernoulli distribution.
Add a hat to the quantities to estimate.
Solve.

This process is nearly trivial for the Bernoulli distribution. \(\text{sample average} = \frac{k}{N} = \hat{\pi}\).

Thus, the fraction of the draws that land on their top is the methods of moments estimator of the probability of getting a top.

2.2.4 Example: Population Average

Suppose we have \(N\) samples with replacement \(x_n\) for \(n \in \{1, 2, ..., N\}\) from a population. We want to estimate the population average.

To estimate the population average, we would simply set the first sample moment \(m_1 = \displaystyle \frac{1}{n} \sum_{n = 1}^{N} x_n^1\) equal to the first population moment \(\mu_1 = E(X^1)\). Then we equate the sample moments to the population moments, add hats to the quantities of interest, and solve. Because \(E(X^1)\) it the population average, there’s not much work to do.

\[ \begin{aligned} m_1 &= \mu_1\\ \displaystyle \frac{1}{N} \sum_{n = 1}^{N} x_n &= \widehat{E(X)}\\ &= \text{estimate of population average}\\ \end{aligned} \]

Once we solve for the quantity of interest, we can use that solution to estimate the quantity of interest. In this example, the sample average \(\displaystyle \frac{1}{n} \sum_{n = 1}^{N} X_n\) is methods of moments estimator of the population average.

2.2.5 Example: Population Average and SD

Suppose we also want the population SD. We set the second sample moment \(m_2 = \displaystyle \frac{1}{n} \sum_{n = 1}^{N} x_n^2\) equal to the second population moment \(\mu_2 = E(X^2)\). Then we equate the sample moments to the population moments and solve for the quantities of interest.

\[ \begin{aligned} \displaystyle \frac{1}{n} \sum_{n = 1}^{N} x_n &= \widehat{E(X)}\\ \displaystyle \frac{1}{n} \sum_{n = 1}^{N} x_n^2 &= \widehat{E(X^2)}\\ \end{aligned} \] Recall that the variance equals \(V(X) = E \left( X^2 \right) - E \left( X \right)^2\) and the population \(\text{SD}(X) = \sqrt{V(X)}\).

\[ \begin{aligned} \displaystyle \sqrt{\frac{1}{N} \sum_{n = 1}^{N} x_n^2 - \left[ \displaystyle \frac{1}{N} \sum_{n = 1}^{N} x_n \right]^2} &= \sqrt{\widehat{E(X^2)} - \widehat{E(X)}^2}\\ &= \text{estimate of population SD} \end{aligned} \] In this case, the sample SD is the methods of moments estimator for the population SD.

Notes

This is the same formula at the top of p. 74 of FPP.
Some authors recommend using \(N - 1\) rather than \(N\), so that \(\sqrt{\frac{1}{N - 1} \sum_{n = 1}^{N} x_n^2 - \left[ \frac{1}{N - 1} \sum_{n = 1}^{N} x_n \right]^2}\). However, the formula using \(N - 1\) is not the methods of moments estimator of the population SD.

2.2.6 Example: German Tank Problem

In World War II, the Germans manufactured some unknown number of tanks \(\nu\), but gave each a sequential serial number \(S = \{1, 2, ..., \nu\}\). It’s of great strategic importance to know the \(\nu\), the total number of tanks manufactured.

When the Allies would destroy or capture a tank, they would record the serial number. Soon they had a data set of \(N\) observed serial numbers \(\{x_1, x_2, ..., x_N\}\), which they can treat as random without replacement from \(S\).

Using the MOM, we can develop an estimator \(\hat{\nu}_{MM}\) of the parameter \(\nu\). We have only a single parameter \(\nu\), so we match just the first sample moment \(\frac{1}{N}\sum x^1_i = \text{avg}(x)\) with the first population moment \(E(X) = \frac{\nu + 1}{2}\).

Then we have

\[ \begin{aligned} \frac{1}{N} \sum_{n = 1}^{N} x_n &= \frac{\hat{\nu}_{MM} + 1}{2}\\ \frac{2}{N} \sum_{n = 1}^{N} x_n &= \hat{\nu}_{MM} + 1\\ \frac{2}{N} \left( \sum_{n = 1}^{N} x_n \right) - 1& = \hat{\nu}_{MM} \end{aligned} \]

Here’s a small simulation to demonstrate the estimator in repeated samples

# the estimator
calc_nu_hat <- function(x) {
  N <- length(x)
  sum_x <- sum(x)
  nu_hat <- ((2 / N) * sum_x) - 1
  return(nu_hat)
}

# simulation parameters
n_sims <- 5  # the number of repeated samples
N <- 5       # the number of observations in each sample
nu <- 100    # the true value of nu

# do the simulation
set.seed(12345)
for (i in 1:n_sims) {
  # take the ith sample
  x <- sample(1:nu, size = N, replace = FALSE)
  x <- sort(x)  # sort x for convenience
  # compute the estimate
  nu_hat <- calc_nu_hat(x)
  # print the results
  data <- paste0(stringr::str_pad(x, 3, pad = "0"), collapse = ", ")
  estimate <- paste0("nu_hat = ", round(nu_hat, 2))
  cat("Sim. #", i, ": ", data, " --> ", estimate, "\n", sep = "")
}

## Sim. #1: 014, 051, 080, 090, 092 --> nu_hat = 129.8
## Sim. #2: 024, 058, 075, 093, 096 --> nu_hat = 137.4
## Sim. #3: 002, 038, 075, 086, 088 --> nu_hat = 114.6
## Sim. #4: 010, 032, 040, 081, 094 --> nu_hat = 101.8
## Sim. #5: 001, 030, 038, 039, 076 --> nu_hat = 72.6

Take a look at the 5 data sets above and the associated MOM estimates. Supposing you didn’t know \(\nu\), would these estimates seem reasonable? Can you spot any problems? (Look closely at Simulation #5.)

2.2.7 Remarks

The method of moments has some advantages:

MOM intuitive; it just makes sense that the sample mean is a reasonable estimate of the population mean.
MOM is numerically easy, usually. Because we’re just computing (and sometimes then transforming) sample moments, we can usually compute these estimates quickly.
MOM demands few assumptions. We don’t need to define the distribution of the data. It’s sufficient, for example, to declare that it has a mean \(mu\) and a variance \(\sigma^2\) to obtain MOM estimator of these quantities.

However, the methods three disadvantages:

MOM estimators are rarely better and sometimes worse than maximum likelihood estimators.
MOM estimates sometimes fall outside the logically possible parameter space (see Simulation #5 of the German tank problem above) and sometimes outside the parameter space all together (e.g., negative estimates of parameters that can only be positive).

Because of the disadvantages, researchers rarely use the method of moments. However, they’re a general, intuitive principle here: (functions of) sample moments are reasonable estimates of (functions of) population moments. In particular,

\(\frac{1}{n} \sum_{n = 1}^{N} x_n\) is a reasonable estimate of a population mean, and
\(\sqrt{\frac{1}{N} \sum_{n = 1}^{N} x_n^2 - \left[ \frac{1}{N} \sum_{n = 1}^{N} x_n \right]^2}\) is a reasonable estimate of the population SD.

2.2.8 Exercises

Exercise 2.3 Suppose we collect \(N\) random samples \(x = \{x_1, x_2, ..., x_N\}\) and model each draw as a random variable \(X \sim \text{exponential}(\lambda)\) with pdf \(f(x_n | \lambda) = \lambda e^{-\lambda x_n}\). Find the method of moments estimator of \(\lambda\). Hint: the mean of the exponential distribution is \(\frac{1}{\lambda}\).

Solution

Since we have a single parameter to estimate, we just set the first sample moment (the average) equal to the first moment of the distribution (the mean; \(\frac{1}{\lambda}\)) and add a hat. Thus, \(\text{avg}(x) = \frac{1}{\hat{\lambda}}\) and therefore \(\hat{\lambda} = \frac{1}{\text{avg}(x)}\).

Exercise 2.4 Model the data from Exercise 2.2 as draws from an exponential distribution with rate parameter \(\lambda\). Use the method of moments estimator from Exercise 2.3 to estimate \(\lambda\).

What if you want to estimate the mean duration \(\mu = \frac{1}{\lambda}\) rather than the failure rate \(\lambda\)? Can you just use \(\hat{\mu} = \frac{1}{\hat{\lambda}}\)?

Solution

x <- c(142, 773, 333, 432, 1823)/365  # rescale to years
1/mean(x)  # mm estimator of lambda

## [1] 0.520982

By construction, MOM estimators are invariant under transformation, so the MOM estimate of \(g(\theta)\) can be found by transforming the MOM estimate of \(\theta\), so \(\widehat{g(\theta)} = g(\hat{\theta})\). For this problem, \(\hat{\mu} = \frac{1}{\hat{\lambda}}\).