Point Estimates

class: center, middle, inverse, title-slide

.title[
# Point Estimates
]
.date[
### updated: 2022-11-14
]

---

class: middle
layout: true

---
class: center

### King, Keohane, and Verba (1996, p. 46) define *inference* as "the process of using the facts we know to learn about facts we do not know."

Consider the following three targets of inference:

---

## The Average Treatment Effect

- Suppose you conduct an experiment in which you assign `$N$` subjects to either treatment or control. 
- For each subject `$n$`, you observe either the outcome under the treatment condition `$Y^{T}_n$` or the outcome under control `$Y^{C}_n$`. 
- Define the *average treatment effect* or *ATE* as `$\displaystyle \frac{1}{N}\sum_{n = 1}^{N}\left( Y^{T}_n - Y^{C}_n \right)$`. 
- Because we cannot place subject `$n$` in both treatment and control, we cannot observe the ATE; we can only estimate it.
---
## Features of a Population, Using a Sample

- Suppose we have a random sample of `$N$` observations from a much larger population. 
- We can use the sample to estimate features of the population, such as the average of a variable or correlation between two variables. 
- Because we cannot (or perhaps choose to not) observe the each member of population, we cannot observe the features of the population directly; we can only estimate it.
---
## Parameters of a Stochastic Model

- Suppose the outcome variable `$y$` is a collection of samples from a distribution `$f(\theta)$`. 
- We cannot observe `$\theta$` directly, but we can use the observed samples `$y$` to learn estimate `$\theta$`. 
- Example, suppose you have a binary outcome variable `$y$` that you model as draws from a Bernoulli distribution. Then `$y_i \sim \text{Bernoulli}(\pi)$`. Your inferential target would not be the proportion of ones in the sample, but the value of `$\pi$`.

---

Modeling these three targets of inference identical in some situations and very similar in many.

**I focus on estimating the parameters of a stochastic model.**

In this situation, we observe a sample `$y$` from a particular distribution and use the sample to estimate the parameters of the distribution.

---

We consider two types of estimates:

1. *Point Estimates*: using the observed data to calculate a *single value* or best-guess for the unobservable quantity of interest.
1. *Interval Estimates*: using the observed data to calculate a *range of values* for the unobservable quantity of interest.

For each type of estimate, we consider:

1. How to *develop* an estimator.
1. How to *evaluate* an estimator.

---

# Three Types

Today, we'll discuss three different types of point estimates

1. Bayesian Point Estimates
1. Method of Moments
1. Maximum Likelihood

---
class: middle

# Bayesian Point Estimates

Bayesian inference follows a simple recipe:

1. Choose a distribution for the data.
1. Choose a distribution to describe your prior beliefs.
1. Update the prior distribution upon observing the data by computing the posterior distribution. 
1. Summarize the **location** of the posterior distribution.

---
# Mechanics of Bayesian Inference

Suppose a random sample from a distribution `$f(x; \theta)$` that depends on the unknown parameter `$\theta$`.

Bayesian inference models our *beliefs* about the unknown parameter `$\theta$` as a distribution

It answers the question: what should we believe about `$\theta$`, given the observed samples `$x = \{x_1, x_2, ..., x_n\}$` from `$f(x; \theta)$`? These beliefs are simply the conditional distribution `$f(\theta \mid x)$`.

By Bayes' rule, `$\displaystyle f(\theta \mid x) = \frac{f(x \mid \theta)f(\theta)}{f(x)} = \frac{f(x \mid \theta)f(\theta)}{\displaystyle \int_{-\infty}^\infty f(x \mid \theta)f(\theta) d\theta}$`.

`$\displaystyle \underbrace{f(\theta \mid x)}_{\text{posterior}} = \frac{\overbrace{f(x \mid \theta)}^{\text{likelihood}} \times \overbrace{f(\theta)}^{\text{prior}}}{\displaystyle \underbrace{\int_{-\infty}^\infty f(x \mid \theta)f(\theta) d\theta}_{\text{normalizing constant}}}$`

---
# Mechanics of Bayesian Inference

There are four parts to a Bayesian analysis: `$\displaystyle \underbrace{f(\theta \mid x)}_{\text{posterior}} = \frac{\overbrace{f(x \mid \theta)}^{\text{likelihood}} \times \overbrace{f(\theta)}^{\text{prior}}}{\displaystyle \underbrace{\int_{-\infty}^\infty f(x \mid \theta)f(\theta) d\theta}_{\text{normalizing constant}}}$`

1. `$f(\theta \mid x)$`. "The posterior;" what we're trying to find. This distribution models our beliefs about parameter `$\theta$` given the data `$x$`. 
1. `$f(x \mid \theta)$`. "The likelihood." This distribution model conditional density/probability of the data `$x$` given the parameter `$\theta$`. We need to invert the conditioning in order to find the posterior.
1. `$f(\theta)$`. "The prior;" our beliefs about `$\theta$` prior to observing the sample `$x$`.
1. `$f(x) =\int_{-\infty}^\infty f(x \mid \theta)f(\theta) d\theta$`. A normalizing constant. Recall that the role of the normalizing constant is to force the distribution to integrate or sum to one. Therefore, we can safely ignore this constant until the end, and then find proper normalizing constant.

It's convenient to choose a **conjugate** prior distribution that, when combined with the likelihood, produces a posterior from the same family.

---

# The Toothpaste Cap Problem

As a running example, we use the **toothpaste cap problem**:

> We have a toothpaste cap--one with a wide bottom and a narrow top. We're going to toss the toothpaste cap. It can either end up lying on its side, its (wide) bottom, or its (narrow) top.

> We want to estimate the probability of the toothpaste cap landing on its top.

> We can model each toss as a Bernoulli trial, thinking of each toss as a random variable `$X$` where `$X \sim \text{Bernoulli}(\pi)$`. If the cap lands on its top, we think of the outcome as 1. If not, as 0.

> Suppose we toss the cap `$N$` times and observe `$k$` tops. What is the posterior distribution of `$\pi$`?

---
# The Likelihood (The Boring Part)

According to the model `$f(x_i \mid \pi) = \pi^{x_i} (1 - \pi)^{(1 - x_i)}$`. Because the samples are iid, we can find the *joint* distribution `$f(x) = f(x_1) \times ... \times f(x_N) = \prod_{i = 1}^N f(x_i)$`. We're just multiplying `$k$` `$\pi$`s (i.e., each of the `$k$` ones has probability `$\pi$`) and `$(N - k)$` `$(1 - \pi)$`s (i.e., each of the `$N - k$` zeros has probability `$1 - \pi$`), so that the `$f(x | \pi) = \pi^{k} (1 - \pi)^{(N - k)}$`.

$$
\text{the likelihood:  } f(x | \pi) = \pi^{k} (1 - \pi)^{(N - k)}, \text{where } k = \sum_{n = 1}^N x_n \\
$$
---

# The Prior (The Awkward Part)

The prior describes your beliefs about `$\pi$` *before* observing the data.

Here are some questions that we might ask ourselves the following questions:

1. What's the most likely value of `$\pi$`? 
1. Are our beliefs best summarizes by a distribution that's skewed to the left or right? *To the right.*
1. `$\pi$` is about _____, give or take _____ or so. 
1. There's a 25% chance that `$\pi$` is less than ____. 
1. There's a 25% chance that `$\pi$` is greater than ____.

---

# The Prior (The Awkward Part)

The prior describes your beliefs about `$\pi$` *before* observing the data.

Here are some questions that we might ask ourselves the following questions:

1. What's the most likely value of `$\pi$`? *Perhaps 0.15.*
1. Are our beliefs best summarizes by a distribution that's skewed to the left or right? *To the right.*
1. `$\pi$` is about _____, give or take _____ or so. *Perhaps 0.17 and 0.10.*
1. There's a 25% chance that `$\pi$` is less than ____. *Perhaps 0.05.*
1. There's a 25% chance that `$\pi$` is greater than ____. *Perhaps 0.20*.

---

# The Prior (The Awkward Part)

Given these answers, we can sketch the pdf of the prior distribution for `$\pi$`.

![](point-slides_files/figure-html/unnamed-chunk-1-1.png)

---

# The Prior (The Awkward Part)

We need to find a density function that matches these prior beliefs.

For the Bernoulli model, the *beta distribution* is the conjugate prior. (The beta prior produces a beta posterior.)

While a conjugate prior is not crucial in general, it makes the math much more tractable.

---
# The Prior (The Awkward Part)

So then what beta distribution captures our prior beliefs?

See the code snippet linked in today's HW assignment. Go find *your* prior density.

---
# The Prior (The Awkward Part)

I find that setting the parameters `$\alpha$` and `$\beta$` of the beta distribution to 3 and 15, respectively, captures my prior beliefs about the probability of getting a top.

![](point-slides_files/figure-html/unnamed-chunk-2-1.png)

The pdf of the beta distribution is `$f(x) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1}(1 - x)^{\beta - 1}$`. Remember that `$B()$` is the beta function, so `$\frac{1}{B(\alpha, \beta)}$` is a constant.

---
# The Prior (The Awkward Part)

Let's denote our chosen values of `$\alpha = 3$` and `$\beta = 15$` as `$\alpha^*$` and `$\beta^*$`. As we see in a moment, it's convenient distinguish the parameters in the prior distribution from other parameters.

`$f(\pi) = \frac{1}{B(\alpha^*, \beta^*)} \pi^{\alpha^* - 1}(1 - \pi)^{\beta^* - 1}$`