2.1 Bayesian Inference

Bayesian inference follows a simple recipe:

  1. Choose a distribution for the data.
  2. Choose a distribution to describe your prior beliefs.
  3. Update the prior distribution upon observing the data by computing the posterior distribution.

2.1.1 Mechanics

Suppose a random sample from a distribution \(f(x; \theta)\) that depends on the unknown parameter \(\theta\).

Bayesian inference models our beliefs about the unknown parameter \(\theta\) as a distribution. It answers the question: what should we believe about \(\theta\), given the observed samples \(x = \{x_1, x_2, ..., x_n\}\) from \(f(x; \theta)\)? These beliefs are simply the conditional distribution \(f(\theta \mid x)\).

By Bayes’ rule, \(\displaystyle f(\theta \mid x) = \frac{f(x \mid \theta)f(\theta)}{f(x)} = \frac{f(x \mid \theta)f(\theta)}{\displaystyle \int_{-\infty}^\infty f(x \mid \theta)f(\theta) d\theta}\).

\[ \displaystyle \underbrace{f(\theta \mid x)}_{\text{posterior}} = \frac{\overbrace{f(x \mid \theta)}^{\text{likelihood}} \times \overbrace{f(\theta)}^{\text{prior}}}{\displaystyle \underbrace{\int_{-\infty}^\infty f(x \mid \theta)f(\theta) d\theta}_{\text{normalizing constant}}} \] There are four parts to a Bayesian analysis.

  1. \(f(\theta \mid x)\). “The posterior;” what we’re trying to find. This distribution models our beliefs about parameter \(\theta\) given the data \(x\).
  2. \(f(x \mid \theta)\). “The likelihood.” This distribution model conditional density/probability of the data \(x\) given the parameter \(\theta\). We need to invert the conditioning in order to find the posterior.
  3. \(f(\theta)\). “The prior;” our beliefs about \(\theta\) prior to observing the sample \(x\).
  4. \(f(x) =\int_{-\infty}^\infty f(x \mid \theta)f(\theta) d\theta\). A normalizing constant. Recall that the role of the normalizing constant is to force the distribution to integrate or sum to one. Therefore, we can safely ignore this constant until the end, and then find proper normalizing constant.

It’s convenient to choose a conjugate prior distribution that, when combined with the likelihood, produces a posterior from the same family.

As a running example, we use the toothpaste cap problem:

We have a toothpaste cap–one with a wide bottom and a narrow top. We’re going to toss the toothpaste cap. It can either end up lying on its side, its (wide) bottom, or its (narrow) top.

We want to estimate the probability of the toothpaste cap landing on its top.

We can model each toss as a Bernoulli trial, thinking of each toss as a random variable \(X\) where \(X \sim \text{Bernoulli}(\pi)\). If the cap lands on its top, we think of the outcome as 1. If not, as 0.

Suppose we toss the cap \(N\) times and observe \(k\) tops. What is the posterior distribution of \(\pi\)?

2.1.1.1 The Likelihood

According to the model \(f(x_i \mid \pi) = \pi^{x_i} (1 - \pi)^{(1 - x_i)}\). Because the samples are iid, we can find the joint distribution \(f(x) = f(x_1) \times ... \times f(x_N) = \prod_{i = 1}^N f(x_i)\). We’re just multiplying \(k\) \(\pi\)s (i.e., each of the \(k\) ones has probability \(\pi\)) and \((N - k)\) \((1 - \pi)\)s (i.e., each of the \(N - k\) zeros has probability \(1 - \pi\)), so that the \(f(x | \pi) = \pi^{k} (1 - \pi)^{(N - k)}\).

\[ \text{the likelihood: } f(x | \pi) = \pi^{k} (1 - \pi)^{(N - k)}, \text{where } k = \sum_{n = 1}^N x_n \\ \]

2.1.1.2 The Prior

The prior describes your beliefs about \(\pi\) before observing the data.

Here are some questions that we might ask ourselves the following questions:

  1. What’s the most likely value of \(\pi\)? Perhaps 0.15.
  2. Are our beliefs best summarizes by a distribution that’s skewed to the left or right? To the right.
  3. \(\pi\) is about _____, give or take _____ or so. Perhaps 0.17 and 0.10.
  4. There’s a 25% chance that \(\pi\) is less than ____. Perhaps 0.05.
  5. There’s a 25% chance that \(\pi\) is greater than ____. Perhaps 0.20.

Given these answers, we can sketch the pdf of the prior distribution for \(\pi\).

Now we need to find a density function that matches these prior beliefs. For this Bernoulli model, the beta distribution is the conjugate prior. While a conjugate prior is not crucial in general, it makes the math much more tractable.

So then what beta distribution captures our prior beliefs?

There’s a code snippet here to help you explore different beta distributions (also in the appendix).

After some exploration, we find that setting the parameters \(\alpha\) and \(\beta\) of the beta distribution to 3 and 15, respectively, captures our prior beliefs about the probability of getting a top.

The pdf of the beta distribution is \(f(x) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1}(1 - x)^{\beta - 1}\). Remember that \(B()\) is the beta function, so \(\frac{1}{B(\alpha, \beta)}\) is a constant.

Let’s denote our chosen values of \(\alpha = 3\) and \(\beta = 15\) as \(\alpha^*\) and \(\beta^*\). As we see in a moment, it’s convenient distinguish the parameters in the prior distribution from other parameters.

\[ \text{the prior: } f(\pi) = \frac{1}{B(\alpha^*, \beta^*)} \pi^{\alpha^* - 1}(1 - \pi)^{\beta^* - 1} \]

2.1.1.3 The Posterior

Now we need to compute the posterior by multiplying the likelihood times the prior and then finding the normalizing constant. \[ \text{the posterior: } \displaystyle \underbrace{f(\pi \mid x)}_{\text{posterior}} = \frac{\overbrace{f(x \mid \pi)}^{\text{likelihood}} \times \overbrace{f(\pi)}^{\text{prior}}}{\displaystyle \underbrace{\int_{-\infty}^\infty f(x \mid \pi)f(\pi) d\pi}_{\text{normalizing constant}}} \\ \] Now we plug in the likelihood, plug in the prior, and denote the normalizing constant as \(C_1\) to remind ourselves that it’s just a constant.

\[ \text{the posterior: } \displaystyle \underbrace{f(\pi \mid x)}_{\text{posterior}} = \frac{\overbrace{\left[ \pi^{k} (1 - \pi)^{(N - k) }\right] }^{\text{likelihood}} \times \overbrace{ \left[ \frac{1}{B(\alpha^*, \beta^*)} \pi^{\alpha^* - 1}(1 - \pi)^{\beta^* - 1} \right] }^{\text{prior}}}{\displaystyle \underbrace{C_1}_{\text{normalizing constant}}} \\ \]

Now we need to simplify the right-hand side.

First, notice that the term \(\frac{1}{B(\alpha^*, \beta^*)}\) in the numerator is just a constant. We can incorporate that constant term with \(C_1\) by multiplying top and bottom by \(B(\alpha^*, \beta^*)\) and letting \(C_2 = C_1 \times B(\alpha^*, \beta^*)\).

\[ \text{the posterior: } \displaystyle \underbrace{f(\pi \mid x)}_{\text{posterior}} = \frac{\overbrace{\left[ \pi^{k} (1 - \pi)^{(N - k) }\right] }^{\text{likelihood}} \times \left[ \pi^{\alpha^* - 1}(1 - \pi)^{\beta^* - 1} \right] }{\displaystyle \underbrace{C_2}_{\text{new normalizing constant}}} \\ \]

Now we can collect the exponents with base \(\pi\) and the exponents with base \((1 - \pi)\).

\[ \text{the posterior: } \displaystyle \underbrace{f(\pi \mid x)}_{\text{posterior}} = \frac{\left[ \pi^{k} \times \pi^{\alpha^* - 1} \right] \times \left[ (1 - \pi)^{(N - k) } \times (1 - \pi)^{\beta^* - 1} \right] }{ C_2} \\ \] Recalling that \(x^a \times x^b = x^{a + b}\), we combine the powers.

\[ \text{the posterior: } \displaystyle \underbrace{f(\pi \mid x)}_{\text{posterior}} = \frac{\left[ \pi^{(\alpha^* + k) - 1} \right] \times \left[ (1 - \pi)^{[\beta^* + (N - k)] - 1} \right] }{ C_2} \\ \]

Because we’re clever, we notice that this is almost a beta distribution with \(\alpha = (\alpha^* + k)\) and \(\beta = [\beta^* + (N - k)]\). If \(C_2 = B(\alpha^* + k, \beta^* + (N - k))\), then the posterior is exactly a \(\text{beta}(\alpha^* + k, \beta^* + [N - k]))\) distribution.

This is completely expected. We chose a beta distribution for the prior because it would give us a beta posterior distribution. For simplicity, we can denote the parameter for the beta posterior as \(\alpha^\prime\) and \(\beta^\prime\), so that \(\alpha^\prime = \alpha^* + k\) and \(\beta^\prime = \beta^* + [N - k]\)

\[ \begin{aligned} \text{the posterior: } \displaystyle \underbrace{f(\pi \mid x)}_{\text{posterior}} &= \frac{ \pi^{\overbrace{(\alpha^* + k)}^{\alpha^\prime} - 1} \times (1 - \pi)^{\overbrace{[\beta^* + (N - k)]}^{\beta^\prime} - 1} }{ B(\alpha^* + k, \beta^* + [N - k])} \\ &= \frac{ \pi^{\alpha^\prime - 1} \times (1 - \pi)^{\beta^\prime - 1} }{ B(\alpha^\prime, \beta^\prime)}, \text{where } \alpha^\prime = \alpha^* + k \text{ and } \beta^\prime = \beta^* + [N - k] \end{aligned} \]

This is an elegant, simple solution. To obtain the parameters for the beta posterior distribution, we just add the number of tops (Bernoulli successes) to the prior value for \(\alpha\) and the number of not-tops (sides and bottoms; Bernoulli failures) to the prior value for \(\beta\).

Suppose that I tossed the toothpaste cap 150 times and got 8 tops.

# prior parameters
alpha_prior <- 3
beta_prior <- 15

# data 
k <- 8
N <- 150

# posterior parameters
alpha_posterior <- alpha_prior + k
beta_posterior <- beta_prior + N - k

# plot prior and posterior
gg_prior <- ggplot() + 
  stat_function(fun = dbeta, args = list(shape1 = alpha_prior, shape2 = beta_prior)) + 
  labs(title = "prior distribution", x = "pi", y = "prior density")
gg_posterior <- ggplot() + 
  stat_function(fun = dbeta, args = list(shape1 = alpha_posterior, shape2 = beta_posterior)) + 
  labs(title = "posterior distribution", x = "pi", y = "posterior density")

library(patchwork)
gg_prior + gg_posterior

2.1.1.4 Bayesian Point Estimates

In this section, we want point estimates—a best guess at the parameter—not a full posterior distribution.

We have three options:

  1. The posterior mean. The posterior mean minimizes a squared-error loss function. That is, the cost of guessing \(a\) when the truth is \(\alpha\) is \((a - \alpha)^2\). In the case of the beta posterior, it’s just \(\dfrac{\alpha^\prime}{\alpha^\prime + \beta^\prime}\). For our prior and data, we have \(\dfrac{3 + 8}{(3 + k) + (15 + 150 - 8)} \approx 0.065\).
  2. The posterior median: The posterior median minimizes an absolute loss function where the cost of guessing \(a\) when the truth is \(\alpha\) is \(|a - \alpha|\). Intuitively, there’s a 50% chance that \(\pi\) falls above and below the posterior median. In the case of the beta posterior, it’s just \(\dfrac{\alpha^\prime - \frac{1}{3}}{\alpha^\prime + \beta^\prime - \frac{2}{3}}\) (for \(\alpha^\prime, \beta^\prime > 1\)). For our prior and data, we have \(\dfrac{3 + 8 -\frac{1}{3}}{(3 + k) + (15 + 150 - 8) - \frac{2}{3}} \approx 0.064\).
  3. The posterior mode: The posterior mode is the most likely value of \(\pi\), so it minimizes a loss function that penalizes all misses equally. In the case of the beta posterior, it’s just \(\dfrac{\alpha^\prime - 1}{\alpha^\prime + \beta^\prime - 2}\) (for \(\alpha^\prime, \beta^\prime > 1\)). For our prior and data, we have \(\dfrac{3 + 8 - 1}{(3 + k) + (15 + 150 - 8) - 2} \approx 0.060\).

2.1.2 Example: Poisson Distribution

Suppose we collect \(N\) random samples \(x = \{x_1, x_2, ..., x_N\}\) and model each draw as a random variable \(X \sim \text{Poisson}(\lambda)\). Find the posterior distribution of \(\lambda\) for the gamma prior distribution. Hint: the gamma distribution is the conjugate prior for the Poisson likelihood.

\[ \begin{aligned} \text{Poisson likelihood: } f(x \mid \lambda) &= \prod_{n = 1}^N \frac{\lambda^{x_n} e^{-\lambda}}{x_n!} \\ &= \displaystyle \left[ \frac{1}{\prod_{n = 1}^N x_n !} \right]e^{-N\lambda}\lambda^{\sum_{n = 1}^N x_n} \end{aligned} \]

\[ \text{Gamma prior: } f( \lambda; \alpha^*, \beta^*) = \frac{{\beta^*}^{\alpha^*}}{\Gamma(\alpha^*)} \lambda^{\alpha^* - 1} e^{-\beta^*\lambda} \] To find the posterior, we multiply the likelihood times the prior and normalize. Because the gamma prior distribution is the conjugate prior for the Poisson likelihood, we know that the posterior will be a gamma distribution.

\[ \begin{aligned} \text{Gamma posterior: } f( \lambda \mid x) &= \frac{\left( \displaystyle \left[ \frac{1}{\prod_{n = 1}^N x_n !} \right]e^{-N\lambda}\lambda^{\sum_{n = 1}^N x_n}\right) \times \left( \left[ \frac{{\beta^*}^{\alpha^*}}{\Gamma(\alpha^*)} \right] \lambda^{\alpha^* - 1} e^{-\beta^*\lambda}\right)}{C_1} \\ \end{aligned} \] Because \(x\), \(\alpha_*\), and \(\beta\) are fixed, the terms in square brackets are constant, so we can safely consider those part of the normalizing constant.

\[ \begin{aligned} &= \frac{\left( \displaystyle e^{-N\lambda}\lambda^{\sum_{n = 1}^N x_n}\right) \times \left( \lambda^{\alpha^* - 1} e^{-\beta^*\lambda}\right)}{C_2} \\ \end{aligned} \] Now we can collect the exponents with the same base.

\[ \begin{aligned} &= \frac{\left( \lambda^{\alpha^* - 1} \times \lambda^{\sum_{n = 1}^N x_n}\right) \times \left( \displaystyle e^{-N\lambda} \times e^{-\beta^*\lambda} \right)}{C_2} \\ &= \frac{\lambda^{ \overbrace{\left[ \alpha^* + \sum_{n = 1}^N x_n \right]}^{\alpha^\prime} - 1} e^{-\overbrace{[\beta^* + N]}^{\beta^\prime}\lambda} }{C_2} \\ \end{aligned} \]

We recognize this as almost a Gamma distribution with parameters \(\alpha^\prime = \alpha^* + \sum_{n = 1}^N x_n\) and \(\beta^\prime = \beta^* + N\). Indeed, if \(\frac{1}{C_2} = \frac{{\beta^\prime}^{\alpha^\prime}}{\Gamma(\alpha^{\prime})}\), then we have exactly a gamma distribution.

\[ \begin{aligned} &= \frac{{\beta^\prime}^{\alpha^\prime}}{\Gamma(\alpha^{\prime})} \lambda^{ \alpha^\prime - 1} e^{-\beta^\prime\lambda}, \text{where } \alpha^\prime = \alpha^* + \sum_{n = 1}^N x_n \text{ and } \beta^\prime = \beta^* + N \end{aligned} \]

Like the Bernoulli likelihood with the beta prior, the Poisson likelihood withe the gamma prior gives a nice result. We start with values parameters of the gamma distribution \(\alpha = \alpha^*\) and \(\beta + \beta^*\) so that the gamma prior distribution describes our prior beliefs about the parameters \(\lambda\) of the Poisson distribution. Then we add the sum of the data \(x\) to \(\alpha^*\) and the number of samples \(N\) to \(\beta^*\) to obtain the parameters of the gamma posterior distribution.

The code below shows the posterior distribution

# set see to make reproducible
set.seed(1234)

# prior parameters
alpha_prior <- 3
beta_prior <- 3

# create an "unknown" value of lambda to estimate
lambda <- 2

# generate a data set
N <- 5  # number of samples
x <- rpois(N, lambda = lambda)
print(x)  # print the data set
## [1] 0 2 2 2 4
# posterior parameters
alpha_posterior <- alpha_prior + sum(x)
beta_posterior <- beta_prior + N

# plot prior and posterior
gg_prior <- ggplot() + xlim(0, 5) + 
  stat_function(fun = dgamma, args = list(shape = alpha_prior, rate = beta_prior)) + 
  labs(title = "prior distribution", x = "lambda", y = "prior density")
gg_posterior <- ggplot() + xlim(0, 5) + 
  stat_function(fun = dgamma, args = list(shape = alpha_posterior, rate = beta_posterior)) + 
  labs(title = "posterior distribution", x = "lambda", y = "posterior density")
gg_prior + gg_posterior  # uses patchwork package

# posterior mean: alpha/beta
alpha_posterior/beta_posterior
## [1] 1.625
# posterior median: no closed form, so simulate
post_sims <- rgamma(10000, alpha_posterior, beta_posterior)
median(post_sims)
## [1] 1.586689
# posterior mode: (alpha - 1)/beta for alpha > 1
(alpha_posterior - 1)/beta_posterior
## [1] 1.5

2.1.3 Remarks

Bayesian inference presents two difficulties.

  1. Choosing a prior.
    1. It can be hard to actually construct a prior distribution. It’s challenging when dealing with a single parameter. It becomes much more difficult when dealing with several or many parameters.
    2. Priors are subjective, so that one researcher’s prior might not work for another.
  2. Computing the posterior. Especially for many-parameter problems and non-conjugate priors, computing the posterior can be nearly intractable.

However, there are several practical solutions to these difficulties.

  1. Choosing a prior.
    1. We can use a “uninformative” or constant prior. Sometimes, we can use an improper prior that doesn’t integrate to one, but places equal prior weight on all values.
    2. We can use an extremely diffuse prior. For example, if we wanted to estimate the average height in a population in inches, we might use a normal distribution centered at zero with an SD of 10,000. This prior says: “The average height is about zero, give or take 10,000 inches or so.”
    3. We can use an informative prior, but conduct careful robustness checks to assess whether the conclusions depend on the particular prior.
    4. We can use a weakly informative prior, that rules places meaningful prior weight on all the plausible values and little prior weight only on the most implausible values. As a guideline, you might create a weakly informative prior by doubling or tripling the SD of the informative prior.
  2. Computing the posterior.
    1. While analytically deriving the posterior becomes intractable for most applied problems, it’s relatively easy to sample from the posterior distribution for many models.
    2. Algorithms like Gibbs samplers, MCMC, and HMC make this sampling procedure straightforward for a given model.
    3. Software such as Stan make sampling easy to set up and very fast. Post-processing R packages such as tidybayes make it each to work with the posterior simulations.

2.1.4 Exercises

Exercise 2.1 Suppose we collect \(N\) random samples \(x = \{x_1, x_2, ..., x_N\}\) and model each draw as a random variable \(X \sim \text{exponential}(\lambda)\) with pdf \(f(x_n | \lambda) = \lambda e^{-\lambda x_n}\). Find the posterior distribution of \(\lambda\) for the gamma prior distribution. Hint: the gamma distribution is the conjugate prior for the exponential likelihood.

Hint Use the Poisson example from above. Because they share a prior, the math works quite similarly.
Solution

\[ \begin{aligned} \text{exponential likelihood: } f(x \mid \lambda) &= \prod_{n = 1}^N \lambda e^{-\lambda x_n} \\ &= \lambda^N e^{-\lambda \sum_{n = 1}^N x_n} \end{aligned} \] \[ \text{Gamma prior: } f( \lambda; \alpha^*, \beta^*) = \frac{{\beta^*}^{\alpha^*}}{\Gamma(\alpha^*)} \lambda^{\alpha^* - 1} e^{-\beta^*\lambda} \] \[ \begin{aligned} \text{Gamma posterior: } f( \lambda \mid x) &= \frac{\left( \lambda^N e^{-\lambda \sum_{n = 1}^N x_n}\right) \times \left( \left[ \frac{{\beta^*}^{\alpha^*}}{\Gamma(\alpha^*)} \right] \lambda^{\alpha^* - 1} e^{-\beta^*\lambda}\right)}{C_1} \\ &= \frac{\left( \lambda^N e^{-\lambda \sum_{n = 1}^N x_n}\right) \times \left( \lambda^{\alpha^* - 1} e^{-\beta^*\lambda}\right)}{C_2} \\ &= \frac{ \lambda^{ \overbrace{\left[ \alpha^* + N \right]}^{\alpha^\prime} - 1} e^{- \overbrace{\left[ \beta^* + \sum_{n = 1}^N x_n \right]}^{\beta^\prime} \lambda}}{C_2} \\ & = \frac{{\beta^\prime}^{\alpha^\prime}}{\Gamma(\alpha^\prime)} \lambda^{\alpha^\prime - 1} e^{-\beta^\prime\lambda}\text{ where } \alpha^\prime = \alpha^* + N \text{ and } \beta^\prime = \beta^* + \sum_{n = 1}^N x_n \end{aligned} \]

Exercise 2.2 You are a researcher interesting in government duration in parliamentary democracies. You decide to model duration as an exponential distribution. To estimate the parameter \(\lambda\) of the exponential distribution, you collect data on the last five UK governments.

  • The first Johnson ministry lasted 142 days.
  • The second May ministry lasted 773 days.
  • The first May ministry lasted 333 days.
  • The second Cameron ministry lasted 432 days.
  • The first Cameron ministry (Cameron-Clegg coalition) lasted 1,823 days.
  1. Rescale the data to years–that helps with the interpretation.
  2. Create three gamma prior distributions.
    1. One that describes your prior beliefs.
    2. One that’s weakly informative.
    3. One that’s as uninformative as possible.
  3. Use what you learned from Exercise 2.1 to plot each prior and posterior.
  4. For each posterior, compute the mean, median, and mode of \(\lambda\). Interpret. Hint: \(\lambda\) is a rate. If the durations are measured in years, then it’s the failures per year. The expected duration in years is \(\frac{1}{\lambda}\), but remember that \(E \left( \frac{1}{\lambda}\right) \neq \frac{1} {E(\lambda)}\). If you want to find the posterior distribution of the expected duration (rather than the failures/year), then you can simulate many draws of \(\lambda\) from the posterior and transform each draw into \(\frac{1}{\lambda}\). The mean of the transformed draws is the posterior mean of the expected duration.
  5. Assess the model. Do you think we have a good model? Hint: the exponential distribution is memoryless. Does that make sense in this appliation?
Solutions
# load packages
library(tidyverse)
library(stringr)
library(kableExtra)

# set seed for reproducibility
set.seed(1234)

# choose informative  prior distribution
# - I expect govts to last, on avg, 3 years, so fail at a 
#   rate of lambda = 1/3 per year, give-or-take about 1/10.
# - I know that the mean of the gamma is a/b and the sd is 
#   sqrt{a/(b^2)}, so I set a/b = 1/3 and sqrt{a/(b^2)} = 1/10
#   and solve. You could also just experiment with different 
#   values
alpha_inf <- 100/9
beta_inf <-  3*alpha_inf

# plot informative prior
ggplot() + xlim(0, 1) +
  stat_function(fun = dgamma, args = list(shape = alpha_inf, rate = beta_inf)) + 
  labs(title = "prior distribution", x = "lambda; expected failures/year", y = "prior density")

# choose **weakly informative** prior distribution
# - For this, I repeated the same process, but doubled my
#   give-or-take from 1/10 to 1/5,
alpha_weak <- 25/9
beta_weak <- 3*alpha_weak

# plot weakly informative prior
ggplot() + xlim(0, 1) +
  stat_function(fun = dgamma, args = list(shape = alpha_weak, rate = beta_weak)) + 
  labs(title = "prior distribution", x = "lambda; expected failures/year", y = "prior density")

# choose flattest prior distribution
# - I just want to make this prior as close to uniform as possible.
# - the failure rate (failures/year) is *surely* between 0 and 100
#   (lambda = 100 means that 100 governments are failing per year), 
#   so I plot the prior from zero to 100--it's basically flat   
alpha_flat <- 1
beta_flat <- 0.01

# plot flattish prior
ggplot() + xlim(0, 100) + ylim(0, NA) + 
  stat_function(fun = dgamma, args = list(shape = alpha_flat, rate = beta_flat)) + 
  labs(title = "prior distribution", x = "lambda; expected failures/year", y = "prior density")

# data
x <- c(142, 773, 333, 432, 1823)/365  # rescale to years

# make a data frame with the posterior and prior parameters
posts <- data.frame(prior_type = c("Informative", 
                                   "Weakly Informative",
                                   "Flat-ish"),
                    alpha_prior = c(alpha_inf, 
                                    alpha_weak,
                                    alpha_flat),
                    beta_prior = c(beta_inf, 
                                    beta_weak,
                                    beta_flat)) %>%
  # add posterior parameters
  mutate(alpha_posterior = alpha_prior + length(x),
         beta_posterior = beta_prior + sum(x)) %>%
  # create one row per parameter
  pivot_longer(cols = alpha_prior:beta_posterior) %>%
  # separate parameter from distribution type
  separate(name, into = c("parameter", "distribution_type")) %>%
  # put parameters into separate columns
  pivot_wider(names_from = parameter, values_from = value) %>%
  mutate(distribution_type = str_to_title(distribution_type),
         distribution_type = factor(distribution_type, levels = c("Prior", "Posterior"))) %>%
  mutate(prior_type = factor(prior_type, levels = c("Informative", "Weakly Informative", "Flat-ish")))

# compute point estimates and make a table
point_estimates <- posts %>% 
  #filter(distribution_type == "posterior") %>%
  group_by(prior_type, distribution_type) %>%
  mutate(mean = alpha/beta,
         median = median(rgamma(100000, alpha, beta)),
         mode = (alpha - 1)/beta,
         mean_expected_duration = mean(1/rgamma(100000, shape = alpha, rate = beta))) 

point_estimates %>%
  select(-alpha, -beta) %>%
  arrange(distribution_type, prior_type) %>%
  rename(`Prior` = prior_type, 
         `Distribution` = distribution_type,
         `Mean of lambda` = mean,
         `Median of lambda` = median,
         `Mode of lambda` = mode,
         `Mean of 1/lambda` = mean_expected_duration) %>%
  kable(format = "markdown", digits = 2)
Prior Distribution Mean of lambda Median of lambda Mode of lambda Mean of 1/lambda
Informative Prior 0.33 0.32 0.30 3.30
Weakly Informative Prior 0.33 0.29 0.21 4.70
Flat-ish Prior 100.00 69.38 0.00 0.12
Informative Posterior 0.38 0.37 0.35 2.84
Weakly Informative Posterior 0.43 0.42 0.38 2.65
Flat-ish Posterior 0.62 0.59 0.52 1.92
# plot distributions
gg_posts <- posts %>%
  # add in lambda values for which to compute the prior and posterior density
  full_join(data_frame(lambda = seq(0, 2, length.out = 100)), 
            by = character()) %>%
  # compute the posterior and prior density for each lambda
  mutate(density = dgamma(lambda, shape = alpha, rate = beta))

ggplot(gg_posts, aes(x = lambda, y = density, 
                  color = prior_type, 
                  linetype = distribution_type)) + 
  geom_line()