class: center, middle, inverse, title-slide # Regression as a Conditional Average ### Carlisle Rainey ### 2020-10-22 (updated: 2021-10-20) --- class: center, middle # The Regression Equation --- ## Observed Values Let's start by describing a scatterplot using a line. Indeed, we can think of the regression equation as an equation for a scatterplot. First, let's agree that we won't encounter a scatterplot where all the **observed values** `\((x_i, y_i)\)` fall exactly along a line. As such, we need a notation that allows us to distinguish the regression line from the observed values. --- ![](regression-as-conditional-average_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- ## Fitted Values and Coefficients We commonly refer to the values along the line as the **fitted values** (or "predicted values" or "predictions" and the observations themselves as the "observed values" or "observations." We use `\(y_i\)` to denote the `\(i\)`th observation of `\(y\)` and use `\(\hat{y}_i\)` to denote the fitted value (usually *given* `\(x_i\)`). We write the equation for the line as `\(\hat{y} = \alpha + \beta x\)` and the fitted values as `\(\hat{y}_i = \alpha + \beta x_i\)` (note the presense and absense of the subscript). We refer to the intercept `\(\alpha\)` and the slope `\(\beta\)` as **coefficients**. --- ![](regression-as-conditional-average_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## Residuals We refer to the difference between the observed value `\(y_i\)` and the fitted value `\(\hat{y}_i = \alpha + \beta x_i\)` as the **residual** `\(r_i = y_i - \hat{y}_i\)`. Thus, for *any* `\(\alpha\)` and `\(\beta\)`, we can write `\(y_i = \alpha + \beta x_i + r_i\)` for the observations `\(i = \{1, 2, ..., n\}\)`. Notice that we can break each `\(y_i\)` into two pieces components 1. the linear function of `\(x_i\)`: `\(\alpha + \beta x_i\)` (i.e., the regression line) 1. the residual `\(r_i\)`. In short, we can describe any scatterplot using the model `\(y_i = \alpha + \beta x_i + r_i\)`. --- ![](regression-as-conditional-average_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- class: inverse, center, middle ## But how do we choose `\(\alpha\)` and `\(\beta\)` that *best* fits the observed values? --- class: center, middle # An Exercise in Line Fitting --- class: inverse, center, middle # The Regression Line as a Conditional Average --- ## The Conditional Average ![](regression-as-conditional-average_files/figure-html/unnamed-chunk-4-1.png)<!-- --> Have a look at the scatterplot above. What's the portfolio share of a party in a coalition government with a seat share of 25%? --- ## The Conditional Average ![](regression-as-conditional-average_files/figure-html/unnamed-chunk-5-1.png)<!-- --> Your eyes probably immediately begin examining a vertical strip above 25%. --- ## The Conditional Average ![](regression-as-conditional-average_files/figure-html/unnamed-chunk-6-1.png)<!-- --> You probably estimate the average is a little more than 25%; call it 27%. You can see that the SD is about 10% because you'd need to go out about 10 percentage points above and below the average to grab about 2/3rds of the data. --- Now you're informed by the data and ready to answer the question. - Q: What's the portfolio share of a party in a coalition government with a seat share of 25%? - A: It's about 27% give or take 10 percentage points or so. --- ## The Conditional Average ![](regression-as-conditional-average_files/figure-html/unnamed-chunk-7-1.png)<!-- --> If we break the data into many small windows, we can visually create an average (and an SD) for each. Freedman, Pisani, and Purves (2008) refer to this as a "graph of averages." Fox (2008) calls this "naive nonparametric regression." It's a conceptual tool to help us understand regression. --- ## The Conditional Average ![](regression-as-conditional-average_files/figure-html/unnamed-chunk-8-1.png)<!-- --> For some datasets, these averages will fall roughly along a line. In that case, we can described the average value of `\(y\)` for each value of `\(x\)`--that is, the *conditional* average of `\(y\)`--with a line. --- class: inverse, center, middle # The Best Line --- ## The Best Point Before we talk about a good *line*, let's talk about a good *point*. Suppose you have a dataset `\(y = \{y_1, y_2, ... , y_n\}\)` and you want to predict these observations with a single point `\(\theta\)`. Use calculus to find the `\(\theta\)` that minimizes the r.m.s. of the residuals `\(r_i = y_i - \theta\)` or that minimizes `\(f(\theta) = \sqrt{\dfrac{\sum_{i = 1}^n(y_i - \theta)^2}{n}}\)`. -- **Hint 1**: Realize that the `\(\theta\)` that minimizes `\(f(\theta) = \sqrt{\dfrac{\displaystyle \sum_{i = 1}^n (y_i - \theta)^2}{n}}\)` also minimizes `\(g(\theta) = \dfrac{\displaystyle \sum_{i = 1}^n(y_i - \theta)^2}{n}\)`. We know this because the square root function is monotonically increase (for positive values, which this must always be) and preserves the order of observations. In other words, the `\(\theta\)` that produces the smallest RMS of the deviations also produces the smallest MS of the deviations. --- ## The Best Point Before we talk about a good *line*, let's talk about a good *point*. Suppose you have a dataset `\(y = \{y_1, y_2, ... , y_n\}\)` and you want to predict these observations with a single point `\(\theta\)`. Use calculus to find the `\(\theta\)` that minimizes the r.m.s. of the residuals `\(r_i = y_i - \theta\)` or that minimizes `\(f(\theta) = \sqrt{\dfrac{\sum_{i = 1}^n(y_i - \theta)^2}{n}}\)`. **Hint 2**: Realize that the `\(\theta\)` that minimizes `\(g(\theta) = \dfrac{\displaystyle \sum_{i = 1}^n(y_i - \theta)^2}{n}\)` also minimizes `\(h(\theta) = \displaystyle \sum_{i = 1}^n(y_i - \theta)^2\)`. Removing a constant just shifts the curve up or down, but it does not change the `\(\theta\)` that minimizes the curve. So work with `\(h(\theta) = \displaystyle \sum_{i = 1}^n(y_i - \theta)^2\)`. --- ## The Best Point Before we talk about a good *line*, let's talk about a good *point*. Suppose you have a dataset `\(y = \{y_1, y_2, ... , y_n\}\)` and you want to predict these observations with a single point `\(\theta\)`. Use calculus to find the `\(\theta\)` that minimizes the r.m.s. of the residuals `\(r_i = y_i - \theta\)` or that minimizes `\(f(\theta) = \sqrt{\dfrac{\sum_{i = 1}^n(y_i - \theta)^2}{n}}\)`. **Hint 3**: To make things easier, expand `\(h(\theta) = \displaystyle \sum_{i = 1}^n(y_i - \theta)^2\)` to `\(h(\theta) = \displaystyle \sum_{i = 1}^n(y_i^2 - 2\theta y_i + \theta^2)\)`. --- ## The Best Point Before we talk about a good *line*, let's talk about a good *point*. Suppose you have a dataset `\(y = \{y_1, y_2, ... , y_n\}\)` and you want to predict these observations with a single point `\(\theta\)`. Use calculus to find the `\(\theta\)` that minimizes the r.m.s. of the residuals `\(r_i = y_i - \theta\)` or that minimizes `\(f(\theta) = \sqrt{\dfrac{\sum_{i = 1}^n(y_i - \theta)^2}{n}}\)`. **Hint 4**: Distribute the summation operator to obtain `\(h(\theta) = \displaystyle \sum_{i = 1}^n y_i^2 - 2 \theta \sum_{i = 1} y_i + n\theta^2\)`. --- ## The Best Point Before we talk about a good *line*, let's talk about a good *point*. Suppose you have a dataset `\(y = \{y_1, y_2, ... , y_n\}\)` and you want to predict these observations with a single point `\(\theta\)`. Use calculus to find the `\(\theta\)` that minimizes the r.m.s. of the residuals `\(r_i = y_i - \theta\)` or that minimizes `\(f(\theta) = \sqrt{\dfrac{\sum_{i = 1}^n(y_i - \theta)^2}{n}}\)`. **Hint 5**: Now take the derivative of `\(h(\theta)\)` w.r.t. `\(\theta\)`, set that derivative equal to zero, and solve for `\(\theta\)`. The result should be familiar. --- ## The Best Line So far, we have to results: 1. The average is the point that minimizes the RMS of the deviations. 1. We want a line that captures the conditional average. Just as the average minimizes the RMS of the deviations, perhaps the regression line should minimize the RMS of the residuals... -- that's exactly what we do. **We want the pair of coefficients `\((\hat{\alpha}, \hat{\beta})\)` that minimizes the RMS of the residuals or ** `\(\DeclareMathOperator*{\argmin}{arg\,min}\)` `\begin{equation} (\hat{\alpha}, \hat{\beta}) = \displaystyle \argmin_{( \alpha, \, \beta ) \, \in \, \mathbb{R}^2} \sqrt{\frac{r_i^2}{n}} \end{equation}` --- ## The Best Line Let's explore three methods to find the coefficients that minimize the RMS of the residuals. 1. grid search 1. numerical optimization, to get additional intuition and preview more advanced methods 1. analytical optimization --- # Grid Search ![](img/grid-search.gif) --- # Numerical Optimization Remember that we simply need to minimize the function `\(f(\alpha, \beta) = \sqrt{\frac{\sum_{i = 1}^n r_i^2}{n}} = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} = \sqrt{\frac{ \sum_{i = 1}^n [y_i - (\alpha + \beta x_i)]^2}{n}}\)`. --- # Numerical Optimization
Hill-climbing algorithms, such as Newton-Raphson, find the optimum *numerically* by investigating the shape of `\(f\)` at its current location, taking a step uphill, and then repeating. When no step leads uphill, the algorithm has found the optimum. Under meaningful restrictions (e.g., no local optima), these algorithms find the *global* optimum. --- # Analytical Optimization Remember that we simply need to minimize the function `\(f(\alpha, \beta) = \displaystyle \sqrt{\frac{\sum_{i = 1}^n [y_i - (\alpha + \beta x_i)]^2}{n}}\)`. This is equivalent to minimizing `\(h(\alpha, \beta) = \sum_{i = 1}^n(y_i - \alpha - \beta x_i)^2\)`. We sometimes refer to this quantity as the SSR or "sum of squared residuals." To minimize `\(h(\alpha, \beta)\)`, remember that we need to solve for `\(\frac{\partial h}{\partial \alpha} = 0\)` and `\(\frac{\partial h}{\partial \beta} = 0\)` (i.e., the first-order conditions). --- # Analytical Optimization Using the chain rule, we have the partial derivatives `\(\frac{\partial h}{\partial \alpha} = \sum_{i = 1}^n [2 \times (y_i - \alpha - \beta x_i) \times (-1)] = -2 \sum_{i = 1}^n(y_i - \alpha + \beta x_i)\)` and `\(\frac{\partial h}{\partial \beta} = \sum_{i = 1}^n 2 \times (y_i - \alpha - \beta x_i) \times (-x_i) = -2 \sum_{i = 1}^n(y_i - \alpha - \beta x_i)x_i\)` and the two first-order conditions `\(-2 \sum_{i = 1}^n(y_i - \hat{\alpha} + \hat{\beta} x_i) = 0\)` and `\(-2 \sum_{i = 1}^n(y_i - \hat{\alpha} + \hat{\beta} x_i)x_i = 0\)` --- class: center # Analytical Optimization ### 1st Order Condition `\(-2 \sum_{i = 1}^n(y_i - \hat{\alpha} - \hat{\beta} x_i) = 0\)` `\(\sum_{i = 1}^n(y_i - \hat{\alpha} - \hat{\beta} x_i) = 0\)` (divide both sizes by -2) `\(\sum_{i = 1}^n y_i - \sum_{i = 1}^n \hat{\alpha} - \sum_{i = 1}^n \hat{\beta} x_i = 0\)` (distribute the sum) `\(\sum_{i = 1}^n y_i - n \hat{\alpha} - \hat{\beta}\sum_{i = 1}^n x_i = 0\)` (move constant `\(\beta\)` in front and realize that `\(\sum_{i = 1}^n \hat{\alpha} = n\hat{\alpha}\)`) `\(\sum_{i = 1}^n y_i = n \hat{\alpha} + \hat{\beta}\sum_{i = 1}^n x_i\)` (rearrange) `\(\frac{\sum_{i = 1}^n y_i}{n} = \hat{\alpha} + \hat{\beta} \frac{\sum_{i = 1}^n x_i}{n}\)` (divide both sides by `\(n\)`) `\(\overline{y} = \hat{\alpha} + \hat{\beta} \overline{x}\)` (recognize the average of `\(y\)` and of `\(x\)`) **The 1st first-order condition implies that the regression line `\(\hat{y} = \hat{\alpha} + \hat{\beta}x\)` equals `\(\overline{y}\)` when `\(x = \overline{x}\)`. Thus, the regression line must go through the point `\((\overline{x}, \overline{y})\)` or "the point of averages."** --- --- class: center # Analytical Optimization ### 1st First-Order Condition `\(-2 \sum_{i = 1}^n(y_i - \hat{\alpha} - \hat{\beta} x_i) = 0\)` ... `\(\overline{y} = \hat{\alpha} + \hat{\beta} \overline{x}\)` **The 1st first-order condition implies that the regression line `\(\hat{y} = \hat{\alpha} + \hat{\beta}x\)` equals `\(\overline{y}\)` when `\(x = \overline{x}\)`. Thus, the regression line must go through the point `\((\overline{x}, \overline{y})\)` or "the point of averages."** --- class: center # Analytical Optimization ### 2nd First-Order Condition [Do some similar, but a bit more complicated math.] `\(\hat{\beta} = r \times \dfrac{\text{SD of } y}{\text{SD of }x} = \dfrac{\sum_{i = 1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i = 1}^n (x_i - \overline{x})^2} = \dfrac{\sum_{i = 1}^n x_i y_i - n \overline{x} \overline{y}}{\sum_{i = 1}^n x_i^2 - n \overline{x}^2}\)` --- class: center # The Best Line `\(\hat{\beta} = r \times \dfrac{\text{SD of } y}{\text{SD of }x}\)` `\(\hat{\alpha} = \overline{y} - \hat{\beta}\overline{x}\)` --- # The RMS of the Residuals Sometimes the RMS of the residuals is called the "RMS error (of the regression)," the "standard error of the regression," or denoted as `\(\hat{\sigma}\)`. Here's how we can think of a regression model as a **description** of a scatterplot: The points/observations `\(y_i\)` are about `\(\hat{\alpha} + \hat{\beta} x_i\)`, give or take the RMS-of-the-residuals or so. ![](regression-as-conditional-average_files/figure-html/unnamed-chunk-10-1.png)<!-- -->