Data Wrangling

# Data Wrangling

---

# to **wrangle** is to **act**

---

# wrangling is **hard**

there are 1000s of tasks; you must learn each task

it's not general like ggplot()

---

## follow along by cloning wrangling from our POS 5737 organization

---

# to **wrangle** is to...

`read_***()`, 
`glimpse()`, 
`filter()`, 
`select()`, 
`rename()`, 
`mutate()`, 
`pivot_longer()`,
`join()`, and 
`write_***()`

![](https://upload.wikimedia.org/wikipedia/commons/6/6f/Twemoji2_1f93a.svg)

---
class: bottom, center

# what we want 
--

### 50 states

### four quarters

### percent change in GDP

### state political ideology

---
class: middle, center

# what we want

### 50 states

### four quarters

### percent change in GDP

### state political ideology

---
class: middle, center

---

# data on GDP in the U.S. states

--
* On Sept. 11, 2018, I went to the BEA's website and grabbed their [current release](https://www.bea.gov/data/gdp/gdp-state) of GDP by state.

--
* To preserve the raw data file, I changed nothing and uploaded [`qgdpstate0718_0.xlsx`](https://docs.google.com/spreadsheets/d/1y10LLeyfMir4e2uz87frYb2PVXtS41hXXBqMz828GFQ/edit#gid=556942308) to Google Drive as a Google Sheet.

--
* To use the data locally, I downloaded `qgdpstate0718_0.xlsx` using *File* > *Download as...* > *Comma-separated values (.csv, current sheet)*. I added it to the `data/` subdirectory.

---
class: middle, center

---

background-image: url("https://upload.wikimedia.org/wikipedia/commons/1/11/Noto_Emoji_Oreo_1f92c.svg")
background-size: cover

---

background-image: url("https://upload.wikimedia.org/wikipedia/commons/6/63/Twemoji2_1f914.svg")
background-size: cover

---
class: middle, center

# reading the data into R

### `read_csv()`

---

```r
# load packages
library(tidyverse)
```

```
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
```

```
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
```

```
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
```

---

```r
# read data
gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv")
```

```
## New names:
## Rows: 65 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (7): Table 1. Percent Change in Real Gross Domestic Product (GDP) by Sta... dbl
## (1): ...2
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
```

---
class: middle, center

---

```r
# read data, again
gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv", 
                       skip = 3)
```

```
## New names:
## Rows: 62 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...8 dbl (6): ...2, Q1...3, Q2, Q3, Q4, Q1...7
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
## • `Q1` -> `Q1...3`
## • `Q1` -> `Q1...7`
## • `` -> `...8`
```

---

```r
# quick look
glimpse(gdp_df_raw)
```

```
## Rows: 62
## Columns: 8
## $ ...1   <chr> "United States1 …………..", "New England ……………", "Connecticut ……………
## $ ...2   <dbl> 2.1, 1.6, -0.2, 1.4, 2.6, 1.9, 1.6, 1.1, 1.3, 1.6, 1.7, 1.5, 0.…
## $ Q1...3 <dbl> 0.9, 2.2, -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 0.8, 2.4, 2.8, -2.9, -…
## $ Q2     <dbl> 2.7, 1.2, 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, 0.7, -9.7, -1.7, 2.6,…
## $ Q3     <dbl> 3.3, 5.1, 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 4.1, 13.1, 12.2, 5.4, 2…
## $ Q4     <dbl> 2.7, 2.5, 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 1.1, 0.2, 1.2, 1.2, 2.1…
## $ Q1...7 <dbl> 1.8, 1.5, 1.6, 0.6, 1.5, 1.3, 1.3, 2.6, 1.5, 1.3, 2.0, 1.5, 1.6…
## $ ...8   <chr> ".......", ".......", "23", "46", "29", "36", "35", "8", ".....…
```

---
class: center, middle
# filtering rows

### `filter()`

---

```r
# filter the rows we want
gdp_df <- filter(gdp_df_raw, ...8 != ".......")

# quick look
glimpse(gdp_df)
```

```
## Rows: 50
## Columns: 8
## $ ...1   <chr> "Connecticut …………….", "Maine ……………………", "Massachusetts …………", "…
## $ ...2   <dbl> -0.2, 1.4, 2.6, 1.9, 1.6, 1.1, 1.6, 1.5, 0.9, 1.1, 1.8, 1.2, 2.…
## $ Q1...3 <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1…
## $ Q2     <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9,…
## $ Q3     <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.…
## $ Q4     <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7…
## $ Q1...7 <dbl> 1.6, 0.6, 1.5, 1.3, 1.3, 2.6, 1.3, 1.5, 1.6, 1.1, 2.0, 1.4, 1.3…
## $ ...8   <chr> "23", "46", "29", "36", "35", "8", "38", "30", "25", "41", "19"…
```

---
class: center, middle

# selecting columns

### `select()`

---

```r
# select the rows we want
gdp2_df <- select(gdp_df, -...2, -Q1...7, -...8)

# quick look
glimpse(gdp2_df)
```

```
## Rows: 50
## Columns: 5
## $ ...1   <chr> "Connecticut …………….", "Maine ……………………", "Massachusetts …………", "…
## $ Q1...3 <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1…
## $ Q2     <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9,…
## $ Q3     <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.…
## $ Q4     <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7…
```

---
class: center, middle

# renaming columns

### `rename()`

---

```r
# rename the pooly named variables
gdp3_df <- rename(gdp2_df, state = ...1, 
                  Q1 = Q1...3)

# quick look
glimpse(gdp3_df)
```

```
## Rows: 50
## Columns: 5
## $ state <chr> "Connecticut …………….", "Maine ……………………", "Massachusetts …………", "N…
## $ Q1    <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1,…
## $ Q2    <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9, …
## $ Q3    <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.4…
## $ Q4    <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7,…
```

---
class: middle, center

---
class: middle, center

# three **actions** so far

### `filter()`
### `select()`
### `rename()`

---

# All Together Now!

```r
# read data
gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv", 
                       skip = 3)

# clean data
gdp_df <- filter(gdp_df_raw, ...8 != ".......")
gdp2_df <- select(gdp_df, -...2, -Q1...7, -...8)
gdp3_df <- rename(gdp2_df, state = ...1, 
                  Q1 = Q1...3)

# quick look
glimpse(gdp3_df)
```

---
class: middle, center

# This whole `gdp_df`, `gdp2_df`, `gdp3_df`, ... isn't very satisfying.

---
class:middle

# Notice that the data frame output from one function is the input to the next...

```
# read data
gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv", 
                       skip = 3)

# clean data
gdp_df <- filter(gdp_df_raw, ...8 != ".......")
gdp2_df <- select(gdp_df, -...2, -Q1...7, -...8)
gdp3_df <- rename(gdp2_df, state = ...1, 
                  Q1 = Q1...3)
```

---
class: middle, center

# enter the pipe

---

# All Together Now! With a Pipe...  😎

```r
# read data
gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv", 
                       skip = 3)

# clean data
gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>%
  select(-...2, -Q1...7, -...8) %>%
  rename(state = ...1, Q1 = Q1...3)
```

---
class: middle, center

# mutating variables

### `mutate()`

---

```r
# clean data
gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>%
  select(-...2, -Q1...7, -...8) %>%
  rename(state = ...1, Q1 = Q1...3) %>% 
  # mutate state (remove all non-alpha characters)
  mutate(state = str_remove_all(state, pattern = "\\W")) %>%
  glimpse()
```

```
## Rows: 50
## Columns: 5
## $ state <chr> "Connecticut", "Maine", "Massachusetts", "NewHampshire", "RhodeI…
## $ Q1    <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1,…
## $ Q2    <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9, …
## $ Q3    <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.4…
## $ Q4    <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7,…
```
---

```r
# clean data
gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>%
  select(-...2, -Q1...7, -...8) %>%
  rename(state = ...1, Q1 = Q1...3) %>% 
  mutate(state = str_remove_all(state, pattern = "\\W")) %>%
  # add spaces before upper-case letters preceeded by a lower-case
  mutate(state = str_replace(state, 
                             pattern = "([a-z])([A-Z])", 
                             replacement = "\\1 \\2")) %>%
  glimpse()
```

```
## Rows: 50
## Columns: 5
## $ state <chr> "Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode…
## $ Q1    <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1,…
## $ Q2    <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9, …
## $ Q3    <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.4…
## $ Q4    <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7,…
```
---
### details

--
```
gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>%
  select(-...2, -Q1...7, -...8) %>%
  rename(state = ...1, Q1 = Q1...3) %>% 
  mutate(state = str_remove_all(state, pattern = "\\W")) %>%
  mutate(state = str_replace(state, 
                             pattern = "([a-z])([A-Z])", 
                             replacement = "\\1 \\2")) %>%
  glimpse()
```

--
* `mutate()` changes a variable in a data frame, but **returns a data frame**.

--
* `str_remove_all()` looks at every character in each element of a character vector and removes the characters that match a pattern. **It outputs a character vector the same length as the input.**

--
* `pattern = "\\W"` is regex for "non-letters."

--
* `str_replace()` looks at every character in each element of a character vector and replaces the characters that match a pattern. **It outputs a character vector the same length as the input.**

--
* `pattern = "([a-z])([A-Z])"` is regex for "lower-case letter followed by upper-case letter."

--
* `replacement = "\\1 \\2""` is regex for "the first match in parenthesis (e.g., w), a space, and the second match in parenthesis (e.g., H)."

---
class: middle, center

# whenver you need to use regex, remember the following...

---

---

# gathering columns

### ~~`gather()`~~

### `pivot_longer()`

---

```r
# clean data
gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>%
  select(-...2, -Q1...7, -...8) %>%
  rename(state = ...1, Q1 = Q1...3) %>% 
  mutate(state = str_remove_all(state, pattern = "\\W")) %>%
  mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", 
                             replacement = "\\1 \\2")) %>%
  # gather columns Q1-Q4 into a single variable
  # gather(quarter, percent_change_in_gdp, Q1:Q4) %>%  # gather(): old approach
  pivot_longer(cols = Q1:Q4,                           # pivot_longer(): new approach
               names_to = "quarter", 
               values_to = "percent_change_in_gdp") %>%
  glimpse()
```

```
## Rows: 200
## Columns: 3
## $ state                 <chr> "Connecticut", "Connecticut", "Connecticut", "Co…
## $ quarter               <chr> "Q1", "Q2", "Q3", "Q4", "Q1", "Q2", "Q3", "Q4", …
## $ percent_change_in_gdp <dbl> -5.5, 3.1, 4.6, 2.4, 3.0, -2.8, 6.8, 2.6, 5.1, 1…
```
---
class: middle, center

# mutating variables (again)

### `mutate()`

---

```r
# clean data
gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>%
  select(-...2, -Q1...7, -...8) %>%
  rename(state = ...1, Q1 = Q1...3) %>% 
  mutate(state = str_remove_all(state, pattern = "\\W")) %>%
  mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", 
                             replacement = "\\1 \\2")) %>%
  pivot_longer(cols = Q1:Q4, 
               names_to = "quarter", 
               values_to = "percent_change_in_gdp") %>%  
  # mutate quarter from character into factor
  mutate(quarter = factor(quarter)) %>%  
  glimpse()
```

```
## Rows: 200
## Columns: 3
## $ state                 <chr> "Connecticut", "Connecticut", "Connecticut", "Co…
## $ quarter               <fct> Q1, Q2, Q3, Q4, Q1, Q2, Q3, Q4, Q1, Q2, Q3, Q4, …
## $ percent_change_in_gdp <dbl> -5.5, 3.1, 4.6, 2.4, 3.0, -2.8, 6.8, 2.6, 5.1, 1…
```

---
class: center, middle

# joining new columns

### `left_join()`

---

![](screenshots/fording-website.png)

---

---

# data on the ideology of U.S. states

--
* On Sept. 12, 2018, I went to the Richard Fording's [website](https://rcfording.wordpress.com/state-ideology-data/) and grabbed the [latest version](https://www.dropbox.com/s/rx3fkoq8q61xqmh/stateideology_v2018.dta?dl=0) of [Berry, Ringquist, Fording, and Hanson's](https://rcfording.files.wordpress.com/2018/06/brfh_2015.pdf) measure of state political ideology.

--
* To preserve the raw data file, I changed nothing and uploaded [`stateideology_v2018.dta`](https://drive.google.com/file/d/1y4sgRqf8yq5WNKJ3AOCzrpcM6rKTxtCK/view?usp=sharing) to Google Drive as a Google Sheet.

--
* To use the data locally, I downloaded `stateideology_v2018.dta` using the *Download* button. I added it to the `data/` subdirectory.

---

```r
# load packages
library(haven)

# read data
fording_df_raw <- read_dta("data/stateideology_v2018.dta") %>%
  glimpse
```

```
## Rows: 2,901
## Columns: 5
## $ statename    <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
## $ state        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ year         <dbl> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
## $ citi6016     <dbl> 41.192535, 40.845158, 40.705376, 37.155907, 34.978466, 10…
## $ inst6017_nom <dbl> 60.18840, 60.18840, 60.18840, 60.69920, 60.69920, 59.1549…
```
---

```r
# clean data
fording_df <- fording_df_raw %>%
  # filter observations from 2017
  filter(year == 2017) %>%
  glimpse()
```

```
## Rows: 50
## Columns: 5
## $ statename    <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California",…
## $ state        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ year         <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
## $ citi6016     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ inst6017_nom <dbl> 24.47462, 40.99030, 18.87136, 26.67673, 70.38421, 56.3490…
```
---

```r
# clean data
fording_df <- fording_df_raw %>%
  filter(year == 2017) %>%
  # select the two variables we need
  select(statename, inst6017_nom) %>%  
  glimpse()
```

```
## Rows: 50
## Columns: 2
## $ statename    <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California",…
## $ inst6017_nom <dbl> 24.47462, 40.99030, 18.87136, 26.67673, 70.38421, 56.3490…
```
---

```r
# clean data
fording_df <- fording_df_raw %>%
  filter(year == 2017) %>%
  select(statename, inst6017_nom) %>%
  # rename the variables more sensibly (and to match gdp_df)
  rename(state = statename, ideology = inst6017_nom) %>%
  glimpse()
```

```
## Rows: 50
## Columns: 2
## $ state    <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Co…
## $ ideology <dbl> 24.47462, 40.99030, 18.87136, 26.67673, 70.38421, 56.34900, 6…
```

---

```r
# clean data
gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>%
  select(-...2, -Q1...7, -...8) %>%
  rename(state = ...1, Q1 = Q1...3) %>% 
  mutate(state = str_remove_all(state, pattern = "\\W")) %>%
  mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", 
                             replacement = "\\1 \\2")) %>%
  pivot_longer(cols = Q1:Q4, 
               names_to = "quarter", 
               values_to = "percent_change_in_gdp") %>%  
  mutate(quarter = factor(quarter)) %>%
  # join fording_df to the left
  left_join(fording_df) %>%
  glimpse()
```

```
## Joining, by = "state"
```

```
## Rows: 200
## Columns: 4
## $ state                 <chr> "Connecticut", "Connecticut", "Connecticut", "Co…
## $ quarter               <fct> Q1, Q2, Q3, Q4, Q1, Q2, Q3, Q4, Q1, Q2, Q3, Q4, …
## $ percent_change_in_gdp <dbl> -5.5, 3.1, 4.6, 2.4, 3.0, -2.8, 6.8, 2.6, 5.1, 1…
## $ ideology              <dbl> 62.50349, 62.50349, 62.50349, 62.50349, 46.28472…
```

---
class: center, middle

# writing the new data set to a file

### `write_csv()`

and

### `write_rds()`

---

```r
# clean data
gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>%
  select(-...2, -Q1...7, -...8) %>%
  rename(state = ...1, Q1 = Q1...3) %>% 
  mutate(state = str_remove_all(state, pattern = "\\W")) %>%
  mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", 
                             replacement = "\\1 \\2")) %>%
  gather(quarter, percent_change_in_gdp, Q1:Q4) %>%
  mutate(quarter = factor(quarter)) %>%
  left_join(fording_df) %>%
  # write clean data set to file
  write_rds("data/gdp-2017.rds") %>%  # preserve factors
  write_csv("data/gdp-2017.csv") %>%  # browse on github
  glimpse()
```

```
## Rows: 200
## Columns: 4
## $ state                 <chr> "Connecticut", "Maine", "Massachusetts", "New Ha…
## $ quarter               <fct> Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, …
## $ percent_change_in_gdp <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, …
## $ ideology              <dbl> 62.50349, 46.28472, 61.21511, 33.10439, 69.30332…
```

---

# 🧐quick check

```r
summary(gdp_df)
```

```
##     state           quarter percent_change_in_gdp    ideology    
##  Length:200         Q1:50   Min.   :-9.700        Min.   :17.78  
##  Class :character   Q2:50   1st Qu.: 0.400        1st Qu.:24.47  
##  Mode  :character   Q3:50   Median : 2.300        Median :37.05  
##                     Q4:50   Mean   : 2.115        Mean   :39.44  
##                             3rd Qu.: 3.525        3rd Qu.:51.93  
##                             Max.   :17.800        Max.   :70.38
```

# 👍it looks okay
---

```r
# fit model
fit <- lm(percent_change_in_gdp ~ ideology, data = gdp_df)

# print estimates
arm::display(fit, detail = TRUE)
```

```
## lm(formula = percent_change_in_gdp ~ ideology, data = gdp_df)
##             coef.est coef.se t value Pr(>|t|)
## (Intercept) 1.88     0.64    2.93    0.00    
## ideology    0.01     0.02    0.39    0.69    
## ---
## n = 200, k = 2
## residual sd = 3.52, R-Squared = 0.00
```

---

```r
ggplot(gdp_df, aes(x = ideology, y = percent_change_in_gdp)) + 
  geom_smooth(method = "lm") + 
  geom_point() + 
  facet_wrap(vars(quarter))
```

```
## `geom_smooth()` using formula 'y ~ x'
```

![](03-sl-wrangling_files/figure-html/unnamed-chunk-22-1.png)

---

# pwning factors

---

```r
cg_df <- read_rds("data/parties.rds") %>%
  glimpse()
```

```
## Rows: 555
## Columns: 10
## $ country              <chr> "Albania", "Albania", "Albania", "Argentina", "Ar…
## $ year                 <int> 1992, 1996, 1997, 1946, 1951, 1954, 1958, 1960, 1…
## $ average_magnitude    <dbl> 1.00, 1.00, 1.00, 10.53, 10.53, 4.56, 8.13, 4.17,…
## $ eneg                 <dbl> 1.10693, 1.10693, 1.10693, 1.34210, 1.34210, 1.34…
## $ enep                 <dbl> 2.190, 2.785, 2.870, 5.750, 1.970, 1.930, 2.885, …
## $ upper_tier           <dbl> 28.57, 17.86, 25.80, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ en_pres              <dbl> 0.00, 0.00, 0.00, 2.09, 1.96, 1.96, 2.65, 2.65, 3…
## $ proximity            <dbl> 0.00, 0.00, 0.00, 1.00, 1.00, 0.20, 1.00, 0.20, 1…
## $ social_heterogeneity <fct> Bottom 3rd of ENEG, Bottom 3rd of ENEG, Bottom 3r…
## $ electoral_system     <fct> Single-Member District, Single-Member District, S…
```

---

```r
ggplot(cg_df, aes(x = enep, y = country)) + 
  geom_point()
```

![](03-sl-wrangling_files/figure-html/unnamed-chunk-24-1.png)

---

### `reorder()` existing factors

```r
new_cg_df <- cg_df %>%
  mutate(country = reorder(country, average_magnitude))

ggplot(new_cg_df, aes(x = enep, y = country)) + 
  geom_point()
```

![](03-sl-wrangling_files/figure-html/unnamed-chunk-25-1.png)

---

### `cut_*()` into new factors

| Function         | Action                                                              |
|------------------|---------------------------------------------------------------------|
| `cut_interval()` | makes `n` groups with equal range                                   |
| `cut_number()`   | makes `n` groups with (approximately) equal numbers of observations |
| `cut_width()`    | makes groups of width width                                         |

---

```r
new_cg_df <- cg_df %>%
  mutate(social_heterogeneity_alt = cut_number(eneg, n = 3))