class: center, middle, inverse, title-slide # Data Wrangling --- class: middle, center # to **wrangle** is to **act** --- class: middle, center # wrangling is **hard** there are 1000s of tasks; you must learn each task it's not general like ggplot() --- class: middle, center ## follow along by cloning wrangling from our POS 5737 organization --- # to **wrangle** is to... `read_***()`, `glimpse()`, `filter()`, `select()`, `rename()`, `mutate()`, `pivot_longer()`, `join()`, and `write_***()` ![](https://upload.wikimedia.org/wikipedia/commons/6/6f/Twemoji2_1f93a.svg) --- class: bottom, center # what we want -- ### 50 states -- ### four quarters -- ### percent change in GDP -- ### state political ideology --- class: middle, center # what we want ### 50 states ### four quarters ### percent change in GDP ### state political ideology --- class: middle, center <img src="screenshots/clean-data.png" style="width: 100%"/> --- # data on GDP in the U.S. states -- * On Sept. 11, 2018, I went to the BEA's website and grabbed their [current release](https://www.bea.gov/data/gdp/gdp-state) of GDP by state. -- * To preserve the raw data file, I changed nothing and uploaded [`qgdpstate0718_0.xlsx`](https://docs.google.com/spreadsheets/d/1y10LLeyfMir4e2uz87frYb2PVXtS41hXXBqMz828GFQ/edit#gid=556942308) to Google Drive as a Google Sheet. -- * To use the data locally, I downloaded `qgdpstate0718_0.xlsx` using *File* > *Download as...* > *Comma-separated values (.csv, current sheet)*. I added it to the `data/` subdirectory. --- class: middle, center <img src="screenshots/raw-data-google.png" style="width: 100%"/> --- background-image: url("https://upload.wikimedia.org/wikipedia/commons/1/11/Noto_Emoji_Oreo_1f92c.svg") background-size: cover --- background-image: url("https://upload.wikimedia.org/wikipedia/commons/6/63/Twemoji2_1f914.svg") background-size: cover --- class: middle, center # reading the data into R ### `read_csv()` --- ```r # load packages library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ``` ``` ## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ## ✔ tibble 3.1.7 ✔ dplyr 1.0.9 ## ✔ tidyr 1.2.0 ✔ stringr 1.4.0 ## ✔ readr 2.1.2 ✔ forcats 0.5.1 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ``` --- ```r # read data gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv") ``` ``` ## New names: ## Rows: 65 Columns: 8 ## ── Column specification ## ──────────────────────────────────────────────────────── Delimiter: "," chr ## (7): Table 1. Percent Change in Real Gross Domestic Product (GDP) by Sta... dbl ## (1): ...2 ## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ ## Specify the column types or set `show_col_types = FALSE` to quiet this message. ## • `` -> `...2` ## • `` -> `...3` ## • `` -> `...4` ## • `` -> `...5` ## • `` -> `...6` ## • `` -> `...7` ## • `` -> `...8` ``` --- class: middle, center <img src="screenshots/raw-data-google.png" style="width: 100%"/> --- ```r # read data, again gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv", skip = 3) ``` ``` ## New names: ## Rows: 62 Columns: 8 ## ── Column specification ## ──────────────────────────────────────────────────────── Delimiter: "," chr ## (2): ...1, ...8 dbl (6): ...2, Q1...3, Q2, Q3, Q4, Q1...7 ## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ ## Specify the column types or set `show_col_types = FALSE` to quiet this message. ## • `` -> `...1` ## • `` -> `...2` ## • `Q1` -> `Q1...3` ## • `Q1` -> `Q1...7` ## • `` -> `...8` ``` --- ```r # quick look glimpse(gdp_df_raw) ``` ``` ## Rows: 62 ## Columns: 8 ## $ ...1 <chr> "United States1 …………..", "New England ……………", "Connecticut …………… ## $ ...2 <dbl> 2.1, 1.6, -0.2, 1.4, 2.6, 1.9, 1.6, 1.1, 1.3, 1.6, 1.7, 1.5, 0.… ## $ Q1...3 <dbl> 0.9, 2.2, -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 0.8, 2.4, 2.8, -2.9, -… ## $ Q2 <dbl> 2.7, 1.2, 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, 0.7, -9.7, -1.7, 2.6,… ## $ Q3 <dbl> 3.3, 5.1, 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 4.1, 13.1, 12.2, 5.4, 2… ## $ Q4 <dbl> 2.7, 2.5, 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 1.1, 0.2, 1.2, 1.2, 2.1… ## $ Q1...7 <dbl> 1.8, 1.5, 1.6, 0.6, 1.5, 1.3, 1.3, 2.6, 1.5, 1.3, 2.0, 1.5, 1.6… ## $ ...8 <chr> ".......", ".......", "23", "46", "29", "36", "35", "8", ".....… ``` --- class: center, middle # filtering rows ### `filter()` --- ```r # filter the rows we want gdp_df <- filter(gdp_df_raw, ...8 != ".......") # quick look glimpse(gdp_df) ``` ``` ## Rows: 50 ## Columns: 8 ## $ ...1 <chr> "Connecticut …………….", "Maine ……………………", "Massachusetts …………", "… ## $ ...2 <dbl> -0.2, 1.4, 2.6, 1.9, 1.6, 1.1, 1.6, 1.5, 0.9, 1.1, 1.8, 1.2, 2.… ## $ Q1...3 <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1… ## $ Q2 <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9,… ## $ Q3 <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.… ## $ Q4 <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7… ## $ Q1...7 <dbl> 1.6, 0.6, 1.5, 1.3, 1.3, 2.6, 1.3, 1.5, 1.6, 1.1, 2.0, 1.4, 1.3… ## $ ...8 <chr> "23", "46", "29", "36", "35", "8", "38", "30", "25", "41", "19"… ``` --- class: center, middle # selecting columns ### `select()` --- ```r # select the rows we want gdp2_df <- select(gdp_df, -...2, -Q1...7, -...8) # quick look glimpse(gdp2_df) ``` ``` ## Rows: 50 ## Columns: 5 ## $ ...1 <chr> "Connecticut …………….", "Maine ……………………", "Massachusetts …………", "… ## $ Q1...3 <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1… ## $ Q2 <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9,… ## $ Q3 <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.… ## $ Q4 <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7… ``` --- class: center, middle # renaming columns ### `rename()` --- ```r # rename the pooly named variables gdp3_df <- rename(gdp2_df, state = ...1, Q1 = Q1...3) # quick look glimpse(gdp3_df) ``` ``` ## Rows: 50 ## Columns: 5 ## $ state <chr> "Connecticut …………….", "Maine ……………………", "Massachusetts …………", "N… ## $ Q1 <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1,… ## $ Q2 <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9, … ## $ Q3 <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.4… ## $ Q4 <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7,… ``` --- class: middle, center <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a3/Pause-thin-rounded-button.svg/2000px-Pause-thin-rounded-button.svg.png" style="width: 70%"/> --- class: middle, center # three **actions** so far ### `filter()` ### `select()` ### `rename()` --- # All Together Now! ```r # read data gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv", skip = 3) # clean data gdp_df <- filter(gdp_df_raw, ...8 != ".......") gdp2_df <- select(gdp_df, -...2, -Q1...7, -...8) gdp3_df <- rename(gdp2_df, state = ...1, Q1 = Q1...3) # quick look glimpse(gdp3_df) ``` ``` ## Rows: 50 ## Columns: 5 ## $ state <chr> "Connecticut …………….", "Maine ……………………", "Massachusetts …………", "N… ## $ Q1 <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1,… ## $ Q2 <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9, … ## $ Q3 <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.4… ## $ Q4 <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7,… ``` --- class: middle, center # This whole `gdp_df`, `gdp2_df`, `gdp3_df`, ... isn't very satisfying. --- class:middle # Notice that the data frame output from one function is the input to the next... ``` # read data gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv", skip = 3) # clean data gdp_df <- filter(gdp_df_raw, ...8 != ".......") gdp2_df <- select(gdp_df, -...2, -Q1...7, -...8) gdp3_df <- rename(gdp2_df, state = ...1, Q1 = Q1...3) ``` --- class: middle, center # enter the pipe <img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/pipe.png" style="width: 50%"/> --- # All Together Now! With a Pipe... 😎 ```r # read data gdp_df_raw <- read_csv("data/qgdpstate0718_0 - Table 1.csv", skip = 3) # clean data gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>% select(-...2, -Q1...7, -...8) %>% rename(state = ...1, Q1 = Q1...3) ``` --- class: middle, center # mutating variables ### `mutate()` --- ```r # clean data gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>% select(-...2, -Q1...7, -...8) %>% rename(state = ...1, Q1 = Q1...3) %>% # mutate state (remove all non-alpha characters) mutate(state = str_remove_all(state, pattern = "\\W")) %>% glimpse() ``` ``` ## Rows: 50 ## Columns: 5 ## $ state <chr> "Connecticut", "Maine", "Massachusetts", "NewHampshire", "RhodeI… ## $ Q1 <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1,… ## $ Q2 <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9, … ## $ Q3 <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.4… ## $ Q4 <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7,… ``` --- ```r # clean data gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>% select(-...2, -Q1...7, -...8) %>% rename(state = ...1, Q1 = Q1...3) %>% mutate(state = str_remove_all(state, pattern = "\\W")) %>% # add spaces before upper-case letters preceeded by a lower-case mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", replacement = "\\1 \\2")) %>% glimpse() ``` ``` ## Rows: 50 ## Columns: 5 ## $ state <chr> "Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode… ## $ Q1 <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, 3.5, -0.5, -2.1,… ## $ Q2 <dbl> 3.1, -2.8, 1.0, -2.9, 4.2, 0.8, -9.7, 2.6, 3.4, -1.3, 3.2, 1.9, … ## $ Q3 <dbl> 4.6, 6.8, 5.2, 7.9, 4.5, 0.5, 13.1, 5.4, 2.8, 3.0, 4.5, 3.6, 1.4… ## $ Q4 <dbl> 2.4, 2.6, 2.4, 2.5, 2.7, 2.3, 0.2, 1.2, 2.1, 0.0, 2.5, 2.7, 2.7,… ``` --- ### details -- ``` gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>% select(-...2, -Q1...7, -...8) %>% rename(state = ...1, Q1 = Q1...3) %>% mutate(state = str_remove_all(state, pattern = "\\W")) %>% mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", replacement = "\\1 \\2")) %>% glimpse() ``` -- * `mutate()` changes a variable in a data frame, but **returns a data frame**. -- * `str_remove_all()` looks at every character in each element of a character vector and removes the characters that match a pattern. **It outputs a character vector the same length as the input.** -- * `pattern = "\\W"` is regex for "non-letters." -- * `str_replace()` looks at every character in each element of a character vector and replaces the characters that match a pattern. **It outputs a character vector the same length as the input.** -- * `pattern = "([a-z])([A-Z])"` is regex for "lower-case letter followed by upper-case letter." -- * `replacement = "\\1 \\2""` is regex for "the first match in parenthesis (e.g., w), a space, and the second match in parenthesis (e.g., H)." --- class: middle, center # whenver you need to use regex, remember the following... --- class: middle, center <img src="https://upload.wikimedia.org/wikipedia/commons/2/2f/Google_2015_logo.svg" style="width: 100%"/> --- class: middle, center # gathering columns ### ~~`gather()`~~ ### `pivot_longer()` --- ```r # clean data gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>% select(-...2, -Q1...7, -...8) %>% rename(state = ...1, Q1 = Q1...3) %>% mutate(state = str_remove_all(state, pattern = "\\W")) %>% mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", replacement = "\\1 \\2")) %>% # gather columns Q1-Q4 into a single variable # gather(quarter, percent_change_in_gdp, Q1:Q4) %>% # gather(): old approach pivot_longer(cols = Q1:Q4, # pivot_longer(): new approach names_to = "quarter", values_to = "percent_change_in_gdp") %>% glimpse() ``` ``` ## Rows: 200 ## Columns: 3 ## $ state <chr> "Connecticut", "Connecticut", "Connecticut", "Co… ## $ quarter <chr> "Q1", "Q2", "Q3", "Q4", "Q1", "Q2", "Q3", "Q4", … ## $ percent_change_in_gdp <dbl> -5.5, 3.1, 4.6, 2.4, 3.0, -2.8, 6.8, 2.6, 5.1, 1… ``` --- class: middle, center # mutating variables (again) ### `mutate()` --- ```r # clean data gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>% select(-...2, -Q1...7, -...8) %>% rename(state = ...1, Q1 = Q1...3) %>% mutate(state = str_remove_all(state, pattern = "\\W")) %>% mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", replacement = "\\1 \\2")) %>% pivot_longer(cols = Q1:Q4, names_to = "quarter", values_to = "percent_change_in_gdp") %>% # mutate quarter from character into factor mutate(quarter = factor(quarter)) %>% glimpse() ``` ``` ## Rows: 200 ## Columns: 3 ## $ state <chr> "Connecticut", "Connecticut", "Connecticut", "Co… ## $ quarter <fct> Q1, Q2, Q3, Q4, Q1, Q2, Q3, Q4, Q1, Q2, Q3, Q4, … ## $ percent_change_in_gdp <dbl> -5.5, 3.1, 4.6, 2.4, 3.0, -2.8, 6.8, 2.6, 5.1, 1… ``` --- class: center, middle # joining new columns ### `left_join()` --- class: center, middle ![](screenshots/fording-website.png) --- class: center, middle <img src="screenshots/fording-codebook.png" style="width: 90%"/> --- # data on the ideology of U.S. states -- * On Sept. 12, 2018, I went to the Richard Fording's [website](https://rcfording.wordpress.com/state-ideology-data/) and grabbed the [latest version](https://www.dropbox.com/s/rx3fkoq8q61xqmh/stateideology_v2018.dta?dl=0) of [Berry, Ringquist, Fording, and Hanson's](https://rcfording.files.wordpress.com/2018/06/brfh_2015.pdf) measure of state political ideology. -- * To preserve the raw data file, I changed nothing and uploaded [`stateideology_v2018.dta`](https://drive.google.com/file/d/1y4sgRqf8yq5WNKJ3AOCzrpcM6rKTxtCK/view?usp=sharing) to Google Drive as a Google Sheet. -- * To use the data locally, I downloaded `stateideology_v2018.dta` using the *Download* button. I added it to the `data/` subdirectory. --- ```r # load packages library(haven) # read data fording_df_raw <- read_dta("data/stateideology_v2018.dta") %>% glimpse ``` ``` ## Rows: 2,901 ## Columns: 5 ## $ statename <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A… ## $ state <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ## $ year <dbl> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196… ## $ citi6016 <dbl> 41.192535, 40.845158, 40.705376, 37.155907, 34.978466, 10… ## $ inst6017_nom <dbl> 60.18840, 60.18840, 60.18840, 60.69920, 60.69920, 59.1549… ``` --- ```r # clean data fording_df <- fording_df_raw %>% # filter observations from 2017 filter(year == 2017) %>% glimpse() ``` ``` ## Rows: 50 ## Columns: 5 ## $ statename <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California",… ## $ state <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17… ## $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201… ## $ citi6016 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ inst6017_nom <dbl> 24.47462, 40.99030, 18.87136, 26.67673, 70.38421, 56.3490… ``` --- ```r # clean data fording_df <- fording_df_raw %>% filter(year == 2017) %>% # select the two variables we need select(statename, inst6017_nom) %>% glimpse() ``` ``` ## Rows: 50 ## Columns: 2 ## $ statename <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California",… ## $ inst6017_nom <dbl> 24.47462, 40.99030, 18.87136, 26.67673, 70.38421, 56.3490… ``` --- ```r # clean data fording_df <- fording_df_raw %>% filter(year == 2017) %>% select(statename, inst6017_nom) %>% # rename the variables more sensibly (and to match gdp_df) rename(state = statename, ideology = inst6017_nom) %>% glimpse() ``` ``` ## Rows: 50 ## Columns: 2 ## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Co… ## $ ideology <dbl> 24.47462, 40.99030, 18.87136, 26.67673, 70.38421, 56.34900, 6… ``` --- ```r # clean data gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>% select(-...2, -Q1...7, -...8) %>% rename(state = ...1, Q1 = Q1...3) %>% mutate(state = str_remove_all(state, pattern = "\\W")) %>% mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", replacement = "\\1 \\2")) %>% pivot_longer(cols = Q1:Q4, names_to = "quarter", values_to = "percent_change_in_gdp") %>% mutate(quarter = factor(quarter)) %>% # join fording_df to the left left_join(fording_df) %>% glimpse() ``` ``` ## Joining, by = "state" ``` ``` ## Rows: 200 ## Columns: 4 ## $ state <chr> "Connecticut", "Connecticut", "Connecticut", "Co… ## $ quarter <fct> Q1, Q2, Q3, Q4, Q1, Q2, Q3, Q4, Q1, Q2, Q3, Q4, … ## $ percent_change_in_gdp <dbl> -5.5, 3.1, 4.6, 2.4, 3.0, -2.8, 6.8, 2.6, 5.1, 1… ## $ ideology <dbl> 62.50349, 62.50349, 62.50349, 62.50349, 46.28472… ``` --- class: center, middle # writing the new data set to a file ### `write_csv()` and ### `write_rds()` --- ```r # clean data gdp_df <- filter(gdp_df_raw, ...8 != ".......") %>% select(-...2, -Q1...7, -...8) %>% rename(state = ...1, Q1 = Q1...3) %>% mutate(state = str_remove_all(state, pattern = "\\W")) %>% mutate(state = str_replace(state, pattern = "([a-z])([A-Z])", replacement = "\\1 \\2")) %>% gather(quarter, percent_change_in_gdp, Q1:Q4) %>% mutate(quarter = factor(quarter)) %>% left_join(fording_df) %>% # write clean data set to file write_rds("data/gdp-2017.rds") %>% # preserve factors write_csv("data/gdp-2017.csv") %>% # browse on github glimpse() ``` ``` ## Rows: 200 ## Columns: 4 ## $ state <chr> "Connecticut", "Maine", "Massachusetts", "New Ha… ## $ quarter <fct> Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, … ## $ percent_change_in_gdp <dbl> -5.5, 3.0, 5.1, 9.5, 1.2, 2.0, 2.4, -2.9, -2.3, … ## $ ideology <dbl> 62.50349, 46.28472, 61.21511, 33.10439, 69.30332… ``` --- # 🧐quick check ```r summary(gdp_df) ``` ``` ## state quarter percent_change_in_gdp ideology ## Length:200 Q1:50 Min. :-9.700 Min. :17.78 ## Class :character Q2:50 1st Qu.: 0.400 1st Qu.:24.47 ## Mode :character Q3:50 Median : 2.300 Median :37.05 ## Q4:50 Mean : 2.115 Mean :39.44 ## 3rd Qu.: 3.525 3rd Qu.:51.93 ## Max. :17.800 Max. :70.38 ``` # 👍it looks okay --- ```r # fit model fit <- lm(percent_change_in_gdp ~ ideology, data = gdp_df) # print estimates arm::display(fit, detail = TRUE) ``` ``` ## lm(formula = percent_change_in_gdp ~ ideology, data = gdp_df) ## coef.est coef.se t value Pr(>|t|) ## (Intercept) 1.88 0.64 2.93 0.00 ## ideology 0.01 0.02 0.39 0.69 ## --- ## n = 200, k = 2 ## residual sd = 3.52, R-Squared = 0.00 ``` --- ```r ggplot(gdp_df, aes(x = ideology, y = percent_change_in_gdp)) + geom_smooth(method = "lm") + geom_point() + facet_wrap(vars(quarter)) ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ![](03-sl-wrangling_files/figure-html/unnamed-chunk-22-1.png)<!-- --> --- class: middle, center # pwning factors --- ```r cg_df <- read_rds("data/parties.rds") %>% glimpse() ``` ``` ## Rows: 555 ## Columns: 10 ## $ country <chr> "Albania", "Albania", "Albania", "Argentina", "Ar… ## $ year <int> 1992, 1996, 1997, 1946, 1951, 1954, 1958, 1960, 1… ## $ average_magnitude <dbl> 1.00, 1.00, 1.00, 10.53, 10.53, 4.56, 8.13, 4.17,… ## $ eneg <dbl> 1.10693, 1.10693, 1.10693, 1.34210, 1.34210, 1.34… ## $ enep <dbl> 2.190, 2.785, 2.870, 5.750, 1.970, 1.930, 2.885, … ## $ upper_tier <dbl> 28.57, 17.86, 25.80, 0.00, 0.00, 0.00, 0.00, 0.00… ## $ en_pres <dbl> 0.00, 0.00, 0.00, 2.09, 1.96, 1.96, 2.65, 2.65, 3… ## $ proximity <dbl> 0.00, 0.00, 0.00, 1.00, 1.00, 0.20, 1.00, 0.20, 1… ## $ social_heterogeneity <fct> Bottom 3rd of ENEG, Bottom 3rd of ENEG, Bottom 3r… ## $ electoral_system <fct> Single-Member District, Single-Member District, S… ``` --- ```r ggplot(cg_df, aes(x = enep, y = country)) + geom_point() ``` ![](03-sl-wrangling_files/figure-html/unnamed-chunk-24-1.png)<!-- --> --- ### `reorder()` existing factors ```r new_cg_df <- cg_df %>% mutate(country = reorder(country, average_magnitude)) ggplot(new_cg_df, aes(x = enep, y = country)) + geom_point() ``` ![](03-sl-wrangling_files/figure-html/unnamed-chunk-25-1.png)<!-- --> --- ### `cut_*()` into new factors | Function | Action | |------------------|---------------------------------------------------------------------| | `cut_interval()` | makes `n` groups with equal range | | `cut_number()` | makes `n` groups with (approximately) equal numbers of observations | | `cut_width()` | makes groups of width width | --- ```r new_cg_df <- cg_df %>% mutate(social_heterogeneity_alt = cut_number(eneg, n = 3)) ggplot(new_cg_df, aes(x = enep)) + geom_histogram() + facet_wrap(vars(social_heterogeneity_alt)) ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` ![](03-sl-wrangling_files/figure-html/unnamed-chunk-26-1.png)<!-- --> --- ```r new_cg_df <- cg_df %>% mutate(social_heterogeneity_alt = cut_number(eneg, n = 5)) ggplot(new_cg_df, aes(x = enep)) + geom_histogram() + facet_wrap(vars(social_heterogeneity_alt)) ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` ![](03-sl-wrangling_files/figure-html/unnamed-chunk-27-1.png)<!-- --> --- ```r new_cg_df <- cg_df %>% mutate(social_heterogeneity_alt = cut_number(eneg, n = 2)) ggplot(new_cg_df, aes(x = enep)) + geom_histogram() + facet_wrap(vars(social_heterogeneity_alt)) ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` ![](03-sl-wrangling_files/figure-html/unnamed-chunk-28-1.png)<!-- --> --- ```r new_cg_df <- cg_df %>% mutate(social_heterogeneity_alt = cut_interval(eneg, n = 3)) ggplot(new_cg_df, aes(x = enep)) + geom_histogram() + facet_wrap(vars(social_heterogeneity_alt)) ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` ![](03-sl-wrangling_files/figure-html/unnamed-chunk-29-1.png)<!-- --> --- ```r new_cg_df <- cg_df %>% mutate(social_heterogeneity_alt = cut_width(eneg, width = 2)) ggplot(new_cg_df, aes(x = enep)) + geom_histogram() + facet_wrap(vars(social_heterogeneity_alt)) ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` ![](03-sl-wrangling_files/figure-html/unnamed-chunk-30-1.png)<!-- --> --- class: middle, center ### more factor pwnage: forcats package (see the cheatsheet) --- ### `case_when()` for manual factor creation ```r starwars %>% select(name:mass, gender, species) %>% mutate(type = case_when(height > 200 | mass > 200 ~ "large", species == "Droid" ~ "robot", TRUE ~ "other")) %>% glimpse() ``` ``` ## Rows: 87 ## Columns: 6 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organ… ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 228,… ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, … ## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "feminine"… ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human",… ## $ type <chr> "other", "robot", "robot", "large", "other", "other", "other",… ```