Week 3: Data Wrangling

Conceptual Homework

Describe, in words, what the following data wrangling functions do. For each function, note (1) the purpose/job of the function, (2) the main arguments, and (3) the package that contains the function. You should be able to find all of this information in the help file for the function (e.g., ?read_csv).

read_csv(), read_dta(), read_rds(), read_excel()
glimpse()
rename()
select()
filter()
mutate()
separate()
pivot_longer()
pivot_wider()
left_join(), right_join(), inner_join(), full_join()
bind_rows()
bind_cols()

Optional: For some of the more subtle functions (e.g., pivot_longer()) you might create a small example that illustrates how the function works. You can use an example from the help file or make your own.

Computational Homework

In this computational homework, we are going to transform the data set you identified in HW 1 from a (probably) huge mess into a tidy, usable data set (e.g. Broman and Woo’s rules). You will use this tidy data set for the exercises moving forward, so make sure to get a variety of variables (numeric, categorical, binary, dates, names, etc.) that you’re interested in.

A large fraction (maybe 80-90%) of applied statistics is data wrangling. Data wrangling is hard, because there are 100s of potential tasks and you must learn each separately. There isn’t a general pattern like ggplot.

Whenever a data set isn’t quite how you want it, you must identify a way to get it slightly closer. Once you identify the the small, desired improvement, use Google extensively, try examples, experiment, etc. Get it working, then make another small improvement. The optimal path from messy to pristine isn’t always clear, but if you repeatedly get closer, you’ll find the way.

Hadley’s book R for Data Science has a whole section on wrangling. Stat 545 at UBC has a nice collection of notes, especially for managing factors.

This week, your task is straightforward-to-describe but formidable-to-complete. Expect to spend several frustrating hours struggling through it. Depending on your application, you might have a clean, tidy data set at the end of the week. Or you might have made a bit of progress toward that goal. As you work, refer to some of the resources below:

data wrangling (combines tidyr and dplyr; deprecated, but still my fav) [pdf]
data import (tidyr) [dowload pdf]
data transformation (dplyr) [dowload pdf]
working with strings and regex (stringr) [download pdf]
working with factors (forcats) [download pdf]
working with dates and/or times (lubridate) [download pdf]
there’s more here!

Feel free to copy over any files or text from HW 1 as you need. You will use this data set over the next several weeks, so make smart project-management choices (use good names, keep code well-documented, commit at reasonable intervals, etc.).

Depending on the number of obstacles you encounter, you might sail smoothly through all of this assignment and easily create a wonderfully tidy data set. Or you might struggle and make only a little progress. The latter is fine if you’re asking questions and making progress.

Do the following:

Prepare.
1. Open RStudio and start a new RStudio Project. This is Homework 3, so name it appropriately and stay organized. Initialize git.
2. Open GitHub Desktop, add the local repo you just created, publish the initial files up to GitHub as a part of the pos5737 organization. Regardless of what you called your project directory above (on your local computer), please name the repo hw03-first-last on GitHub. (Note: The project directory on your computer and the repo name on GitHub do not need to match. In practice, they usually will. To help us find each other’s work, we’re naming them differently this one time.)
Get to Work!
1. Create a data/raw/ subdirectory of your project. Add your raw data sets (probably just copying them over from HW 1) to the directory. Do not change anything about these data sets, including the filenames. If you want to be especially careful, lock the raw data sets to prevent accidental changes. Add notes to README.md explaining how you obtained (and how others can obtain) the raw data sets (include links!). It’s usually helpful to include the dates that you downloaded the raw data. Document this well for your future self and others checking your work.
Wrangle. In a thoroughly commented R script, do the following:
1. Wrangle the data into a clean, tidy data frame.
  - Hint 1: This might be the hardest, most frustrating thing you’ve ever done in school. Break it into steps, fight through it, and ask questions. Wrangling requires knowing a thousand different tools to accomplish a thousand frustratingly-simple tasks. It’s like a super-complicated maze. You’ll just have to experiment until you get something working. If you know what you need to do, but aren’t sure what function you should use, try Googling. If that doesn’t work, ask me for help. The packages tidyr and dplyr contain most of the functions we’ll want. stringr helps us work with strings, lubridate with dates, and forcats with factors.
  - Hint 2: In order to use a function, you need to understand how it works, so take time to study the help file and the examples before trying to use it. Use Google to find more examples.
  - Hint 3: Use glimpse() a lot to understand how you’re changing your data frame. Include descriptions as comments to help you check and update your work later. It’s probably good practice to have a glimpse() before and after your last action in the pipe sequence so you can easily see exactly what you are changing.
  - Hint 4: The cut_number() and cut_interval() functions might be helpful if you want to convert a numeric variable into a factor. (You can facet, color, and fill histograms by factors, for example.)
  - Hint 5: You’ll ultimately want to use a piped workflow, but you might find it easier to get a action working outside the pipe and then adding it in once it works.
2. Save the clean, tidy data set as .rds and .csv files in data/. (Use write_rds() then write_csv(), in that order, because write_csv() messes up the factors).
3. Make sure you have an up-to-date CSV version of your data dictionary.
Check your work. Make sure that you have (1) several variables and a variety of types, (2) follow Bromand and Woo’s rules, and (3) include a CSV data set and data dictionary.
Plot. In a separate R script, make a small plot. Don’t worry about polishing it or finding exactly the best figure. I just want an initial, preliminary effort to understand your data. Don’t spend much time on this.
Write. In the README, briefly describe your data set. Link to the CSV version of the data dictionary. Link to the CSV version of the data set. If you like the figure you made above, you can include it in the README.md (see the “Images” section here).

Check your work. When you are done, make sure you have the following files, all pushed up to GitHub:

the original, unchanged, raw data sets in data/raw/
careful notes about the raw data in README.md
an R script that cleans and tidies the data (or have made progress and asked questions)
a CSV and an RDS version of the clean, tidy data in data/
an R script that creates a preliminary figure
a PNG of the figure
README.md that describes your work