Describe, in words, what the following data wrangling functions do.
For each function, note (1) the purpose/job of the function, (2) the
main arguments, and (3) the package that contains the function. You
should be able to find all of this information in the help file for the
function (e.g., ?read_csv
).
read_csv()
, read_dta()
,
read_rds()
, read_excel()
glimpse()
rename()
select()
filter()
mutate()
separate()
pivot_longer()
pivot_wider()
left_join()
, right_join()
,
inner_join()
, full_join()
bind_rows()
bind_cols()
Optional: For some of the more subtle functions (e.g.,
pivot_longer()
) you might create a small example that
illustrates how the function works. You can use an example from the help
file or make your own.
In this computational homework, we are going to transform the data set you identified in HW 1 from a (probably) huge mess into a tidy, usable data set (e.g. Broman and Woo’s rules). You will use this tidy data set for the exercises moving forward, so make sure to get a variety of variables (numeric, categorical, binary, dates, names, etc.) that you’re interested in.
A large fraction (maybe 80-90%) of applied statistics is data wrangling. Data wrangling is hard, because there are 100s of potential tasks and you must learn each separately. There isn’t a general pattern like ggplot.
Whenever a data set isn’t quite how you want it, you must identify a way to get it slightly closer. Once you identify the the small, desired improvement, use Google extensively, try examples, experiment, etc. Get it working, then make another small improvement. The optimal path from messy to pristine isn’t always clear, but if you repeatedly get closer, you’ll find the way.
Hadley’s book R for Data Science has a whole section on wrangling. Stat 545 at UBC has a nice collection of notes, especially for managing factors.
This week, your task is straightforward-to-describe but formidable-to-complete. Expect to spend several frustrating hours struggling through it. Depending on your application, you might have a clean, tidy data set at the end of the week. Or you might have made a bit of progress toward that goal. As you work, refer to some of the resources below:
Feel free to copy over any files or text from HW 1 as you need. You will use this data set over the next several weeks, so make smart project-management choices (use good names, keep code well-documented, commit at reasonable intervals, etc.).
Depending on the number of obstacles you encounter, you might sail smoothly through all of this assignment and easily create a wonderfully tidy data set. Or you might struggle and make only a little progress. The latter is fine if you’re asking questions and making progress.
Do the following:
hw03-first-last
on
GitHub. (Note: The project directory on your computer and the repo name
on GitHub do not need to match. In practice, they usually will. To help
us find each other’s work, we’re naming them differently this one
time.)data/raw/
subdirectory of your project. Add
your raw data sets (probably just copying them over from HW 1) to the
directory. Do not change anything about these data sets,
including the filenames. If you want to be especially careful,
lock the raw data sets to prevent accidental changes. Add notes to
README.md
explaining how you obtained (and how others can
obtain) the raw data sets (include links!). It’s usually helpful to
include the dates that you downloaded the raw data. Document
this well for your future self and others checking your
work.glimpse()
a lot to
understand how you’re changing your data frame. Include descriptions as
comments to help you check and update your work later. It’s probably
good practice to have a glimpse()
before and after your
last action in the pipe sequence so you can easily see exactly what you
are changing.cut_number()
and
cut_interval()
functions might be helpful if you want to
convert a numeric variable into a factor. (You can facet, color, and
fill histograms by factors, for example.).rds
and
.csv
files in data/
. (Use
write_rds()
then write_csv()
, in that
order, because write_csv()
messes up the factors).README.md
(see the “Images” section here).Check your work. When you are done, make sure you have the following files, all pushed up to GitHub:
data/raw/
README.md
data/
README.md
that describes your work