Become more efficient through automation and better organization

2020-06-17 automation efficiency purrr organization

Motivation

Recently, while reading through a book about creating R Packages, I was especially impacted by a few lines concerning automation: anything that can be automated, should be automated. Do as little as possible by hand. Do as much as possible with functions. This philosophy has vast implications, and has caused me to reflect on different ways that I can reduce inefficiency in my work.

In addition to automating where possible, keeping your projects well organized will also make you more efficient. I am certainly not the only one who has sifted through poorly named files and folders trying find some data or script.

The goal of this post is to save you both time and headaches. I will highlight a few practices and methods that I have found useful, and I would encourage you to seek out and share similar tips with the community! I’ll assume you have some basic experience with the tidyverse.

# install.packages(c("tidyverse","fs"))
library(tidyverse) # all the things
library(fs) # for working with files

Let’s dive right in.


Autocomplete and intentional naming of variables

Take advantage of RStudio’s fabulous autocomplete! Be intentional with naming objects and functions. While working as a tutor in a lab for R programming and data science, I would often see the most interesting and inconsistent variable names:

# bad

WednesdayHousingData_Project
Data_That_I_Cleaned
ABBYLANEmanor2

Generally, variable names should strike a nice balance between being concise and descriptive. If you want, you could lean more towards descriptive names since it is easier to type more with RStudio’s autocomplete.

Pick a naming convention and stick with it. I personally use snake case (all lower case letters separated by underscores). snake_case_is_arguably_easier toReadThanCamelCase. You can choose whatever you want, just be consistent.

Personally, I always name datasets dat (for data), or give them dat as a prefix. Some people prefer df, dt or something else. Consistency with a name or prefix makes it a lot easier to type and find objects.

# good

dat
dat_tidy
dat_train
dat_nested

# even longer names are okay because of autocomplete

dat_linkedin_profiles
dat_canvas_long
dat_canvas_wide

To be fair, dat itself isn’t super descriptive, but it is great when working with just one dataset or when used as a prefix for modified versions of a dataset. With RStudio’s autocomplete, even if you forget what something is called, you can always type dat… and all your datasets will appear.

# simple example of a workflow

dat_courses <- read_csv("path-to-data/data1.csv")

dat_sections <- read_csv("path-to-data/data2.csv")

dat <- dat_courses %>% 
    left_join(dat_sections)

dat %>% glimpse()

# comments about what I did to go from dat to dat2
# blah blah blah

dat2 <- dat %>%
    select(column1, column2, column3) %>% 
    filter(column3 > 50) %>% 
    mutate(
        # lots of code
    )

dat2 %>% glimpse()

# comments about what I did to go from dat2 to dat_final
# blah blah blah

dat_final <- dat2 %>% 
    group_by(column1) %>% 
    summarise(
        # all sorts of summaries
    )

dat_final %>% glimpse()
    
write_rds(dat_final, "path/data.rds")

Functions and iteration

As an R user, writing functions and iterating with the purrr package just seemed too formidable for a long time. After a lot of study and practice, however, I am far more confident in those areas. If you want to learn how to leverage these tools too, now is the time!

I will outline the basics here, and point you towards additional resources at the end of this post. I mainly want to highlight some examples that you may find helpful.

Functions

# basic outline for writing a function

name <- function(variables) {
    # code here
}
dat <- mtcars %>% as_tibble()
dat %>% glimpse()
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17...
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3....
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,...

Say there is some code that I run frequently during an analysis:

dat %>% 
  group_by(cyl) %>% 
  summarize(
    n = n(), sd = sd(mpg), median = median(mpg), mean = mean(mpg)
  )
## # A tibble: 3 x 5
##     cyl     n    sd median  mean
##   <dbl> <int> <dbl>  <dbl> <dbl>
## 1     4    11  4.51   26    26.7
## 2     6     7  1.45   19.7  19.7
## 3     8    14  2.56   15.2  15.1

I use it so much, that I want to wrap it up into a function that I’ll name get_summary. I should look at my code and ask myself: what are the inputs? In this case, I have the data (dat), a variable that I want to group the summary by (cyl), and I also have the variable that I actually want a summary of (mpg). I would list those inputs as arguments inside of function(), and give them easy to understand names. Then I’d swap out the hard coded variable names with the argument names I just came up with. Thus, cyl becomes grouping_var and mpg becomes summary_var, both in function() and the main code body.

Since I am writing a function with tidyverse verbs, I also need to wrap the variables in the main body of code with {{ }}. More on that later. Lastly, remember that the main body of the code is nestled between curly braces {}. With all of that said, we end up with this:

get_summary <- function(data, grouping_var, summary_var) {
  data %>% 
    group_by( {{ grouping_var }} ) %>% 
    summarize(
      n = n(),
      sd = sd( {{ summary_var }} ),
      median = median( {{ summary_var }} ),
      mean = mean( {{ summary_var }} )
    )
}

dat %>% get_summary(cyl, mpg)
## # A tibble: 3 x 5
##     cyl     n    sd median  mean
##   <dbl> <int> <dbl>  <dbl> <dbl>
## 1     4    11  4.51   26    26.7
## 2     6     7  1.45   19.7  19.7
## 3     8    14  2.56   15.2  15.1
dat %>% get_summary(am, wt)
## # A tibble: 2 x 5
##      am     n    sd median  mean
##   <dbl> <int> <dbl>  <dbl> <dbl>
## 1     0    19 0.777   3.52  3.77
## 2     1    13 0.617   2.32  2.41

Now I can use the get_summary function with all sorts of variables! A detailed explanation as to why we use the curly-curly {{ }} operator here is beyond the scope of this humble post, but I would point you towards the two best resources that I have found that explain why we need them: Programming with dplyr and rlang 0.4.0. In short, using them allows us to write things like cyl and mpg when calling the function, instead of dat$cyl or dat$mpg. Having to use {{ }} may seem slightly painful at first, but it allows us to create functions that are much more user friendly, just like functions from the tidyverse.

Iteration

Now, onto iterating with purrr. Introducing map.

map is basically a for loop in a function. It allows you to apply a function to many things at once.

# the map function

map(my_list, my_function)

# my_list is a list, vector, or data.frame/tibble that you want to iterate over
# my_function is a function that will be applied to each element of my_list

Instead of this:

mean(dat$mpg)
mean(dat$disp)
mean(dat$hp)
# and so on...

do this:

dat %>% map(mean) # or map(dat, mean)
## $mpg
## [1] 20.09062
## 
## $cyl
## [1] 6.1875
## 
## $disp
## [1] 230.7219
## 
## $hp
## [1] 146.6875
## 
## $drat
## [1] 3.596563
## 
## $wt
## [1] 3.21725
## 
## $qsec
## [1] 17.84875
## 
## $vs
## [1] 0.4375
## 
## $am
## [1] 0.40625
## 
## $gear
## [1] 3.6875
## 
## $carb
## [1] 2.8125

Hmm, I don’t love that the output is a list of numbers. Let’s use map_dbl to return a vector of doubles instead. map always returns a list, but there are other map_* variants that give you more control over the output.

Run ?purrr::map in the console for more info.

dat %>% map_dbl(mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

map and its friends can iterate over data.frames and tibbles because the columns are just vectors of equal length. Also, notice we just type mean and not mean(). Alternatively we could write:

dat %>% map(~ mean(.x))

Use the ~ to signal that you are writing a function and the pronoun .x to represent each element of the list. For those familiar with for loops, the .x is similar to the i in for i in ....

You can also use map2 to iterate over two vectors or lists at the same time. ?map2

a <- 1:5
b <- 5:1
a
## [1] 1 2 3 4 5
b
## [1] 5 4 3 2 1
map2(a, b, ~ .x * .y)
## [[1]]
## [1] 5
## 
## [[2]]
## [1] 8
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 8
## 
## [[5]]
## [1] 5
map2_dbl(a, b, ~ .x * .y)
## [1] 5 8 9 8 5

In this case, .x represents the first list, and .y represent the second. See ?pmap for help with 3 or more arguments.

Some personal examples

You can use helper functions that you have created in combination with functions from purrr to reduce duplication in your code and automate processes that are more manual.

The other day I found myself doing this a lot:

dat %>% count(cyl)
## # A tibble: 3 x 2
##     cyl     n
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14
dat %>% count(vs)
## # A tibble: 2 x 2
##      vs     n
##   <dbl> <int>
## 1     0    18
## 2     1    14
dat %>% count(am)
## # A tibble: 2 x 2
##      am     n
##   <dbl> <int>
## 1     0    19
## 2     1    13

Unfortunately, using map with count is not as straightforward as map(dat, count), so I wrote a function to make things easier.

count_values <- function(data, ...) {
    select(data, ...) %>% 
        
        # the ... let you select columns like normal in dplyr::select
        
      mutate(across(everything(), as.character)) %>%  
        
        # dplyr::across is from dplyr 1.0.0 
        # dplyr::mutate_all(as.character) would do the same thing 
        
      map_dfr(~ count(tibble(value = .x), value, sort = TRUE), .id = "variable")
    
        # purrr::map_dfr outputs a row-binded tibble
}

dat %>% count_values(cyl, vs, am)
## # A tibble: 7 x 3
##   variable value     n
##   <chr>    <chr> <int>
## 1 cyl      8        14
## 2 cyl      4        11
## 3 cyl      6         7
## 4 vs       0        18
## 5 vs       1        14
## 6 am       0        19
## 7 am       1        13

Automate the tedious

One way to stay organized is by keeping different scripts and data in appropriately named folders. When I start fresh in a new R project, I usually create the same folders every time. I wrote a function to make this easier:

get_started <- function() {
  dirs <- list("data-raw", "data-derived", "eda", "scripts", "models", "helpers")
  purrr::walk(dirs, fs::dir_create)
}
# walk is like map, but only calls the function for its side-effect
# dir_create creates a new directory

get_started outputs:

If you wanted more control over which folders to create, you could use get_started2 and just input what you want.

get_started2 <- function(...) {
  dirs <- list(...)
  purrr::walk(dirs, fs::dir_create)
}

get_started2("data", "reports", "personal-folders", "analysis")

Hold on to your helpers

Over the course of an analysis, or just your R user life in general, you will probably write a bunch of helper function to ahem help you. If you don’t create your own R package to keep those functions, the next best thing is to keep them each in their own script with some comments documenting what they do/how to use them.

If I had all of my helper functions in a directory called helpers, I could source all of them with the code below:

fs::dir_ls("helpers") %>% walk(source)

# dir_ls lists all of the files in a directory

Files and folders

From the names of the directories in get_started and get_started2, can you guess what sort of files should go in each? I hope so. Names of files and folders should be easy to understand for both humans and machines. Stick with numbers, lower case letters, _, and -. That’s it. Avoid funky characters and spaces. Can you image how much better the world would be if everyone did just that?

I generally like to use _ for separating groups of things that are bigger ideas or are not necessarily related, and - for everything else. Here is an example of a typical project directory organized with files and folders:

  • data-raw
    • 2020-03-26_canvas-assignments.csv
    • 2020-04-05_canvas-assignments.csv
    • 2020-04-18_canvas-assignments.csv
    • student-total.csv
  • data-derived
    • canvas-tidy-current.rds
    • student-seniors.rds
  • scripts
    • canvas-tidy-current.R
    • student-seniors.R
  • helpers
    • count-values.R
    • get-started.R
  • models
    • canvas-decision-tree.R
    • canvas-random-forest.R
    • student-seniors-log-regr.R
    • student-seniors-xgboost.R
  • eda
    • 01_canvas-api.R
    • 02_canvas-courses.R
    • 03_canvas-assignments.R

Resources for learning more

Here are some of my favorite books and tutorials that helped me to learn more about writing my own functions and iterating with purrr:

For more help with a project-oriented workflow:

For more help with managing file paths:


Thank you for reading!

I truly hope that something in this post as helped you! Investing time and effort into learning how to write functions, iterate with purrr, and staying well organized will pay off greatly. Stay safe and happy coding!

comments powered by Disqus