Effective exploratory data analysis (EDA) is crucial to most any data science project. For a great first look at how to do EDA in R, check out the 7th chapter of R for Data Science. This post here will point you towards some useful tools to make some aspects of EDA easier and faster.
Prepared to be primed
Over the past few months I have found myself using a few packages or functions over and over again whenever I get my hands on a new dataset. I also have recently stumbled upon some new beauties that I think are worth sharing. This post is meant to be more of a primer than a real deep dive into any one package. Links to learn more about each package/function are included throughout.
# if needed: # install.packages(c("tidyverse", "janitor", "DataExplorer", "skimr", "trelliscopejs", "gapminder")) library(tidyverse) # (dplyr, ggplot2, %>%) library(janitor) library(DataExplorer) library(skimr) library(trelliscopejs) library(gapminder) dat <- ggplot2::diamonds # learn more about diamonds dataset: ?diamonds
An oldie but a goodie
When I first look at a new dataset, I really just want to take a peak, or a glimpse of the data.
glimpse from dplyr is perfect for just that. It shows you all the basics of your dataset: number of rows and columns, names and types of variables, and the first several values in each row.
## Rows: 53,940 ## Columns: 10 ## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,... ## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ... ## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,... ## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS... ## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,... ## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,... ## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,... ## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,... ## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,... ## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
glimpse simply gives you more information. Also, when working in an R script, using
glimpse is often nicer than using
View takes you out of the script and can slightly distrupt your flow. I have also had
View crash R when being used on very large datasets.
glimpse also works great with the pipe
%>%. I like to use it at the end of a series of manipulations on data as a sort of sanity check.
# random dplyr code dat %>% rename(length = x, width = y) %>% mutate(price_euro = price * .91) %>% filter(carat > .7) %>% select(carat, cut, price_euro) %>% glimpse()
## Rows: 26,778 ## Columns: 3 ## $ carat <dbl> 0.86, 0.71, 0.78, 0.96, 0.73, 0.80, 0.75, 0.75, 0.74, 0.... ## $ cut <ord> Fair, Very Good, Very Good, Fair, Very Good, Premium, Ve... ## $ price_euro <dbl> 2508.87, 2510.69, 2510.69, 2510.69, 2511.60, 2511.60, 25...
clean_names() function from the janitor package is awesome for cleaning up annoying column names. You just pipe in your data and it magically converts your columns to snake case. There are other options, too.
?clean_names. Okay maybe this function doesn’t really have much to do with EDA, but quickly standardizing all of your column names sure makes working with them easier.
dat_ugly_names <- tribble( ~"BAD Column", ~"Good name?", ~"This-hurts_Me", "a", 1, "fruit", "b", 2, "taco", "c", 3, "corona virus" ) dat_ugly_names %>% clean_names()
## # A tibble: 3 x 3 ## bad_column good_name this_hurts_me ## <chr> <dbl> <chr> ## 1 a 1 fruit ## 2 b 2 taco ## 3 c 3 corona virus
# I use clean_names just about every time I import data from somewhere! # import dataset from the wild: # dat <- read_csv("some-crazy-data-from-the-wild.csv) %>% clean_names() # oh wow, now this wild dataset at least has some tame column names
Data the Explora-data
I am not proud of that subheading. Enter the DataExplorer package to redeem myself. This is a package I plan on diving deeper into myself. Lots of golden nuggets here. From this package, I have used
plot_bar for quite a while. Just this last week (at the time of writing this) I learned about
create_report. Finally, while writing this post, I learned about another useful function called
profile_missing just take data as an argument, while the other functions allow you to customize the outputs a bit more if desired.
- plot_histogram() - Creates histograms for all continuous variables in a dataset.
dat %>% plot_histogram()
- plot_bar() - Creates bar charts for all discrete variables in a dataset.
dat %>% plot_bar()
- profile_missing() - Tells you the number and percentage of
NAvalues from each of the columns in a dataset.
datasets::airquality %>% profile_missing() # datasets from base R
## feature num_missing pct_missing ## 1 Ozone 37 0.24183007 ## 2 Solar.R 7 0.04575163 ## 3 Wind 0 0.00000000 ## 4 Temp 0 0.00000000 ## 5 Month 0 0.00000000 ## 6 Day 0 0.00000000
create_report() - Compiles a whole bunch of data profiling statistics (including outputs from the three above functions, correlation between variables, etc.) into an html report. It looks like you can customize it a bunch, but the default report has been sufficient for me (except for setting a y so that a response variable can be included in some of the plotting functions). I won’t include an example here because it produces an html document, but you could probably run
create_report(mtcars)or something in your console to see what it outputs. Lots of good stuff here.
introduce() - Describes basic info about the data.
dat %>% introduce()
## # A tibble: 1 x 9 ## rows columns discrete_columns continuous_colu~ all_missing_col~ ## <int> <int> <int> <int> <int> ## 1 53940 10 3 7 0 ## # ... with 4 more variables: total_missing_values <int>, complete_rows <int>, ## # total_observations <int>, memory_usage <dbl>
Hmm. Seeing how the output is formatted, I don’t like it as much. Too wide. I think I would rather use it in combination with
dat %>% introduce() %>% glimpse()
## Rows: 1 ## Columns: 9 ## $ rows <int> 53940 ## $ columns <int> 10 ## $ discrete_columns <int> 3 ## $ continuous_columns <int> 7 ## $ all_missing_columns <int> 0 ## $ total_missing_values <int> 0 ## $ complete_rows <int> 53940 ## $ total_observations <int> 539400 ## $ memory_usage <dbl> 3457760
Skim a bit right off the top
From the skimr package, “skim() is an alternative to summary(), quickly providing a broad overview of a data frame.” Now that I use functions from DataExplorer a lot, I don’t use
skim as much, but some people might like it.
|Number of rows||150|
|Number of columns||5|
|Column type frequency:|
Variable type: factor
|Species||0||1||FALSE||3||set: 50, ver: 50, vir: 50|
Variable type: numeric
You could also pipe this into
skim(iris) %>% summary()
|Number of rows||150|
|Number of columns||5|
|Column type frequency:|
Wow, just wow. Go here to check out how to use
trelliscopejs some more. I love what people come up with.
I hope some of this has been useful to you. Most of the functions I mentioned can probably help get you started with EDA. They are especially useful when you take an initial look at a dataset, and perhaps you could continue to use some of these functions during the EDA process. However, simple functions like these do not replace best practices that you have been taught. Hopefully they just support you in whatever your process looks like.
Thank you so much for reading my first post! Feel free to share this with anyone who might find it helpful or leave a comment pointing towards other useful packages.