Introduction
Data scientists spend close to 70% (if not more) of their time cleaning, massaging and preparing data. That’s no secret – multiple surveys have confirmed that number. I can attest to it as well – it is simply the most time-taking aspect in a data science project.
Unfortunately, it is also among the least interesting things we do as data scientists. There is no getting around it, though. It is an inevitable part of our role. We simply cannot build powerful and accurate models without ensuring our data is well prepared.
So how can we make this phase of our job interesting?
Welcome to the wonderful world of Tidyverse! It is the most powerful collection of R packages for preparing, wrangling and visualizing data. Tidyverse has completely changed the way I work with messy data – it has actually made data cleaning and massaging fun!
If you’re a data scientist and have not yet come across Tidyverse, this article will blow your mind. I will show you the top R packages bundled with in Tidyverse that make data preparation an enjoyable experience. We’ll also look at code snippets for each package to help you get started.
You can also check out my pick of the top eight useful R packages you should incorporate into your data science work.
Table of contents
- What is Tidyverse?
- Core R Packages in Tidyverse
- Data Wrangling and Transformation
- dplyr
- tidyr
- stringr
- forcats
- Data Import and Management
- tibble
- readr
- Functional Programming
- purrr
- Data Visualization and Exploration
- ggplot2
- Data Wrangling and Transformation
- Some more useful Tidyverse libraries
What is Tidyverse?
Tidyverse is a collection of essential R packages for data science. The packages under the tidyverse umbrella help us in performing and interacting with the data. There are a whole host of things you can do with your data, such as subsetting, transforming, visualizing, etc.
Tidyverse was created by the great Hadley Wickham and his team with the aim of providing all these utilities to clean and work with data.
Let’s now look at some versatile Tidyverse libraries that the majority of data scientists use to manage and streamline their data workflows.
Core R Packages in Tidyverse
Ready to explore the tidyverse? Go ahead and install it directly from within RStudio:
install.packages("tidyverse")
We’ll be working on the food demand forecasting challenge in this article. I have taken a random 10% sample from the train file for faster computation. You can take the entire dataset if you want (and if your machine can support it!).
Let’s begin!
Data Wrangling and Transformation
dplyr
dplyr is one of my all-time favorite packages. It is simply the most useful package in R for data manipulation. One of the greatest advantages of this package is you can use the pipe function “%>%” to combine different functions in R. From filtering to grouping the data, this package does it all.
Here is the complete list of functions dplyr offers:
- select(): Select columns from your dataset
- filter(): Filter out certain rows that meet your criteria(s)
- group_by(): Group different observations together such that the original dataset does not change. Only the way it is represented is changed in the form of a list
- summarise(): Summarise any of the above functions
- arrange(): Arrange your column data in ascending or descending order
- join(): Perform left, right, full, and inner joins in R
- mutate(): Create new columns by preserving the existing variables
Let’s look at an example to understand how to use these different functions in R.
Open up the food forecasting dataset we downloaded earlier. We have 2 other files apart from the training set. We can join them with our train file to add more features. Let’s use dplyr and merge all the files. Again, I’m just using 10% of the overall data to make the computation faster.
Output:
id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured 1 1448490 1 55 2631 243.50 242.50 0 0 2 1446016 1 55 2290 311.43 310.43 0 0 3 1313873 1 55 2306 243.50 340.53 0 0 4 1440008 1 55 1962 582.03 612.13 1 0 5 1107611 1 24 1770 340.53 486.03 0 0 6 1298505 1 24 1198 147.50 191.09 0 0 num_orders city_code region_code center_type op_area 1 40 NA NA <NA> NA 2 162 NA NA <NA> NA 3 28 NA NA <NA> NA 4 231 NA NA <NA> NA 5 54 NA NA <NA> NA 6 148 NA NA <NA> NA
Note: We see a lot of NAs here. This is because we randomly chose samples from each of the three files and then merged them. If you use the whole dataset, you will not observe this amount of missing values.
Next, let’s use three dplyr functions simultaneously to summarise the data. Here, we’ll select ‘TYPE_A’ from the ‘center_type’ variable and calculate the mean of the ‘num_orders’ variable at this particular center:
Here, %>% is called the piping operator. This comes in handy when we want to use one or more functions together.
Output:
avg_A 1 286.3757
Go ahead and try out the other functions. Trust me, they will completely change the way you do data preparation.
tidyr
The tidyr package complements dplyr perfectly. It boosts the power of dplyr for data manipulation and pre-processing. Below is the list of functions tidyr offers:
- gather(): The function “gathers” multiple columns from your dataset and converts them into key-value pairs
- spread(): This takes two columns and “spreads” them into multiple columns
- separate(): As the name suggests, this function helps in separating or splitting a single column into numerous columns
- unite(): Works completely opposite to the separate() function. It helps in combining two or more columns into one
Let’s see a quick example of how to use tidyr. We’ll unite two binary variables and create only one column for both:
Output:
id week center_id meal_id checkout_price base_price email_home num_orders city_code region_code 1 1448490 1 55 2631 243.50 242.50 0_0 40 NA NA 2 1446016 1 55 2290 311.43 310.43 0_0 162 NA NA 3 1313873 1 55 2306 243.50 340.53 0_0 28 NA NA 4 1440008 1 55 1962 582.03 612.13 1_0 231 NA NA 5 1107611 1 24 1770 340.53 486.03 0_0 54 NA NA 6 1298505 1 24 1198 147.50 191.09 0_0 148 NA NA center_type op_area 1 <NA> NA 2 <NA> NA 3 <NA> NA 4 <NA> NA 5 <NA> NA
Here’s another example of how tidyr works:
Output:
variable1 variable2 num 1 A factor1 1 2 A factor2 2 3 A factor3 3 4 B factor1 4 5 B factor2 5 6 B factor3 6 > spread(data,variable2,num) variable1 factor1 factor2 factor3 1 A 1 2 3 2 B 4 5 6 3 C 7 8 9
We easily converted the factor variables into a table that can be swiftly interpreted without much pre-processing.
stringr
Dealing with string variables is a tricky challenge. They can often trip up to our final analysis because we skipped over those variables initially thinking they won’t affect our model. That’s a mistake.
stringr is my go-to package in R for such situations. It plays a big role in processing raw data into a cleaner and an easily understandable format. stringr contains a variety of functions that make working with string data really easy.
Some basic functions that you can perform with the stringr package are:
- str_sub(): Extract substrings from a character vector
- str_trim():Trim white spaces
- str_length(): Checks the length of the string
- str_to_lower/str_to_upper: Converts the string into upper case or lower case
There are many more functions inside the stringr package. Let’s look at a couple of functions:
Output:
> str_to_lower(x) [1] "analytics vidhya 001" > str_to_upper(x) [1] "ANALYTICS VIDHYA 001"
Combine two strings:
forcats
The forcats package is dedicated to dealing with categorical variables or factors. Anyone who has worked with categorical data knows what a nightmare they can be. forcats feels like a godsend.
It is quite frustrating when a factor appears in a place where we least expect it. If we’re using the tibble format, we don’t need to worry about this issue. The aim is to fill in those missing pieces so we can access the power of factors with minimum effort.
Use the following example to experiment with factors in your data:
Output:
# A tibble: 4 x 2 f n <fct> <int> 1 TYPE_A 1890 2 TYPE_B 569 3 TYPE_C 537 4 NA 42657
Data Import and Management
readr
We have plenty of ways to read data in R. So why use the readr package? The readr package solves the problem of parsing a flat file into a tibble. This provides an improvement over the standard file importing methods and significantly improves the computation speed.
You can easily read a .CSV file in the following way:
read_delim("filename.csv",delim=",")
Use this function and you’ll automatically see the difference in the time RStudio takes to read in huge data files.
tibble
We work with dataframes in R. It’s one of the first things we learn about R – convert your data into a dataframe before we can proceed with any sort of data science steps.
Tibble is a type of dataframe in R. It truly stands out when we’re trying to detect anomalies in our dataset. How? Tibble does not change variable names or types. It certainly doesn’t throw up errors when a variable does not exist or a value is missing.
Along with the print() function, the Tibble package helps in easy handling of big datasets containing complex objects. Such features enable us to treat the inherent data issues early on, hence producing cleaner code and data.
data<- as.tibble(train) head(data)
Notice how the data type is mentioned along with the column names. This is a very useful way to present data. Using the above example we can easily see how R gives a “tibble” output to the users:
Output:
# A tibble: 456,548 x 9 id week center_id meal_id checkout_price base_price emailer_for_pro~ homepage_featur~ <int> <int> <int> <int> <dbl> <dbl> <int> <int> 1 1.38e6 1 55 1885 137. 152. 0 0 2 1.47e6 1 55 1993 137. 136. 0 0 3 1.35e6 1 55 2539 135. 136. 0 0 4 1.34e6 1 55 2139 340. 438. 0 0 5 1.45e6 1 55 2631 244. 242. 0 0 6 1.27e6 1 55 1248 251. 252. 0 0 7 1.19e6 1 55 1778 183. 184. 0 0 8 1.50e6 1 55 1062 182. 183. 0 0 9 1.03e6 1 55 2707 193. 192. 0 0 10 1.05e6 1 55 1207 326. 384. 0 1 # ... with 456,538 more rows, and 1 more variable: num_orders <int>
The train file that we converted to the tibble format now gives us a more clear look at the data types and number of variables. Looks pretty neat and tidy, right?
Functional Programming
purrr
The purrr package in R provides a complete toolkit for enhancing R’s functional programming. We can use the functions provided by purrr to avoid many loops with just one line of code.
Which function do you typically use to check the mean of every column in your data? Most data scientists using R tend to lean on the summary() function. It gives us the descriptive statistics for each column.
An even better way to just deduce the mean value, without using any ugly loops, is to use the “map” function. Let’s see how we can do that using our training set:
map_dbl(train,~mean(.x))
Output:
id week center_id meal_id 1.250096e+06 7.476877e+01 8.210580e+01 2.024337e+03 checkout_price base_price emailer_for_promotion homepage_featured 3.322389e+02 3.541566e+02 8.115247e-02 1.091999e-01 num_orders 2.618728e+02
Data Visualization and Exploration
ggplot2
I’m sure you must have heard of ggplot2. It is far and away from the best visualization package I have ever used. Data scientists universally love using ggplot2 to produce their charts and visualizations. It’s such a useful and popular package that they’ve integrated it into the Python language!
There is so much we can do with this package. Whether it’s building box plots, density plots, violin plots, tile plots, time series plots – you name it and ggplot2 has a function for it.
Let’s see a few examples of how to create some really interactive plots with ggplot2 in R.
‘num_orders’ is the target variable in our food forecasting dataset. Let’s look at its distribution by generating a density chart:
As you can see above, the dependent variable is right-skewed.
Now, how about drawing up a violin plot? It’s a nice alternative to boxplots for detecting outliers:
Woah. There are plenty of outliers in our data. Don’t you love how a simple visualization offers up so many insights?
Next, plot a scatterplot to check the relationship between the checkout price and the base price:
Interestingly, there seems to be a pretty strong linear relationship between the two variables. We can certainly dig deeper into this when we’re working on this challenge to understand how these variables affect our overall model building strategy.
The power of visualization never ceases to amaze me.
Some More Tidyverse Packages
These packages are not included directly in the tidyverse bundle. So you won’t be able to load them through the function library(tidyverse). Hence, I have provided the installation commands for each package in this section.
Importing Data
- readxl: This package is very useful when you want to import Excel sheets in R:
install.packages("readxl") library(readxl) data <- read_xlsx("filename.xlxs")
- haven: For importing SPSS, STATA and SAS data:
install.packages("haven")
library(haven)
dat = read_sas("path to file", "path to formats catalog")
- googledrive: For importing Google Drive files:
Data Wrangling
- lubridate: The best R package for working with date-time data. lubridate provides a series of functions that are a permutation of the letters “m”, “d” and “y” to represent the ordering of month, day and year:
Output:
"2019-01-11" "2018-09-12" "2019-04-01"
- hms: This packages works similar to lubridate but only with time-based variables:
Output:
"9H 10M 1S" "9H 10M 2S" "9H 10M 3S"
Pretty awesome!
End Notes
Tidyverse is the most popular collection of R packages. Which isn’t all that surprising given how useful and easy to use they are. You’re definitely missing out on saving time and making your work much more efficient if you aren’t using the Tidyverse packages.
Have you used these R packages before? Are there any other packages you feel should be incorporated into Tidyverse? I want to hear hear your thoughts, feedback, and experience with Tidyverse. Let me know in the comments section below!
And if you get stuck at any point while using these packages, I’ll be happy to help you out.
We have summarised the use of every package under tidyverse in this amazing cheatsheet, you can access it here.