Introduction
R users (mostly beginners) struggle helplessly while dealing with large data sets. They get haunted by repetitive warnings, error messages of insufficient memory usage. Most of them come to an immediate conclusion, that their machine specification isn’t powerful enough. It’s time to upgrade the RAM or work on a new machine. Have you ever thought this way?
If you have seriously worked on data sets, I’m sure you would have. Even, I did too when I participated in The Black Friday. The data set contained more than 400,000 rows. I was totally clueless. Honestly, it was frustrating to see RStudio taking hours to execute one line of code. As we say, ‘necessity is the mother of all inventions’. I was in need of a solution.
After 2 hours of internet research, I came across an interesting set of R packages and APIs, specially made to work with large data sets without compromising with execution speed. One such package is data.table.
In this article, I’ve shared a smart approach which you should use when you work on large data sets. As you scroll down, you will come across the type of changes you can make to improve your R coding. It’s time to write codes which are fast and short. Consider it as a quick tutorial on data.table package.
Note: This article is best suited to beginners in data science using R who mainly work on data sets using data.frame() .If you are already a proficient user of data.table, this might not interest you.
Why does your machine fail to work with large data sets?
It’s important to understand the factors which deters your R code performance. Many a times, the incompetency of your machine is directly correlated with the type of work you do while running R code. Below are some practices which impedes R’s performance on large data sets:
- Using read.csv function to load large files.
- Using Google chrome: Opening several tabs in chrome consumes a significant amount of system’s memory. This can be checked using Shift + Esc key in chrome browser. (same applies to Mozilla web browser as well)
- Machine Specification: R reads entire data set into RAM at once. That is, R objects live in memory entirely. If you are still working on a 2GB RAM machine, you are technically disabled. With 2GB RAM, there isn’t enough free RAM space available which could seamlessly work with large data. Hence, It is strongly recommend to work on atleast 4GB RAM machines.
- Working in hot temperature: The processing speed slows the once the machine get heats up. During extreme summers, it evolves into a serious issue.
Note: My system specification is Intel(R) Core (TM) i5-3230M CPU @ 2.60GHz, 2 Core(s), 4 Logical Processors with 8GB RAM.
What is data.table ?
The package data.table is written by Matt Dowle in year 2008.
Think of data.table as an advanced version of data.frame. It inherits from data.frame and works perfectly even when data.frame syntax is applied on data.table. This package is good to use with any other package which accepts data.frame.
The syntax of data.table is quite similar to SQL. Therefore, if you’ve worked on SQL you would quickly understand it. The general form of syntax is:
DT[i, j, by]
where:
DT
is referred to the data table.i <=> where
: refers to the row indexing takes place i.e. put the row condition here.j <=> select
: refers to the column indexing takes place i.e. put the conditions (to filter, to summarise) on columns here.by <=> group_by
: refers to any categorical variable i.e. put the variable on the basis of which the grouping shall be executed.
For example:
#creating a dummy data table
DT <- data.table( ID = 1:50,
Capacity = sample(100:1000, size = 50, replace = F),
Code = sample(LETTERS[1:4], 50, replace = T),
State = rep(c("Alabama","Indiana","Texas","Nevada"), 50))
#simple data.table command
> DT[Code == "C", mean(Capacity), State]
Let’s see how does this command work. After the data table is created, I asked data table to filter the rows whose code is C. Then I asked it to calculate the mean capacity of the rows which have code C for every state separately. It’s not necessary that you always mention all the three parts of the syntax. Try doing the following commands at your end :
DT[Code == "D"]
DT[, mean(Capacity), by = State]
DT[Code == "A", mean(Capacity)]
Write your answers in the comments! Let’s see how quickly you are getting this concept.
Why should you use data.table instead of data.frame?
After I delved deeper into data.table, I found several aspects at which data.table package outperforms data.frame. Therefore, I would recommend every R beginner to use data.table as much as they can. There is a lot to explore. The earlier you start, the better you’ll become. You should use data.table because:
1. It provides blazing fast speed when it comes to loading data. With the fread
function in data.table package, loading large data sets need just few seconds. For example: I checked the loading time using a data set which contains 439,541 rows. Let’s see how fast is fread –
> system.time(dt <- read.csv("data.csv"))
user system elapsed
11.46 0.21 11.69
> system.time(dt <- fread("data.csv"))
user system elapsed
0.66 0.00 0.66
> dim(dt)
[1] 439541 18
As you saw, loading data with fread is 16x faster than the base function read.csv. fread() is faster than read.csv() because, read.csv() tries to first read rows into memory as character and then tries to convert them into integer and factor as data types. On the other hand, fread() simply reads everything as character.
2. It is even faster than the popular dplyr, plyr packages used for data manipulation. data.table provides enough room for tasks such as aggregating, filtering, merging, grouping and other related tasks. For example:
> system.time(dt %>% group_by(Store_ID) %>% filter(Gender == "F") %>% summarise(sum(Transaction_Amount), mean(Var2))) #with dplyr
user system elapsed
0.13 0.02 0.21
> system.time(dt[Gender == "F", .(sum(Transaction_Amount), mean(Var2)), by = Store_ID])
user system elapsed
0.02 0.00 0.01
data.table has processed this task 20x faster than dplyr. It happened because it avoids allocating memory to the intermediate steps such as filtering. Also, dplyr creates deep copies of the entire data frame where as data.table does a shallow copy of the data frame. Shallow copy means that the data is not physically copied in system’s memory. It’s just a copy of column pointers (names). Deep copy copies the entire data to another location in the memory. Hence, with memory efficiency, the speed of computation is enhanced.
3. Not just reading files, writing the files using data.table is much faster than write.csv()
. This packages provides fwrite()
function enabled with parallelised fast writing ability. So, next time you get to write 1 million rows, try this function.
4. In built features such as automatic indexing, rolling joins, overlapping range joins further enhances the user experience while working on large data sets.
Therefore, you see there is nothing wrong with data.frame, it just lacks the wide range of features and operations that data.table is enabled with.
Important Data Manipulation Commands
The idea of this tutorial is to provide you handy commands which can speed up your modeling process. Actually, there is so much to explore in this packages, chances are you might get puzzled from where to start, which command to stick with and when to use a particular command. Here, I provide answer to some of the most common questions which you come across while doing data exploration / manipulation.
The data set used below can be download from here: download. The data set contains 1714258 rows of 12 columns. It will be interesting to see, how long does the data.table takes in loading this data. Time for action!
Note: The data set contains uneven distribution of observations i.e. blank columns and NA values. The reason of taking this data is to check the performance of data.table on large data sets.
#set working directory
> setwd(".../AV/desktop/Data/")
#load data
> DT <- fread("GB_full.csv")
Read 1714258 rows and 12 (of 12) columns from 0.189 GB file in 00:00:07
It took only 7 seconds to read this file. Do try at your end.
1. How to subset rows & columns?
#subsetting rows
> sub_rows <- DT[V4 == "England" & V3 == "Beswick"]
#subsetting columns
> sub_columns <- DT[,.(V2,V3,V4)]
In a data table, columns are referred to as variables. Therefore, we don’t need to refer to variables as DT$column name
, column name alone works just fine. If you do DT[,c(V2,V3,V4)]
, it would return a vector of values. Using .()
symbol, wraps the variables within list()
and returns data table. In fact, every data table or data frame is a compilation of list of equal length and different data types. Isn’t it?
Subsetting data can be done even faster setting keys
in data table. Keys are nothing but supercharged rownames. A part of it has been demonstrated below.
2. How to order variables in ascending or descending order?
#ordering columns
> dt_order <- DT[order(V4, -V8)]
Order function is data table is much faster than base function order(). Reason being, order in data table uses radix order sort which impart additional boost. -
sign results in descending order.
3. How to add / update / delete a column or values in a data set?
#add a new column
> DT[, V_New := V10 + V11]
We did not assign the results back to DT. This is because,
:= operator modifies the input object by reference. It results in shallow copies in R which leads to better performance with less memory usage. The result is return invisibly.
#update row values
> DT[V8 == "Aberdeen City", V8 := "Abr City"]
With this line of code, we’ve updated Aberdeen City to Abr City in column V8.
#delete a column
> DT [,c("V6","V7") := NULL ]
Check View(DT).
We see that the data contains blank columns in the data set. It
can be removed using the code above. In fact, all the three steps can be done in command as well. This is known as chaining of commands.
> DT[V8 == "Aberdeen City", V8 := "Abr City"][, V_New := V10 + V11][,c("V6","V7") := NULL]
4. How to compute functions on variables based on grouping a column?
Let’s calculate mean of V10
variable on the bases of V4
(showing country).
#compute the average
> DT[, .(average = mean(V1o)), by = V4]
#compute the count
> DT[, .N, by = V4]
.N is a special variable in data.table used to calculate the count of values in a variable. If you wish to obtain the order of the variable specified with by
option, you can replace by
with keyby
. keyby
automatically orders the grouping variable in ascending order.
5. How to use keys for subsetting data ?
keys
in data table delivers incredibly fast results. We usually set keys on column names which can be of any type i.e. numeric, factor, integer, character. Once a key
is set of a variable, it reorders the column observations in increasing order. Setting a key is helpful, specially when you know that you need to make multiple computations on one variable.
#setting a key
> setkey(DT, V4)
Once, the key is set, we can subset any value from the key. For example:
#subsetting England from V4
> DT[.("England")]
Once the key is set, we no longer need to provide the column name again and again. If we were to look for multiple values in a column, we can write it as:
> DT[.(c("England", "Scotland"))]
Similarly, we can set multiple keys as well. This can be done using:
> setkey(DT, V3, V4)
We can again, subset value from these two columns simultaneously using:
> DT[.("Shetland South Ward","Scotland")]
There are several other modifications which can be done in the 5 steps demonstrated above. These 5 steps illustrated above will help you to perform the basic data manipulation tasks using data.table. To learn more, I would suggest you to start using this package in your every day R work. You’d face various hurdles and that’s where your learning curve would accelerate. You can also check the official data.table guide here.
End Notes
This article is written to provide you a path using which you can easily deal with large data sets. No longer, you need to spend money on upgrading your machines, instead it’s time to upgrade your knowledge of dealing with such situations. Apart from data.table, there are several other packages for parallel computing available. But, I don’t see any need to any other package for data manipulation, once you become proficient with data.table.
In this article, I discussed about some important aspects which every beginner in R must know while working on large data sets. After data manipulation, the very next hurdle which comes is model building. With large data sets, packages like caret, random forest, xgboost takes a lot of time in computation. Has it occurred to you?
I plan to provide an interesting solution in my post next week! Do let me know your pain points in dealing with large data stets. Did you like reading this article? Which other package do you use when dealing with large data sets? Drop your suggestions / opinions in the comments.