Introduction
Lately, I’ve been reading the book Data Scientist at Work to draw some inspiration from successful data scientists. Among other things, I found that most of the data scientists have emphasized upon the evolution of Spark and its incredible extent of computational power.
This piqued my interest to know more about Spark. Since then, I’ve done an extensive research on this topic to come across every possible bit of information I could find.
Fortunately, Spark has extensive packages for different programming languages. I think, being an R user, my inherent inclination to SparkR is justified.
After I finished with the research, I realized there is no structured learning path available on SparkR. I even connected with folks who are keen to learn SparkR, but none came across such structured learning path. Have you faced the same difficulty ? If yes, here’s your answer.
This inspired me to create this step by step learning path. I’ve listed the best resources available on SparkR. If you manage to complete the 7 steps thoroughly, you are expected to acquire intermediate level of adeptness on Spark. However, your journey from intermediate to expert level would require hours of practice. You knew that, right ? Let’s begin!
Step 1: What is Spark? Why do we need it?
Spark is an Apache project promoted as “lightning fast cluster computing”. It’s astonishing computing speed makes it 100x faster than hadoop and 10x faster than Mapreduce in memory. For large data processing, Spark has become first choice of every data scientist or engineer today.
You see Amazon, eBay, Yahoo, Facebook, everyone is using Spark for data processing on insanely large data sets. Apache Spark has one of the fastest growing big data community with more than 750 contributors from 200+ companies worldwide. According to the 2015 Data Science Salary Survey by O’Reilly, presence of Apache Spark skills added $11,000 extra to the median salary.
To explore the amazing world of Spark in detail, you can refer this article.
You can also watch this video to learn more about the value that Spark has added to the business world:
However, if you more of a person who read stuffs, you can skip the video and check this recommended blog.
Interesting Read: Apache officially sets a new record in large scale sorting
Step 2: What is Spark R?
Being an R user, let’s channelize our focus on SparkR.
R is one of the most widely used programming languages in data science. With its simple syntax and ability to run complex algorithms, it is probably the first choice of language for beginners.
But, R suffers from a problem. That is, its data processing capacity is limited to memory on a single node. This limits the amount of data you can process with R. Now, you know why does R runs out of memory when you attempt to work on large data sets. To overcome this memory problem, we can use SparkR.
Along with R, Apache Spark provides APIs for various languages such as Python, Scala, Java, SQL and many more. These APIs act as a bridge in connecting these tools with Spark.
For a detailed view of SparkR, this is a must watch video:
Note: SparkR has a limitation. Currently, it only support linear predictive models. Therefore, if you were excited to run boosting algorithm on SparkR, you might have to wait until the next version is rolled out.
Step 3 : Setting up your Machine
If you are still reading, I presume that this new technology has sparked a curiosity in you and that you would be determined to complete this journey. So, lets move on with setting up the machine:
To install SparkR, firstly, we need to install Spark in our systems, since it runs at the backend.
Following resources will help you in installation on your respective OS:
After you’ve successfully installed, it just takes few extra steps to initiate SparkR , once you are done with Spark installation. Following resources will help you to initiate SparkR locally:
Step 4 : Getting the Basics Right
Start with R: Though I assume that you would be knowing R if you are interested to work with Big Data. However, if R is not your domain, this course by data camp will help you to get started with R.
Exercise: Install a package swirl in R and do the complete set of exercises.
Database handling with SQL: SQL is widely used in SparkR in order to implement functions easily using simple commands. This helps in reducing the code lines you have to write. Also, increases the speed of operations. If you are not familiar with SQL, you should do this course by codecademy.
Exercise: Practice 1 and Practice 2
Step 5 : Data Exploration with SparkR and SQL
Once your basics are at place, it’s time to learn to work with SparkR & SQL.
SparkR enables us to use a number of data exploration operations using a combination of R and SQL simultaneously. The most common ones being select
, collect
, group_By
, summarize
, subset
and arrange
. You can learn these operations with this article.
Exercise: Do this exercise by AmpBerkley
Dataset used in above exercise: Download
Step 6 : Building Predictive Models (Linear) on SparkR
As mentioned above, SparkR only supports linear modeling algorithms such as Regression
. However, it’s just a matter of time until we are facing this constraint. I am expecting them to soon roll out an updated version which would support non-linear models as well.
SparkR implements linear modeling using the function glm
. On the other hand, at present, Spark has a machine learning library known as MLlib
(for more info on MLlib, click here), which supports non-linear modeling.
Learn and Practice: To build your first linear regression model on SparkR, follow this link. To build a logistic regression model, follow this link.
Step 7 : Integrating SparkR with Hive for Faster Computation
SparkR works even faster with Apache Hive for database management.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Integrating Hive with SparkR would help running queries even faster and more efficiently.
If you want to step into bigdata, the use of hive would really be a great advantage for efficient data processing. You can install Hive by following the links given for respective OS:
After you’ve installed R successfully, you can start integrating Hive with SparkR using the steps demonstrated in this video. Alternatively, if you are more comfortable in reading, this video is also available in text format on this blog.
For a quick overview on SparkR, you can also follow its official documentation.
End Notes
I hope that I have made the learning path clear enough to accelerate your journey into data science using SparkR.
SparkR is often being seen as an intermediate step to switch into Big Data using R. I learned SparkR because I used to find immense difficulty in working on large data sets in R. SparkR provided me a convenient and cost free way to continue with my learning.
In addition, for a R user, SparkR can also provide headstart to someone who wishes to transition into big data industry. It’s is much powerful than I have explored yet.
Did you find this article helpful ? Have you worked on SparkR ? Do share your suggestions / experience in the comments section below.