Data science is an interdisciplinary endeavor, and it serves the purpose of extracting insight from varying sources of information. Various communities come together at Data Science Conferences to share their knowledge and promote innovation. It is not surprising, then, that the tools showcased by data scientists at ODSC East are myriad, but what are the most valued and popular programming languages in a data scientist’s tool-box?
A 2014 KD nuggets poll1 suggest R, Python, SAS and SQL are among the top contenders. Almost every posting on the topic will mention at least 3 of these tools. There even seems to be a not-so-silent2 competition between the R and Python community in terms crowning either as ‘best’ for performing data science tasks. A nice post by Martijn Theuwissen3 in 2015 summarizes the current state of the art in R and Python comparisons. So, is R or Python better for data science? Well, that depends, Martijn suggests. This is, of course, exactly correct. R and Python are defined by their differences, and both have unique advantages.
Most businesses house their data in structured databases, such as MySQL, PostgreSQL, other SQL-accessible systems or NoSQL. As such, being able to interact with these systems by writing SQL-based queries to extract data is an important and incredibly valuable skill for most analysts and data scientists.
As datasets become vastly larger, deep knowledge of distributed storage, computation and querying is becoming among the most valuable skill-sets available for data scientists. Many of the interfaces to these systems are SQL-like, and Big Data architectures are rapidly becoming the norm for business storage solutions. Among the more popular interfaces to Big Data architectures are Pig, Hive, and SparkSQL, but there are some interesting new developments like Apache Drill, which promises “Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage”.
A very nice blog posting by Jesse Steinweg-Woods4 demonstrates how to use Beautiful Soup and other python tools to scrape Indeed.com for information on data science job postings. We used this approach to look at the top 5 cities for data science jobs to find out what the top 5 data science are.
The top 5 tools by job listing are:
1. Python
2. R
3. SQL
4. Hadoop
5. Java
R and Python are reliably the top two programming languages that companies want their data scientists to know. SQL, comes in a clear third, with Java, Hadoop, Spark, Pig and Hive trailing behind. Surprising to us was the prevalence of Excel, Matlab, SAS, SPSS, Tableu, which are not always thought of as the most popular of toolsets among data scientists. Julia has a surprise showing. It is still a young in its development cycle, but Julia seems to be an increasingly popular language for use in data science tasks.
Summary
The top tools and languages for data science are rather consistent if not slightly varying across region, and knowledge in R, Python and SQL are in general the most sought-after skills. What is clear is that an understanding of and skills in the use of languages that interact with Big Data storage and compute architecture are quickly becoming a must for practicing data scientists. Julia is now on the job-market map, and it will be interesting to see how quickly Julia will spread in the data sciences.
1. http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
2. http://www.infoworld.com/article/2951779/application-development/in-data-science-the-r-language-is-swallowing-python.html
3. http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
4. https://jessesw.com/Data-Science-Skills/