Introduction
Do I have the necessary skills to take part in Kaggle Competitions?
Did you ever face this question? At least I did, as a sophomore, when I used to fear Kaggle just by envisaging the level of difficulty it offers. This fear was similar to my fear of water. My fear of water wouldn’t allow me to take up swimming classes. Though, later I learnt, “Till the moment you don’t step into water, you can’t make out how deep it is”. Similar philosophy applies to Kaggle. Don’t conclude until you try!
Kaggle, the home of data science, provides a global platform for competitions, customer solutions and job board. Here’s the Kaggle catch, these competitions not only make you think out of the box, but also offers a handsome prize money.
Yet, people hesitate to participate in these competitions. Here are some major reasons:
- They belittle their level of skills, knowledge and techniques acquired.
- Irrespective of their level of skill sets, they choose the problem offering highest prize money.
- They fail to equivocate their level of skill set with the difficulty level of problem.
I reckon, this issue emanates for Kaggle itself. Kaggle.com doesn’t provide any information which can help people to choose the most appropriate problem matching with their skill set. As a result, it has become an arduous task for beginners/intermediates to decide for suitable problem to begin.
Table of contents
- Introduction
- Top 8 Kaggle Problems
- Different stages of life to start their Kaggle journey!
- Case 1 : I have a background of Coding but new to machine learning.
- Case 2 : I have been in analytics Industry for more than 2 years, but not comfortable on R / Python
- Case 3 : I am good with coding and machine learning, need something challenging to work on
- Case 4 : I am a newbie to both machine learning or coding language, but I want to learn
- Few hacks to be a fair competition on Kaggle
- Conclusion
Top 8 Kaggle Problems
1. Titanic : Machine Learning from disaster
Objective: A classic popular problem to start your journey with machine learning. You are given a set of attributes of passengers onboard and you need to predict who would have survived after the ship sanked.
Difficulty level on each of the attributes:
a) Machine Learning Skills – Easy
b) Coding skills – Easy
c) Acquiring Domain Skills –Easy
d) Tutorials available – Very comprehensive
2. First Step with Julia
Objective: This is a problem to identify characters on Google Street view picture using an upcoming tool Julia.
Difficulty level on each of the attributes :
a) Machine Learning Skills – Easy
b) Coding skills – Medium
c) Acquiring Domain Skills –Easy
d) Tutorial available – Comprehensive
3. Digit Recognizer
Objective: You are given a data with pixels on handwritten digits and you need to conclusively say what digit is it. This is a classic problem for Latent Markov model.
Difficulty level on each of the attributes :
a) Machine Learning Skills – Medium
b) Coding skills – Medium
c) Acquiring Domain Skills –Easy
d) Tutorial available – Available but no hand holding
4. Bag of Words meet Bag of Popcorn
Objective: You are given a set of movie reviews, and you need to find the sentiment hidden in these statement. The objective of this problem statement is to introduce you to Google Package – Word2Vec.
It is a fantastic package which helps you convert words into a finite dimension space. This way we can build analogies only looking at the vector. One very simple example is that your algorithm can bring out analogies like : King – Male + Female will give you Queen.
Difficulty level on each of the attributes :
a) Machine Learning Skills – Difficult
b) Coding skills – Medium
c) Acquiring Domain Skills –Easy
d) Tutorial available – Available but no hand holding
5. Denoising Dirty Documents
Objective: You might know about a technology known as OCR. It simply converts handwritten documents to digital documents. However, it is not perfect. Your job here is to use machine learning to make it perfect.
Difficulty level on each of the attributes :
a) Machine Learning Skills – Difficult
b) Coding skills – Difficult
c) Acquiring Domain Skills –Difficult
d) Tutorial available – No
6. San Francisco Crime Classification
Objective: Predict the category of crimes that occurred in the city by the bay.
Difficulty level on each of the attributes :
a) Machine Learning Skills – Very Difficult
b) Coding skills – Very Difficult
c) Acquiring Domain Skills –Difficult
d) Tutorial available – No
7. Taxi Trajectory Prediction Time / Location
Objective: There are two problem based on the same datasets. You are given the controller of a taxi, and you are supposed to predict where is the taxi going to or the time it will take to complete the journey.
Difficulty level on each of the attributes :
a) Machine Learning Skills – Easy
b) Coding skills – Difficult
c) Acquiring Domain Skills –Medium
d) Tutorial available – A few benchmark codes available
8. Facebook Recruiting – Human or bot
Objective: If you have a nag to understand a new domain, you have got to solve this one. You are given the bidding data and are expected to classify the bidder to bot or human. This has the richest data source available out of all problems on Kaggle.
Difficulty level on each of the attributes :
a) Machine Learning Skills – Medium
b) Coding skills – Medium
c) Acquiring Domain Skills –Medium
d) Tutorial available – No support available as it is a recruiting contest
Note: I have not covered the Kaggle contests offering prize money in this article as they are all related to a specific domain. Let me know your take on them in the comment section below.
Different stages of life to start their Kaggle journey!
We have defined the correct approach to take up a kaggle problem for the following cases:
Case 1 : I have a background of Coding but new to machine learning.
Step 1: The first kaggle problem you should take up is: Taxi Trajectory Prediction. Reason being, the problem has a complex dataset which includes a JSON format in one of the columns which tells the set of coordinates the taxi has visited. If you are able to break this down, getting some initial estimate on target destination or time does not need a machine learning. Hence, you can use your coding strength to find your value in this industry.
Step 2: Your next step should be to take up: Titanic. Reason being, you would now already understand how to handle complex datasets. Hence, now is the perfect time to take a shot on pure machine learning problems. With abundance of solutions/scripts available, you will be able to build a good solution.
Step 3: You are now ready for something big. Try Facebook Recruiting. This will help you appreciate how understanding domain can help you get the best out of machine learning.
Once you have all these pieces in place, you are good to try any problem on Kaggle.
Case 2 : I have been in analytics Industry for more than 2 years, but not comfortable on R / Python
Step 1: You should begin with taking a shot on Titanic. Reason being, you already understand how to build predictive algorithm. You should now strive to learn languages like R and Python. With abundance of solutions/scripts available, you will be able to build different kind of models on both R and Python. This problem will also help you understand a few advanced machine learning algorithms.
Step 2: Next step should be Facebook Recruiting. Reason being, given the simplicity of the data structure and the richness of the content, you will be able to join right tables and make a predictive algorithm on this one. This will also help you appreciate how understanding domain can help you get the best out of machine learning.
Suggestions: You are now ready for something very different from your comfort zone. Read problems like Diabetic Retinopathy Detection, Avinto Context Ad Clicks, Crime Classification and find the domain of your interest. Now try applying whatever you have learned so far.
Now is the time to try something more complex to code. Try Taxi Trajectory prediction or Denoising Dirty Documents. Once you have all these pieces in place, you can now try any problem on Kaggle.
Case 3 : I am good with coding and machine learning, need something challenging to work on
Step 1: You have many options on Kaggle. First option is master a new language like Julia. You can start with First step with Julia. Reason being, this will give you an additional exposure to what can Julia do in addition to Python or R.
Step 2: Second option is to develop skills with an additional domain. You can try Avito Context , Search Relevance or Facebook – Human vs. Bot.
Case 4 : I am a newbie to both machine learning or coding language, but I want to learn
Step 1: You should begin your kaggle journey with Titanic. Reason being, the first step for you is to learn languages like R and Python. With abundance of solutions/scripts available, you will be able to build different kind of models on both R and Python. This problem will also help you understand a few machine learning algorithms.
Step 2: You should then take up: Facebook Recruiting. Reason being, given the simplicity of the data structure and the richness of the content, you will be able to join right tables and make a predictive algorithm on this one. This will also help you appreciate how understanding domain can help you get the best out of machine learning.
Once you are done with these, you can then take up problems as per your interest.
Few hacks to be a fair competition on Kaggle
This is not a comprehensive list of hacks, but meant to provide you a good start. Comprehensive list deserves a new post by itself:
- Make sure you submit a solution (even the sample submission will do this job) before the last entry date, if you wish to participate in the competition in future.
- Understand the domain before you get on to the data. For instance in the bot vs. human, you need to understand how online bidding platform works before you start the journey with data.
- Make your own evaluation algorithm which can mimic the Kaggle test score. A simple cross validation 10-fold generally works fine.
- Try to carve out as many features as possible from the train data – feature engineering is usually the part which pushes you from top 40 percentile to top 10 percentile.
- A single model generally does not get you in top 10. You need to make many many models and ensemble them together. This can be multiple models with different algorithms or different set of variables.
Conclusion
There are multiple benefits I have realized after working on Kaggle problems. I have learnt R / Python on the fly. I believe that is the best way to learn the same. Also interacting with people of discussion forum on various problems will help you get a deeper scoop into machine learning and domain.
In this article, we illustrated various Kaggle problems and categorized their essential attributes into the level of difficulty. We also took up various real life cases and elicited the right approach to participate in Kaggle.
Have you participated in any Kaggle problem? Did you see any significant benefits by doing the same? Do let us know your thoughts about this guide in the comments section below.
If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.