Introduction
Surprises arrive when you expect them least to arrive.
The Seer’s Accuracy turned out to be a challenging surprise for data scientists. So, what changed this time ? Actually, there was no test or train file given. Participants were given just one file to download. Would you believe? Everyone was puzzled. The weak ones gave up at the beginning; but the determined ones stayed till the end and learned something new!
Did you miss out on this experience ? If you didn’t, great! But if you did, unfortunately you just missed out a wonderful opportunity to learn something new. Though, I can’t bring back the thrilling experience, but I can give you one more chance to learn (data set is live again).
The Seer’s Accuracy held from 29th April to 1st May 2016. This competition enticed more than ~2200 participants across the world. In this 72 hour battle, the first thing which participants were required to do is to create a train and test file themselves. After which, the race to seer would start.
Once again, XGBoost and Ensemble Modeling helped winners to discover the highly accurate solutions. Below are the winning solutions of top 3 winners. Here is quick short interview of winners highlighting their approach and thought process which made them got in top 3.
If you participated in this competition, it’s time to analyze your hits and misses and become better for next.
Note: R was extensively used by winning team members. Special thanks to these winners for their immense cooperation in sharing their experience and knowledge.
The Competition
The participants were required to help “ElecMart”, a chain of electronic superstores looking to increase its sales from existing customers. The evaluation metric used was AUC – ROC.
Problem Statement
About ElecMart
ElecMart, as the name suggests is a supermarket for Electronics. They serve the needs of both, retail clients and various corporate clients. Customers not only get to see and feel a wide range of products, they also receive exciting discounts and excellent customer service. ElecMart started in 1999 and launched a customer loyalty program in 2003.
ElecMart aims to be largest Electronic superstore across the nation, but they have a big hurdle ahead!
The problem – Where are the recurring buyers?
The loyalty program is meant for customers who want to take benefit from repeat purchases and register at the time of purchase. They need to present the loyalty card at Point of Sale at time of purchase and the benefits are non-transferrable. Also corporate sales automatically get the benefits of the loyalty program.
In a recent benchmarking activity and market survey which ElecMart sponsored, it was found that the “Repeat purchase rate” i.e. customer who come again for purchases from these customers is very low compared to other competitors. Increasing sales to these customers is the only way to run a successful loyalty program.
Data provided
ElecMart has shared all the transactions it had with their loyalty program customers since the loyalty program has started. They want to do focused campaigns with these customers highlighting the benefits of continued shopping with ElecMart. You are expected to identify the probabilty of the each customer (in the loyalty program) making a purchase in next 12 months.
You are expected to upload the solution in the format of “sample_submission.csv”. The public-private split is 20:80
Note: For practice, the data set is currently available for download on Link. Please note that the data set is available for your practice purpose and will be accessible until 12th May 2016.
Winners!
Rank 3: Bishwarup Bhattacharjee, Kolkata, India
Bishwarup is an entrepreneur and is currently the CEO of Alphinite Analytics. He is a Kaggle Master and is currently ranked 13 on Data Hack. He won INR 20,000 ($300).
He said:
The data for this particular competition was a bit different from the conventional ML problems. It had no target column and no explicit separation between the training and test set. So I discovered, there were more than one potential ways to tackle such problems.
However, since the evaluation metric for the competition was the area under the ROC curve (AUC), I preferred to first formulate the problem as a case of supervised learning which I think majority of the participants did as well.
I used the data from 2003-2005 as my training set and matched the customers who repeated in 2006 to derive the labels for my data. That was pretty straightforward. Just formulating the problem in this way and using a very simple xgboost model, I could get > 0.83 on the public leaderboard.
Then, feature engineering played a huge role to play in my success. Since, we were ultimately supposed to predict the probability of a repeat on per user basis, I summarized multiple user records in the training data to one single training instance. The features which helped me are as follows:
- Age of the customer as of in 2007-01-01
- Creating user-store association matrix
- Creating user-product category association matrix
- Creating user-sales executive association matrix (and dropping extremely sparse columns)
- Creating user-payment method association matrix
- Creating user-lead source category association matrix
- Average popularity of all the sales executive who has attended a particular customer
- Entropy of price range offered to a customer compared to general price range of a particular product category
- Number of transactions in last 1 year
- Number of transactions in last 6 months
- Number of transactions in last 3 months
- Number of transactions in the last month of the training period
- First store visited by a customer
- Total number of store visited by a customer
- Min, Max, Mean, Range of transaction amount for a customer
- Time to last purchase in days
- Median EMI
- Number of unique stores / Number of previous purchase
- Number of unique product categories / Number of previous purchase
There were more features which I derived, but they did not help my model’s accuracy.
In the end, I trained two xgboost models on the above features selecting a part of it in each of them and the rank average of them got me to end at 3rd position in this competition with 0.874409 accuracy.
My Solution: Link
Rank 2: Oleksii Renov (Dnipro, Ukraine) and Thakur Raj Anand (Hyderabad, India)
This was the first time, a team (Team Or) managed to secure a position in Top 3. This team won INR 35,000 ($500).
Thakur Raj Anand (DataGeek) is a data science analyst with Masters in Quantitative Finance based out of Hyderabad. He mostly uses R and Python for data science competitions. Oleksii Renov (orenov) is a data scientist based out of Dnipro, Ukraine. He loves to do programming in Python, R and Scala.
They said:
We spent 40% of the time exploring data and converted the problem into a Supervised problem.
We generated negative sample by assigning 0 to those IDs which had no transaction in the year 2006 but had a history before 2006. We constructed 4 different representation of data to make models with the idea of capturing different signals from different representations.
For modeling, we mainly used XGBoost but we did try Random Forest and ExtraTrees which unfortunately didn’t help to improve our final predictions accuracy.
Oleksii has an usual habit of looking for unusual patterns in data. He found that predictions from tree model and linear model were very different and averaging them was giving a significant boost in CV as well as on LB.
We kept exploring different styles and finally we made 4 tree models and 1 linear model using XGBoost. We only made a linear model on the final representation of data on which XGBOOST was giving best CV. We finished at Rank 2 with 0.876660 accuracy.
In this competition, we learned a lot about sparse matrices. We decided to learn simple things like aggregating, transformation etc. on sparse matrices which is very helpful in exploring large data sets in an efficient way.
In the end, we would like to tell young aspiring data science folks to never give up. Every time you feel like giving up, try to make a different representation of data and try different models on them.
Our Solution: Link
Rank 1: Rohan Rao, Mumbai, India
Rohan Rao is currently working as a Data Scientist at AdWyze. He is a Kaggle Master and currently ranked 6 on Data Hack. He is a three time National Sudoku Champion and currently ranked 14th in the world. He won INR 70,000 ($1000).
He said:
Hackathons might be meant for quick and smart modelling, but this one restored my faith in focusing on smart.
I’ve been regularly participating in competitions at Data Hack. More than anything, I’ve learned many new things. I am glad I finally got my maiden win!
The road to achieve a seer’s accuracy turned out to be interesting. Unlike a majority of predictive modelling competitions, this hackathon did not have the standard train/test data format.
I started off with understanding how best to build a machine-learning based solution with the data, along with setting up a stable validation framework.
Based on my CV-LB scores from an XGBoost model, that were quite well in sync, I explored each variable and started working on feature engineering. I could see, that there is subtle but good scope of creating new variables.
My final model was an ensemble of 3 XGBoost models, each having a different set of data points, features and parameters. The ensemble was mainly to ensure more stability in the predictions. I explored few other ML-based models, but none were as good as XGBoost. Even their ensembling with XGBoost did not help. This way I won this competition with the accuracy of 0.88002.
I feel, it is always wonderful to work with clean datasets that are designed over a good problem statement. And, this hackathon was very well organized. The CV-LB stable correlation was a huge plus because it enabled me to focus on feature engineering, which is the most exciting part of building machine learning models.
It was nice to see and compete with many of the top data scientists in India, and at the end, I’m glad I finished 1st to win my maiden competition on AnalyticsVidhya.
The biggest learning for me from this competition was the importance of drilling down into understanding the problem statement inside out and building a robust and solid solution step-by-step. And, then practice more so that one can do these as quickly as possible. It might sound cliche but it actually works!
Finally, some of the tips I would like to give to aspiring data scientists:
Always trust your Cross-Validation (CV) score. And to trust your CV, you need to build the right validation method depending on the problem, data and evaluation. During the competition, explore and try out as many ideas as possible. You’d be surprised to know that sometimes, the simplest algorithm or the least obvious ones could also work out. In the end, always be ready to learn from others and never hesitate in asking for help. There’s always something to learn for everyone.
My Solution: Link
Key Takeaways from this Competition
This competition gave a clean well structured data set. Hence, no efforts were required in data cleaning. But, problem framing (which most of us overlook) paved the way towards success. Moving away from a conventional ML competition, turned out to be challenging event for participants but eventually gave them something new to learn. Below are the key takeaways from our top 3 participants:
- Understand the Problem: Before you start working on data, make sure you clearly understand what has been asked for. This will avoid all sorts of confusion and help you to start with a definite goal.
- Feature Engineering: Once again, feature engineering remains played a crucial part in modeling. Your motive should be to derive new features in order to supply for unique information to the algorithm.
- Boosting & Ensemble: Choice of ML algorithm totally depends on participants. But, the magnificent power bestowed by boosting algorithm (XGBoost) outperforms the need to use any other ML algorithm. The cameo played by ensemble in the end helps further in improving prediction accuracy. You must learn boosting and ensemble to perform better in competition. You can start here.
End Notes
Some of your might have sought motivation & some of you take away knowledge from this article. If you have thoroughly read the winners talk, you would have realized that winning this competition didn’t require anything extra ordinary technique. It wasn’t about knowing advanced machine learning algorithms, but required a simple approach of understanding the problem.
Therefore, next time when you come for challenge, make sure you’ve understood what has been asked for, and then start working on predictive modeling. This way you’ll have more confidence while working. Last but not the least, learn about cross validation, xgboost and feature engineering.
Did you find this article helpful ? Were you able to analyze your hits and misses ? Don’t worry, there is always a next time. Winning is a good habit. Coming up soon is Mini Data Hack.