Introduction
One of the most common questions asked these days is what makes a good data scientist. The simple answer – it depends. The long answer – someone who can lead all the phases of a data science project. For an even longer answer, read on.
A Data Science project is not just a hackathon competition where a ready-made dataset is provided and the success metric or the error to optimize is clearly laid out.
So what’s different? Well, there are various phases in a data science project – Getting the context of the problem, understanding the data, deep diving into it, understanding implementations and coding shortcomings, figuring out the right set of algorithms to use, coding those algos, performance of those algorithms from an engineering and a data science perspective and optimization.
As you can imagine, a data science skillset is a mixture of what was traditionally called computer science, and business analytics. Sometimes, given the breadth and depth of the work, you might be unlikely to find a person who knows all these aspects (let alone being good at them). Instead, its better to build a team that has a mix of people who specialize in different areas required for the data science project.
In this article, we will look at what types of data scientists are there, how to find them, what the current process is and what can be further improved.
Table of Contents
- Types of Scientists
- Strategy of the talent pool
- How to find these super scientists
- The Current Process
- What can be improved?
Types of scientists
Given this prelude, I am going to help us understand and categorize the existing talent pool in the market into different categories of skill sets and knowledge based on three dimensions – Context, Coding and Concept.
Context:
The context in which the problem is set.
- Where does the problem fit in the overall scheme of things?
- Who are the right stakeholders?
- What is success and is the project worth all the effort?
- Given the knowledge, what would be the right algorithm to use?
- Understanding data and shaping the technical problem (For example: which is the dependant variable and are there any post processing steps after the initial modelling?)
Coding:
Simply put, R/Python or any other open source data science tool with which the person can analyze, create features and build a model. Work with the implementation team to get the codes to a production environment.
Concept:
The depth of understanding the technical solution. Ability to understand the algorithm in detail. Some knowledge of literature in this area. Ability to do a lit survey and differentiate or adopt the solution to the given problem.
The size of the bubble in the above chart purely measures knowledge and depth of understanding the algorithms.
Strategy of the talent pool
Given that we now understand how data science talents are, how do you, as a start up or a mid-level company or an enterprise, match the right pool to the available job? Whom do you choose and what weights do you give to each of the 3Cs at what stage of your company? Let’s examine this in a bit more detail.
Start-ups
If you are starting up or building a data science pool in your organization, chances are that the problem is not well defined and is still very blurry. The need of the hour could be breadth rather than depth. Maybe the balance could be more of geeky business analysts, data scientists and data engineers than the algo Specialists. Depending on the nature of the problem it could be a mix 30 – 40% Geeky Business Analysts and the rest divided between data scientists and engineers
Enterprises
Here I would assume that the problem is well defined. There may be existing data science solutions based either on machine learning or some other technique. The need of the hour may be to upgrade the solutions and get more of the solutions into deployment mode. I would recommend this – 40% of data scientists, 20% of data engineers, 20% Algorithm Specialists and another 20% of Geeky Business analysts.
For R&D
For organizations that wants to have a research division, the mix could shift towards algorithm specialists. They can afford to have fewer Data Scientists and Business Analysts. The idea here is that the organization aims to contribute more to research journals and wants to mark its space in certain areas or specializations.
But sometimes during this search for talent, we also come across what I like to call “Super Scientists”. Finding a super data scientist is 10 times tougher than a full stack developer. This is why there is no industry tag to them. There is also a fundamental mistake of evaluating data scientists only in terms of knowledge of ML or Python (or any other tool). This yardstick only effectively measures the efficiency of modeling process to model delivery and leaves the other parts to mere chance. Salary is also not a yardstick while finding these super scientists as very few companies realize their potential and hence would have given a premium to them.
How to find these super scientists
Before we see how to find them, let’s take a look at what a super scientist is capable of doing.
- Independently understand the business problem – for example, they don’t think they are not working on recommended system but rather working to solve a consumer engagement problem
- Go beyond the problem specified to them to where the issue is – sometimes you are building a model, because the business or a stakeholder or your manager thought so as opposed to whether the problem really needs it
- Understand Data – the unfortunate mundane task of writing multiple select * statements and just living with the data for a couple of weeks
- Identify the algorithm or come up with one. With time in hand little bit of literature review to know how the world is handling a similar problem
- Start the core modeling work of working with data and algorithms. Trying to be open about what fits the as opposed to what fits your resume
- Final execution code that can be delivered to deployment team
- Reporting and making sure things are running right if not the drive to make the model or the algorithm better
- Presenting the results to whomsoever the point needs to be proven
- Final rollout and monitoring
- Optimization of the results and ensuring progress
As you can see, all 10 steps are important.
The Current Process
Currently, most hiring organizations evaluate data scientists only on point 4-5 in the form of an interview discussion. There too the focus ends up being too much on the knowledge and too little on the application itself. How do get your code into production? Can you streamline your pipeline to work with the existing hardware (and even software) that exists in the organization? These are critical questions I feel are not asked enough in interviews.
More or less the rest of the 8 steps are left to chance. Its important we start innovating on how we test the usefulness of a person to a job than how much the person knows
What can be improved?
Case Studies can be a key instrument in testing all 9 points. Case studies can be presented as real data science problems that would show up the job. For example, instead of interviewing on collaborative filtering, one can give a statement that we want to show or send right items to the right set of users. Then we can evaluate how the candidate arrives at a solution and how does the person think of success metrics, KPIs, etc. Create a scenario where the interviewer plays the role of a business or problem owner and see how the candidate reacts to constraints – be it data or implementations. Then deep dive into programming and algorithms.
End Notes
This is my humble attempt on building a data science team and how to recruit evasive super scientists . Now, time to find the needle in the haystack!
If you have ever been in a hiring role, what has your experience been like? On the other hand, folks looking for a data scientist role – what are some of the challenges you have faced in your journey? Use the comments section below to let me know!
About the Author
Mathangi is currently building a Data Science team at PhonePe. She has 13+ years of proven track record in building world-class data sciences solutions and products. She has extensively worked on building chatbots and productizing text mining insights. She has 6 Patent grants and 20+ patents pending in the area of inuitive customer service,indoor positioning and user profiles. She is adept across machine learning , text mining NLP technologies & tools.