This article was published as a part of the Data Science Blogathon.
Source: DDI
Introduction
Data science job interviews need special skills. The candidates who succeed in landing employment are often not the ones with the best technical abilities but those who can pair such capabilities with interview acumen.
Although data science is broad, a few specific questions often come up in interviews. I have created a list of the seven most commonly-asked data science interview questions and their answers.
Data Science Interview Questions
Question 1: How does XGBoost handle the bias-variance tradeoff?
Answer: XGBoost is a boosted version of bagging and boosting. As a result, XGBoost manages bias and variance similarly to any other boosting strategy. Boosting is an ensemble meta-algorithm that takes a weighted average of different weak models to reduce bias and variation. The error (and hence the bias) is decreased by concentrating on weak predictions and iterating through models. The final model also has a lower variance than the weaker models individually because it is the weighted average of multiple weak models.
Question 2: You must use multiple regression models to create a predictive model. Describe how you aim to validate this model.
Answer: There are two primary methods for doing this:
A) Adjusted R-squared: Adjusted R-Squared is a statistic that indicates how much of the variance in the independent variables can be accounted for by the variance in the dependent variable. In essence, R-squared shows the scatter around the line of best fit, while coefficients estimate trends.
A model with multiple independent variables may seem to fit the data better even though it doesn’t since each extra independent variable boosts the R-squared value of the model. Here, corrected R2 enters the picture. Each extra independent variable is considered by the modified R2, which only rises if the model is improved beyond the bounds of probability. Given that we are building a multiple regression model, this is important.
B) Cross-Validation: A common approach divides the data into training, validating, and testing data.
Question 3: What distinguishes batch learning from online learning?
When a model learns over groups of patterns, this process is called batch learning or offline learning. Most people are familiar with this kind of learning, where you gather a dataset and create a model using the entire dataset in one go.
On the other hand, online learning uses an approach that ingests data one observation at a time. Online learning is data-efficient since, in theory, you don’t need to retain your data because it is no longer necessary after it has been used.
Question 4: Suggest some strategies for handling null values.
Answer: There are several methods for dealing with null values, including the ones listed below:
– You can completely omit rows containing null values.
– Measures of central tendency (mean, median, and mode) or a new category (like “None” can be used to replace null values).
– Based on other factors, you can forecast the null values. For example, if a row has a height value but no weight value, you can replace the height value with the average weight for that height.
– Finally, if you use a machine learning model that automatically handles null values, you can leave the null values.
Question 5: Is it appropriate to impute mean values for missing data? Whether or not.
Answer: Mean imputation is substituting the data set’s mean for any null values.
Since it ignores feature association, mean imputation is often not a good idea. Consider a table where the age and fitness score are listed, and the fitness score for an individual who is 80 years old is missing. The eighty-year-old will seem to have a considerably greater fitness score than he should if the average fitness score for a range of ages from 15 to 80 is used.
Second, mean imputation increases bias in our data and decreases variance in the data. A decreased variance results in a less accurate model and a narrower confidence interval.
Question 6: How do you detect outliers?
Answer: There are several methods for locating outliers, including:
Z-score/standard deviations: If we know that 99.7% of the data in a data set fall within three standard deviations, we may determine the size of one standard deviation, multiply it by three, and then pinpoint the data points that fall outside of this range. Similarly, if the calculated z-score of a particular point is more than or equal to +/- 3, it is an outlier.
It should be noted that this method has a few limitations, including the requirement that the data be normally distributed, the fact that it cannot be used for tiny data sets, and the possibility that the existence of too many outliers may cause the z-score to be inaccurate.
Interquartile Range (IQR): IQR, the idea behind boxplot construction, can also be used to spot outliers. The IQR is equal to the gap between the first and third quartiles. If a point is more than Q3 + 1.5*IQR or less than Q1-1.5*IRQ, you can determine if it is an outlier. The resulting standard deviation is around 2.698.
Other methods include Isolation Forests, Robust Random Cut Forests, and DBScan clustering.
Question 7: Is it appropriate to impute mean values for missing data? Why or why not?
Answer: The process of substituting the mean of the dataset for any null values is called mean imputation.
Mean imputation is usually not a good idea because it doesn’t consider feature association. For example, let’s say we have a table where the age and fitness score are listed, and the fitness score for an individual who is 80 years old is missing. The eighty-year-old will seem to have a remarkably greater fitness score than he should if the average fitness score for a range of ages from 15 to 80 is used.
As a result of mean imputation, our data have a higher bias and less variance. Consequently, the model is less accurate, and the confidence interval is smaller.
Conclusion
In this article, we covered seven data science interview questions, and the following are the key takeaways:
- XGBoost is a boosted version of bagging and boosting. As a result, XGBoost manages bias and variance like any other boosting strategy. On the other hand, boosting is an ensemble meta-algorithm that takes a weighted average of different weak models to decrease bias and variation.
- Adjusted R-squared and Cross-validation can be used to validate a predictive model created using multiple regression models.
- When a model learns over groups of patterns, this process is called batch learning or offline learning. On the other hand, online learning uses an approach that ingests data one observation at a time.
- Z-score/standard deviations and Interquartile Range (IQR) can be used to check if there are outliers.
Read more articles on Data Science interview questions here.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.