Feature Engineering Techniques to follow in Machine Learning

20 July 2024

1

This article was published as a part of the Data Science Blogathon

Feature Engineering Techniques — Photo by Firmbee.com on Unsplash

What is a feature, and why do we need it engineered? In general, all machine learning algorithms use some form of input data to generate outputs. This input data consists of feature engineering techniques, which are in the form of structured columns. Algorithms require features with a specific characteristic to function better. The need for feature engineering arises in this situation.

I believe that feature engineering efforts are primarily motivated by two objectives:

Creates input data that is compatible with the machine learning algorithm’s requirements.
Improvement of ML model performance.

The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering.
— Luca Massaron

According to a Forbes survey, data scientists spend 80% of their time preparing data:

This metric demonstrates the significance of feature engineering in data science. As a result, I decided to write this article, which summarises the main techniques of feature engineering and provides brief descriptions of each.

I also included some simple Python scripts for each technique. To use them, you must first import the Pandas and Numpy libraries.

Some of the techniques listed below may work better with specific algorithms or datasets, while others may be helpful in all cases. This post does not intend to delve too deeply into this topic. It is possible to publish a post for each of the methods listed below, and I had attempted to keep the explanations brief and informative.

Practising different techniques on different datasets and observing their effect on model performance is the best way to gain expertise in feature engineering.

Let’s dive into techniques:

Imputation

Missing values are one of the most common issues that arise when attempting to prepare data for machine learning. Human errors, interruptions in the data flow, privacy concerns, and other factors could be the reason for missing values. Missing values, for whatever reason, have an impact on the performance of machine learning models.

Some machine learning platforms automatically drop rows with missing values during the model training phase, which reduces model performance due to the reduced training size. On the other hand, most algorithms reject datasets with missing values and return an error.

The most straightforward way to deal with missing values is to remove the rows or the entire column. There is no optimal dropping criterion, however, you can take 80% as an example and drop the rows and columns with missing values greater than that proportion.

threshold_value = 0.8

#Dropping columns with missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold_value]]

#Dropping rows with missing value rate higher than threshold
data = data.loc[data.isnull().mean(axis=1) < threshold_value]

Numerical Imputation

Imputation is performed to dropping since it retains data size. However, there is a significant selection of what you replace with the missing numbers. I recommend starting by contemplating a suitable default value for missing values in the field. For example, if you have a column with only 1 and NA, the NA rows likely correspond to 0. For example, if you have a column that shows the “customer visit count in the last month,” replace the missing numbers with 0 if it is a reasonable option.

Another cause of missing numbers is combining tables of various sizes, and in this situation, replacing 0 may be appropriate as well.

Instead of providing a default value for missing values, I believe the optimum imputation method is to use the column medians. The averages of the columns are susceptible to outlier values, although the medians are more stable in this regard.

#filling all missing values with 0
data = data.fillna(0)
# filling missing values with median of columns
data = data.fillna(data.median)

Categorical Imputation

To handle categorical variables, replacing missing values of columns with the mode is a good choice. If there is no dominant value and the features are uniform, imputing a category like “unknown” is sensible, whereas your imputation is likely to converge a random selection.

#Max fill function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts()
.idxmax(), inplace=True)

Binning

This can be applied to both numerical and categorical data.

#Numerical Bin example
Value     Bin
0-30    -> Low
31-70   -> Med
71-100  -> High
#Categorical Bin example
Value       Bin
Spain   -> Europe
Italy   -> Europe
Chile   -> South America

The main reason for binning is to make the model more robust and to prevent overfitting; however, it comes at a cost in terms of performance. Every time you throw something away, you give up information and make your data more regular. (For more information, see regularisation in machine learning.)

The main motto of the binning process is the trade-off between performance and overfitting. Binding, in my opinion, maybe redundant for some types of algorithms for numerical columns, except for some obvious overfitting cases, due to its effect on model performance.

However, for categorical columns, labels with low frequencies are likely to harm the robustness of statistical models. Assigning a customary category to these less frequent values thus contributes to the model’s robustness. For example, your data set contains 10,000 rows, it might be a good idea to group labels with a count of less than 100 into a new category called “New.”

#Numerical Binning Example
data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Low", "Mid", "High"]) 
   value   bin
0      2   Low
1     45   Mid
2      7   Low
3     85  High
4     28   Low
#Categorical Binning Example 
     Country
0      Spain
1      Chile
2  Australia
3      Italy
4     Brazil
conditions = [ data['Country'].str.contains('Spain'), data['Country'].str.contains('Italy'),
 data['Country'].str.contains('Chile'),
    data['Country'].str.contains('Brazil')]
choices = ['Europe', 'Europe', 'South America', 'South America']
data['Continent'] = np.select(conditions, choices, default='Other') 
     Country      Continent
0      Spain         Europe
1      Chile  South America
2  Australia          Other
3      Italy         Europe
4     Brazil  South America

Outliers Handling

Before discussing how to handle outliers, I’d like to point out that visualising the data is the best way to detect outliers. All other statistical methodologies are prone to error, whereas visualising outliers allows for a more precise decision.

Statistical methodologies, as previously stated, are less precise, but they have an advantage in that they are fast. In this section, I will discuss two approaches to dealing with outliers. These will detect them through the use of standard deviation and percentiles.

Outlier Detection with Standard Deviation

If a value’s distance from the average is more than x * standard deviation, it is considered an outlier. So, what should x be?

There is no simple solution for x, but a value between 2 and 4 seems reasonable.

#Dropping the outlier rows with standard deviation
factor = 3
upper_limt = data['column'].mean () + data['column'].std () * factor
lower_limt = data['column'].mean () - data['column'].std () * factor
data = data[(data['column']  lower_limt)]

Furthermore, the z-score can be substituted for the formula above. To standardise the distance between a value and the mean in the Z-score (or standard score) use standard deviation.

Outlier Detection with Percentiles

The use of percentiles is another mathematical method for detecting outliers. As an outlier, you can take a certain percentage of the value from the top or bottom. The main point here is to reset the percentage value, which is determined by the distribution of your data, as previously mentioned.

Furthermore, a simple error is to use percentiles based on the data range. In other words, if your data ranges from 0 to 100, the values between 96 and 100 do not constitute your top 5%. The top 5% of features are those that are less than the 95th percentile of data.

#Dropping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)
data = data[(data['column']  lower_lim)]

Log Transform

The logarithm transformation (or log transform) is a famous mathematical transformation in feature engineering. What are the advantages of log transformation:

It helps in handling skewed data, and after transformation, the distribution becomes more similar to normal.
The magnitude order of the data varies within the range of the data. For example, the difference between 20 and 25 is not equal to the ages of 60 and 65. In terms of years, it is similar, but 5 years difference in young ages is known as a higher magnitude difference. Log transform helps to normalize the magnitude difference like this.
The effect of outliers is decreased and the model becomes robust.

Note: If you apply log transform on data that has only positive values, you will receive an error. Also, before transforming your data, you can add 1 to it. As a result, you assure that the transformation’s output is positive.

Log(X+1)

#Log Transform Example
data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})
data['log+1'] = (data['value']+1).transform(np.log)
#Negative Values Handling
#Note that the values are different
data['log'] = (data['value']-data['value'].min()+1) .transform(np.log) 
    value  log(x+1)     log(x-min(x)+1)
0      2   1.09861          3.25810
1     45   3.82864          4.23411
2    -23       nan          0.00000
3     85   4.45435          4.69135
4     28   3.36730          3.95124
5      2   1.09861          3.25810
6     35   3.58352          4.07754
7    -12       nan          2.48491

One-Hot Encoding

It is one of the most common encoding methods in Machine Learning. Features spread across columns to multiple flag columns and assign 0 or 1 to them. These values express the relation between grouped and encoded columns.

Categorical data is challenging to understand for algorithms. This encoding changes to numerical format and allows to group categorical data without losing information.

If you have N unique values in the column, it is good to map them to N-1 binary columns where missing values can deduct from other columns. If all the column values are 0, then the missing value must be equal to 1. It is the reason why it is known as One-Hot Encoding.

Here’s an example of the get_dummies function of pandas that map all column values to multiple features.

encoded = pd.get_dummies(data['column'])
data = data.join(encoded).drop('column', axis=1)

Splitting Feature

In terms of ML, splitting features is the best way to make them more valuable. The dataset almost always contains string columns, which violates tidy data rules. By isolating the informative bits of a column and transforming them into new features:

We make it possible for machine learning algorithms to understand them.
Allow them to be categorised and grouped.
By exposing potential information, you can improve the model’s performance.

Splitting features is a smart choice, but there is no one-size-fits-all solution. How to split the column is determined by the column’s attributes. Let’s start with a couple of examples. For starters, here’s a simple split method for a regular name column:

data.name
0  Luther N. Gonzalez
1    Charles M. Young
2        Terry Lawson
3       Taylor White
4      Thomas Logsdon#Extracting first names
data.name.str.split(" ").map(lambda x: x[0])
0     Luther
1    Charles
2      Terry
3     Taylor 
4     Thomas#Extracting last names
data.name.str.split(" ").map(lambda x: x[-1])
0    Gonzalez
1       Young
2      Lawson
3       White
4     Logsdon

The first and last items in the example above handle names longer than two words, making the function robust for corner cases when processing strings like that.

To extract a string segment between two characters split method is helpful. The following example is using two split functions in a row to understand the above case.

#String extraction example
data.title.head()
0                      Toy Story (1995)
1                        Jumanji (1995)
2               Grumpier Old Men (1995)
3              Waiting to Exhale (1995)
4    Father of the Bride Part II (1995)
data.title.str.split("(", n=1, expand=True)[1].str.split(")", n=1, expand=True)[0]
0    1995
1    1995
2    1995
3    1995
4    1995

Grouping

The row represents every instance, and columns consist of different features of each example. This kind of data is known as Tidy.

We group the data by example, and each instance is known by only one row.

Grouping | Feature Engineering Techniques

Photo by @thiszun from Pexels

The main aim of the group by is to determine the aggregation functions of the features. Average and sum fractions are usually convenient for numerical features, whereas it is complicated for categorical data.

I suggest two ways of aggregating categorical columns

The first option is to choose the label with the highest frequency. In other words, this is the max operation for categorical columns, but ordinary max functions rarely return this value; instead, a lambda function is required.

data.groupby('id').agg(lambda x: x.value_counts().index[0])

After performing one-hot encoding, the second alternative is to use a group by function. This technique keeps all of the data and, in the meantime, converts the encoded column from categorical to numerical.

Scaling

The numerical properties of the dataset, in most circumstances, do not have a fixed range and differ from one another. In reality, expecting the age and income columns to have the same range is absurd. But how can these two columns be compared from the standpoint of machine learning?

This issue is solved by scaling. After a scaling operation, the continuous features become similar in terms of range. Although this step is not a must for many algorithms, it’s still a good idea to do so. Distance-based algorithms like k-NN and k-Means, on the other hand, require scaled continuous features as model input.

Normalization

All values are scaled in a specified range between 0 and 1 via normalisation (or min-max normalisation). This modification does not influence the feature’s distribution, but it does exacerbate the effects of outliers due to lower standard deviations. As a result, it’s a good idea to deal with outliers before normalisation.

data = pd.DataFrame({'feature':[2, 45, -23, 85, 28, 2, 35, -12]})
data['normalized'] = (data['feature'] - data['feature'].min()) / (data['feature'].max() - data['feature'].min()) 
     value  normalized
0      2        0.23
1     45        0.63
2    -23        0.00
3     85        1.00
4     28        0.47
5      2        0.23
6     35        0.54
7    -12        0.10

Standardization

Standardization (also known as z-score normalisation) is the process of scaling values while accounting for standard deviation. If the standard deviation of features differs, the range of those features will likewise differ. The effect of outliers in the characteristics is reduced as a result.

data = pd.DataFrame({'feature':[2,45, -23, 85, 28, 2, 35, -12]})
data['standardized'] = (data['feature'] - data['feature'].mean()) / data['feature'].std() 
    value  standardized
0      2         -0.52
1     45          0.70
2    -23         -1.23
3     85          1.84
4     28          0.22
5      2         -0.52
6     35          0.42
7    -12         -0.92

Date Extraction

Even though date columns typically give helpful information about the model goal, they are either ignored as an input or used in an illogical manner by machine learning algorithms. This may be because dates come in a variety of formats, making them difficult for algorithms to interpret, even when simplified to a format like “01–01–2020.”

If you don’t manipulate the date columns, it’s very difficult for a machine learning system to build an ordinal relationship between the data. Here are three forms of date preparation that I recommend:

Parts of the date are extracted and placed in other columns: Year, month, day, and so forth.
Extracting the period between the current date and the columns in years, months, days, and other units.
Extracting specific information from the date, such as the weekday’s name, whether it’s a weekend or not, if it’s a holiday or not, and so on.

When you convert the date column into the extracted columns, as shown above, the information contained inside them is revealed, and machine learning algorithms can readily comprehend it.

from datetime import date

data = pd.DataFrame({'date':['01-01-2017','04-12-2008','23-06-2010','25-08-2005','20-02-2020',]})

#Transform string to date

data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")

#Extracting Year

data['year'] = data['date'].dt.year

#Extracting Month

data['month'] = data['date'].dt.month

#Extracting passed years since the date

data['passed_years'] = date.today().year - data['date'].dt.year

#Extracting passed months since the date

data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month

#Extracting the weekday name of the date

data['day_name'] = data['date'].dt.day_name() 

       date   year month  passed_years  passed_months day_name

0 2017-01-01 2017 1 4 54 Sunday

1 2008-12-04 2008 12 13 151 Thursday

2 2010-06-23 2010 6 11 133 Wednesday

3 2005-08-25 2005 8 16 191 Thursday

4 2020-02-20 2020 2 1 17 Thursday

Conclusion

These techniques aren’t magical, so try out and get the key information from features that helps in better performance of the model.

I hope you’ve found this article useful, and that might help you in the feature engineering process.

Frequently Asked Questions

Q1.What is “binning” in feature engineering?

Binning in feature engineering is like sorting data into groups to make it easier for computers to understand.

Q2.What is feature engineering in image processing?

In image processing, feature engineering is about helping computers recognize important things in pictures, like edges, shapes, and colors. It’s like teaching computers to understand what’s in the images.

Q3. Are there tools for feature engineering?

Yes, there are tools like Featuretools and TPOT that make feature engineering faster and easier

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Feature Engineering Techniques to follow in Machine Learning

Table of contents

Let’s dive into techniques:

Imputation

Numerical Imputation

Categorical Imputation

Binning

Outliers Handling

Outlier Detection with Standard Deviation

Outlier Detection with Percentiles

Log Transform

Log(X+1)

One-Hot Encoding

Splitting Feature

Grouping

Scaling

Normalization

Standardization

Date Extraction

Conclusion

Frequently Asked Questions

Related

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US