In this article, we will learn how to convert a categorical variable into a Numeric by using pandas.
When we look at the categorical data, the first question that arises to anyone is how to handle those data, because machine learning is always good at dealing with numeric values. We could make machine learning models by using text data. So, to make predictive models we have to convert categorical data into numeric form.
Method 1: Using replace() method
Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset of people’s salaries based on their level of education. This is an ordinal type of categorical variable. We will convert their education levels into numeric terms.
Syntax:
replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method=’pad’)
Consider the given data:
Python3
#import pandas import pandas as pd # read csv file df = pd.read_csv( 'data.csv' ) # replacing values df[ 'Education' ].replace([ 'Under-Graduate' , 'Diploma ' ], [ 0 , 1 ], inplace = True ) |
Output:
In the above program, we have replaced “under-graduate” as 0 and “Diploma” as 1.
Method 2: Using get_dummies()
Replacing the values is not the most efficient way to convert them. Pandas provide a method called get_dummies which will return the dummy variable columns.
Syntax: pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Stepwise Implementation
Step 1: Create dummies columns
get_dummies() method is called and the parameter name of the column is given. This method will return the dummy variable columns. In this case, we have 3 types of Categorical variables so, it returned three columns
Step 2: Concatenate
Syntax: pandas.concat(objs, axis=0, join=’outer’, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True
The next step is to concatenate the dummies columns into the data frame. In pandas, there is a concat() method, which you can call to join two data frames. You should supply it with the name of two data frames and the axis. This will give you the merged data frame.
Step 3: Drop columns
We have to drop the original ‘education’ column because we have the dummy variable column and we don’t need the text column. And we might also drop one of the dummy variable columns So that we could avoid the dummy variable trap which could mess up the model. After dropping the columns, the desired dataframe is obtained
We will implement this at code
Python3
#import pandas import pandas as pd # read csv df = pd.read_csv( 'salary.csv' ) # get the dummies and store it in a variable dummies = pd.get_dummies(df.Education) # Concatenate the dummies to original dataframe merged = pd.concat([df, dummies], axis = 'columns' ) # drop the values merged.drop([ 'Education' , 'Under-Graduate' ], axis = 'columns' ) # print the dataframe print (merged) |
Output: