This article was published as a part of the Data Science Blogathon.
Introduction
We all know that Data Cleaning and Preprocessing take up the most time in a Data Science project. While executing various pre-processing approaches, we may find several issues that can primarily be solved by a single library – Pandas. Pandas is a well-known Python module that handles everything from data pre-processing to data analysis. Pandas’ extensive function set enables users to complete jobs considerably faster than traditional Excel.
Source: Image by ThisIsEngineering from Pexels
This article will look at some of the simple yet powerful Pandas operations that may be useful and, more importantly, will come in handy when we need them. These bookmark-worthy operations we will discuss are mostly generic, and one can modify them depending on the use cases.
Operations in Pandas
1. To replace NaN values with Random Values from a List
To replace the NaN values in a Pandas DataFrame, generally, we use the .fillna() method on a DataFrame. To fill these NaN values randomly from a list of numbers (including both float and int), or strings, we can use the .loc() method on the DataFrame.
Let’s see an example below:
Importing the Libraries
import pandas as pd import numpy as np
Creating a Pandas DataFrame with dummy data
df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] })
On Executing this code, we get the Pandas DataFrame, as attached in the above image. In this dataset, we purposefully added a few NaN values as we would need those to treat in this operation.
Setting the seed value:
Let’s set the seed value as we will be replacing NaN values with random data generated from the NumPy library, and we want the same set of results every time we run this code.
np.random.seed(124)
Replacing the NaN values
Now that we have our seed set, we will use the .loc() method on our DataFrame to perform our operation.
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
Executing this will assign a list of values to the NaN values in the column ‘Favourite Sport’. This list would have randomly picked values from a list of [‘Volleyball’, ‘Football’, ‘Basketball’, ‘Cricket’], and the number of the values to be picked will be equal to the number of NaNs in the selected column, here Favourite Sport.
Putting it all together
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] print(df)
On executing this code, our final dataset would look something like this:
As explained, for indexes 0 and 3, we see the NaNs are replaced with randomly picked values i.e., ‘Basketball’ and ‘Volleyball’ from the mentioned sports list.
2. To Map Values in a Categorical Column into Codes
Mapping values to numeric codes is a helpful method that might come in handy when we want numeric data in our DataFrame, but it needs to be unique and relate to something else. One use of this function may be the automatic assignment of Roll Numbers in a class from a list of student names.
Let’s begin by repeating the prerequisites from the preceding step: importing the libraries, generating a Pandas DataFrame, and executing the (1.) operation.
Performing the Prerequisites
# Importing the libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
Creating a list of Codes (from the ‘Name’ column)
list(pd.Categorical(df['Name'], ordered = True).codes)
On executing this, we get the following:
Here, we used Pandas’ Categorical() method and passed the ‘Name’ column from our DataFrame. We also passed the value ‘True’ to the parameter ‘ordered,’ and thus, we get the list of numbers based on the alphabetically ordered ‘Name’ column. So when the Name ‘Alex’, the code assigned is ‘0’, while for the name ‘Jimmy,’ the code assigned will be ‘2’ as the name ‘Jimmy’ comes in the 3rd position among the four names in the ‘Name’ column, alphabetically. We passed this entire code into a list to get a list of values.
We can also pass this list of values into the DataFrame as a column.
Creating New Column from Codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
Executing this will create a new column named ‘Roll Number.’
Putting this all together
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes) print(df)
On executing this code, our DataFrame would look like this:
3. To format the Integers in a DataFrame
This process aids in improving the readability of the numbers for users. We frequently encounter numbers with numerous digits in a DataFrame, which causes confusion and misinterpretation.
We will format the values in the ‘Salary’ column in the following example.
Let’s begin by completing the requirements from the primary operations: importing libraries building the Pandas DataFrame, and the previous two operations.
Performing the prerequisites
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
Formatting the ‘Salary’ column
df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))
Putting it all together
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes) # Format values in 'Salary' column df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d')) print(df)
On executing this code, we get the following:
Here, we passed each value in the ‘Salary’ column in the built-in format() method, using the .apply() method of Pandas.
A potential caveat in performing this operation is that the value becomes object-type or categorical when formatting an integer because of the commas between the digits.
4. To Extract Rows if a Certain Categorical Column has a Given Substring
Sometimes we wish to remove rows that fulfill a specific requirement. This operation is often performed on the categorical columns of a DataFrame. We will perform a similar operation on one of our categorical columns below.
In our DataFrame, we will extract all the rows where the person has a ball game as their favorite Sport. To carry out this process, we will use our Favorite Sport column.
We will start with the prerequisites, including library imports, DataFrame construction, and the previously completed operations.
Performing the Prerequisites
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping the 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes) # Format values in the 'Salary' column df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))
Extracting the Rows of Interest
print(df[df['Favourite Sport'].str.contains('ball')])
Executing this will extract all the rows where the Favourite Sport of a person has the text ‘ball’ in it.
Putting it all together
# Import the Libraries import pandas as pd import numpy as np # Creating the DataFrame df = pd.DataFrame({ 'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'], 'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan], 'Height (in cm)': [167.7, 182.3, 178.7, 166.2], 'Salary': [12134343, 21312324, 421324554, 234434325] }) # Replacing the NaN values np.random.seed(124) df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())] # Mapping the 'Name' column into numeric codes df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes) # Format values in the 'Salary' column df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d')) # Checking if 'ball' is in the 'Favourite Sport' column print(df[df['Favourite Sport'].str.contains('ball')])
On executing this code, we get the following:
Here, as explained, we got all the rows from our DataFrame having substring ‘ball’ in the column ‘Favourite Sport’.
Conclusions
This article discusses four simple yet powerful Pandas operations that can be used in various situations. All the actions described are carried out in the simplest way possible; however, there may be other ways to carry out these operations. This simplicity makes the article worth bookmarking because it saves time looking for similar solutions on StackOverflow.
Key Takeaways:
- We saw how we could replace NaNs with randomized values – numbers or strings.
- We also saw how to code the strings into numerics based on the alphabetical arrangement of strings.
- In the third operation, we learned how to format the integers and improve the readability for the user.
- We also have seen how formatting this can change the column’s datatype from int to str.
- In the fourth operation, we understood how to extract rows when a given substring is found in one of the columns specified.
Connect with me on LinkedIn. Check out my other Articles Here.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.