Saturday, September 21, 2024
Google search engine
HomeData Modelling & AIData Manipulation Using Pandas you need to know!

Data Manipulation Using Pandas you need to know!

This article was published as a part of the Data Science Blogathon

Pandas

Pandas is an open-source data analysis and data manipulation library written in python. Pandas provide you with data structures and functions to work on structured data seamlessly. The name Pandas refer to “Panel Data”, which means a structured dataset. Pandas have two main classes to work on, DataFrame and Series. Let us explore more on this later in this article.

Key Features of Pandas

  • Perform Group by operation seamlessly
  • Datasets are mutable using pandas which means we can add new rows and columns to them.
  • Easy to handle missing data
  • Merge and join datasets
  • Indexing and subsetting data

Installation

Install via pip using the following command,

pip install pandas

Install via anaconda using the following command,

conda install pandas

DataFrame in Pandas

A DataFrame is a two-dimensional table in pandas. Each column can have different data types like int, float, or string. Each column is of class Series in pandas, we’ll discuss this later in this article.

 

Creating a DataFrame in Pandas

# import the library as pd
import pandas as pd
df = pd.DataFrame(
    {
        'Name': ['Srivignesh', 'Hari'],
        'Age': [22, 11],
        'Country': ['India', 'India']
    }
) 
print(df)
# output
#         Name  Age Country
# 0  Srivignesh   22   India
# 1        Hari   11   India

pd.DataFrame is a class available in pandas. Here we provide a dictionary whose keys are the column names (‘Name’, ‘Age’, ‘Country’) and the values are the values in those columns. Here each column is of class pandas.Series. Series is a one-dimensional data used in pandas.

# accessing the column 'Name' in df
print(df['Name'])
# Output
# 0    Srivignesh
# 1          Hari
# Name: Name, dtype: object
print(type(df['Name']))
# Output
# <class 'pandas.core.series.Series'>

Let’s get started with Data Manipulation using Pandas!

For this purpose, we are going to use Titanic Dataset which is available on Kaggle.

import pandas as pd
path_to_data = 'path/to/titanic_dataset'
# read the csv data using pd.read_csv function
data = pd.read_csv(path_to_data)
data.head()
data.head | Data manipulation

Dropping columns in the data

df_dropped = data.drop('Survived', axis=1)
df_dropped.head()
dropping columns

The ‘Survived’ column is dropped in the data. The axis=1 denotes that it ‘Survived’ is a column, so it searches ‘Survived’ column-wise to drop.

Drop multiple columns using the following code,

df_dropped_multiple = data.drop(['Survived', 'Name'], axis=1)
df_dropped_multiple.head()

The columns ‘Survived’ and ‘Name’ are dropped in the data.

Dropping rows in the data

df_row_dropped = data.drop(2, axis=0)
df_row_dropped.head()
dropping rows

The row with index 2 is dropped in the data. The axis=0 denotes that index 2 is a row, so it searches the index 2 column-wise.

Drop multiple rows using the following code,

df_row_dropped_multiple = data.drop([2, 3], axis=0)
df_row_dropped_multiple.head()
drop multiple rows | Data manipulation

The rows with indexes 2 and 3 are dropped in the data.

Renaming a column in the dataset

data.columns
# Output
# Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
#        'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
#       dtype='object')
df_renamed = data.rename(columns={'PassengerId': 'Id'})
df_renamed.head()
rename column | Data manipulation

The column ‘PassengerId’ is renamed to ‘Id’ in the data. Do not forget to mention the dictionary inside the columns parameter.

Rename multiple columns using the following code,

df_renamed_multiple = data.rename(
    columns={
        'PassengerId': 'Id', 
        'Sex': 'Gender',
    }
)
df_renamed_multiple.head()
rename multiple columns

The columns ‘PassengerId’ and ‘Sex’ are renamed to ‘Id’ and ‘Gender’ respectively.

Select columns with specific data types

integer_data = data.select_dtypes('int')
integer_data.head()
selct columns

The above code selects all columns with integer data types.

float_data = data.select_dtypes('float')
float_data.head()
select by value

The above code selects all columns with float data types.

Slicing the dataset

data.iloc[:5, 0]
slicing the data | Data manipulation

The above code returns the first five rows of the first column. The ‘:5’ in the iloc denotes the first five rows and the number 0 after the comma denotes the first column, iloc is used to locate the data using numbers or integers.

data.loc[:5, 'PassengerId']
data.loc | Data manipulation

The above code does the same but we can use the column names directly using loc in pandas. Here the index 5 is inclusive.

Handle Duplicates in Dataset

Since there are no duplicate data in the titanic dataset, let us first add a duplicated row into the data and handle it.

df_dup = data.copy()
# duplicate the first row and append it to the data
row = df_dup.iloc[:1]
df_dup = df_dup.append(row, ignore_index=True)
df_dup
duplicates
df_dup[df_dup.duplicated()]
handle duplicate

The above code returns the duplicated rows in the data.

df_dup.drop_duplicates()
drop duplicate | Data manipulation

The above code drops the duplicated rows in the data.

Select specific values in the column

data[data['Pclass'] == 1]
select specific value in column

The above code returns the values which are equal to one in the column ‘Pclass’ in the data.

Select multiple values in the column using the following code,

data[data['Pclass'].isin([1, 0])]

The above code returns the values which are equal to one and zero in the column ‘Pclass’ in the data.

Group by in DataFrame

data.groupby('Sex').agg({'PassengerId': 'count'})
groupby | Data manipulation

The above code groups the values of the column ‘Sex’ and aggregates the column ‘PassengerId’ by the count of that column.

data.groupby('Sex').agg({'Age':'mean'})

The above code groups the values of the column ‘Sex’ and aggregates the column ‘Age’ by mean of that column.

Group multiple columns using the following code,

data.groupby(['Pclass', 'Sex']).agg({'PassengerId': 'count'})
groupby

Map in Pandas

data['Survived'].map(lambda x: 'Survived' if x==1 else 'Not-Survived')
map in pandas | Data manipulation

The above code maps the values 0 to ‘Not-Survived’ and 1 to ‘Survived’. You can alternatively use the following code to obtain the same results.

data['Survived'].map({1: 'Survived', 0: 'Not-Survived'})

Replacing values in a DataFrame

data['Sex'].replace(['male', 'female'], ["M", "F"])
replacing values

The above code replaces ‘male’ as ‘M’ and ‘female’ as ‘F’.

Save the DataFrame as a CSV file

data.to_csv('/path/to/save/the/data.csv', index=False)

The index=False argument does not save the index as a separate column in the CSV.

References

[1] Pandas Documentation

Thank you!

The media shown in this article on Deploying Machine Learning Models leveraging CherryPy and Docker are not owned by Analytics Vidhya and are used at the Author’s discretion.

Srivignesh R

23 Jun 2021

Machine Learning Engineer @ Zoho Corporation

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments