Filter pandas DataFrame by substring criteria

25 July 2024

4

Introduction

Pandas is a popular Python library for data analysis and manipulation. The DataFrame is one of the key data structures in Pandas, providing a way to store and work with structured data in a tabular format. DataFrames are useful for organizing and storing data in a consistent format, allowing you to perform operations on the data such as filtering, grouping, and aggregation.

DataFrames can be created from various sources, such as a list of lists, a dictionary, or an existing DataFrame. Here is an example of how to create a DataFrame from a list of dictionaries:

Python3

import pandas as pd 
  
# List of dictionaries containing data 
data = [ 
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, 
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, 
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, 
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, 
    {"Name": "Bob Johnson", "Age": 28, 
        "Address": "654 Cedar Blvd, Union City, NJ 07087"}, 
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, 
    {"Name": "Bill Williams", "Age": 33, 
        "Address": "245 Pine Rd,West New York, NJ 07093"}, 
    {"Name": "Mary Johnson", "Age": 25, 
        "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, 
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, 
] 
  
# Create the DataFrame 
df = pd.DataFrame(data) 
  
print(df) 

Output:

	Name	Age	Address
0	John Smith	35	123 Main St, New York, NY 10001
1	Jane Doe	28	456 Park Ave, Newark, NJ 70004
2	Joe Schmo	55	789 Broad Way, Jersey City, NJ 07306
3	Sally Smith	42	321 Maple St, Hoboken, NJ 07030
4	Bob Johnson	28	654 Cedar Blvd, Union City, NJ 07087
5	Sue Johnson	29	912 Oak St, Weehawken, NJ 07086
6	Bill Williams	33	245 Pine Rd, West New York, NJ 07093
7	Mary Johnson	25	369 Birch Ave, Guttenberg, NJ 07093
8	Tom Williams	44	159 Willow St, Hoboken, NJ 07030

Filter the Dataframe

Now suppose you need to find the list of users with the age of 28. You can do the same by writing the following code:

Python3

df = df.loc[df["Age"] == 28] 
  
print(df) 

Output:

This would filter the DataFrame to only include rows where the “Age” is 28 and produce the following output:

	Name	Age	Address
1	Jane Doe	28	456 Park Ave, Newark, NJ 70004
4	Bob Johnson	28	654 Cedar Blvd, Union City, NJ 07087

This seems pretty easy. So what if we need to find the users with addresses in New York? Since the address column contains information like street, city, zip code, etc. we need to use the substring filter operations for the same

Method 1: Using loc with str.contains

To filter the DataFrame using a substring in the “Address” column, you can use the .loc[] method and specify the desired substring in the filter criteria using the .str.contains() method:

Python3

import pandas as pd 
  
# List of dictionaries containing data 
data = [ 
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, 
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, 
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, 
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, 
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, 
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, 
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, 
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, 
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, 
] 
  
# Create the DataFrame 
df = pd.DataFrame(data) 
  
# Filter the DataFrame to only include rows where the Address contains the substring "New York" 
df = df.loc[df["Address"].str.contains("New York")] 
  
print(df)

Output:

This would filter the DataFrame to only include rows where the “Address” column contains the substring “New York” and produce the output as follows:

	Name	Age	Address
0	John Smith	35	123 Main St, New York, NY 10001
6	Bill Williams	33	245 Pine Rd, West New York, NJ 07093

Method 2: Using `query` method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .query() method and specify the desired substring in the filter criteria:

Python3

import pandas as pd 
  
# List of dictionaries containing data 
data = [ 
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, 
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, 
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, 
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, 
    {"Name": "Bob Johnson", "Age": 28, 
        "Address": "654 Cedar Blvd, Union City, NJ 07087"}, 
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, 
    {"Name": "Bill Williams", "Age": 33, 
        "Address": "245 Pine Rd, West New York, NJ 07093"}, 
    {"Name": "Mary Johnson", "Age": 25, 
        "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, 
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, 
] 
  
# Create the DataFrame 
df = pd.DataFrame(data) 
  
# Filter the DataFrame to only include rows where the Address contains the substring "New York" 
df = df.query("Address.str.contains('New York', case=False)") 
  
print(df) 

Output:

	Name	Age	Address
0	John Smith	35	123 Main St, New York, NY 10001
6	Bill Williams	33	245 Pine Rd, West New York, NJ 07093

Method 3: Using `apply` method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .apply() method and specify a custom function that returns True if the “Address” column contains the desired substring, and False otherwise:

Python3

import pandas as pd 
  
# List of dictionaries containing data 
data = [ 
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, 
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, 
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, 
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, 
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, 
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, 
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, 
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, 
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, 
] 
  
# Create the DataFrame 
df = pd.DataFrame(data) 
  
# Define a custom function to check if the Address contains the substring "New York" 
def contains_new_york(address): 
    return "New York" in address 
  
# Filter the DataFrame to only include rows where the Address contains the substring "New York" 
df = df[df["Address"].apply(contains_new_york)] 
  
print(df) 

Output:

	Name	Age	Address
0	John Smith	35	123 Main St, New York, NY 10001
6	Bill Williams	33	245 Pine Rd, West New York, NJ 07093

Method 4: Using map method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .map() method and specify a custom function that returns True if the “Address” column contains the desired substring, and False otherwise:

Python3

import pandas as pd 
  
# List of dictionaries containing data 
data = [ 
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, 
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, 
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, 
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, 
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, 
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, 
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, 
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, 
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, 
] 
  
# Create the DataFrame 
df = pd.DataFrame(data) 
  
# Define a custom function to check if the Address contains the substring "New York" 
def contains_new_york(address): 
    return "New York" in address 
  
# Map the custom function to the Address column to create a new column containing the result of the function 
df = df[df["Address"].map(contains_new_york)] 
  
print(df)

Output:

	Name	Age	Address
0	John Smith	35	123 Main St, New York, NY 10001
6	Bill Williams	33	245 Pine Rd, West New York, NJ 07093

Method 5: Using `filter` method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .filter() method and specify the desired substring in the filter criteria using the like operator:

Python3

import pandas as pd 
  
# List of dictionaries containing data 
data = [ 
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, 
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, 
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, 
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, 
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, 
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, 
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, 
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, 
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, 
] 
  
# Create the DataFrame 
df = pd.DataFrame(data) 
  
# Filter the DataFrame to only include the mentioned columns 
df = df.filter(like="Address") 
  
print(df) 

Output:

Address

123 Main St, New York, NY 10001

456 Park Ave, Newark, NJ 70004

789 Broad Way, Jersey City, NJ 07306

321 Maple St, Hoboken, NJ 07030

654 Cedar Blvd, Union City, NJ 07087

912 Oak St, Weehawken, NJ 07086

245 Pine Rd, West New York, NJ 07093

369 Birch Ave, Guttenberg, NJ 07093

159 Willow St, Hoboken, NJ 07030

The .filter() method is similar to the .query() method in that it allows you to specify filter criteria. However, the .filter() method only filters columns, whereas the .query() method can filter both columns and rows.

Method 6: Using `isin` method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .isin() method and specify the desired substring in the filter criteria:

Python3

import pandas as pd 
  
# List of dictionaries containing data 
data = [ 
    {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, 
    {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, 
    {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, 
    {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, 
    {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, 
    {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, 
    {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, 
    {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, 
    {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, 
] 
  
# Create the DataFrame 
df = pd.DataFrame(data) 
  
# Filter the DataFrame to only include rows where the Address contains the mentioned string 
df = df[df["Address"].isin(["123 Main St, New York, NY 10001"])] 
  
print(df) 

Output:

	Name	Age	Address
0	John Smith	35	123 Main St, New York, NY 10001

The .isin() method allows you to specify a list of values that the “Address” column should contain in order for the row to be included in the resulting DataFrame. In this case, the list contains only the value “123 Main St, New York, NY 10001”, so only rows where the “Address” column contains that exact value would be included in the resulting DataFrame.

To filter a Pandas DataFrame using a substring in any specific column data, you can use one of several methods, including the .loc[], .query(), .filter(), .isin(), .apply(), and .map() methods. The specific method you choose will depend on your personal preference and the specific requirements of your project.

Filter pandas DataFrame by substring criteria

Introduction

Python3

Filter the Dataframe

Python3

Method 1: Using loc with str.contains

Python3

Method 2: Using `query` method

Python3

Method 3: Using `apply` method

Python3

Method 4: Using map method

Python3

Method 5: Using `filter` method

Python3

Method 6: Using `isin` method

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

PureVPN vs. Private Internet Access 2025: Which Is Better? by Gjurgjica Panova

Recent Comments

EDITOR PICKS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR POSTS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR CATEGORY

ABOUT US

FOLLOW US