Introduction
Pandas is a popular Python library for data analysis and manipulation. The DataFrame is one of the key data structures in Pandas, providing a way to store and work with structured data in a tabular format. DataFrames are useful for organizing and storing data in a consistent format, allowing you to perform operations on the data such as filtering, grouping, and aggregation.
DataFrames can be created from various sources, such as a list of lists, a dictionary, or an existing DataFrame. Here is an example of how to create a DataFrame from a list of dictionaries:
Python3
import pandas as pd # List of dictionaries containing data data = [ { "Name" : "John Smith" , "Age" : 35 , "Address" : "123 Main St, New York, NY 10001" }, { "Name" : "Jane Doe" , "Age" : 28 , "Address" : "456 Park Ave, Newark, NJ 70004" }, { "Name" : "Joe Schmo" , "Age" : 55 , "Address" : "789 Broad Way, Jersey City, NJ 07306" }, { "Name" : "Sally Smith" , "Age" : 42 , "Address" : "321 Maple St, Hoboken, NJ 07030" }, { "Name" : "Bob Johnson" , "Age" : 28 , "Address" : "654 Cedar Blvd, Union City, NJ 07087" }, { "Name" : "Sue Johnson" , "Age" : 29 , "Address" : "912 Oak St, Weehawken, NJ 07086" }, { "Name" : "Bill Williams" , "Age" : 33 , "Address" : "245 Pine Rd,West New York, NJ 07093" }, { "Name" : "Mary Johnson" , "Age" : 25 , "Address" : "369 Birch Ave, Guttenberg, NJ 07093" }, { "Name" : "Tom Williams" , "Age" : 44 , "Address" : "159 Willow St, Hoboken, NJ 07030" }, ] # Create the DataFrame df = pd.DataFrame(data) print (df) |
Output:
Name | Age | Address | |
0 | John Smith | 35 | 123 Main St, New York, NY 10001 |
1 | Jane Doe | 28 | 456 Park Ave, Newark, NJ 70004 |
2 | Joe Schmo | 55 | 789 Broad Way, Jersey City, NJ 07306 |
3 | Sally Smith | 42 | 321 Maple St, Hoboken, NJ 07030 |
4 | Bob Johnson | 28 | 654 Cedar Blvd, Union City, NJ 07087 |
5 | Sue Johnson | 29 | 912 Oak St, Weehawken, NJ 07086 |
6 | Bill Williams | 33 | 245 Pine Rd, West New York, NJ 07093 |
7 | Mary Johnson | 25 | 369 Birch Ave, Guttenberg, NJ 07093 |
8 | Tom Williams | 44 | 159 Willow St, Hoboken, NJ 07030 |
Filter the Dataframe
Now suppose you need to find the list of users with the age of 28. You can do the same by writing the following code:
Python3
df = df.loc[df[ "Age" ] = = 28 ] print (df) |
Output:
This would filter the DataFrame to only include rows where the “Age” is 28 and produce the following output:
Name | Age | Address | |
1 | Jane Doe | 28 | 456 Park Ave, Newark, NJ 70004 |
4 | Bob Johnson | 28 | 654 Cedar Blvd, Union City, NJ 07087 |
This seems pretty easy. So what if we need to find the users with addresses in New York? Since the address column contains information like street, city, zip code, etc. we need to use the substring filter operations for the same
Method 1: Using loc with str.contains
To filter the DataFrame using a substring in the “Address” column, you can use the .loc[] method and specify the desired substring in the filter criteria using the .str.contains() method:
Python3
import pandas as pd # List of dictionaries containing data data = [ { "Name" : "John Smith" , "Age" : 35 , "Address" : "123 Main St, New York, NY 10001" }, { "Name" : "Jane Doe" , "Age" : 28 , "Address" : "456 Park Ave, Newark, NJ 70004" }, { "Name" : "Joe Schmo" , "Age" : 55 , "Address" : "789 Broad Way, Jersey City, NJ 07306" }, { "Name" : "Sally Smith" , "Age" : 42 , "Address" : "321 Maple St, Hoboken, NJ 07030" }, { "Name" : "Bob Johnson" , "Age" : 28 , "Address" : "654 Cedar Blvd, Union City, NJ 07087" }, { "Name" : "Sue Johnson" , "Age" : 29 , "Address" : "912 Oak St, Weehawken, NJ 07086" }, { "Name" : "Bill Williams" , "Age" : 33 , "Address" : "245 Pine Rd, West New York, NJ 07093" }, { "Name" : "Mary Johnson" , "Age" : 25 , "Address" : "369 Birch Ave, Guttenberg, NJ 07093" }, { "Name" : "Tom Williams" , "Age" : 44 , "Address" : "159 Willow St, Hoboken, NJ 07030" }, ] # Create the DataFrame df = pd.DataFrame(data) # Filter the DataFrame to only include rows where the Address contains the substring "New York" df = df.loc[df[ "Address" ]. str .contains( "New York" )] print (df) |
Output:
This would filter the DataFrame to only include rows where the “Address” column contains the substring “New York” and produce the output as follows:
Name | Age | Address | |
0 | John Smith | 35 | 123 Main St, New York, NY 10001 |
6 | Bill Williams | 33 | 245 Pine Rd, West New York, NJ 07093 |
Method 2: Using `query` method
Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .query() method and specify the desired substring in the filter criteria:
Python3
import pandas as pd # List of dictionaries containing data data = [ { "Name" : "John Smith" , "Age" : 35 , "Address" : "123 Main St, New York, NY 10001" }, { "Name" : "Jane Doe" , "Age" : 28 , "Address" : "456 Park Ave, Newark, NJ 70004" }, { "Name" : "Joe Schmo" , "Age" : 55 , "Address" : "789 Broad Way, Jersey City, NJ 07306" }, { "Name" : "Sally Smith" , "Age" : 42 , "Address" : "321 Maple St, Hoboken, NJ 07030" }, { "Name" : "Bob Johnson" , "Age" : 28 , "Address" : "654 Cedar Blvd, Union City, NJ 07087" }, { "Name" : "Sue Johnson" , "Age" : 29 , "Address" : "912 Oak St, Weehawken, NJ 07086" }, { "Name" : "Bill Williams" , "Age" : 33 , "Address" : "245 Pine Rd, West New York, NJ 07093" }, { "Name" : "Mary Johnson" , "Age" : 25 , "Address" : "369 Birch Ave, Guttenberg, NJ 07093" }, { "Name" : "Tom Williams" , "Age" : 44 , "Address" : "159 Willow St, Hoboken, NJ 07030" }, ] # Create the DataFrame df = pd.DataFrame(data) # Filter the DataFrame to only include rows where the Address contains the substring "New York" df = df.query( "Address.str.contains('New York', case=False)" ) print (df) |
Output:
Name | Age | Address | |
0 | John Smith | 35 | 123 Main St, New York, NY 10001 |
6 | Bill Williams | 33 | 245 Pine Rd, West New York, NJ 07093 |
Method 3: Using `apply` method
Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .apply() method and specify a custom function that returns True if the “Address” column contains the desired substring, and False otherwise:
Python3
import pandas as pd # List of dictionaries containing data data = [ { "Name" : "John Smith" , "Age" : 35 , "Address" : "123 Main St, New York, NY 10001" }, { "Name" : "Jane Doe" , "Age" : 28 , "Address" : "456 Park Ave, Newark, NJ 70004" }, { "Name" : "Joe Schmo" , "Age" : 55 , "Address" : "789 Broad Way, Jersey City, NJ 07306" }, { "Name" : "Sally Smith" , "Age" : 42 , "Address" : "321 Maple St, Hoboken, NJ 07030" }, { "Name" : "Bob Johnson" , "Age" : 28 , "Address" : "654 Cedar Blvd, Union City, NJ 07087" }, { "Name" : "Sue Johnson" , "Age" : 29 , "Address" : "912 Oak St, Weehawken, NJ 07086" }, { "Name" : "Bill Williams" , "Age" : 33 , "Address" : "245 Pine Rd, West New York, NJ 07093" }, { "Name" : "Mary Johnson" , "Age" : 25 , "Address" : "369 Birch Ave, Guttenberg, NJ 07093" }, { "Name" : "Tom Williams" , "Age" : 44 , "Address" : "159 Willow St, Hoboken, NJ 07030" }, ] # Create the DataFrame df = pd.DataFrame(data) # Define a custom function to check if the Address contains the substring "New York" def contains_new_york(address): return "New York" in address # Filter the DataFrame to only include rows where the Address contains the substring "New York" df = df[df[ "Address" ]. apply (contains_new_york)] print (df) |
Output:
Name | Age | Address | |
0 | John Smith | 35 | 123 Main St, New York, NY 10001 |
6 | Bill Williams | 33 | 245 Pine Rd, West New York, NJ 07093 |
Method 4: Using map method
Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .map() method and specify a custom function that returns True if the “Address” column contains the desired substring, and False otherwise:
Python3
import pandas as pd # List of dictionaries containing data data = [ { "Name" : "John Smith" , "Age" : 35 , "Address" : "123 Main St, New York, NY 10001" }, { "Name" : "Jane Doe" , "Age" : 28 , "Address" : "456 Park Ave, Newark, NJ 70004" }, { "Name" : "Joe Schmo" , "Age" : 55 , "Address" : "789 Broad Way, Jersey City, NJ 07306" }, { "Name" : "Sally Smith" , "Age" : 42 , "Address" : "321 Maple St, Hoboken, NJ 07030" }, { "Name" : "Bob Johnson" , "Age" : 28 , "Address" : "654 Cedar Blvd, Union City, NJ 07087" }, { "Name" : "Sue Johnson" , "Age" : 29 , "Address" : "912 Oak St, Weehawken, NJ 07086" }, { "Name" : "Bill Williams" , "Age" : 33 , "Address" : "245 Pine Rd, West New York, NJ 07093" }, { "Name" : "Mary Johnson" , "Age" : 25 , "Address" : "369 Birch Ave, Guttenberg, NJ 07093" }, { "Name" : "Tom Williams" , "Age" : 44 , "Address" : "159 Willow St, Hoboken, NJ 07030" }, ] # Create the DataFrame df = pd.DataFrame(data) # Define a custom function to check if the Address contains the substring "New York" def contains_new_york(address): return "New York" in address # Map the custom function to the Address column to create a new column containing the result of the function df = df[df[ "Address" ]. map (contains_new_york)] print (df) |
Output:
Name | Age | Address | |
0 | John Smith | 35 | 123 Main St, New York, NY 10001 |
6 | Bill Williams | 33 | 245 Pine Rd, West New York, NJ 07093 |
Method 5: Using `filter` method
Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .filter() method and specify the desired substring in the filter criteria using the like operator:
Python3
import pandas as pd # List of dictionaries containing data data = [ { "Name" : "John Smith" , "Age" : 35 , "Address" : "123 Main St, New York, NY 10001" }, { "Name" : "Jane Doe" , "Age" : 28 , "Address" : "456 Park Ave, Newark, NJ 70004" }, { "Name" : "Joe Schmo" , "Age" : 55 , "Address" : "789 Broad Way, Jersey City, NJ 07306" }, { "Name" : "Sally Smith" , "Age" : 42 , "Address" : "321 Maple St, Hoboken, NJ 07030" }, { "Name" : "Bob Johnson" , "Age" : 28 , "Address" : "654 Cedar Blvd, Union City, NJ 07087" }, { "Name" : "Sue Johnson" , "Age" : 29 , "Address" : "912 Oak St, Weehawken, NJ 07086" }, { "Name" : "Bill Williams" , "Age" : 33 , "Address" : "245 Pine Rd, West New York, NJ 07093" }, { "Name" : "Mary Johnson" , "Age" : 25 , "Address" : "369 Birch Ave, Guttenberg, NJ 07093" }, { "Name" : "Tom Williams" , "Age" : 44 , "Address" : "159 Willow St, Hoboken, NJ 07030" }, ] # Create the DataFrame df = pd.DataFrame(data) # Filter the DataFrame to only include the mentioned columns df = df. filter (like = "Address" ) print (df) |
Output:
Address |
123 Main St, New York, NY 10001 |
456 Park Ave, Newark, NJ 70004 |
789 Broad Way, Jersey City, NJ 07306 |
321 Maple St, Hoboken, NJ 07030 |
654 Cedar Blvd, Union City, NJ 07087 |
912 Oak St, Weehawken, NJ 07086 |
245 Pine Rd, West New York, NJ 07093 |
369 Birch Ave, Guttenberg, NJ 07093 |
159 Willow St, Hoboken, NJ 07030 |
The .filter() method is similar to the .query() method in that it allows you to specify filter criteria. However, the .filter() method only filters columns, whereas the .query() method can filter both columns and rows.
Method 6: Using `isin` method
Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .isin() method and specify the desired substring in the filter criteria:
Python3
import pandas as pd # List of dictionaries containing data data = [ { "Name" : "John Smith" , "Age" : 35 , "Address" : "123 Main St, New York, NY 10001" }, { "Name" : "Jane Doe" , "Age" : 28 , "Address" : "456 Park Ave, Newark, NJ 70004" }, { "Name" : "Joe Schmo" , "Age" : 55 , "Address" : "789 Broad Way, Jersey City, NJ 07306" }, { "Name" : "Sally Smith" , "Age" : 42 , "Address" : "321 Maple St, Hoboken, NJ 07030" }, { "Name" : "Bob Johnson" , "Age" : 28 , "Address" : "654 Cedar Blvd, Union City, NJ 07087" }, { "Name" : "Sue Johnson" , "Age" : 29 , "Address" : "912 Oak St, Weehawken, NJ 07086" }, { "Name" : "Bill Williams" , "Age" : 33 , "Address" : "245 Pine Rd, West New York, NJ 07093" }, { "Name" : "Mary Johnson" , "Age" : 25 , "Address" : "369 Birch Ave, Guttenberg, NJ 07093" }, { "Name" : "Tom Williams" , "Age" : 44 , "Address" : "159 Willow St, Hoboken, NJ 07030" }, ] # Create the DataFrame df = pd.DataFrame(data) # Filter the DataFrame to only include rows where the Address contains the mentioned string df = df[df[ "Address" ].isin([ "123 Main St, New York, NY 10001" ])] print (df) |
Output:
Name | Age | Address | |
0 | John Smith | 35 | 123 Main St, New York, NY 10001 |
The .isin() method allows you to specify a list of values that the “Address” column should contain in order for the row to be included in the resulting DataFrame. In this case, the list contains only the value “123 Main St, New York, NY 10001”, so only rows where the “Address” column contains that exact value would be included in the resulting DataFrame.
To filter a Pandas DataFrame using a substring in any specific column data, you can use one of several methods, including the .loc[], .query(), .filter(), .isin(), .apply(), and .map() methods. The specific method you choose will depend on your personal preference and the specific requirements of your project.