Split a String into columns using regex in pandas DataFrame

27 July 2024

3

Given some mixed data containing multiple values as a string, let’s see how can we divide the strings using regex and make multiple columns in Pandas DataFrame.

Method #1:
In this method we will use re.search(pattern, string, flags=0). Here pattern refers to the pattern that we want to search. It takes in a string with the following values:

\w matches alphanumeric characters
\d matches digits, which means 0-9
\s matches whitespace characters
\S matches non-whitespace characters
. matches any character except the new line character \n
* matches 0 or more instances of a pattern

# import the regex library 
import pandas as pd 
import re 
  
# Create a list with all the strings 
movie_data = ["Name: The_Godfather Year: 1972 Rating: 9.2", 
            "Name: Bird_Box Year: 2018 Rating: 6.8", 
            "Name: Fight_Club Year: 1999 Rating: 8.8"] 
  
# Create a dictionary with the required columns  
# Used later to convert to DataFrame 
movies = {"Name":[], "Year":[], "Rating":[]} 
  
for item in movie_data: 
      
    # For Name field 
    name_field = re.search("Name: .*",item) 
      
    if name_field is not None: 
        name = re.search('\w*\s\w*',name_field.group()) 
    else: 
        name = None
    movies["Name"].append(name.group()) 
      
    # For Year field 
    year_field = re.search("Year: .*",item) 
    if year_field is not None: 
        year = re.search('\s\d\d\d\d',year_field.group()) 
    else: 
        year = None
    movies["Year"].append(year.group().strip()) 
      
    # For rating field 
    rating_field = re.search("Rating: .*",item) 
    if rating_field is not None:  
        rating = re.search('\s\d.\d',rating_field.group()) 
    else:  
        rating - None
    movies["Rating"].append(rating.group().strip()) 
  
# Creating DataFrame 
df = pd.DataFrame(movies) 
print(df) 

Output:

Explanation:

In the code above, we use a for loop to iterate through movie data so we can work with each movie in turn. We create a dictionary, movies, that will hold all the details of each detail, such as the rating and name.
We then find the entire Name field using the re.search() function. The . means any character except \n, and * extends it to the end of the line. Assign this to the variable name_field.
But, data isn’t always straightforward. It can contain surprises. For instance, what if there’s no Name: field? The script would throw an error and break. We pre-empt errors from this scenario and check for a not None case.
Again we use the re.search() function to extract the final required string from the name_field. For the name we use \w* to represent the first word, \s to represent the space in between and \w* for the second word.
Do the same for year and rating and get the final required dictionary.

Method #2:
To break up the string we will use Series.str.extract(pat, flags=0, expand=True) function. Here pat refers to the pattern that we want to search for.

import pandas as pd 
  
dict = {'movie_data':['The Godfather 1972 9.2', 
                    'Bird Box 2018 6.8', 
                    'Fight Club 1999 8.8'] } 
  
# Convert the dictionary to a dataframe 
df = pd.DataFrame(dict) 
  
# Extract name from the string  
df['Name'] = df['movie_data'].str.extract('(\w*\s\w*)', expand=True) 
   
# Extract year from the string  
df['Year'] = df['movie_data'].str.extract('(\d\d\d\d)', expand=True) 
  
# Extract rating from the string  
df['Rating'] = df['movie_data'].str.extract('(\d\.\d)', expand=True) 
print(df) 

Output:

Split a String into columns using regex in pandas DataFrame

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best VPNs for Minecraft in 2025: Fast & Secure by Manual Thomas

5 Best VPNs for Crypto Trading in 2025: 100% Safe Trading by Tyler Cross

5 Best VPNs for ITVX in 2025: Fast & Easy to Use by Gjurgjica Panova

5 Best VPNs for Iran in 2025: Fast & Secure by Penka Hristovska

Recent Comments

EDITOR PICKS

5 Best VPNs for Minecraft in 2025: Fast & Secure by Manual Thomas

5 Best VPNs for Crypto Trading in 2025: 100% Safe Trading by Tyler Cross

5 Best VPNs for ITVX in 2025: Fast & Easy to Use by Gjurgjica Panova

POPULAR POSTS

5 Best VPNs for Minecraft in 2025: Fast & Secure by Manual Thomas

5 Best VPNs for Crypto Trading in 2025: 100% Safe Trading by Tyler Cross

5 Best VPNs for ITVX in 2025: Fast & Easy to Use by Gjurgjica Panova

POPULAR CATEGORY

ABOUT US

FOLLOW US