Tuesday, November 19, 2024
Google search engine
HomeLanguagesHow to merge many TSV files by common key using Python Pandas?

How to merge many TSV files by common key using Python Pandas?

For data analysis the most important thing is data and we need to prepare it before we can use it for analysis. Sometimes required data can be scattered in multiple files and we need to merge them. In this article, we are going to merge multiple TSV (Tab Separated Values) files with a common key. This can be possible by using the merge method of the pandas Python library. This method allows us to combine files by using a common key.

Approach:

  • Import pandas library
  • Then read first two tsv files and merge them using pd.merge() function by setting the ‘on’ parameter to the common column present in both files. Then store the result in a new dataframe called ‘Output_df’.
  • Store remaining files in a list.
  • Run a loop that will iterate over these file names. Read these files one by one and merge them with ‘Output_df’  dataframe
  • Save ‘Output_df’ in tsv file

Example 1:

In this example, we will merge tsv files by using an inner join. We have taken four tsv files for this example as follows.

Used file: Customer.tsv, Account.tsv, Branch.tsv, Loan.tsv

Python3




# Import pandas library
import pandas as pd
 
# Read first two csv files with '\t' separator
tsv1 = pd.read_csv("Documents/Customer.tsv", sep='\t')
tsv2 = pd.read_csv("Documents/Account.tsv", sep='\t')
 
# store the result in Output_df dataframe.
# Here common column is 'ID' column
Output_df = pd.merge(tsv1, tsv2, on='ID',
                     how='inner')
 
# store remaining file names in list
tsv_files = ["Branch.tsv", "Loan.tsv"]
 
# One by one read tsv files and merge with
# 'Output_df' dataframe and again store
# the final result in Output_df
for i in tsv_files:
    path = "Documents/"+i
    tsv = pd.read_csv(path, sep='\t')
    Output_df = pd.merge(Output_df, tsv,
                         on='ID', how='inner')
 
# Now store the 'Output_df'
# in tsv file 'Output.tsv'
Output_df.to_csv("Documents/Output.tsv",
                 sep="\t", header=True,
                 index=False)


Output:

Output.tsv

Example 2:

In this example, we will merge tsv files by using an outer join. We have taken four tsv files for this example as follows.

Used file: Course.tsv, Teacher.tsv, Credits.tsv, Marks.tsv

Python3




# Import pandas library
import pandas as pd
 
# Read first two csv files with '\t' separator
tsv3 = pd.read_csv("Documents/Course.tsv", sep='\t')
tsv4 = pd.read_csv("Documents/Teacher.tsv", sep='\t')
 
# store the result in Output_df dataframe.
# Here common column is 'Course_ID' column
Output_df2 = pd.merge(tsv3, tsv4, on='Course_ID', how='outer')
 
# store remaining file names in list
tsv_files = ["Credits.tsv", "Marks.tsv"]
 
# One by one read tsv files and merge with
# 'Output_df2' dataframe and again store
# the final result in 'Output_df2'
for i in tsv_files:
    path = "Documents/"+i
    tsv = pd.read_csv(path, sep='\t')
    Output_df2 = pd.merge(Output_df2, tsv,
                          on='Course_ID', how='outer')
 
 
# Now store the 'Output_df2' in tsv file 'Output_outer.tsv'
# Here we replacing nan values with NA
Output_df2.to_csv("Documents/Output_outer.tsv", sep="\t",
                  header=True, index=False, na_rep="NA")


Output:

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments