For data analysis the most important thing is data and we need to prepare it before we can use it for analysis. Sometimes required data can be scattered in multiple files and we need to merge them. In this article, we are going to merge multiple TSV (Tab Separated Values) files with a common key. This can be possible by using the merge method of the pandas Python library. This method allows us to combine files by using a common key.
Approach:
- Import pandas library
- Then read first two tsv files and merge them using pd.merge() function by setting the ‘on’ parameter to the common column present in both files. Then store the result in a new dataframe called ‘Output_df’.
- Store remaining files in a list.
- Run a loop that will iterate over these file names. Read these files one by one and merge them with ‘Output_df’ dataframe
- Save ‘Output_df’ in tsv file
Example 1:
In this example, we will merge tsv files by using an inner join. We have taken four tsv files for this example as follows.
Used file: Customer.tsv, Account.tsv, Branch.tsv, Loan.tsv
Python3
# Import pandas library import pandas as pd # Read first two csv files with '\t' separator tsv1 = pd.read_csv( "Documents/Customer.tsv" , sep = '\t' ) tsv2 = pd.read_csv( "Documents/Account.tsv" , sep = '\t' ) # store the result in Output_df dataframe. # Here common column is 'ID' column Output_df = pd.merge(tsv1, tsv2, on = 'ID' , how = 'inner' ) # store remaining file names in list tsv_files = [ "Branch.tsv" , "Loan.tsv" ] # One by one read tsv files and merge with # 'Output_df' dataframe and again store # the final result in Output_df for i in tsv_files: path = "Documents/" + i tsv = pd.read_csv(path, sep = '\t' ) Output_df = pd.merge(Output_df, tsv, on = 'ID' , how = 'inner' ) # Now store the 'Output_df' # in tsv file 'Output.tsv' Output_df.to_csv( "Documents/Output.tsv" , sep = "\t" , header = True , index = False ) |
Output:
Example 2:
In this example, we will merge tsv files by using an outer join. We have taken four tsv files for this example as follows.
Used file: Course.tsv, Teacher.tsv, Credits.tsv, Marks.tsv
Python3
# Import pandas library import pandas as pd # Read first two csv files with '\t' separator tsv3 = pd.read_csv( "Documents/Course.tsv" , sep = '\t' ) tsv4 = pd.read_csv( "Documents/Teacher.tsv" , sep = '\t' ) # store the result in Output_df dataframe. # Here common column is 'Course_ID' column Output_df2 = pd.merge(tsv3, tsv4, on = 'Course_ID' , how = 'outer' ) # store remaining file names in list tsv_files = [ "Credits.tsv" , "Marks.tsv" ] # One by one read tsv files and merge with # 'Output_df2' dataframe and again store # the final result in 'Output_df2' for i in tsv_files: path = "Documents/" + i tsv = pd.read_csv(path, sep = '\t' ) Output_df2 = pd.merge(Output_df2, tsv, on = 'Course_ID' , how = 'outer' ) # Now store the 'Output_df2' in tsv file 'Output_outer.tsv' # Here we replacing nan values with NA Output_df2.to_csv( "Documents/Output_outer.tsv" , sep = "\t" , header = True , index = False , na_rep = "NA" ) |
Output: