In this article, we will discuss how to merge Pandas DataFrame based on the closest DateTime. To learn how to merge DataFrames first you have to learn that how to create a DataFrame for that you have to refer to the article Creating a Pandas DataFrame. After creating DataFrames need to merge them and to merge the Dataframe there’s a function named merge_asof() when it comes to writing this then it can be written as:
pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=(‘_x’, ‘_y’), tolerance=None, allow_exact_matches=True, direction=’backward’)
Note:
- To know more about this function refer to the article pandas.merge_asof() function in Python
- Dataframes must be sorted by the key.
Step-by-step Approach
Step 1: Import pandas library
To complete this task we have to import the library named Pandas.
import pandas as pd
Step 2: Create the Dataframe
In this step, we have to create DataFrames using the function “pd.DataFrame()”. In this, we created 2 data frames one is named left and another is named right because our last goal is to merge 2 data frames based on the closest DateTime. It can be written as:
left = pd.DataFrame( {
“time”: [pd.Timestamp(“2020-03-25 13:30:00.023”),
pd.Timestamp(“2020-03-25 13:30:00.023”),
pd.Timestamp(“2020-03-25 13:30:00.030”),
pd.Timestamp(“2020-03-25 13:30:00.041”),
pd.Timestamp(“2020-03-25 13:30:00.048”),
pd.Timestamp(“2020-03-25 13:30:00.049”),
pd.Timestamp(“2020-03-25 13:30:00.072”),
pd.Timestamp(“2020-03-25 13:30:00.075”)
],
“ticker”: [“GOOG”,”MSFT”,”MSFT”,”MSFT”,”GOOG”,”AAPL”,”GOOG”,”MSFT”],
“bid”: [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
“ask”: [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]
})
right = pd.DataFrame( {
“time”: [
pd.Timestamp(“2020-03-25 13:30:00.023”),
pd.Timestamp(“2020-03-25 13:30:00.038”),
pd.Timestamp(“2020-03-25 13:30:00.048”),
pd.Timestamp(“2020-03-25 13:30:00.048”),
pd.Timestamp(“2020-03-25 13:30:00.048”)
],
“ticker”: [“MSFT”, “MSFT”, “GOOG”, “GOOG”, “AAPL”],
“price”: [51.95, 51.95, 720.77, 720.92, 98.0],
“quantity”: [75, 155, 100, 100, 100]
})
Step 3: Merge the Dataframes and print them
In this step, the data frames are going to be merged using the function “pd.merge_asof()”. The result of the merge_asof() function is stored in a variable and then the variable is printed by using “print()”.
Python3
# Importing the required package import pandas as pd # Creating the DataFrame of left side left = pd.DataFrame({ "time" : [pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.030" ), pd.Timestamp( "2020-03-25 13:30:00.041" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.049" ), pd.Timestamp( "2020-03-25 13:30:00.072" ), pd.Timestamp( "2020-03-25 13:30:00.075" ) ], "ticker" : [ "GOOG" , "MSFT" , "MSFT" , "MSFT" , "GOOG" , "AAPL" , "GOOG" , "MSFT" ], "bid" : [ 720.50 , 51.95 , 51.97 , 51.99 , 720.50 , 97.99 , 720.50 , 52.01 ], "ask" : [ 720.93 , 51.96 , 51.98 , 52.00 , 720.93 , 98.01 , 720.88 , 52.03 ] }) # Creating the Dataframe of right side right = pd.DataFrame({ "time" : [ pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.038" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ) ], "ticker" : [ "MSFT" , "MSFT" , "GOOG" , "GOOG" , "AAPL" ], "price" : [ 51.95 , 51.95 , 720.77 , 720.92 , 98.0 ], "quantity" : [ 75 , 155 , 100 , 100 , 100 ] }) # Applying merge_asof on data and store it # in a variable merged_dataframe = pd.merge_asof(right, left, on = "time" , by = "ticker" ) # print the variable print (merged_dataframe) |
Output :
Example 1: Now we change the position of the left and right Dataframe in the merge_asof function.
Python3
# Importing the required package import pandas as pd # Creating the DataFrame of left side left = pd.DataFrame({ "time" : [pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.030" ), pd.Timestamp( "2020-03-25 13:30:00.041" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.049" ), pd.Timestamp( "2020-03-25 13:30:00.072" ), pd.Timestamp( "2020-03-25 13:30:00.075" ) ], "ticker" : [ "GOOG" , "MSFT" , "MSFT" , "MSFT" , "GOOG" , "AAPL" , "GOOG" , "MSFT" ], "bid" : [ 720.50 , 51.95 , 51.97 , 51.99 , 720.50 , 97.99 , 720.50 , 52.01 ], "ask" : [ 720.93 , 51.96 , 51.98 , 52.00 , 720.93 , 98.01 , 720.88 , 52.03 ] }) # Creating the Dataframe of right side right = pd.DataFrame({ "time" : [ pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.038" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ) ], "ticker" : [ "MSFT" , "MSFT" , "GOOG" , "GOOG" , "AAPL" ], "price" : [ 51.95 , 51.95 , 720.77 , 720.92 , 98.0 ], "quantity" : [ 75 , 155 , 100 , 100 , 100 ] }) # Applying merge_asof on data and store it # in a variable merged_dataframe = pd.merge_asof(left, right, on = "time" , by = "ticker" ) # print the variable print (merged_dataframe) |
Output:
Note: So, it is clear from our 2 outputs that when we put right DataFrame in the first place then the number of rows in output is 5 equals to the number of rows in right DataFrame and when the left DataFrame is put on the first place then the number of rows in output is equal to the number of rows in left DataFrame. If we look at both outputs and compare them then we can easily say that the merge_asof() is similar to the left-join except that we match on the nearest key rather than equal keys.
Example 2: We only asof within 2ms between the quoted time and the trade time.
Python3
# Importing the required package import pandas as pd # Creating the DataFrame of left side left = pd.DataFrame({ "time" : [pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.030" ), pd.Timestamp( "2020-03-25 13:30:00.041" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.049" ), pd.Timestamp( "2020-03-25 13:30:00.072" ), pd.Timestamp( "2020-03-25 13:30:00.075" ) ], "ticker" : [ "GOOG" , "MSFT" , "MSFT" , "MSFT" , "GOOG" , "AAPL" , "GOOG" , "MSFT" ], "bid" : [ 720.50 , 51.95 , 51.97 , 51.99 , 720.50 , 97.99 , 720.50 , 52.01 ], "ask" : [ 720.93 , 51.96 , 51.98 , 52.00 , 720.93 , 98.01 , 720.88 , 52.03 ] }) # Creating the Dataframe of right side right = pd.DataFrame({ "time" : [ pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.038" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ) ], "ticker" : [ "MSFT" , "MSFT" , "GOOG" , "GOOG" , "AAPL" ], "price" : [ 51.95 , 51.95 , 720.77 , 720.92 , 98.0 ], "quantity" : [ 75 , 155 , 100 , 100 , 100 ] }) # Applying merge_asof on data and store it # in a variable merged_dataframe = pd.merge_asof(left, right, on = "time" , by = "ticker" , tolerance = pd.Timedelta( "2ms" )) # print the variable print (merged_dataframe) |
Output :
Example 3: We only asof within 10ms between the quoted time and the trade time, and we exclude exact matches on time. However, prior data will propagate forward.
Python3
# Importing the required package import pandas as pd # Creating the DataFrame of left side left = pd.DataFrame({ "time" : [pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.030" ), pd.Timestamp( "2020-03-25 13:30:00.041" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.049" ), pd.Timestamp( "2020-03-25 13:30:00.072" ), pd.Timestamp( "2020-03-25 13:30:00.075" ) ], "ticker" : [ "GOOG" , "MSFT" , "MSFT" , "MSFT" , "GOOG" , "AAPL" , "GOOG" , "MSFT" ], "bid" : [ 720.50 , 51.95 , 51.97 , 51.99 , 720.50 , 97.99 , 720.50 , 52.01 ], "ask" : [ 720.93 , 51.96 , 51.98 , 52.00 , 720.93 , 98.01 , 720.88 , 52.03 ] }) # Creating the Dataframe of right side right = pd.DataFrame({ "time" : [ pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.038" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ) ], "ticker" : [ "MSFT" , "MSFT" , "GOOG" , "GOOG" , "AAPL" ], "price" : [ 51.95 , 51.95 , 720.77 , 720.92 , 98.0 ], "quantity" : [ 75 , 155 , 100 , 100 , 100 ] }) # Applying merge_asof on data and store it # in a variable merged_dataframe = pd.merge_asof(left, right, on = "time" , by = "ticker" , tolerance = pd.Timedelta( "2ms" ), allow_exact_matches = False ) # print the variable print (merged_dataframe) |
Output :
Example 4: When the same DataFrame is used in both places. In this left Dataframe is used on both sides.
Python3
# Importing the required package import pandas as pd # Creating the DataFrame of left side left = pd.DataFrame({ "time" : [pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.030" ), pd.Timestamp( "2020-03-25 13:30:00.041" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.049" ), pd.Timestamp( "2020-03-25 13:30:00.072" ), pd.Timestamp( "2020-03-25 13:30:00.075" ) ], "ticker" : [ "GOOG" , "MSFT" , "MSFT" , "MSFT" , "GOOG" , "AAPL" , "GOOG" , "MSFT" ], "bid" : [ 720.50 , 51.95 , 51.97 , 51.99 , 720.50 , 97.99 , 720.50 , 52.01 ], "ask" : [ 720.93 , 51.96 , 51.98 , 52.00 , 720.93 , 98.01 , 720.88 , 52.03 ] }) # Creating the Dataframe of right side right = pd.DataFrame({ "time" : [ pd.Timestamp( "2020-03-25 13:30:00.023" ), pd.Timestamp( "2020-03-25 13:30:00.038" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ), pd.Timestamp( "2020-03-25 13:30:00.048" ) ], "ticker" : [ "MSFT" , "MSFT" , "GOOG" , "GOOG" , "AAPL" ], "price" : [ 51.95 , 51.95 , 720.77 , 720.92 , 98.0 ], "quantity" : [ 75 , 155 , 100 , 100 , 100 ] }) # Applying merge_asof on data and store it # in a variable merged_dataframe = pd.merge_asof(left, left, on = "time" , by = "ticker" ) # print the variable print (merged_dataframe) |
Output :
It created the same data frame into 2 frames one is indicated as x and another is created as y i.e. bid_x, bid_y, ask_x, ask_y.