Wednesday, December 25, 2024
Google search engine
HomeLanguagesMerge two Pandas DataFrames based on closest DateTime

Merge two Pandas DataFrames based on closest DateTime

In this article, we will discuss how to merge Pandas DataFrame based on the closest DateTime. To learn how to merge DataFrames first you have to learn that how to create a DataFrame for that you have to refer to the article Creating a Pandas DataFrame. After creating DataFrames need to merge them and to merge the Dataframe there’s a function named merge_asof() when it comes to writing this then it can be written as:

pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=(‘_x’, ‘_y’), tolerance=None, allow_exact_matches=True, direction=’backward’)

Note:

Step-by-step Approach

Step 1: Import pandas library

To complete this task we have to import the library named Pandas.

import pandas as pd

Step 2: Create the Dataframe

In this step, we have to create DataFrames using the function “pd.DataFrame()”. In this, we created 2 data frames one is named left and another is named right because our last goal is to merge 2 data frames based on the closest DateTime. It can be written as:

left = pd.DataFrame( {

       “time”: [pd.Timestamp(“2020-03-25 13:30:00.023”),

           pd.Timestamp(“2020-03-25 13:30:00.023”),

           pd.Timestamp(“2020-03-25 13:30:00.030”),

           pd.Timestamp(“2020-03-25 13:30:00.041”),

           pd.Timestamp(“2020-03-25 13:30:00.048”),

           pd.Timestamp(“2020-03-25 13:30:00.049”),

           pd.Timestamp(“2020-03-25 13:30:00.072”),

           pd.Timestamp(“2020-03-25 13:30:00.075”)

       ],

       “ticker”: [“GOOG”,”MSFT”,”MSFT”,”MSFT”,”GOOG”,”AAPL”,”GOOG”,”MSFT”],

          “bid”: [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],

          “ask”: [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]

   })

right = pd.DataFrame( {

          “time”: [

              pd.Timestamp(“2020-03-25 13:30:00.023”),

              pd.Timestamp(“2020-03-25 13:30:00.038”),

              pd.Timestamp(“2020-03-25 13:30:00.048”),

              pd.Timestamp(“2020-03-25 13:30:00.048”),

              pd.Timestamp(“2020-03-25 13:30:00.048”)

          ],

          “ticker”: [“MSFT”, “MSFT”, “GOOG”, “GOOG”, “AAPL”],

          “price”: [51.95, 51.95, 720.77, 720.92, 98.0],

          “quantity”: [75, 155, 100, 100, 100]

      })

Step 3: Merge the Dataframes and print them

In this step, the data frames are going to be merged using the function “pd.merge_asof()”. The result of the merge_asof() function is stored in a variable and then the variable is printed by using “print()”.

Python3




# Importing the required package
import pandas as pd
 
# Creating the DataFrame of left side
left = pd.DataFrame({
   
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
   
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(right, left, on="time",
                                 by="ticker")
 
# print the variable
print(merged_dataframe)


Output : 

Example 1: Now we change the position of the left and right Dataframe in the merge_asof function.

Python3




# Importing the required package
import pandas as pd
# Creating the DataFrame of left side
left = pd.DataFrame({
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(left, right, on="time",
                                 by="ticker")
 
# print the variable
print(merged_dataframe)


Output: 

Note: So, it is clear from our 2 outputs that when we put right DataFrame in the first place then the number of rows in output is 5 equals to the number of rows in right DataFrame and when the left DataFrame is put on the first place then the number of rows in output is equal to the number of rows in left DataFrame. If we look at both outputs and compare them then we can easily say that the merge_asof() is similar to the left-join except that we match on the nearest key rather than equal keys.

Example 2: We only asof within 2ms between the quoted time and the trade time.

Python3




# Importing the required package
import pandas as pd
 
# Creating the DataFrame of left side
left = pd.DataFrame({
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(left, right, on="time", by="ticker",
                                 tolerance=pd.Timedelta("2ms"))
 
# print the variable
print(merged_dataframe)


Output : 

Example 3: We only asof within 10ms between the quoted time and the trade time, and we exclude exact matches on time. However, prior data will propagate forward.

Python3




# Importing the required package
import pandas as pd
 
# Creating the DataFrame of left side
left = pd.DataFrame({
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
   
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(left, right, on="time", by="ticker",
                                 tolerance=pd.Timedelta("2ms"),
                                 allow_exact_matches=False)
 
# print the variable
print(merged_dataframe)


Output : 

Example 4: When the same DataFrame is used in both places. In this left Dataframe is used on both sides.

Python3




# Importing the required package
import pandas as pd
 
# Creating the DataFrame of left side
left = pd.DataFrame({
    "time": [pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.023"),
             pd.Timestamp("2020-03-25 13:30:00.030"),
             pd.Timestamp("2020-03-25 13:30:00.041"),
             pd.Timestamp("2020-03-25 13:30:00.048"),
             pd.Timestamp("2020-03-25 13:30:00.049"),
             pd.Timestamp("2020-03-25 13:30:00.072"),
             pd.Timestamp("2020-03-25 13:30:00.075")
             ],
   
    "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG",
               "AAPL", "GOOG", "MSFT"],
   
    "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99,
            720.50, 52.01],
   
    "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01,
            720.88, 52.03]
})
 
# Creating the Dataframe of right side
right = pd.DataFrame({
    "time": [
        pd.Timestamp("2020-03-25 13:30:00.023"),
        pd.Timestamp("2020-03-25 13:30:00.038"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048"),
        pd.Timestamp("2020-03-25 13:30:00.048")
    ],
   
    "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
   
    "price": [51.95, 51.95, 720.77, 720.92, 98.0],
   
    "quantity": [75, 155, 100, 100, 100]
})
 
# Applying merge_asof on data and store it
# in a variable
merged_dataframe = pd.merge_asof(left, left, on="time",
                                 by="ticker")
 
# print the variable
print(merged_dataframe)


Output :  

It created the same data frame into 2 frames one is indicated as x and another is created as y i.e. bid_x, bid_y, ask_x, ask_y.

RELATED ARTICLES

Most Popular

Recent Comments