Random Forest for Time Series Forecasting

20 July 2024

2

This article was published as a part of the Data Science Blogathon

Introduction

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It is an ensemble learning method, constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. It can be used for both Classification and Regression problems in ML. However, it can also be used in time series forecasting, both univariate and multivariate dataset by creating lag variables and seasonal component variables manually.

No algorithm works best for all the datasets. So depending on the data you can try various algorithms and choose the best for your data. I have tried ARIMA, SARIMA, ets, lstm, Random forest, XGBoost, and fbprophet for time series forecasting and each of these algorithms worked best for one category or the other. Random forest, XGBoost, and fbprophet outperformed for multivariate and intermittent data.

Intermittent data:

Intermittent demand data is one of the data types with a very random pattern, for example, demand data. The data will have a value (not zero) if there is a demand. If there is no demand, the data is zero. Intermittent demand data is usually called customer demand data or sales data for an item that is not sold every time.

In this tutorial, you will learn how to develop a Random forest model for time series forecasting.

After completing this tutorial, you will know:

How to develop a Random Forest model for univariate/multivariate time series data.
How to limit the number of independent variables to a certain value.
How to forecast for multiple date points e.g. for the coming 4 months or 4 weeks.

Let’s get started.

Problem: Forecast demand for a jeans brand for the coming 6 months.

Data: We have monthly sales quantity available for 2 years (from May 2019 to May 2021) in the CSV file.

Import all required Packages

import pandas as pd
from sklearn.feature_selection import RFE

from sklearn.ensemble import RandomForestRegressor
from pandas import DataFrame
import numpy as np
from datetime import timedelta
import calender
jeans_data=pd.read_csv('jeans_data.csv')
jeans_data.head()

date	SaleQty
2019-05-01	1683
2019-06-01	1321
2019-07-01	1447
2019-08-01	0
2019-09-01	86
2019-10-01	1165

Check if the data is stationary

from statsmodels.tsa.stattools import adfuller
from numpy import log
result = adfuller(df.value.dropna())
print('p-value: %f' % result[1])

p-value: 0.024419

Since the p-value is below 0.05, the data can be assumed to be stationary hence we can proceed with the data without any transformation.

Create lag variables

dataframe = DataFrame()
for i in range(12, 0, -1):
   dataframe['t-' + str(i)] = jeans_data.SaleQty.shift(i)
final_data = pd.concat(jeans_data, dataframe], axis=1)
final_data.dropna(inplace=True)

You can give any value in place of 12, depending on your time interval and the number of lags you want to create. It is ideal to give 12 for monthly data and 54 for weekly data and limit the number of independent variables later.

Add seasonal variable

Create a variable that has different values for different months which will add a seasonal component to the model, which may help improve the forecast.

final_data['date'] = pd.to_datetime(final_data['date'], format='%Y-%m-%d')
final_data['month'] = final_data['date'].dt.month

Or we can add dummy variables for each month:

dummy = pd.get_dummies(final_data['month'])
final_data = pd.concat([final_data, dummy], axis=1)

Train the model:

We will take the most recent 6 months data as the test dataset and the rest of the data as the training dataset.

finaldf = final_data.drop(['date'], axis=1)
finaldf = finaldf.reset_index(drop=True)
test_length=6
end_point = len(finaldf)
x = end_point - test_length
finaldf_train = finaldf.loc[:x - 1, :]
finaldf_test = finaldf.loc[x:, :]
finaldf_test_x = finaldf_test.loc[:, finaldf_test.columns != 'SaleQty']
finaldf_test_y = finaldf_test['SaleQty']
finaldf_train_x = finaldf_train.loc[:, finaldf_train.columns != 'SaleQty']
finaldf_train_y = finaldf_train['SaleQty']
print("Starting model train..")
rfe = RFE(RandomForestRegressor(n_estimators=100, random_state=1), 4)
fit = rfe.fit(finaldf_train_x, finaldf_train_y)
y_pred = fit.predict(finaldf_test_x)

I have used RFE (recursive feature elimination) to limit the number of independent variables/features to 4, you can change the value and choose the value that gives the least error. I have taken n_estimators (number of trees in the forest) 100 which is the default value.

Evaluating the Algorithm:

y_true = np.array(finaldf_test_ y['SaleQty'])
sumvalue=np.sum(y_true)
mape=np.sum(np.abs((y_true - y_pred)))/sumvalue*100
accuracy=100-mape
print('Accuracy:', round(accuracy,2),'%.')

Accuracy: 89.42 %.

Predict for Future:

We will predict sale quantity for the future 6 months. The lags will be null for future date points so we have to predict for one month at a time and use the predicted sale for creating lag for next month’s prediction and so on. Please note we are using the predicted sale only to create the lag variable, we are not training the model again.

def create_lag(df3):
    dataframe = DataFrame()
    for i in range(12, 0, -1):
        dataframe['t-' + str(i)] = df3.SaleQty.shift(i)
    df4 = pd.concat([df3, dataframe], axis=1)
    df4.dropna(inplace=True)
    return df4
yhat=[]
future_dataframe= jeans_data.copy()
n=6
x = future_dataframe.at[end_point - 1, 'date']
days_in_month=calendar.monthrange(x.year, x.month)[1]
for i in range(n):
       future_dataframe.at[future_dataframe.index[end_point + i], 'date'] = x + timedelta(days=days_in_month + days_in_month * i)
       future_dataframe.at[future_dataframe.index[end_point + i], SaleQty] = 0
future_dataframe ['date'] = pd.to_datetime(future_dataframe ['date'], format='%Y-%m-%d')
future_dataframe ['month'] = future_dataframe ['date'].dt.month
future_dataframe = future_dataframe.drop(['date'], axis=1)
future_dataframe _end = len(jeans_data)
for i in range(n, 0, -1):
    y = future_dataframe _end - i
    inputfile = finaldf.loc[y:end_point, :]
    inputfile_x = inputfile.loc[:, inputfile.columns != 'SaleQty']
    pred_set = inputfile_x.head(1)
    pred = fit.predict(pred_set)
    future_dataframe.at[future_dataframe.index[future_dataframe _end - i], 'SaleQty'] = pred[0]
    finaldf = create_lag(future_dataframe)
    finaldf = finaldf.reset_index(drop=True)
    yhat.append(pred)
predicted_value= np.array(yhat)

You can add any other independent variables available like promotions, special_days, weekends, start_of_month, etc.

Find below the complete code:

import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from pandas import DataFrame
import numpy as np
from datetime import datetime
import calendar
from datetime import timedelta
import datetime as dt
def add_month(df, forecast_length, forecast_period):
    end_point = len(df)
    df1 = pd.DataFrame(index=range(forecast_length), columns=range(2))
    df1.columns = ['SaleQty', 'date']
    df = df.append(df1)
    df = df.reset_index(drop=True)
    x = df.at[end_point - 1, 'date']
    x = pd.to_datetime(x, format='%Y-%m-%d')
    days_in_month=calendar.monthrange(x.year, x.month)[1]
    if forecast_period == 'Week':
        for i in range(forecast_length):
            df.at[df.index[end_point + i], 'date'] = x + timedelta(days=7 + 7 * i)
            df.at[df.index[end_point + i], 'SaleQty'] = 0
    elif forecast_period == 'Month':
        for i in range(forecast_length):
            df.at[df.index[end_point + i], 'date'] = x + timedelta(days=days_in_month + days_in_month * i)
            df.at[df.index[end_point + i], 'SaleQty'] = 0
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
    df['month'] = df['date'].dt.month
    df = df.drop(['date'], axis=1)
    return df
def create_lag(df3):
    dataframe = DataFrame()
    for i in range(12, 0, -1):
        dataframe['t-' + str(i)] = df3.SaleQty.shift(i)
    df4 = pd.concat([df3, dataframe], axis=1)
    df4.dropna(inplace=True)
    return df4
def randomForest(df1, forecast_length, forecast_period):
    df3 = df1[['SaleQty', 'date']]
    df3 = add_month(df3, forecast_length, forecast_period)
    finaldf = create_lag(df3)
    finaldf = finaldf.reset_index(drop=True)
    n = forecast_length
    end_point = len(finaldf)
    x = end_point - n
    finaldf_train = finaldf.loc[:x - 1, :]
    finaldf_train_x = finaldf_train.loc[:, finaldf_train.columns != 'SaleQty']
    finaldf_train_y = finaldf_train['SaleQty']
    print("Starting model train..")
    rfe = RFE(RandomForestRegressor(n_estimators=100, random_state=1), 4)
    fit = rfe.fit(finaldf_train_x, finaldf_train_y)
    print("Model train completed..")
    print("Creating forecasted set..")
    yhat = []
    end_point = len(finaldf)
    n = forecast_length
    df3_end = len(df3)
    for i in range(n, 0, -1):
        y = end_point - i
        inputfile = finaldf.loc[y:end_point, :]
        inputfile_x = inputfile.loc[:, inputfile.columns != 'SaleQty']
        pred_set = inputfile_x.head(1)
        pred = fit.predict(pred_set)
        df3.at[df3.index[df3_end - i], 'SaleQty'] = pred[0]
        finaldf = create_lag(df3)
        finaldf = finaldf.reset_index(drop=True)
        yhat.append(pred)
    yhat = np.array(yhat)
    print("Forecast complete..")
    return yhat
predicted_value=randomForest(jeans_data, 6, 'Month')

Random forest is an ensemble learning method and it does bootstrap of observations where the training set is sampled randomly. So the order of the data points change hence it might not perform well in many time series data, but it does perform well for intermittent data as it catches the probability of demand/sale of a zero selling product well.

Please let me know your queries and suggestions if any.

To connect with me on LinkedIn, please click here

Random Forest for Time Series Forecasting

Introduction

Intermittent data:

Import all required Packages

Check if the data is stationary

Create lag variables

Add seasonal variable

Train the model:

Evaluating the Algorithm:

Predict for Future:

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US