Saturday, September 21, 2024
Google search engine
HomeData Modelling & AIRandom Forest for Time Series Forecasting

Random Forest for Time Series Forecasting

This article was published as a part of the Data Science Blogathon

Introduction

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It is an ensemble learning method, constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. It can be used for both Classification and Regression problems in ML. However, it can also be used in time series forecasting, both univariate and multivariate dataset by creating lag variables and seasonal component variables manually.

No algorithm works best for all the datasets. So depending on the data you can try various algorithms and choose the best for your data. I have tried ARIMA, SARIMA, ets, lstm, Random forest, XGBoost, and fbprophet for time series forecasting and each of these algorithms worked best for one category or the other. Random forest, XGBoost, and fbprophet outperformed for multivariate and intermittent data.

Intermittent data:

Intermittent demand data is one of the data types with a very random pattern, for example, demand data. The data will have a value (not zero) if there is a demand. If there is no demand, the data is zero. Intermittent demand data is usually called customer demand data or sales data for an item that is not sold every time.

In this tutorial, you will learn how to develop a Random forest model for time series forecasting.

After completing this tutorial, you will know:

  • How to develop a Random Forest model for univariate/multivariate time series data.
  • How to limit the number of independent variables to a certain value.
  • How to forecast for multiple date points e.g. for the coming 4 months or 4 weeks.

Let’s get started.

Problem: Forecast demand for a jeans brand for the coming 6 months.

Data: We have monthly sales quantity available for 2 years (from May 2019 to May 2021) in the CSV file.

Import all required Packages

import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from pandas import DataFrame
import numpy as np
from datetime import timedelta
import calender
jeans_data=pd.read_csv('jeans_data.csv')
jeans_data.head()
date SaleQty
2019-05-01 1683
2019-06-01 1321
2019-07-01 1447
2019-08-01 0
2019-09-01 86
2019-10-01 1165

 

 

Check if the data is stationary

from statsmodels.tsa.stattools import adfuller
from numpy import log
result = adfuller(df.value.dropna())
print('p-value: %f' % result[1])

p-value: 0.024419

Since the p-value is below 0.05, the data can be assumed to be stationary hence we can proceed with the data without any transformation.

 

Create lag variables

dataframe = DataFrame()
for i in range(12, 0, -1):
   dataframe['t-' + str(i)] = jeans_data.SaleQty.shift(i)
final_data = pd.concat(jeans_data, dataframe], axis=1)
final_data.dropna(inplace=True)

You can give any value in place of 12, depending on your time interval and the number of lags you want to create. It is ideal to give 12 for monthly data and 54 for weekly data and limit the number of independent variables later.

 

Add seasonal variable

Create a variable that has different values for different months which will add a seasonal component to the model, which may help improve the forecast.

final_data['date'] = pd.to_datetime(final_data['date'], format='%Y-%m-%d')
final_data['month'] = final_data['date'].dt.month

Or we can add dummy variables for each month:

dummy = pd.get_dummies(final_data['month'])
final_data = pd.concat([final_data, dummy], axis=1)

Train the model:

We will take the most recent 6 months data as the test dataset and the rest of the data as the training dataset.

finaldf = final_data.drop(['date'], axis=1)
finaldf = finaldf.reset_index(drop=True)
test_length=6
end_point = len(finaldf)
x = end_point - test_length
finaldf_train = finaldf.loc[:x - 1, :]
finaldf_test = finaldf.loc[x:, :]
finaldf_test_x = finaldf_test.loc[:, finaldf_test.columns != 'SaleQty']
finaldf_test_y = finaldf_test['SaleQty']
finaldf_train_x = finaldf_train.loc[:, finaldf_train.columns != 'SaleQty']
finaldf_train_y = finaldf_train['SaleQty']
print("Starting model train..")
rfe = RFE(RandomForestRegressor(n_estimators=100, random_state=1), 4)
fit = rfe.fit(finaldf_train_x, finaldf_train_y)
y_pred = fit.predict(finaldf_test_x)

I have used RFE (recursive feature elimination) to limit the number of independent variables/features to 4, you can change the value and choose the value that gives the least error. I have taken n_estimators (number of trees in the forest) 100 which is the default value.

 

Evaluating the Algorithm:

y_true = np.array(finaldf_test_ y['SaleQty'])
sumvalue=np.sum(y_true)
mape=np.sum(np.abs((y_true - y_pred)))/sumvalue*100
accuracy=100-mape
print('Accuracy:', round(accuracy,2),'%.')

Accuracy: 89.42 %.

Predict for Future:

We will predict sale quantity for the future 6 months. The lags will be null for future date points so we have to predict for one month at a time and use the predicted sale for creating lag for next month’s prediction and so on. Please note we are using the predicted sale only to create the lag variable, we are not training the model again.

def create_lag(df3):
    dataframe = DataFrame()
    for i in range(12, 0, -1):
        dataframe['t-' + str(i)] = df3.SaleQty.shift(i)
    df4 = pd.concat([df3, dataframe], axis=1)
    df4.dropna(inplace=True)
    return df4
yhat=[]
future_dataframe= jeans_data.copy()
n=6
x = future_dataframe.at[end_point - 1, 'date']
days_in_month=calendar.monthrange(x.year, x.month)[1]
for i in range(n):
       future_dataframe.at[future_dataframe.index[end_point + i], 'date'] = x + timedelta(days=days_in_month + days_in_month * i)
       future_dataframe.at[future_dataframe.index[end_point + i], SaleQty] = 0
future_dataframe ['date'] = pd.to_datetime(future_dataframe ['date'], format='%Y-%m-%d')
future_dataframe ['month'] = future_dataframe ['date'].dt.month
future_dataframe = future_dataframe.drop(['date'], axis=1)
future_dataframe _end = len(jeans_data)
for i in range(n, 0, -1):
    y = future_dataframe _end - i
    inputfile = finaldf.loc[y:end_point, :]
    inputfile_x = inputfile.loc[:, inputfile.columns != 'SaleQty']
    pred_set = inputfile_x.head(1)
    pred = fit.predict(pred_set)
    future_dataframe.at[future_dataframe.index[future_dataframe _end - i], 'SaleQty'] = pred[0]
    finaldf = create_lag(future_dataframe)
    finaldf = finaldf.reset_index(drop=True)
    yhat.append(pred)
predicted_value= np.array(yhat)

You can add any other independent variables available like promotions, special_days, weekends, start_of_month, etc.

Find below the complete code:

import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from pandas import DataFrame
import numpy as np
from datetime import datetime
import calendar
from datetime import timedelta
import datetime as dt
def add_month(df, forecast_length, forecast_period):
    end_point = len(df)
    df1 = pd.DataFrame(index=range(forecast_length), columns=range(2))
    df1.columns = ['SaleQty', 'date']
    df = df.append(df1)
    df = df.reset_index(drop=True)
    x = df.at[end_point - 1, 'date']
    x = pd.to_datetime(x, format='%Y-%m-%d')
    days_in_month=calendar.monthrange(x.year, x.month)[1]
    if forecast_period == 'Week':
        for i in range(forecast_length):
            df.at[df.index[end_point + i], 'date'] = x + timedelta(days=7 + 7 * i)
            df.at[df.index[end_point + i], 'SaleQty'] = 0
    elif forecast_period == 'Month':
        for i in range(forecast_length):
            df.at[df.index[end_point + i], 'date'] = x + timedelta(days=days_in_month + days_in_month * i)
            df.at[df.index[end_point + i], 'SaleQty'] = 0
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
    df['month'] = df['date'].dt.month
    df = df.drop(['date'], axis=1)
    return df
def create_lag(df3):
    dataframe = DataFrame()
    for i in range(12, 0, -1):
        dataframe['t-' + str(i)] = df3.SaleQty.shift(i)
    df4 = pd.concat([df3, dataframe], axis=1)
    df4.dropna(inplace=True)
    return df4
def randomForest(df1, forecast_length, forecast_period):
    df3 = df1[['SaleQty', 'date']]
    df3 = add_month(df3, forecast_length, forecast_period)
    finaldf = create_lag(df3)
    finaldf = finaldf.reset_index(drop=True)
    n = forecast_length
    end_point = len(finaldf)
    x = end_point - n
    finaldf_train = finaldf.loc[:x - 1, :]
    finaldf_train_x = finaldf_train.loc[:, finaldf_train.columns != 'SaleQty']
    finaldf_train_y = finaldf_train['SaleQty']
    print("Starting model train..")
    rfe = RFE(RandomForestRegressor(n_estimators=100, random_state=1), 4)
    fit = rfe.fit(finaldf_train_x, finaldf_train_y)
    print("Model train completed..")
    print("Creating forecasted set..")
    yhat = []
    end_point = len(finaldf)
    n = forecast_length
    df3_end = len(df3)
    for i in range(n, 0, -1):
        y = end_point - i
        inputfile = finaldf.loc[y:end_point, :]
        inputfile_x = inputfile.loc[:, inputfile.columns != 'SaleQty']
        pred_set = inputfile_x.head(1)
        pred = fit.predict(pred_set)
        df3.at[df3.index[df3_end - i], 'SaleQty'] = pred[0]
        finaldf = create_lag(df3)
        finaldf = finaldf.reset_index(drop=True)
        yhat.append(pred)
    yhat = np.array(yhat)
    print("Forecast complete..")
    return yhat
predicted_value=randomForest(jeans_data, 6, 'Month')

Random forest is an ensemble learning method and it does bootstrap of observations where the training set is sampled randomly. So the order of the data points change hence it might not perform well in many time series data, but it does perform well for intermittent data as it catches the probability of demand/sale of a zero selling product well.

Please let me know your queries and suggestions if any.

To connect with me on LinkedIn, please click here

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Tanushree Biswal

19 Apr 2023

RELATED ARTICLES

Most Popular

Recent Comments