This article was published as a part of the Data Science Blogathon
Introduction
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It is an ensemble learning method, constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. It can be used for both Classification and Regression problems in ML. However, it can also be used in time series forecasting, both univariate and multivariate dataset by creating lag variables and seasonal component variables manually.
No algorithm works best for all the datasets. So depending on the data you can try various algorithms and choose the best for your data. I have tried ARIMA, SARIMA, ets, lstm, Random forest, XGBoost, and fbprophet for time series forecasting and each of these algorithms worked best for one category or the other. Random forest, XGBoost, and fbprophet outperformed for multivariate and intermittent data.
Intermittent data:
Intermittent demand data is one of the data types with a very random pattern, for example, demand data. The data will have a value (not zero) if there is a demand. If there is no demand, the data is zero. Intermittent demand data is usually called customer demand data or sales data for an item that is not sold every time.
In this tutorial, you will learn how to develop a Random forest model for time series forecasting.
After completing this tutorial, you will know:
- How to develop a Random Forest model for univariate/multivariate time series data.
- How to limit the number of independent variables to a certain value.
- How to forecast for multiple date points e.g. for the coming 4 months or 4 weeks.
Let’s get started.
Problem: Forecast demand for a jeans brand for the coming 6 months.
Data: We have monthly sales quantity available for 2 years (from May 2019 to May 2021) in the CSV file.
Import all required Packages
import pandas as pd from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor from pandas import DataFrame import numpy as np from datetime import timedelta import calender jeans_data=pd.read_csv('jeans_data.csv') jeans_data.head()
date | SaleQty |
2019-05-01 | 1683 |
2019-06-01 | 1321 |
2019-07-01 | 1447 |
2019-08-01 | 0 |
2019-09-01 | 86 |
2019-10-01 | 1165 |
Check if the data is stationary
from statsmodels.tsa.stattools import adfuller from numpy import log result = adfuller(df.value.dropna()) print('p-value: %f' % result[1])
p-value: 0.024419
Since the p-value is below 0.05, the data can be assumed to be stationary hence we can proceed with the data without any transformation.
Create lag variables
dataframe = DataFrame() for i in range(12, 0, -1): dataframe['t-' + str(i)] = jeans_data.SaleQty.shift(i) final_data = pd.concat(jeans_data, dataframe], axis=1) final_data.dropna(inplace=True)
You can give any value in place of 12, depending on your time interval and the number of lags you want to create. It is ideal to give 12 for monthly data and 54 for weekly data and limit the number of independent variables later.
Add seasonal variable
Create a variable that has different values for different months which will add a seasonal component to the model, which may help improve the forecast.
final_data['date'] = pd.to_datetime(final_data['date'], format='%Y-%m-%d') final_data['month'] = final_data['date'].dt.month
Or we can add dummy variables for each month:
dummy = pd.get_dummies(final_data['month']) final_data = pd.concat([final_data, dummy], axis=1)
Train the model:
We will take the most recent 6 months data as the test dataset and the rest of the data as the training dataset.
finaldf = final_data.drop(['date'], axis=1) finaldf = finaldf.reset_index(drop=True) test_length=6 end_point = len(finaldf) x = end_point - test_length finaldf_train = finaldf.loc[:x - 1, :] finaldf_test = finaldf.loc[x:, :] finaldf_test_x = finaldf_test.loc[:, finaldf_test.columns != 'SaleQty'] finaldf_test_y = finaldf_test['SaleQty'] finaldf_train_x = finaldf_train.loc[:, finaldf_train.columns != 'SaleQty'] finaldf_train_y = finaldf_train['SaleQty'] print("Starting model train..") rfe = RFE(RandomForestRegressor(n_estimators=100, random_state=1), 4) fit = rfe.fit(finaldf_train_x, finaldf_train_y) y_pred = fit.predict(finaldf_test_x)
I have used RFE (recursive feature elimination) to limit the number of independent variables/features to 4, you can change the value and choose the value that gives the least error. I have taken n_estimators (number of trees in the forest) 100 which is the default value.
Evaluating the Algorithm:
y_true = np.array(finaldf_test_ y['SaleQty']) sumvalue=np.sum(y_true) mape=np.sum(np.abs((y_true - y_pred)))/sumvalue*100 accuracy=100-mape print('Accuracy:', round(accuracy,2),'%.')
Accuracy: 89.42 %.
Predict for Future:
We will predict sale quantity for the future 6 months. The lags will be null for future date points so we have to predict for one month at a time and use the predicted sale for creating lag for next month’s prediction and so on. Please note we are using the predicted sale only to create the lag variable, we are not training the model again.
def create_lag(df3): dataframe = DataFrame() for i in range(12, 0, -1): dataframe['t-' + str(i)] = df3.SaleQty.shift(i) df4 = pd.concat([df3, dataframe], axis=1) df4.dropna(inplace=True) return df4 yhat=[] future_dataframe= jeans_data.copy() n=6 x = future_dataframe.at[end_point - 1, 'date'] days_in_month=calendar.monthrange(x.year, x.month)[1] for i in range(n): future_dataframe.at[future_dataframe.index[end_point + i], 'date'] = x + timedelta(days=days_in_month + days_in_month * i) future_dataframe.at[future_dataframe.index[end_point + i], SaleQty] = 0 future_dataframe ['date'] = pd.to_datetime(future_dataframe ['date'], format='%Y-%m-%d') future_dataframe ['month'] = future_dataframe ['date'].dt.month future_dataframe = future_dataframe.drop(['date'], axis=1) future_dataframe _end = len(jeans_data) for i in range(n, 0, -1): y = future_dataframe _end - i inputfile = finaldf.loc[y:end_point, :] inputfile_x = inputfile.loc[:, inputfile.columns != 'SaleQty'] pred_set = inputfile_x.head(1) pred = fit.predict(pred_set) future_dataframe.at[future_dataframe.index[future_dataframe _end - i], 'SaleQty'] = pred[0] finaldf = create_lag(future_dataframe) finaldf = finaldf.reset_index(drop=True) yhat.append(pred) predicted_value= np.array(yhat)
You can add any other independent variables available like promotions, special_days, weekends, start_of_month, etc.
Find below the complete code:
import pandas as pd from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestRegressor from pandas import DataFrame import numpy as np from datetime import datetime import calendar from datetime import timedelta import datetime as dt def add_month(df, forecast_length, forecast_period): end_point = len(df) df1 = pd.DataFrame(index=range(forecast_length), columns=range(2)) df1.columns = ['SaleQty', 'date'] df = df.append(df1) df = df.reset_index(drop=True) x = df.at[end_point - 1, 'date'] x = pd.to_datetime(x, format='%Y-%m-%d') days_in_month=calendar.monthrange(x.year, x.month)[1] if forecast_period == 'Week': for i in range(forecast_length): df.at[df.index[end_point + i], 'date'] = x + timedelta(days=7 + 7 * i) df.at[df.index[end_point + i], 'SaleQty'] = 0 elif forecast_period == 'Month': for i in range(forecast_length): df.at[df.index[end_point + i], 'date'] = x + timedelta(days=days_in_month + days_in_month * i) df.at[df.index[end_point + i], 'SaleQty'] = 0 df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d') df['month'] = df['date'].dt.month df = df.drop(['date'], axis=1) return df def create_lag(df3): dataframe = DataFrame() for i in range(12, 0, -1): dataframe['t-' + str(i)] = df3.SaleQty.shift(i) df4 = pd.concat([df3, dataframe], axis=1) df4.dropna(inplace=True) return df4 def randomForest(df1, forecast_length, forecast_period): df3 = df1[['SaleQty', 'date']] df3 = add_month(df3, forecast_length, forecast_period) finaldf = create_lag(df3) finaldf = finaldf.reset_index(drop=True) n = forecast_length end_point = len(finaldf) x = end_point - n finaldf_train = finaldf.loc[:x - 1, :] finaldf_train_x = finaldf_train.loc[:, finaldf_train.columns != 'SaleQty'] finaldf_train_y = finaldf_train['SaleQty'] print("Starting model train..") rfe = RFE(RandomForestRegressor(n_estimators=100, random_state=1), 4) fit = rfe.fit(finaldf_train_x, finaldf_train_y) print("Model train completed..") print("Creating forecasted set..") yhat = [] end_point = len(finaldf) n = forecast_length df3_end = len(df3) for i in range(n, 0, -1): y = end_point - i inputfile = finaldf.loc[y:end_point, :] inputfile_x = inputfile.loc[:, inputfile.columns != 'SaleQty'] pred_set = inputfile_x.head(1) pred = fit.predict(pred_set) df3.at[df3.index[df3_end - i], 'SaleQty'] = pred[0] finaldf = create_lag(df3) finaldf = finaldf.reset_index(drop=True) yhat.append(pred) yhat = np.array(yhat) print("Forecast complete..") return yhat predicted_value=randomForest(jeans_data, 6, 'Month')
Random forest is an ensemble learning method and it does bootstrap of observations where the training set is sampled randomly. So the order of the data points change hence it might not perform well in many time series data, but it does perform well for intermittent data as it catches the probability of demand/sale of a zero selling product well.
Please let me know your queries and suggestions if any.
To connect with me on LinkedIn, please click here