For retailers and businesses, the ability to forecast future sales accurately is a coveted skill, one that can spell the difference between profitability and loss. In this era of data-driven decision-making, predictive modeling techniques have become invaluable tools in deciphering the complex patterns of retail sales.
One such technique that has gained prominence in recent years is the random forest algorithm, a powerful and versatile tool in the field of data science and machine learning. This article discusses and presents how to use it to forecast the next core retail sales value.
Core Retail Sales
Retail sales refer to the total revenues generated by businesses from selling goods and services directly to consumers for their personal use or consumption. These sales occur primarily in physical retail stores, online shops, and through various distribution channels such as e-commerce websites, brick-and-mortar stores, catalogs, and more.
Core retail sales refer to a subset of retail sales that excludes certain categories of goods or products. The specific definition of core retail sales may vary depending on the source and context, but it generally excludes items like automobiles, gasoline, and building materials. The purpose of focusing on core retail sales is to get a clearer picture of the underlying trends and consumer spending patterns in the retail sector.
Core retail sales are used as an economic indicator and can provide insights into consumer spending trends, economic health, inflation, and monetary policy.
If core retail sales are growing, it may suggest that consumers are confident and spending more, which can be a positive sign for economic growth.
The following chart shows the US core retail sales over time since 1994. Notice how it’s stationary and ready to be fed to a machine learning algorithm. As a reminder, stationary data is important for statistical forecasts as it exhibits statistical properties that remain relatively constant over time.
You can also check out my other newsletter The Weekly Market Sentiment Report that sends tactical directional views every weekend to highlight the important trading opportunities using a mix between sentiment analysis (COT reports, Put-Call ratio, Gamma exposure index, etc.) and technical analysis.
Creating the Algorithm
The aim of the algorithm is to run an autoregressive task on the values of US core retail sales in order to predict the next monthly change. For this task, we can follow this framework:
Download the US core retail sales from here, and put the file in the directory of Python.
Import the file to Python and split it into a training set and a test set.
Select the lags that will be used to forecast the current values, then fit and predict the model.
Evaluate the model and plot the results.
Run the code using the following syntax:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
def data_preprocessing(data, num_lags, train_test_split):
# Prepare the data for training
x = []
y = []
for i in range(len(data) - num_lags):
x.append(data[i:i + num_lags])
y.append(data[i+ num_lags])
# Convert the data to numpy arrays
x = np.array(x)
y = np.array(y)
# Split the data into training and testing sets
split_index = int(train_test_split * len(x))
x_train = x[:split_index]
y_train = y[:split_index]
x_test = x[split_index:]
y_test = y[split_index:]
return x_train, y_train, x_test, y_test
# importing the time series
data = np.reshape(pd.read_excel('CRS.xlsx').values, (-1))
'''
plt.plot(data, label = 'US Core Retail Sales')
plt.axhline(y = 0, color = 'black')
plt.grid()
plt.legend()
'''
# Creating the training and test sets
x_train, y_train, x_test, y_test = data_preprocessing(data, 3, 0.80)
# Fitting the model
model = RandomForestRegressor(max_depth = 100, random_state = 0)
model.fit(x_train, y_train)
# Predicting in-sample
y_pred_train = model.predict(x_train)
# Predicting out-of-sample
y_pred = model.predict(x_test)
# Performance evaluation
import math
rmse_test = math.sqrt(mean_squared_error(y_pred, y_test))
print(f"RMSE of Test: {rmse_test}")
same_sign_count = np.sum(np.sign(y_pred) == np.sign(y_test)) / len(y_test) * 100
print('Directional Accuracy | Test = ', same_sign_count, '%')
same_sign_count_train = np.sum(np.sign(y_pred_train) == np.sign(y_train)) / len(y_train) * 100
print('Directional Accuracy | Train = ', same_sign_count_train, '%')
plt.plot(y_pred, color = 'orange', label = 'Predictions')
plt.plot(y_test, label = 'Real')
plt.axhline(y = 0, color = 'black')
plt.grid()
plt.legend()
The following chart shows the comparison between predictions and real values in the test set.
The performance evaluation functions have shown the following results:
RMSE of Test: 0.031
Directional Accuracy | Test = 64.0 %
Directional Accuracy | Train = 87.0 %
With a better than random probability, the model is able to pick up on the changes in this economic indicator. By properly forecasting its direction, you may be able to play some interesting opportunities.
You can also check out my other newsletter The Weekly Market Analysis Report that sends tactical directional views every weekend to highlight the important trading opportunities using technical analysis that stem from modern indicators. The newsletter is free.
If you liked this article, do not hesitate to like and comment, to further the discussion!