The Power of Synergy: Combining Two Machine Learning Models for Enhanced Predictive Accuracy
Exploring the Benefits of Model Fusion in Machine Learning
Time series forecasting stands as a cornerstone for a wide array of applications. Whether it’s predicting stock prices, demand for a product, or weather patterns, accurate time series forecasting is critical for informed decision-making. Over the years, various modeling techniques have been employed to tackle the intricacies of time series data. Among these methods, two have gained significant popularity: linear regression and XGBoost.
Linear regression is a simple yet robust technique that has been widely used for time series forecasting. On the other hand, XGBoost, an ensemble learning algorithm, has gained acclaim for its exceptional predictive power.
In this article, we delve into the world of time series forecasting by combining the strengths of linear regression and XGBoost. We aim to create a hybrid model that harnesses the simplicity of linear regression and the predictive prowess of XGBoost to provide accurate, robust, and interpretable forecasts for time series data.
Introduction to XGBoost Regression
Imagine you’re trying to solve a complex puzzle. Each piece of the puzzle represents a small part of the solution. XGBoost is like having a group of experts, each specializing in a particular type of puzzle piece. They work together to solve the puzzle. The algorithm is all about improving, or boosting, the performance of a model. It starts with a simple model, like a decision tree, and gradually makes it better.
XGBoost pays extreme attention to its mistakes. It looks at the pieces of the puzzle it got wrong and focuses on solving those first. Instead of relying on a single expert to solve the entire puzzle, it uses many experts (decision trees). Each expert gives their opinion on how to solve the puzzle, and they vote together to make the final decision.
XGBoost then goes through a lot of practice puzzles (training data) to train its experts. It learns from its mistakes and gets better over time. It also constantly checks how well it’s doing and makes adjustments. It’s like each expert is given a chance to reevaluate their opinion and improve their piece of the puzzle.
The final solution is the combination of all the expert opinions. This combination often leads to a much better result than just one expert could achieve.
Introduction to Linear Regression
Linear regression is a simple yet powerful concept used in statistics and machine learning to understand and predict relationships between two things. Let’s break it down in layman’s terms.
Imagine you’re trying to figure out how one thing is related to another. For example, you might wonder how the amount of time you spend studying (let’s call it “X”) affects your test scores (let’s call it “Y”). Linear regression helps you find a straight line that best represents this relationship.
Linear regression finds a straight line that fits the data points as closely as possible. This line is called the line of best fit. The line of best fit can be represented by a simple equation: Y = mX + b.
In this equation, “Y” is the thing you’re trying to predict (test scores), “X” is the thing you’re using to make the prediction (study time), “m” is the slope of the line (how much Y changes when X changes), and “b” is the intercept (where the line crosses the Y-axis). Once you’ve found the line of best fit, you can use it to make predictions. For example, if you know how much time you plan to study (X), you can predict what your test score (Y) is likely to be based on the line.
Linear regression helps you understand how well your line of best fit predicts the actual outcomes. It measures how closely the predicted values match the real data. The goal is to make the line as close to the data points as possible.
Creating the Model and Evaluating the Results
The aim of this article is to forecast a certain time series using both models and then combining the predictions of in order to compare it to the individual comparisons. For the evaluation, we will use the simple hit ratio (accuracy) metric. The time series that we will forecast is the COT values of the Japanese Yen.
The COT report, or commitments of traders report, is a weekly publication by the U.S. CFTC that provides information on the positions of various market participants, including commercial hedgers, large speculators, and small traders, in the futures and options markets. It offers valuable insights into the sentiment and positioning of these groups, helping traders and investors gauge potential market trends and reversals. The report is commonly used for analyzing commodity and financial futures markets to make informed trading decisions.
You can also check out my other newsletter The Weekly Market Analysis Report that sends tactical directional views every weekend to highlight the important trading opportunities using technical analysis that stem from modern indicators. The newsletter is free.
If you liked this article, do not hesitate to like and comment, to further the discussion!
The following chart shows the weekly values of the COT — JPY. This time series has a positive correlation with JPYUSD (and consequently, a negative correlation with USDJPY).
The framework of this study is as follows:
Download and import the COT JPY data from here.
Take the difference of the data in order to make it stationary.
Split the data into training and test sets (while using lagged values as features or predictors). Fit and predict using the linear regression algorithm and the XGBoost algorithm separately.
Combine the forecasts using a simple averaging.
Evaluate the three models.
Use the following code to create the linear regression algorithm:
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pandas as pd
def data_preprocessing(data, num_lags, train_test_split):
# Prepare the data for training
x = []
y = []
for i in range(len(data) - num_lags):
x.append(data[i:i + num_lags])
y.append(data[i+ num_lags])
# Convert the data to numpy arrays
x = np.array(x)
y = np.array(y)
# Split the data into training and testing sets
split_index = int(train_test_split * len(x))
x_train = x[:split_index]
y_train = y[:split_index]
x_test = x[split_index:]
y_test = y[split_index:]
return x_train, y_train, x_test, y_test
data = np.reshape(pd.read_excel('COT_JPY.xlsx').values, (-1))
data = np.diff(data)
x_train, y_train, x_test, y_test = data_preprocessing(data, 80, 0.80)
# Create the model
model = LinearRegression()
# Fit the model to the data
model.fit(x_train, y_train)
y_pred_lr = model.predict(x_test)
same_sign_count = np.sum(np.sign(y_pred_lr) == np.sign(y_test)) / len(y_test) * 100
print('Hit Ratio Linear Regression = ', same_sign_count, '%')
Use the following code to create the XGBoost algorithm:
from xgboost import XGBRegressor
data = np.reshape(pd.read_excel('COT_JPY.xlsx').values, (-1))
data = np.diff(data)
x_train, y_train, x_test, y_test = data_preprocessing(data, 80, 0.80)
# Create the model
model = XGBRegressor(random_state = 0, n_estimators = 16, max_depth = 16)
# Fit the model to the data
model.fit(x_train, y_train)
y_pred_xgb = model.predict(x_test)
# Plotting
plt.plot(y_pred_lr[-100:], label='Predicted Data | LR', linestyle='--', marker = '.', color = 'red')
plt.plot(y_pred_xgb[-100:], label='Predicted Data | XGBoost', linestyle='--', marker = '.', color = 'orange')
plt.plot(y_test[-100:], label='True Data', marker = '.', alpha = 0.7, color = 'blue')
plt.legend()
plt.grid()
plt.axhline(y = 0, color = 'black', linestyle = '--')
same_sign_count = np.sum(np.sign(y_pred_xgb) == np.sign(y_test)) / len(y_test) * 100
print('Hit Ratio XGBoost = ', same_sign_count, '%')
The following chart shows the different predictions and the real data in blue.
Use the following code to create the averaged forecasts algorithm:
averaged_forecasts = (y_pred_xgb + y_pred_lr) / 2
same_sign_count = np.sum(np.sign(averaged_forecasts) == np.sign(y_test)) / len(y_test) * 100
print('Hit Ratio Averaged Forecasts = ', same_sign_count, '%')
plt.plot(y_pred_lr[-100:], label='Predicted Data | LR', linestyle='--', marker = '.', color = 'red', alpha = 0.5)
plt.plot(y_pred_xgb[-100:], label='Predicted Data | XGBoost', linestyle='--', marker = '.', color = 'orange', alpha = 0.5)
plt.plot(averaged_forecasts[-100:], label='Predicted Data | Averaged Forecasts', linewidth = 2, marker = '.', color = 'black')
plt.plot(y_test[-100:], label='True Data', marker = '.', alpha = 0.7, color = 'blue')
plt.legend()
plt.grid()
plt.axhline(y = 0, color = 'black', linestyle = '--')
The following chart shows the different predictions and the real data in blue.
The following is the results of the models:
Hit Ratio Linear Regression = 54.05 %
Hit Ratio XGBoost = 62.16 %
Hit Ratio Averaged Forecasts = 64.86 %
It seems that the combination of the two models has provided better accuracy. This opens up doors to exploit this type of fusion. The bottom line is that two machines are better than one.
You can also check out my other newsletter The Weekly Market Sentiment Report that sends tactical directional views every weekend to highlight the important trading opportunities using a mix between sentiment analysis (COT reports, Put-Call ratio, Gamma exposure index, etc.) and technical analysis.