Mastering Anomaly Detection in Python
A Comprehensive Guide to Isolation Forests: Detecting Anomalies in Time Series
Anomalies, also known as outliers, are data points or patterns in time series data that deviate significantly from the expected or normal behavior. These deviations can be caused by various factors, such as errors in data collection, equipment failures, or unexpected events.
In this article, we will delve deeper into the workings of isolation forests, explore practical examples, and provide insights into their implementation.
Isolation Forest Algorithm and Anomaly Detection
Identifying anomalies or outliers within datasets is a critical task with far-reaching implications. Anomalies can represent fraudulent transactions in financial data, manufacturing defects in quality control, health anomalies in medical records, or even outstretched market returns. One powerful technique that has emerged as an effective tool in this context is the isolation Forest.
Anomalies are data points that significantly differ from the majority of other data points in a dataset. They can manifest as rare events, errors, or unexpected observations. Detecting anomalies is challenging because they often hide within the noise and complexities of large datasets. Traditional statistical methods and clustering algorithms are not always suitable for handling these challenges effectively.
The isolation forest is an efficient machine learning algorithm designed specifically for anomaly detection. It was introduced by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in a research paper published in 2008. The core idea behind the Isolation Forest is to isolate anomalies by recursively partitioning the dataset into subsets. Let’s look at the details:
The algorithm begins by randomly selecting a feature and then choosing a random value within the range of that feature. This creates a partition, effectively splitting the dataset into two smaller subsets.
The process of random partitioning is repeated recursively until each data point is isolated in its own subset or until a predefined stopping criterion is met.
Anomalies are identified as data points that require fewer partitioning steps to be isolated. In other words, anomalies are typically found in smaller, less dense partitions, making them stand out from the rest of the data.
Isolation forests are highly scalable and perform well on large datasets. They are also efficient and robust to the dimensions of the dataset.
We will also discuss best practices for using Isolation Forests effectively in anomaly detection tasks, helping you harness the power of this versatile algorithm to protect your data and uncover hidden insights.
Merry christmas! This week’s sentiment report is open to all subscribers. Many new opportunities inside and especially a new view on equities! Check them out!🎄🎄🎄
Creating the Isolation Forest Algorithm
Let’s create a simple algorithm in Python that uses the sklearn library to import the isolation forest algorithm. The plan of attack is as follows:
Generate synthetic data and introduce some noise in it in the form of anomalies.
Fit the isolation forest algorithm into the data.
Use the algorithm to detect the anomalies, and plot the results.
Use this code to detect anomalies in synthetically generated data with anomalies infused manually in them:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
# Generate a synthetic time series dataset with anomalies
np.random.seed(42)
n_samples = 500
time = np.arange(n_samples)
data = 0.5 * np.sin(0.2 * time) + np.random.normal(0, 0.1, n_samples)
data[160:170] += 3.0 # Introduce anomalies in the data
data[188:194] += 2.0 # Introduce anomalies in the data
data[400:476] -= 3.0 # Introduce anomalies in the data
# Reshape the data into a column vector (required by Isolation Forest)
data = data.reshape(-1, 1)
# Create and fit the Isolation Forest model
isolation_forest = IsolationForest(contamination=0.05, random_state=42)
isolation_forest.fit(data)
# Predict anomaly scores for each data point
anomaly_scores = isolation_forest.decision_function(data)
# Plot the time series and highlight anomalies
plt.figure(figsize=(12, 6))
plt.plot(time, data)
plt.scatter(time, data, c=np.where(anomaly_scores < 0, 'red', 'blue'), marker='o')
plt.xlabel('Time')
plt.ylabel('Value')
plt.title('Time Series Anomaly Detection with Isolation Forest')
plt.show()
The following is the output graph. The red dots mark the place where the model has detected anomalies.
Detecting Anomalies in the S&P 500 Index
Let’s do the same exercise but on the returns of the S&P 500 index, a US equity benchmark:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
import pandas_datareader as pdr
start_date = '2020-01-01'
end_date = '2023-01-01'
# Import the data
data = (pdr.get_data_fred('SP500', start = start_date, end = end_date).dropna())
data_diff = data.diff().dropna()
time = np.arange(len(data_diff))
# Reshape the data into a column vector (required by Isolation Forest)
data_diff = np.reshape(np.array(data_diff), (-1))
data_diff = data_diff.reshape(-1, 1)
# Create and fit the Isolation Forest model
isolation_forest = IsolationForest(contamination = 0.05, random_state = 0)
isolation_forest.fit(data_diff)
# Predict anomaly scores for each data point
anomaly_scores = isolation_forest.decision_function(data_diff)
plt.figure(figsize=(12, 6))
plt.plot(time[-100:], data_diff[-100:], color = 'blue', linewidth = 1)
plt.scatter(time[-100:], data_diff[-100:], c=np.where(anomaly_scores[-100:] < 0, 'red', 'blue'), marker='o')
plt.xlabel('Time')
plt.ylabel('Value')
plt.title('Time Series Anomaly Detection with Isolation Forest')
plt.show()
plt.axhline(y = 0, color = 'black')
plt.grid()
The following is the resulting chart:
It seems like the isolation forest algorithm is a good start to play around with anomaly detection in time series. One way to exploit anomalies is to assume that a market correction may follow right after.