Demystifying K-Nearest Neighbors: A Beginner’s Guide in Python
A Gentle Introduction to the KNN Algorithm for Newcomers to Machine Learning
In this article, we embark on a journey to demystify KNN, the algorithm that makes predictions based on the company it keeps, quite literally. We will explore how KNN takes inspiration from our everyday experiences, where we often seek advice from our nearest neighbors, friends, or colleagues.
We will dive into two key aspects of KNN: classification and regression. In the world of classification, KNN helps us assign labels or categories to new data points based on their resemblance to previously observed instances. It is a vital tool in tasks like spam detection, image recognition, and disease diagnosis. In the task of regression, KNN enables us to predict continuous values, making it indispensable for applications such as real estate price prediction, stock market forecasting, and more.
Intuition of the KNN Algorithm
K-Nearest Neighbors (KNN) is a simple and intuitive machine learning algorithm used for both classification and regression tasks. But first, what does classification and regression mean?
Classification is a type of supervised learning where the goal is to categorize data points into predefined classes or labels. In classification, the model learns to assign a class or category to an input based on its features. The output in classification is discrete and represents a category or class label. For example, classifying emails as spam or not spam.
Regression, on the other hand, is a type of supervised learning where the goal is to predict a continuous numeric value. In regression, the model learns to establish a relationship between the input features and the output, which is a real-valued number. The output in regression is a continuous range of values, and it represents a quantity or a numerical value. For example, predicting the price of a house based on its features and forecasting stock prices.
KNN is a type of lazy learning, which means that it doesn’t build a model during training but instead memorizes the entire training dataset and makes predictions based on the similarity between new data points and the existing data points.
So, the central idea behind KNN is that objects (data points) with similar characteristics are close to each other in the feature space. It makes predictions by finding the K training examples that are closest to a given test data point in the feature space and then assigns a label or value to the test point based on the labels or values of its nearest neighbors. This means that K is a variable that you can control in the training phase.
In layman’s terms, K represents the number of nearest neighbors that the algorithm considers when making a prediction. It specifies how many of your closest neighbors you’ll consult.
Take a look at the following illustration.
There are two classes, class A (with the bananas) and class B (with the apples). We are trying to identify the class (or label) of the blue object. The nearest neighbors seem to be the bananas, hence class A. That’s how the KNN algorithm works.
KNN is commonly used for classification tasks (such as the above example). In a KNN classifier, the goal is to assign a class label to a new data point based on the class labels of its K-nearest neighbors. The steps for KNN classification are as follows:
Compute the distance between the test data point and all data points in the training set. Common distance metrics include Euclidean distance, Manhattan distance, or other similarity measures.
Select the K training data points with the smallest distances to the test point.
Count the number of data points in each class among the K neighbors.
Assign the class label of the test point as the class with the highest count among its K neighbors.
KNN can also be used for regression tasks. In KNN regression, the goal is to predict a continuous value for a new data point based on the values of its K-nearest neighbors. The steps for KNN regression are similar to KNN classification, but instead of counting class labels, you compute the average or weighted average of the target values of the K nearest neighbors to predict the value of the test point.
Pre-order my new book with O’Reilly Media, Deep Learning for Finance now on Amazon! In today’s fast-paced financial landscape, staying ahead of the game is essential. With the fusion of deep learning and finance, time series analysis will become second nature to you! Algorithms are covered from A-Z and range from simple linear regression to complex LSTM architectures, and even temporal convolutional neural networks! A dedicated GitHub repository is provided.
You can also check out my other newsletter The Weekly Market Sentiment Report that sends tactical directional views every weekend to highlight the important trading opportunities using a mix between sentiment analysis (COT reports, Put-Call ratio, Gamma exposure index, etc.) and technical analysis.
A Regression Example in Python
Let’s try to code a KNN regression example. The main task is to predict the returns of the EURUSD’s daily data. Use the following code to do the task:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
import yfinance as yf
start_date = "2000-01-01"
end_date = "2023-11-01"
data = yf.download('EURUSD=X', start = start_date, end = end_date)
data = data['Close']
data = np.array(data)
data = np.diff(data)
def data_preprocessing(data, num_lags, train_test_split):
# Prepare the data for training
x = []
y = []
for i in range(len(data) - num_lags):
x.append(data[i:i + num_lags])
y.append(data[i+ num_lags])
# Convert the data to numpy arrays
x = np.array(x)
y = np.array(y)
# Split the data into training and testing sets
split_index = int(train_test_split * len(x))
x_train = x[:split_index]
y_train = y[:split_index]
x_test = x[split_index:]
y_test = y[split_index:]
return x_train, y_train, x_test, y_test
# Setting the hyperparameters
num_lags = 500
train_test_split = 0.80
x_train, y_train, x_test, y_test = data_preprocessing(data, 80, 0.80)
# Create the model
model = KNeighborsRegressor(n_neighbors = 4)
# Fit the model to the data
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
# Plotting
plt.plot(y_pred[-100:], label='Predicted Data', linestyle='--', marker = '.', color = 'red')
plt.plot(y_test[-100:], label='True Data', marker = '.', alpha = 0.7, color = 'blue')
plt.legend()
plt.grid()
plt.axhline(y = 0, color = 'black', linestyle = '--')
same_sign_count = np.sum(np.sign(y_pred) == np.sign(y_test)) / len(y_test) * 100
print('Hit Ratio = ', same_sign_count, '%')
The following chart shows the comparison between actual and predicted data:
The hit ratio (accuracy) of the algorithm using a K of 4 and 500 lagged returns as inputs is as follows:
Hit Ratio = 51.96 %
A Classification Example in Python
Let’s now switch to a KNN classification example. The main task is to predict the classes of a randomly assigned classification problem using make_classification from sklearn. Use the following code to do the task:
# Import the necessary libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Generate a synthetic dataset for binary classification
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, random_state=0)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a KNN classifier with K=3 (you can change K as needed)
knn_classifier = KNeighborsClassifier(n_neighbors=20)
# Fit the classifier to the training data
knn_classifier.fit(X_train, y_train)
# Make predictions on the test data
y_pred = knn_classifier.predict(X_test)
# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred) * 100
print('Hit Ratio = ', accuracy, '%')
import numpy as np
# Create a mesh grid to plot the decision boundary
h = 0.02 # Step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the class labels for each point in the mesh grid
Z = knn_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.5, cmap=plt.cm.RdYlBu)
# Plot the training points
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.RdYlBu, marker='o', label="Training Data")
# Plot the test points
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=plt.cm.RdYlBu, marker='x', label="Test Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("KNN Classification with Decision Boundary")
plt.legend(loc='best')
plt.show()
The following graph shows the comparison between actual and predicted data. The marker x shows the test data. If a blue x is in the blue territory, then it was correctly labeled, otherwise, it was incorrectly labeled. The marker o shows the training data with the same reasoning.
The hit ratio (accuracy) of the algorithm using a K of 20 is as follows:
Hit Ratio = 86.67 %
Remember that the choice of the value of K is crucial in KNN. A smaller K can lead to noisy predictions, while a larger K can lead to overly smoothed predictions. The distance metric used for measuring similarity can also have a significant impact on the results.
Furthermore, KNN is sensitive to the scale and distribution of features, so feature scaling and normalization are often necessary. KNN is a non-parametric algorithm, meaning it doesn’t make any assumptions about the underlying data distribution.
You can also check out my other newsletter The Weekly Market Analysis Report that sends tactical directional views every weekend to highlight the important trading opportunities using technical analysis that stem from modern indicators. The newsletter is free.
If you liked this article, do not hesitate to like and comment, to further the discussion!