K-Nearest-Neighbors Classification Algorithm in Practice

What the KNN algorithm is and how it is used in classification tasks.

K-Nearest-Neighbors Classification Algorithm in Practice

Introduction

KNN algorithm is a supervised classification algorithm that is mainly used to predict which category a query point belongs to, given a bunch of data values with respect to its corresponding categories (class labels). Talking from the perspective of the classification tasks, KNN is very much similar to Naive Bayes but slightly different in terms of technicality and implementation. In Naive Bayes, we compute the likelihood by probabilistic methods, whereas in KNN, we compute distance measurements along with other extensions added.

One great advantage of KNN is that, though it is used for classification tasks, we can, however, extend it for regression tasks as well. In this article, we will try to understand and implement KNN from scratch.

Note: If you want to know how Naive Bayes works, then I would strongly recommend checking out my blog post - Naive Bayes in Practice.

Credits of Cover Image - Photo by Tom Rumble on Unsplash


General Intuition

Let's imagine we have 2 sets of points separated as per the category. Now, given a new point, how can you predict or guess which category it belongs to?

knn-intuit.png

The task of putting the query point in the right class label can be tricky. We cannot just randomly assign the class label.

knn-intuit-1.png

Mathematically, we have to compute the distance between the query point and each data point. We need to set a threshold and select the top nearest distance values. Let's say the threshold value is 5.

knn-intuit-2.png

What we can see is that the query point is nearer to the blue category with 3 points and the orange category with 2 points. Now, when we apply the majority voting technique, the class label of the query point will be blue since 3 is greater than 2.

knn-intuit-3.png

The same concept is used in developing the algorithm. Now that we have understood the theory, let's implement this totally from scratch.


Distance Measurements - Theory

Well, to compute the distance between two points, we have various methods. Here, we will just select a few distance measurement metrics.

Types of distance metrics

  1. Euclidean Distance
  2. Manhattan Distance
  3. Minkowski Distance
  4. Hamming Distance
  5. Cosine Distance

1. Euclidean Distance

The length of a line segment between any two points in the coordinate system is Euclidean distance. This metric computes the shortest distance between two points. This is also referred to as the L2 norm.

euclidean_dist.png

Mathematical formula

Let

$$p = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the distance between p and q can be represented as -

$$pq = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}$$

The same pattern is followed in the case of higher-dimensional vectors.

2. Manhattan Distance

Manhattan distance follows the concept of Taxicab Geometry. The usual Euclidean distance metric is replaced in such a way that the distance between points is actually the sum of absolute differences between the points. This is also referred to as the L1 norm.

manhattan_dist.png

Mathematical formula

Let

$$p = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the distance between p and q can be represented as -

$$pq = |x_1 - x_2| + |y_1 - y_2|$$

The same pattern is followed in the case of higher-dimensional vectors.

3. Minkowski Distance

Minkowski distance is the generalized form of both Euclidean and Manhattan distance metrics. This is also called the Lp norm. If the value of p is 1 then it is known as Manhattan, if p is 2 then Euclidean, and so on.

Mathematical formula

Let

$$r = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the distance between r and q can be represented as -

$$rq = \sqrt[p]{(x_1 - x_2)^p + (y_1 - y_2)^p}$$

The same pattern is followed in the case of higher-dimensional vectors.

4. Hamming Distance

The hamming distance metric is highly recommended when we are dealing with binary vectors. In case, if it is used in non-binary vectors, it produces terrible results. It is the number of bit positions in which two bits are different provided the length of two vectors are the same.

Mathematical formula

Let

$$p = [0, 1, 1, 0, 1, 0, 0]$$

and

$$q = [1, 0, 1, 0, 1, 0, 1]$$

the distance between p and q can be represented as -

hamming_dist.png

the sum of non-identical bits between the two points or vectors.

5. Cosine Distance

In order to find the angular distance, first, we have to find the cosine similarity. The similarity between two non-zero points by their inner product is called cosine similarity.

cosine_dist.png

Mathematical formula

Let

$$p = [x_1, y_1]$$

and

$$q = [x_2, y_2]$$

the cosine similarity between p and q can be represented as -

$$\text{cosine-sim}(p, q) = \frac{p.q}{||p||_2 \ ||q||_2} = \frac{p.q}{\sqrt{p.p} \ \sqrt{q.q}}$$

and cosine distance between p and q can be represented as -

$$\text{cosine-dist}(p, q) = 1 - \text{cosine-sim}(p, q)$$

The same pattern is followed in the case of higher-dimensional vectors.

Distance Measurements - Code

Coding the above distance measures or metrics becomes very easy with the help of NumPy being the amazing package for numerical computation and linear algebra. Once we install it in the system, we can easily use it by importing the package.

Full code

import numpy as np

class DistanceMeasures():
    def __init__(self, point1, point2):
        """
        :param array point1: Point
        :param array point2: Point
        """
        self.point1 = np.array(point1)
        self.point2 = np.array(point2)

    def euclidean_measure(self):
        flag = True if (len(self.point1) == len(self.point2)) else False
        if flag:
            dist = np.linalg.norm(self.point1 - self.point2)
            return round(dist, 4)
        return None

    def manhattan_measure(self):
        flag = True if (len(self.point1) == len(self.point2)) else False
        if flag:
            dist = np.abs(self.point1 - self.point2).sum()
            return round(dist, 4)
        return None

    def minkowski_measure(self, p):
        flag = True if (len(self.point1) == len(self.point2)) else False

        if not flag and (p <= 0):
            return None
        if (p == 1):
            return self.manhattan_measure()
        elif (p == 2):
            return self.euclidean_measure()

        dist = np.sum(np.abs(self.point1 - self.point2) ** p) ** (1 / p)
        return round(dist, 4)

    def hamming_measure(self):
        flag = True if (len(self.point1) == len(self.point2)) else False
        if flag:
            return np.sum(self.point1 != self.point2)
        return None

    def cosine_measure(self):
        flag = True if (len(self.point1) == len(self.point2)) else False
        if flag:
            nume = np.dot(a=self.point1, b=self.point2)
            denome = np.sqrt(np.dot(a=self.point1, b=self.point1)) * np.sqrt(np.dot(a=self.point2, b=self.point2))
            return 1 - (nume / denome)
        return None

Custom KNN - Code

We have already understood the intuition of how this algorithm works. Now, we just need to apply that concept in the code. Let's do that.

Note: This article doesn't cover the performance metrics or hyperparameter tuning of the model.

Libraries

import numpy as np
import pandas as pd
import plotly.graph_objects as go

from plotly.subplots import make_subplots
from sklearn.datasets import make_classification

Construction

class KNN():

The name of the classifier is KNN and it is a class where we define other methods.

__init__() Method

    def __init__(self, n_neighbors, train_df, test_df, label, metric='euclidean'):
        """
        :param int n_neighbors: Number of nearest neighbors
        :param str metric: Distance measurement metric. 
        Accepts `euclidean`, `manhattan`, `minkowski`, `hamming`, and `cosine` methods
        """
        self.n_neighbors = n_neighbors
        self.available_metrics = ['euclidean', 'manhattan', 'minkowski', 'hamming', 'cosine']
        self.metric = metric if metric in self.available_metrics else 'euclidean'

        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)
        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)

        self.X_train = self.X_train.values
        self.y_train = self.y_train.values
        self.X_test = self.X_test.values
        self.y_test = self.y_test.values

The above method is a constructor that takes five parameters -

  • n_neighbors → refers to the number of top nearest neighbors that are used to predict the class label.
  • train_df → refers to the subset of the data that is used to train the classifier.
  • test_df → refers to the subset of the data that is used to test the classifier.
  • label → refers to the series of data which is actually the column name of the class label.
  • metric → refers to the type of distance metric that is used to compute the distance between points or vectors.

split_features_targets() Method

    def split_features_targets(self, df, label):
        """
        :param DataFrame df: The main dataset
        :param str label: The column name which signifies class labels
        :return: X, y
        """
        X = df.drop(columns=[label], axis=1)
        y = df[label]
        return X, y

The above method is used to separate features and targets from the data. It takes two parameters -

  • df → refers to the entire dataset that is passed for classifying.
  • label → refers to the series of df which is actually the column name of the class label.

most_freq() Method

    def most_freq(self, l):
        """
        :param list l: List of values (can have duplicates)
        :return: Most occurred element from l
        """
        return max(set(l), key=l.count)

The above method is used to get the most occurred elements in the given list. Basically, this method replicates the majority voting technique that predicts the class label. It takes one parameter -

  • l → refers to the list of elements (class labels).

find_nearest_neighbors() Method

    def find_nearest_neighbors(self, X_new):
        """
        :param array X_new: Array/List of each feature
        :return: distance_labels - dictionary
        """
        distances = []

        for X_row in self.X_train:
            dm = DistanceMeasures(point1=X_row, point2=X_new)

            if (self.metric == 'euclidean'):
                distances.append(dm.euclidean_measure())
            elif (self.metric == 'manhattan'):
                distances.append(dm.manhattan_measure())
            elif (self.metric == 'minkowski'):
                distances.append(dm.minkowski_measure(p=3))
            elif (self.metric == 'hamming'):
                distances.append(dm.hamming_measure())
            else:
                distances.append(dm.cosine_measure())

        distance_labels = {d : l for (d, l) in zip(distances, self.y_train)}
        distance_labels = dict(
            sorted(distance_labels.items(), key=lambda x:x[0])[:self.n_neighbors]
        )
        return distance_labels

The above method is used to find the top nearest neighbors for the query point with respect to all the data points present in the training data. It takes one parameter -

  • X_new → refers to the query point for which the class label is predicted.

predict() Method

    def predict(self):
        if (len(self.X_test) == 1):
            odl = self.find_nearest_neighbors(X_new=self.X_test)
            return self.most_freq(l=list(odl.values()))

        preds = []
        for test in self.X_test:
            odl = self.find_nearest_neighbors(X_new=test)
            preds.append(self.most_freq(l=list(odl.values())))
        return np.array(preds)

The above method is used to predict the class labels for the X_test data points. It takes no parameters but uses the methods like most_freq() and find_nearest_neighbors() in order to complete the task.

score() Method

    def score(self, preds, with_plot=False):
        """
        :param array preds: Predictions
        :param boolean with_plot: True/False
        :return: accuracy level
        """
        if (len(self.y_test) == len(preds)):
            if with_plot:
                self.plot_it(preds=preds)
            return sum([1 if (i == j) else 0 for (i, j) in zip(self.y_test, preds)]) / len(preds)
        return "Lengths do not match"

The above method is used to compute the level of accuracy score that determines whether a model is performing well or not. It is a fraction of the total number of correctly classified data with the total number of all the data points. Generally, the model whose accuracy level is greater than 0.80 or 80 is considered to be a good model. It takes two parameters -

  • preds → refers to an array of predictions (class labels).
  • with_plot → refers to a boolean value from which data visualization is decided.

plot_it() Method

    def plot_it(self, preds):
        """
        :param array preds: Predictions
        :return: None
        """
        fig = make_subplots(rows=1, cols=2)

        x_ = list(self.X_train[:, 0])
        y_ = list(self.X_train[:, 1])
        c_ = list(self.y_train)
        fig.add_trace(
            go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Training'),
            row=1, col=1
        )

        x_ = list(self.X_test[:, 0])
        y_ = list(self.X_test[:, 1])
        c_ = list(preds)
        fig.add_trace(
            go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Testing'),
            row=1, col=2
        )

        title = 'Training {} Testing'.format(' '*68)
        fig.update_layout(
            title=title, height=300, width=900, 
            margin=dict(l=0, b=0, t=40, r=0), showlegend=False
        )

        fig.show()
        return None

The above method is used to visualize the data. Both training and testing data are visualized separately. It takes one parameter -

  • preds → refers to an array of predictions (class labels).

Full code

class KNN():
    def __init__(self, n_neighbors, train_df, test_df, label, metric='euclidean'):
        """
        :param int n_neighbors: Number of nearest neighbors
        :param str metric: Distance measurement metric. 
        Accepts `euclidean`, `manhattan`, `minkowski`, `hamming`, and `cosine` methods
        """
        self.n_neighbors = n_neighbors
        self.available_metrics = ['euclidean', 'manhattan', 'minkowski', 'hamming', 'cosine']
        self.metric = metric if metric in self.available_metrics else 'euclidean'

        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)
        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)

        self.X_train = self.X_train.values
        self.y_train = self.y_train.values
        self.X_test = self.X_test.values
        self.y_test = self.y_test.values

    def split_features_targets(self, df, label):
        """
        :param DataFrame df: The main dataset
        :param str label: The column name which signifies class labels
        :return: X, y
        """
        X = df.drop(columns=[label], axis=1)
        y = df[label]
        return X, y

    def most_freq(self, l):
        """
        :param list l: List of values (can have duplicates)
        :return: Most occurred element from l
        """
        return max(set(l), key=l.count)

    def find_nearest_neighbors(self, X_new):
        """
        :param array X_new: Array/List of each feature
        :return: distance_labels - dictionary
        """
        distances = []

        for X_row in self.X_train:
            dm = DistanceMeasures(point1=X_row, point2=X_new)

            if (self.metric == 'euclidean'):
                distances.append(dm.euclidean_measure())
            elif (self.metric == 'manhattan'):
                distances.append(dm.manhattan_measure())
            elif (self.metric == 'minkowski'):
                distances.append(dm.minkowski_measure(p=3))
            elif (self.metric == 'hamming'):
                distances.append(dm.hamming_measure())
            else:
                distances.append(dm.cosine_measure())

        distance_labels = {d : l for (d, l) in zip(distances, self.y_train)}
        distance_labels = dict(
            sorted(distance_labels.items(), key=lambda x:x[0])[:self.n_neighbors]
        )
        return distance_labels

    def predict(self):
        if (len(self.X_test) == 1):
            odl = self.find_nearest_neighbors(X_new=self.X_test)
            return self.most_freq(l=list(odl.values()))

        preds = []
        for test in self.X_test:
            odl = self.find_nearest_neighbors(X_new=test)
            preds.append(self.most_freq(l=list(odl.values())))
        return np.array(preds)

    def score(self, preds, with_plot=False):
        """
        :param array preds: Predictions
        :param boolean with_plot: True/False
        :return: accuracy level
        """
        if (len(self.y_test) == len(preds)):
            if with_plot:
                self.plot_it(preds=preds)
            return sum([1 if (i == j) else 0 for (i, j) in zip(self.y_test, preds)]) / len(preds)
        return "Lengths do not match"

    def plot_it(self, preds):
        """
        :param array preds: Predictions
        :return: None
        """
        fig = make_subplots(rows=1, cols=2)

        x_ = list(self.X_train[:, 0])
        y_ = list(self.X_train[:, 1])
        c_ = list(self.y_train)
        fig.add_trace(
            go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Training'),
            row=1, col=1
        )

        x_ = list(self.X_test[:, 0])
        y_ = list(self.X_test[:, 1])
        c_ = list(preds)
        fig.add_trace(
            go.Scatter(x=x_, y=y_, mode='markers', marker=dict(color=c_), name='Testing'),
            row=1, col=2
        )

        title = 'Training {} Testing'.format(' '*68)
        fig.update_layout(
            title=title, height=300, width=900, 
            margin=dict(l=0, b=0, t=40, r=0), showlegend=False
        )

        fig.show()
        return None

Custom KNN - Testing

Before testing the model, we need to first create a toy dataset and then split it into training and testing subsets.

Data Creation

X, y = make_classification(
    n_samples=300, 
    n_features=2, 
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    n_classes=3,
    random_state=60
)
df = pd.DataFrame(dict(col1=X[:,0], col2=X[:,1], label=y))

The first five rows of df look like -

col1col2label
0.774027-1.6543000
0.487983-0.2024210
0.762589-0.4401500
-1.9202130.2643161
0.773207-0.0140492

Data splitter

def splitter(dframe, percentage=0.8, random_state=True):
    """
    :param DataFrame dframe: Pandas DataFrame
    :param float percentage: Percentage value to split the data
    :param boolean random_state: True/False
    :return: train_df, test_df
    """
    if random_state:
        dframe = dframe.sample(frac=1)

    thresh = round(len(dframe) * percentage)
    train_df = dframe.iloc[:thresh]
    test_df = dframe.iloc[thresh:]

    return train_df, test_df

The above function is used to divide the data into two parts.

train_df, test_df = splitter(dframe=df)

Object Creation

neigh = KNN(
    train_df=train_df, 
    test_df=test_df, 
    label='label', 
    n_neighbors=5, 
    metric='cosine'
)

Prediction

preds = neigh.predict()

Accuracy Score

acc = neigh.score(preds=preds, with_plot=True)
print(acc)

acc-score.PNG

The accuracy happens to be >= 95% which is a decent percentage and hence the model is good.


Challenges

  • It is better to use cross-validation techniques to find the optimal value of k to obtain better results. We did not discuss that in this article.

  • If the data is huge, then it can very tough job for the computer as there will be time and space constraints. Because KNN happens to be slow compared to other models. Of course, we can make use of methods like KD-Tree and LSH (Locality Sensitive Hashing) to fasten the model performance.

  • Oftentimes, the majority voting technique may not work effectively. To avoid this problem, we can make weighted KNN which works really well.

  • It is always good to use the package methods rather than writing our own algorithms. The algorithms that we write may not be optimized enough to proceed.

References

  1. KNN - Wikipedia article → bit.ly/3w4cUAW
  2. TDS blog article → bit.ly/3x5f3ML
  3. Distance metrics → bit.ly/3ctuBlL

End