This post was originally published by Doug Steen at Towards Data Science
k-Nearest Neighbors (KNN) is a supervised machine learning algorithm that can be used for either regression or classification tasks. KNN is non-parametric, which means that the algorithm does not make assumptions about the underlying distributions of the data. This is in contrast to a technique like linear regression, which is parametric, and requires us to find a function that describes the relationship between dependent and independent variables.
KNN has the advantage of being quite intuitive to understand. When used for classification, a query point (or test point) is classified based on the k labeled training points that are closest to that query point.
For a simplified example, see the figure below. The left panel shows a 2-d plot of sixteen data points — eight are labeled as green, and eight are labeled as purple. Now, the right panel shows how we would classify a new point (the black cross), using KNN when k=3. We find the three closest points, and count up how many ‘votes’ each color has within those three points. In this case, two of the three points are purple — so, the black cross will be labeled as purple.
2-d Classification using KNN when k=3
The distance between points is determined by using one of several versions of the Minkowski distance equation. The generalized formula for Minkowski distance can be represented as follows:
where X and Y are data points, n is the number of dimensions, and p is the Minkowski power parameter. When p =1, the distance is known at the Manhattan (or Taxicab) distance, and when p=2 the distance is known as the Euclidean distance. In two dimensions, the Manhattan and Euclidean distances between two points are easy to visualize (see the graph below), however at higher orders of p, the Minkowski distance becomes more abstract.
Manhattan and Euclidean distances in 2-d
To implement my own version of the KNN classifier in Python, I’ll first want to import a few common libraries to help out.
To test the KNN classifier, I’m going to use the iris data set from sklearn.datasets. The data set has measurements (Sepal Length, Sepal Width, Petal Length, Petal Width) for 150 iris plants, split evenly among three species (0 = setosa, 1 = versicolor, and 2 = virginica). Below, I load the data and store it in a dataframe.
I’ll also separate the data into features (X) and the target variable (y), which is the species label for each plant.
Building out the KNN Framework
Creating a functioning KNN classifier can be broken down into several steps. While KNN includes a bit more nuance than this, here’s my bare-bones to-do list:
- Define a function to calculate the distance between two points
- Use the distance function to get the distance between a test point and all known data points
- Sort distance measurements to find the points closest to the test point (i.e., find the nearest neighbors)
- Use majority class labels of those closest points to predict the label of the test point
- Repeat steps 1 through 4 until all test data points are classified
1. Define a function to calculate distance between two points
First, I define a function called minkowski_distance, that takes an input of two data points (a & b) and a Minkowski power parameter p, and returns the distance between the two points. Note that this function calculates distance exactly like the Minkowski formula I mentioned earlier. By making p an adjustable parameter, I can decide whether I want to calculate Manhattan distance (p=1), Euclidean distance (p=2), or some higher order of the Minkowski distance.
2. Use the distance function to get distance between a test point and all known data points
For step 2, I simply repeat the minkowski_distance calculation for all labeled points in X and store them in a dataframe.
3. Sort distance measurements to find the points closest to the test point
In step 3, I use the pandas .sort_values() method to sort by distance, and return only the top 5 results.
4. Use majority class labels of those closest points to predict the label of the test point
For this step, I use collections.Counter to keep track of the labels that coincide with the nearest neighbor points. I then use the .most_common() method to return the most commonly occurring label. Note: if there is a tie between two or more labels for the title of “most common” label, the one that was first encountered by the Counter() object will be the one that gets returned.
5. Repeat steps 1 through 4 until all test data points are classified
In this step, I put the code I’ve already written to work and write a function to classify the data using KNN. First, I perform a train_test_split on the data (75% train, 25% test), and then scale the data using StandardScaler(). Since KNN is distance-based, it is important to make sure that the features are scaled properly before feeding them into the algorithm.
Additionally, to avoid data leakage, it is good practice to scale the features after the train_test_split has been performed. First, scale the data from the training set only (scaler.fit_transform(X_train)), and then use that information to scale the test set (scaler.tranform(X_test)). This way, I can ensure that no information outside of the training data is used to create the model.
Next, I define a function called knn_predict that takes in all of the training and test data, k, and p, and returns the predictions my KNN classifier makes for the test set (y_hat_test). This function doesn’t really include anything new — it is simply applying what I’ve already worked through above. The function should return a list of label predictions containing only 0’s, 1’s and 2’s.
[0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0, 2, 1, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 0]
And there they are! These are the predictions that this home-brewed KNN classifier has made on the test set. Let’s see how well it worked:
Looks like the classifier achieved 97% accuracy on the test set. Not too bad at all! But how do I know if it actually worked correctly? Let’s check the result of sklearn’s KNeighborsClassifier on the same data:
Sklearn KNN Accuracy: 0.9736842105263158
Nice! sklearn’s implementation of the KNN classifier gives us the exact same accuracy score.
Exploring the effect of varying k
My KNN classifier performed quite well with the selected value of k = 5. KNN doesn’t have as many tune-able parameters as other algorithms like Decision Trees or Random Forests, but k happens to be one of them. Let’s see how the classification accuracy changes when I vary k:
In this case, using nearly any k value less than 20 results in great (>95%) classification accuracy on the test set. However, when k becomes greater than about 60, accuracy really starts to drop off. This makes sense, because the data set only has 150 observations — when k is that high, the classifier is probably considering labeled training data points that are way too far from the test points.
Every neighbor gets a vote — or do they?
In writing my own KNN classifier, I chose to overlook one clear hyperparameter tuning opportunity: the weight that each of the k nearest points has in classifying a point. In sklearn’s KNeighborsClassifier, this is the weights parameter, and it can be set to ‘uniform’, ‘distance’, or another user-defined function.
When set to ‘uniform’, each of the k nearest neighbors gets an equal vote in labeling a new point. When set to ‘distance’, the neighbors in closest to the new point are weighted more heavily than the neighbors farther away. There are certainly cases where weighting by ‘distance’ would produce better results, and the only way to find out is through hyperparameter tuning.
Now, make no mistake — sklearn’s implementation is undoubtedly more efficient and more user-friendly than what I’ve cobbled together here. However, I found it a valuable exercise to work through KNN from ‘scratch’, and it has only solidified my understanding of the algorithm. I hope it did the same for you!
This post was originally published by Doug Steen at Towards Data Science