KNN Algorithm in Machine Learning

KNN Algorithm in Machine Learning

9 mins read1K Views Comment
Updated on Mar 24, 2023 16:57 IST

In this article, we will briefly discuss about KNN algorithm in machine learning, how to find the value of K, how to build a KNN classifier and finally we will discuss pros and cons of knn.

2022_05_feature-images_KNN.jpg

As a Machine Learning practitioner working with labelled training datasets, one of the most sophisticated algorithms you will encounter is the K Nearest Neighbors (KNN). This algorithm works for both classification and regression problems. You might have heard the popular phrase “Birds of the same feather flock together”. Which basically means you are who you keep company with. Our behavior and characteristics tend to be affected by the people around us. Similarly, the KNN algorithm in machine learning determines the characteristics of a data point by the data points surrounding it.

In this article, we will focus on how KNN in machine learning is used for classification.

Table of Content

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

2.5 L
2 years
2.5 L
2 years
1.53 L
11 months
34.65 K
11 months
5.6 L
18 months
– / –
6 months
– / –
8 hours

Quick Introduction to KNN

In the KNN model, learning is based on the nature of the data points (neighbors) that are present close to the query data point in the training dataset. The number of training examples or nearest neighbors is given by ‘K’.

Let’s understand with an analogy – Say, you have a singular close friend you hang out with all the time, and you will probably end up sharing similar interests with them. In KNN, that means the value of K = 1.

Similarly, if you have a group of four friends, your characteristics will tend to be the average of them all. That is KNN with a value of K = 4.

Once the value of K is determined,

  • A KNN classifier determines the class of a data point through a majority vote of nearest neighbors.
  • A KNN regressor predicts the class by calculating the mean of the nearest neighbors.

The KNN algorithm is an instance-based method and is called a lazy learner. Lazy because it doesn’t explicitly learn from the training data. It just memorizes the training instances which are used as “knowledge” during prediction.

Must Check: What is Machine Learning?

Must Check: Machine Learning Online Courses & Certifications

How to find the optimal value of K?

The value of K plays a significant role in the performance efficacy of the model. The chosen K value should neither be too large nor too small. As K increases, the error usually goes down, then stabilizes, and starts rising again when K becomes too large.

If K is too small, the model might perform very well on training data but would drastically fail on testing data (overfitting).

Whereas if K is too large, the resulting model would be too generalized resulting in underfitting. A large K value would also increase the computational expense of your model.

2022_05_image-221.jpg

The optimal value of K depends on the dataset being used and largely depends on trial-and-error. Domain knowledge plays a vital role in this.

It is preferable to choose an odd number for K to minimize the chances of landing a tie during class prediction through the majority voting mechanism in KNN classifiers.

More ways that can help with the estimation of the K value are:

  • Square Root Method: We can consider the square root of all data points in the training dataset as the optimal value of K.
  • Cross-Validation Method: We will start with K=1, run cross-validation (5 to 10-fold), measure the accuracy, and keep repeating until the results get consistent. 
  • As stated above, as the value of K increases, it stabilizes the error at some point before it rises again. We choose the value at the beginning of the stable zone as the optimal value of K. This is also called Elbow Method.
How to choose the Value of k in K-fold Cross-Validation
How to choose the Value of k in K-fold Cross-Validation
Cross-validation is a technique for evaluating a machine learning model and testing its performance. It is used commonly in applied ML tasks. It helps in comparing and selecting an appropriate...read more
Differences Between Supervised and Unsupervised Learning
Differences Between Supervised and Unsupervised Learning
While delving into AI (Artificial Intelligence) and ML (Machine Learning), you will come across two main ways machines learn from the data fed into them – Supervised and Unsupervised. This...read more
Basics of Machine Learning – Definition and Concepts
Basics of Machine Learning – Definition and Concepts
This post will help you understand the emerging technology of today’s time- Machine Learning. Here we have covered basic concept of Machine Learning.

How to find the K Nearest Neighbors?

To determine which data points are close enough to be considered nearest neighbors, we commonly use the following distance measuring techniques:

2022_05_image-222.jpg
  • Euclidean distance (most commonly used method)
  • Manhattan distance
  • Minkowski distance

The figure below illustrates how to calculate the Euclidean distance between two points in a 2D space:

2022_05_image-223.jpg
2022_05_euclidean-distance.jpg

How KNN Works?

  • Step 1:  the optimal value of K is determined.
  • Step 2: The KNN algorithm calculates the distance of all data points from the query data point using the distance measuring techniques stated above.
  • Step 3: It ranks the data points by increasing distance. The closest K points in the data space of the query point are its nearest neighbors.
  • Step 4: For each query point, one of these K neighbors predicts its class as follows.
  • Counting the data points in each category and taking the majority votes into consideration – KNN Classifier model.
    • Calculating the average of the nearest neighbors – KNN Regressor model.

Building a KNN Classifier in Python

Problem Statement:

For demonstration, we are going to build a classifier model using a K Nearest Neighbors algorithm to predict whether the patients have diabetes or not based on the features in the given data. We will also find the optimal value of K using the GridSearchCV() method in the Scikit-learn library.

So, let’s understand!

Dataset Description:

The dataset has 8 features as given below:

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration at 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age in years
  • Outcome: Class variable (0 or 1)

The Outcome column is our target variable.

Tasks to be performed:

  1. Load the data
  2. Perform feature scaling
  3. Perform label encoding
  4. Split the data into training and testing sets
  5. Fit the KNN Classifier model
  6. Generate an accuracy plot
  7. Create a KNN Classifier with K=7
  8. Fit the Classifier and get the Accuracy Score
  9. Create a confusion matrix
  10. Perform cross-validation

Step 1 – Load the data

 
\n \n \n <pre class="python" style="font-family:monospace">\n \n \n <span style="color: #808080;font-style: italic">\n \n \n #Import required libraries\n \n \n
\n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n import numpy \n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n as np\n \n \n
\n \n \n <span style="color: #ff7700;font-weight:bold">\n import pandas \n <span style="color: #ff7700;font-weight:bold">\n as pd \n
\n <span style="color: #ff7700;font-weight:bold">\n import matplotlib. \n <span style="color: black">\n pyplot \n <span style="color: #ff7700;font-weight:bold">\n as plt \n
plt. \n <span style="color: black">\n style. \n <span style="color: black">\n use \n <span style="color: black">\n ( \n <span style="color: #483d8b">\n 'ggplot' \n <span style="color: black">\n ) \n
\n
\n <span style="color: #808080;font-style: italic">\n #Load the dataset \n
data \n <span style="color: #66cc66">\n = pd. \n <span style="color: black">\n read_csv \n <span style="color: black">\n ( \n <span style="color: #483d8b">\n 'diabetes.csv' \n <span style="color: black">\n ) \n
data. \n <span style="color: black">\n head \n <span style="color: black">\n ( \n <span style="color: black">\n ) \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #483d8b"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: #808080;font-style: italic"> \n </span style="color: black"> \n </span style="color: #483d8b"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #ff7700;font-weight:bold"> \n </span style="color: black"> \n </span style="color: #ff7700;font-weight:bold"> \n </span style="color: #ff7700;font-weight:bold"> \n </span style="color: #ff7700;font-weight:bold">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </span style="color: #808080;font-style: italic">\n \n \n </pre class="python" style="font-family:monospace">
Copy code
2022_05_image-244.jpg

Step 2 – Perform feature scaling

We will use StandardScaler() to perform the task of Standardization to have a common scale while building our classifier:

 
\n \n \n <pre class="python" style="font-family:monospace">\n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n from sklearn.\n \n \n <span style="color: black">\n \n \n preprocessing \n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n import StandardScaler\n \n \n
\n \n \n
\n \n \n <span style="color: #808080;font-style: italic">\n #Perform feature scaling \n
sc_X \n <span style="color: #66cc66">\n = StandardScaler \n <span style="color: black">\n ( \n <span style="color: black">\n ) \n
X \n <span style="color: #66cc66">\n = pd. \n <span style="color: black">\n DataFrame \n <span style="color: black">\n (sc_X. \n <span style="color: black">\n fit_transform \n <span style="color: black">\n (data. \n <span style="color: black">\n drop \n <span style="color: black">\n ( \n <span style="color: black">\n [ \n <span style="color: #483d8b">\n 'Outcome' \n <span style="color: black">\n ] \n <span style="color: #66cc66">\n ,axis \n <span style="color: #66cc66">\n = \n <span style="color: #ff4500">\n 1 \n <span style="color: black">\n ). \n <span style="color: black">\n values \n <span style="color: black">\n ) \n <span style="color: black">\n ) \n
y \n <span style="color: #66cc66">\n = data \n <span style="color: black">\n [ \n <span style="color: #483d8b">\n 'Outcome' \n <span style="color: black">\n ]. \n <span style="color: black">\n values \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #483d8b"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #ff4500"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: #483d8b"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: #808080;font-style: italic">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </span style="color: black">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </pre class="python" style="font-family:monospace">
Copy code

Step 3 – Perform label encoding

We will use LabelEncoder() to convert the target variable class labels into a numeric form so they become machine-readable:

 
\n \n \n <pre class="python" style="font-family:monospace">\n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n from sklearn.\n \n \n <span style="color: black">\n \n \n preprocessing \n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n import LabelEncoder\n \n \n
\n \n \n
\n \n \n <span style="color: #808080;font-style: italic">\n #Perform label encoding \n
le \n <span style="color: #66cc66">\n = LabelEncoder \n <span style="color: black">\n ( \n <span style="color: black">\n ) \n
y \n <span style="color: #66cc66">\n = le. \n <span style="color: black">\n fit_transform \n <span style="color: black">\n (y \n <span style="color: black">\n ) \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: #808080;font-style: italic">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </span style="color: black">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </pre class="python" style="font-family:monospace">
Copy code

Step 4 – Split the data into training and testing sets

We will split the data into 70% training and 30% testing sets:

 
\n \n \n <pre class="python" style="font-family:monospace">\n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n from sklearn.\n \n \n <span style="color: black">\n \n \n model_selection \n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n import train_test_split\n \n \n
\n \n \n
\n \n \n <span style="color: #808080;font-style: italic">\n #Splitting the Data into Training and Testing Dataset \n
X_train \n <span style="color: #66cc66">\n , X_test \n <span style="color: #66cc66">\n , y_train \n <span style="color: #66cc66">\n , y_test \n <span style="color: #66cc66">\n = train_test_split \n <span style="color: black">\n (X \n <span style="color: #66cc66">\n ,y \n <span style="color: #66cc66">\n ,test_size \n <span style="color: #66cc66">\n = \n <span style="color: #ff4500">\n 0.3 \n <span style="color: #66cc66">\n ,random_state \n <span style="color: #66cc66">\n = \n <span style="color: #ff4500">\n 42 \n <span style="color: #66cc66">\n , stratify \n <span style="color: #66cc66">\n =y \n <span style="color: black">\n ) \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: #ff4500"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: #ff4500"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: #808080;font-style: italic">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </span style="color: black">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </pre class="python" style="font-family:monospace">
Copy code

Step 5 – Fit the KNN Classifier model

 
\n \n \n <pre class="python" style="font-family:monospace">\n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n from sklearn.\n \n \n <span style="color: black">\n \n \n neighbors \n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n import KNeighborsClassifier\n \n \n
\n \n \n
\n \n \n <span style="color: #808080;font-style: italic">\n #Setup arrays to store training and test accuracies \n
neighbors \n <span style="color: #66cc66">\n = np. \n <span style="color: black">\n arange \n <span style="color: black">\n ( \n <span style="color: #ff4500">\n 1 \n <span style="color: #66cc66">\n , \n <span style="color: #ff4500">\n 9 \n <span style="color: black">\n ) \n
train_accuracy \n <span style="color: #66cc66">\n =np. \n <span style="color: black">\n empty \n <span style="color: black">\n ( \n <span style="color: #008000">\n len \n <span style="color: black">\n (neighbors \n <span style="color: black">\n ) \n <span style="color: black">\n ) \n
test_accuracy \n <span style="color: #66cc66">\n = np. \n <span style="color: black">\n empty \n <span style="color: black">\n ( \n <span style="color: #008000">\n len \n <span style="color: black">\n (neighbors \n <span style="color: black">\n ) \n <span style="color: black">\n ) \n
\n
\n <span style="color: #ff7700;font-weight:bold">\n for i \n <span style="color: #66cc66">\n ,k \n <span style="color: #ff7700;font-weight:bold">\n in \n <span style="color: #008000">\n enumerate \n <span style="color: black">\n (neighbors \n <span style="color: black">\n ): \n
\n <span style="color: #808080;font-style: italic">\n #Setup a knn classifier with k neighbors \n
knn \n <span style="color: #66cc66">\n = KNeighborsClassifier \n <span style="color: black">\n (n_neighbors \n <span style="color: #66cc66">\n =k \n <span style="color: black">\n ) \n
\n
\n <span style="color: #808080;font-style: italic">\n #Fit the model \n
knn. \n <span style="color: black">\n fit \n <span style="color: black">\n (X_train \n <span style="color: #66cc66">\n , y_train \n <span style="color: black">\n ) \n
\n
\n <span style="color: #808080;font-style: italic">\n #Compute accuracy on the training set \n
train_accuracy \n <span style="color: black">\n [i \n <span style="color: black">\n ] \n <span style="color: #66cc66">\n = knn. \n <span style="color: black">\n score \n <span style="color: black">\n (X_train \n <span style="color: #66cc66">\n , y_train \n <span style="color: black">\n ) \n
\n
\n <span style="color: #808080;font-style: italic">\n #Compute accuracy on the test set \n
test_accuracy \n <span style="color: black">\n [i \n <span style="color: black">\n ] \n <span style="color: #66cc66">\n = knn. \n <span style="color: black">\n score \n <span style="color: black">\n (X_test \n <span style="color: #66cc66">\n , y_test \n <span style="color: black">\n ) \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #808080;font-style: italic"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #808080;font-style: italic"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #808080;font-style: italic"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: #808080;font-style: italic"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #008000"> \n </span style="color: #ff7700;font-weight:bold"> \n </span style="color: #66cc66"> \n </span style="color: #ff7700;font-weight:bold"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #008000"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #008000"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: #ff4500"> \n </span style="color: #66cc66"> \n </span style="color: #ff4500"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: #808080;font-style: italic">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </span style="color: black">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </pre class="python" style="font-family:monospace">
Copy code

Step 6 – Generate an accuracy plot

 
\n \n \n <pre class="python" style="font-family:monospace">\n \n \n <span style="color: #808080;font-style: italic">\n \n \n #Generate plot\n \n \n
plt.\n \n \n <span style="color: black">\n \n \n title\n \n \n <span style="color: black">\n \n \n (\n \n \n <span style="color: #483d8b">\n 'k-NN Varying number of neighbors' \n <span style="color: black">\n ) \n
plt. \n <span style="color: black">\n plot \n <span style="color: black">\n (neighbors \n <span style="color: #66cc66">\n , test_accuracy \n <span style="color: #66cc66">\n , label \n <span style="color: #66cc66">\n = \n <span style="color: #483d8b">\n 'Testing Accuracy' \n <span style="color: black">\n ) \n
plt. \n <span style="color: black">\n plot \n <span style="color: black">\n (neighbors \n <span style="color: #66cc66">\n , train_accuracy \n <span style="color: #66cc66">\n , label \n <span style="color: #66cc66">\n = \n <span style="color: #483d8b">\n 'Training accuracy' \n <span style="color: black">\n ) \n
plt. \n <span style="color: black">\n legend \n <span style="color: black">\n ( \n <span style="color: black">\n ) \n
plt. \n <span style="color: black">\n xlabel \n <span style="color: black">\n ( \n <span style="color: #483d8b">\n 'Number of neighbors' \n <span style="color: black">\n ) \n
plt. \n <span style="color: black">\n ylabel \n <span style="color: black">\n ( \n <span style="color: #483d8b">\n 'Accuracy' \n <span style="color: black">\n ) \n
plt. \n <span style="color: black">\n show \n <span style="color: black">\n ( \n <span style="color: black">\n ) \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #483d8b"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #483d8b"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #483d8b"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #483d8b"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #483d8b">\n \n \n </span style="color: black">\n \n \n </span style="color: black">\n \n \n </span style="color: #808080;font-style: italic">\n \n \n </pre class="python" style="font-family:monospace">
Copy code
2022_05_image-247.jpg

From the above plot, we can observe that we get the maximum testing accuracy for K=7 and K=8. As discussed above, we will choose the odd value for K.

So, let’s create a KNeighborsClassifier with the nearest number of neighbors as 7:

Step 8 – Fit the Classifier and get the Accuracy Score

 
\n \n \n <pre class="python" style="font-family:monospace">\n \n \n <span style="color: #808080;font-style: italic">\n \n \n #Fit the model\n \n \n
knn.\n \n \n <span style="color: black">\n \n \n fit\n \n \n <span style="color: black">\n \n \n (X_train\n \n \n <span style="color: #66cc66">\n ,y_train \n <span style="color: black">\n ) \n
y_pred \n <span style="color: #66cc66">\n = knn. \n <span style="color: black">\n predict \n <span style="color: black">\n (X_test \n <span style="color: black">\n ) \n
\n
\n <span style="color: #808080;font-style: italic">\n #Calculating Model Accuracy \n
\n <span style="color: #ff7700;font-weight:bold">\n from sklearn. \n <span style="color: black">\n metrics \n <span style="color: #ff7700;font-weight:bold">\n import accuracy_score \n
\n <span style="color: #ff7700;font-weight:bold">\n print \n <span style="color: black">\n ( \n <span style="color: #483d8b">\n "Accuracy of test set=" \n <span style="color: #66cc66">\n ,accuracy_score \n <span style="color: black">\n (y_test \n <span style="color: #66cc66">\n , y_pred \n <span style="color: black">\n )* \n <span style="color: #ff4500">\n 100 \n <span style="color: black">\n ) \n </span style="color: black"> \n </span style="color: #ff4500"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: #483d8b"> \n </span style="color: black"> \n </span style="color: #ff7700;font-weight:bold"> \n </span style="color: #ff7700;font-weight:bold"> \n </span style="color: black"> \n </span style="color: #ff7700;font-weight:bold"> \n </span style="color: #808080;font-style: italic"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: #66cc66">\n \n \n </span style="color: black">\n \n \n </span style="color: black">\n \n \n </span style="color: #808080;font-style: italic">\n \n \n </pre class="python" style="font-family:monospace">
Copy code
2022_05_image-248.jpg

Step 9 – Create a confusion matrix

A confusion matrix describes the performance of a classifier model on a set of test data for which the true values are known. We will calculate the confusion matrix using the confusion_matrix() method of Scikit-learn:

 
\n \n \n <pre class="python" style="font-family:monospace">\n \n \n <span style="color: #808080;font-style: italic">\n \n \n #Confusion Matrix\n \n \n
confusion_matrix \n \n \n <span style="color: #66cc66">\n \n \n = confusion_matrix\n \n \n <span style="color: black">\n \n \n (y_test\n \n \n <span style="color: #66cc66">\n , y_pred \n <span style="color: black">\n ) \n
confusion_matrix \n </span style="color: black"> \n </span style="color: #66cc66">\n \n \n </span style="color: black">\n \n \n </span style="color: #66cc66">\n \n \n </span style="color: #808080;font-style: italic">\n \n \n </pre class="python" style="font-family:monospace">
Copy code
2022_05_image-249.jpg

Considering the obtained confusion matrix, we have:

  • True negative = 131
  • False positive = 19
  • True positive = 42
  • False negative = 39

Step 10 – Perform cross-validation

Cross-validation is a technique to evaluate predictive models by dividing the dataset into a training set to train the model, and a test set to evaluate it.

We will be using the Scikit-learn’s function called GridSearchCV i.e., Grid Search cross-validation, as shown:

 
\n \n \n <pre class="python" style="font-family:monospace">\n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n from sklearn.\n \n \n <span style="color: black">\n \n \n model_selection \n \n \n <span style="color: #ff7700;font-weight:bold">\n \n \n import GridSearchCV\n \n \n
\n \n \n
\n \n \n <span style="color: #808080;font-style: italic">\n #In KNN, the parameter to be tuned is n_neighbors \n
param_grid \n <span style="color: #66cc66">\n = \n <span style="color: black">\n { \n <span style="color: #483d8b">\n 'n_neighbors':np. \n <span style="color: black">\n arange \n <span style="color: black">\n ( \n <span style="color: #ff4500">\n 1 \n <span style="color: #66cc66">\n , \n <span style="color: #ff4500">\n 50 \n <span style="color: black">\n ) \n <span style="color: black">\n } \n
\n
knn \n <span style="color: #66cc66">\n = KNeighborsClassifier \n <span style="color: black">\n ( \n <span style="color: black">\n ) \n
knn_cv \n <span style="color: #66cc66">\n = GridSearchCV \n <span style="color: black">\n (knn \n <span style="color: #66cc66">\n ,param_grid \n <span style="color: #66cc66">\n ,cv \n <span style="color: #66cc66">\n = \n <span style="color: #ff4500">\n 5 \n <span style="color: black">\n ) \n
knn_cv. \n <span style="color: black">\n fit \n <span style="color: black">\n (X \n <span style="color: #66cc66">\n ,y \n <span style="color: black">\n ) \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #ff4500"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #ff4500"> \n </span style="color: #66cc66"> \n </span style="color: #ff4500"> \n </span style="color: black"> \n </span style="color: black"> \n </span style="color: #483d8b"> \n </span style="color: black"> \n </span style="color: #66cc66"> \n </span style="color: #808080;font-style: italic">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </span style="color: black">\n \n \n </span style="color: #ff7700;font-weight:bold">\n \n \n </pre class="python" style="font-family:monospace">
Copy code
2022_05_image-250.jpg
knn_cv.best_score_
2022_05_image-251.jpg
knn_cv.best_params_

So, our KNN classifier with 17 nearest neighbors achieves the best accuracy score of about 76%.

Pros and Cons of KNN

Pros –

  • Simple and intuitive algorithm.
  • It is a non-parametric algorithm which means it does not need any assumptions to implement.
  • Evolves constantly which allows the algorithm to respond quickly to real-time input changes.
  • Works well with multiclass data in classification problems.
  • Works equally well with regression problems.
  • We can use it to implement non-linear tasks.

Cons –

  • It is a slow algorithm.
  • Sensitive to outliers.
  • Suffers from the curse of dimensionality
  • The dataset should have homogenous features for the algorithm to predict accurately.
  • No capability to deal with missing values in the data.
How to Become a Machine Learning Expert in 9 Months
How to Become a Machine Learning Expert in 9 Months
Learning machine learning is critical because it opens the door to developing cutting-edge applications in cybersecurity, facial recognition, and other fields. This article aims to guide you through the process...read more
Handling missing values: Beginners Tutorial
Handling missing values: Beginners Tutorial
We take data from sometimes sources like kaggle.com, sometimes we collect from different sources by doing web scrapping containing missing values in it. We take data from sometimes sources like...read more
Regression Analysis in Machine Learning
Regression Analysis in Machine Learning
In this article, we will discuss Regression analysis in Machine Learning which is one of the  important concepts used in building machine learning models.

Conclusion

The KNN algorithm in machine learning is a simple, yet versatile supervised algorithm that can be used to solve both classification and regression problems. Machine learning & intelligence are rapidly growing areas in the IT industry and have a huge impact on big businesses across the globe.

Top Trending Articles:

Data Analyst Interview Questions | Data Science Interview Questions | Machine Learning Applications | Big Data vs Machine Learning | Data Scientist vs Data Analyst | How to Become a Data Analyst | Data Science vs. Big Data vs. Data Analytics | What is Data Science | What is a Data Scientist | What is Data Analyst

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio