KNN algorithm simple explanation with code

February 07, 2023

KNN ALGORITHM

K nearest neighbor is a supervised machine learning algorithm that is non parametric(has no underlying assumptions).

KNN is usually used for regression and classification (mostly used for classification)

KNN is a lazy algorithm as it uses all the training data inorder to classify new data points this may lead to excessive use of computer memory.

Although there are many clustering algorithm it is usually better to learn KNN first to provide a basis for more complex algorithms.

KNN can lead to overfitting if a small value is chosen for k and underfitting if a large value is used for k.

It is usually relevant to select an odd number for the value of k to reduce instances where some points are left out during clustering.

Dataset that has most outliers is usually classified welll if a large k value is chosen.

K is the neighbor to which distance is calculated from training data points to the test data points, this distance can be estimated using eucledian,Manhattan,minkowski and hammings distances.Points that are close to one another are put in the same class.

EUCLEDIAN DISTANCE

This is also called the pythagorian distance as it applies the principles of pythogras theorem. The following example will help you understand.

Lets say we have co-ordinate points (2,8),(5,12) :

X1=2

X2=5

Y1=8

Y2=12

The eucledian distance can be calculated as follows

√(x2-x1)²+(y2-y1)²:

√(5-2)²+(12-8)²

Therefore the eucledian distance is √25 = 5

Euclidean distance is mostly used when the data is continuous.

MINKOWSKI DISTANCE

This is the combination of eucledian and Manhattan distance.This is a norm vector space as distance is determined between vectors.Minkowski distance can be calculated as follows

M= (∑(x_i-y _i)^p)^1/p

MANHATTAN DISTANCE

When the p value in minkowski distance is equivalent to one then it becomes manhattan distance.

Other distance like supremum/chebyschev is relevant when the p value in the minkowski formulae is of more than two.

KNN IN PYTHON.

I shall begin by importing the essential libraries in this tutorial;

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

import seaborn as sns

I am going to use the based dataset which can be downloaded from Kaggle

This dataset classifies sales based on age of customers.

df=pd.read_csv(r"C:\Users\OWITI H\Documents\mathe.csv") #reading the dataset from its stored location in my pc.

df.head() # Trying to observe how the data looks like by printing first 4 values.

Output

Performance Age Sickoff Gender Citizen Povert_index Sales

0 78 45 14 M local poor good

1 46 21 1 M local poor average

2 54 33 4 F exper poor below average

3 49 34 2 M exper well_off good

4 40 22 0 F local well_off good

df1=df.drop(['Performance','Sickoff','Gender','Citizen','Povert_index'],axis='columns')#Dropping the Performance,Sickoff',Gender,Citizen,Povert_index columns in this dataframe inorder to remain with few values for simplicity.

Now lets try to look at the dataframe now;

df1.head()

Output

Age Sales

0 45 good

1 21 average

2 33 below average

3 34 good

4 22 good

Ploting a scatter plot to see the classification of sales based on age.

sns.relplot(df1.Sales,df1.Age)

x=df1['Age']

y=df1['Sales’] #Letting x value be age and y value be sales

#splitting the dataset into training and test

x_train,x_test,y_train,y_test=train_test_split(y,x,test_size=0.3,random_state=4)

#Creating an object called kn

kn=KNeighborsClassifier(n_neighbors=3)

#fitting

m=kn.fit(np.array(y_train).reshape(-1,1),x_train)

#predicting new data points

pred=kn.predict(np.array(y_test).reshape(-1,1))

Now that we have looked at kNN lets look at Kmeans clustering. The fact is that all are used for clustering the main difference is that KNN is a supervised algorithm while KMeans is unsupervised . Supervised meaning it uses labeled data and unsupervised uses unlabeled data.

PYTHON CODE FOR KMEANS

I shall use the based dataset found on Kaggle as I had used earlier with KNN.

Most libraries I had imported earlier shall remain the same except for this one sklearn.neighbors shall be sklearn.cluster.

#importing relevant library

from sklearn.cluster import KMeans

#constructing an elbow plot to enable selection of the accurate value for K

#first we shall construct an empty array called err then loop values of k and plot

error=[]

K=range(1,5)

for k in K:

kn1=KMeans(n_clusters=k)

kn1.fit(np.array(y_train).reshape(-1,1),x_train)

error.append(kn1.inertia_)

E=plt.plot(error, K)

plt.show(E)

#the elbow plot will look like this

From the elbow plot we can see that the perfect k value is 3.

Stats for everyone.

KNN algorithm simple explanation with code

Comments

Post a Comment

Popular posts from this blog

Most famous Statisticians of all time.

Online data collection using python

WHO ARE OPEN AI THE COMPANY THAT DEVELOPED CHATGPT