KNN algorithm simple explanation with code

 

 KNN ALGORITHM

K nearest neighbor  is a supervised machine learning algorithm that is non parametric(has no underlying assumptions).

KNN is usually used for regression and classification (mostly used for classification)

KNN is a lazy algorithm as it uses all the training data inorder  to classify new data points this may  lead to excessive use of computer memory.

Although there are many clustering algorithm it is usually better to learn KNN first to provide a  basis for more complex algorithms. 

KNN  can lead to overfitting  if a  small value is chosen for k and underfitting if a large value is used for k. 

It is usually relevant to select an odd number for the value of k to reduce instances where some points are left out during clustering. 

Dataset that has most outliers is usually classified welll if a large k value is chosen. 

K is the neighbor to which distance is calculated from training data points to the test data points, this distance can be estimated using  eucledian,Manhattan,minkowski and hammings distances.Points that are close to one another are put  in the same class. 

EUCLEDIAN DISTANCE 

This is also called the pythagorian distance as it applies the principles of pythogras theorem. The following example will help you understand. 

Lets say we have  co-ordinate points (2,8),(5,12) :

X1=2

X2=5

Y1=8

Y2=12

The  eucledian distance can be calculated as follows

(x2-x1)2+(y2-y1)2 :

 (5-2)2+(12-8)2

Therefore the eucledian distance is 25 = 5

Euclidean distance is mostly used when the data is continuous. 

MINKOWSKI DISTANCE

This is the combination of eucledian and Manhattan distance.This is a norm vector space as distance is determined between vectors.Minkowski distance can be calculated as follows

M= ((x i-y i )p)1/p

MANHATTAN DISTANCE

When the p value  in minkowski distance is  equivalent to one then it becomes manhattan distance.

Other distance like supremum/chebyschev is relevant when the p value in the minkowski formulae is of more than two.

KNN IN PYTHON.

 

I shall begin by importing the essential libraries    in  this tutorial;

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

import seaborn as sns

I am going to use the based dataset which can be downloaded from Kaggle

This dataset classifies sales based on age of customers.

df=pd.read_csv(r"C:\Users\OWITI H\Documents\mathe.csv") #reading the dataset from its stored location in my pc.

df.head() # Trying to observe how the data looks like  by printing first 4 values.

Output

Performance     Age        Sickoff  Gender Citizen  Povert_index     Sales

0              78           45           14           M            local       poor      good

1              46           21           1              M            local       poor      average

2              54           33           4              F              exper    poor      below average

3              49           34           2              M            exper    well_off               good

4              40           22           0              F              local       well_off               good

df1=df.drop(['Performance','Sickoff','Gender','Citizen','Povert_index'],axis='columns')#Dropping the Performance,Sickoff',Gender,Citizen,Povert_index columns in this dataframe inorder to remain with few values for simplicity.

Now lets try to look at the dataframe now;

df1.head()

Output

Age        Sales

0              45           good

1              21           average

2              33           below average

3              34           good

4              22           good

Ploting a scatter plot to see the classification of  sales based on age.

sns.relplot(df1.Sales,df1.Age)

x=df1['Age']

y=df1['Sales’] #Letting x value be age and y value be sales

#splitting the dataset into training and test

x_train,x_test,y_train,y_test=train_test_split(y,x,test_size=0.3,random_state=4)

#Creating an object called kn

kn=KNeighborsClassifier(n_neighbors=3)

#fitting

m=kn.fit(np.array(y_train).reshape(-1,1),x_train)

#predicting new data points

pred=kn.predict(np.array(y_test).reshape(-1,1))

 

Now that we have looked at kNN lets look at Kmeans clustering. The fact is that all are used  for clustering the main difference is that KNN is a supervised algorithm while  KMeans is unsupervised . Supervised  meaning it uses labeled data and unsupervised uses unlabeled data.

                                              PYTHON CODE FOR KMEANS

I shall use the based dataset found on Kaggle as I had used earlier with KNN.

Most libraries I had imported earlier shall remain the same except for this one sklearn.neighbors shall be sklearn.cluster.

#importing relevant library

from sklearn.cluster import KMeans

#constructing an elbow plot to enable selection of the accurate value for K

#first we shall construct an empty array called err then loop values of k and plot

error=[]

K=range(1,5)

for k in K:

    kn1=KMeans(n_clusters=k)

    kn1.fit(np.array(y_train).reshape(-1,1),x_train)

    error.append(kn1.inertia_)

E=plt.plot(error, K) 

plt.show(E)


#the elbow plot will look like this

 

 




From the elbow plot we can see that the perfect k value is 3.

Comments

Popular posts from this blog

Online data collection using python

IMPACTS AND A BRIEF HISTORY OF AI.

Most famous Statisticians of all time.