KNN algorithm simple explanation with code
KNN ALGORITHM
K nearest
neighbor is a supervised machine
learning algorithm that is non parametric(has no underlying assumptions).
KNN is
usually used for regression and classification (mostly used for
classification)
KNN is a
lazy algorithm as it uses all the training data inorder to classify new data points this may lead to excessive use of computer memory.
Although
there are many clustering algorithm it is usually better to learn KNN first to
provide a basis for more complex algorithms.
KNN can lead to overfitting if a small value is chosen for k and underfitting if a large value is used for k.
It is usually relevant to select an odd number for the value of k to reduce instances where some points are left out during clustering.
Dataset that has most outliers is usually classified welll if a large k value is chosen.
K is the
neighbor to which distance is calculated from training data points to the test
data points, this distance can be estimated using eucledian,Manhattan,minkowski and hammings
distances.Points that are close to one another are put in the same class.
EUCLEDIAN DISTANCE
This is also called the pythagorian distance as it applies the principles of pythogras theorem. The following example will help you understand.
Lets say we
have co-ordinate points (2,8),(5,12) :
X1=2
X2=5
Y1=8
Y2=12
The eucledian distance can be calculated as
follows
√(x2-x1)2+(y2-y1)2 :
√(5-2)2+(12-8)2
Therefore
the eucledian distance is √25 = 5
Euclidean distance is mostly used when the data is continuous.
MINKOWSKI
DISTANCE
This is the
combination of eucledian and Manhattan distance.This is a norm vector
space as distance is determined between vectors.Minkowski distance can be
calculated as follows
M= (∑(x i-y i )p)1/p
MANHATTAN DISTANCE
When the p value in minkowski
distance is equivalent to one then it
becomes manhattan distance.
Other distance like supremum/chebyschev is relevant when the p value in
the minkowski formulae is of more than two.
KNN IN PYTHON.
I shall
begin by importing the essential libraries
in this tutorial;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
I am going
to use the based dataset which can be downloaded from Kaggle
This
dataset classifies sales based on age of customers.
df=pd.read_csv(r"C:\Users\OWITI H\Documents\mathe.csv") #reading the dataset from its stored
location in my pc.
df.head() # Trying to observe how the data looks like by printing first 4 values.
Output
Performance Age Sickoff Gender Citizen Povert_index Sales
0 78 45 14 M local poor good
1 46 21 1 M local poor average
2 54 33 4 F exper poor below
average
3 49 34 2 M exper well_off good
4 40 22 0 F local well_off good
df1=df.drop(['Performance','Sickoff','Gender','Citizen','Povert_index'],axis='columns')#Dropping
the Performance,Sickoff',Gender,Citizen,Povert_index columns in this dataframe
inorder to remain with few values for simplicity.
Now lets try to look at the dataframe now;
df1.head()
Output
Age Sales
0 45 good
1 21 average
2 33 below average
3 34 good
4 22 good
Ploting a scatter plot to see the classification
of sales based on age.
sns.relplot(df1.Sales,df1.Age)
x=df1['Age']
y=df1['Sales’]
#Letting x value be age and y value be sales
#splitting
the dataset into training and test
x_train,x_test,y_train,y_test=train_test_split(y,x,test_size=0.3,random_state=4)
#Creating an object called kn
kn=KNeighborsClassifier(n_neighbors=3)
#fitting
m=kn.fit(np.array(y_train).reshape(-1,1),x_train)
#predicting new data points
pred=kn.predict(np.array(y_test).reshape(-1,1))
Now that we
have looked at kNN lets look at Kmeans clustering. The fact is that all are
used for clustering the main difference
is that KNN is a supervised algorithm while KMeans is unsupervised . Supervised meaning it uses labeled data and unsupervised uses unlabeled data.
PYTHON
CODE FOR KMEANS
I shall use
the based dataset found on Kaggle as I had used earlier with KNN.
Most
libraries I had imported earlier shall remain the same except for this one sklearn.neighbors shall be sklearn.cluster.
#importing
relevant library
from sklearn.cluster import KMeans
#constructing an elbow plot to enable selection of the
accurate value for K
#first we shall construct an empty array called err
then loop values of k and plot
error=[]
K=range(1,5)
for k in K:
kn1=KMeans(n_clusters=k)
kn1.fit(np.array(y_train).reshape(-1,1),x_train)
error.append(kn1.inertia_)
E=plt.plot(error, K)
plt.show(E)
#the elbow plot will look like this
From the
elbow plot we can see that the perfect k value is 3.
Comments
Post a Comment