ML

KNN

Posted by freeCookie🍪 on September 24, 2019

KNN

KNN

Classifiers

An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. Stores all data and does not learn model.

KNN has no model other than storing the entire dataset, so there is no learning required. Efficient implementations can store the data using complex data structures like k-d trees to make look-up and matching of new patterns during prediction efficient.

Distance measures:

  • Euclidean Distance

  • Hamming Distance: Calculate the distance between binary vectors.

  • Manhattan Distance: Calculate the distance between real vectors using the sum of their absolute difference.

  • Minkowski Distance: Generalization of Euclidean and Manhattan distance.

K: algorithm tuning, try from 1-21

KNN for Classification

The output can be calculated as the class with the highest frequency from the K-most similar instances. Class probabilities can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance.

Data Preparing

  • Rescale Data: Normalizing, standardlizing
  • Addressing missing data: excluded or imputed
  • Lower Dimensionality: suitable for lower dimensional data, feature selection

Implementation

Implemenmtation of KNN

Tutorial to implement KNN

Pros

  1. It is extremely easy to implement
  2. As said earlier, it is lazy learning algorithm and therefore requires no training prior to making real time predictions. This makes the KNN algorithm much faster than other algorithms that require training e.g SVM, linear regression, etc.
  3. Since the algorithm requires no training before making predictions, new data can be added seamlessly.
  4. There are only two parameters required to implement KNN i.e. the value of K and the distance function (e.g. Euclidean or Manhattan etc.)

Cons

  1. The KNN algorithm doesn’t work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension.
  2. The KNN algorithm has a high prediction cost for large datasets. This is because in large datasets the cost of calculating distance between new point and each existing point becomes higher.
  3. Finally, the KNN algorithm doesn’t work well with categorical features since it is difficult to find the distance between dimensions with categorical features.
def euclideanDistance(instance1, instance2, length):
	distance = 0
	for x in range(length):
		distance += pow((instance1[x] - instance2[x]), 2)
	return math.sqrt(distance)
 
def getNeighbors(trainingSet, testInstance, k):
	distances = []
	length = len(testInstance)-1	# x[-1] is label
	for x in range(len(trainingSet)):
		dist = euclideanDistance(testInstance, trainingSet[x], length)
		distances.append((trainingSet[x], dist))
  # O(nlgn) sort, O(nlgk) max heap, O(n) avg quick select 
	distances.sort(key=operator.itemgetter(1))
	neighbors = []
	for x in range(k):
		neighbors.append(distances[x][0])	# trainingSet[x]
	return neighbors

def quickSelect(distances, k):
  length = len(distances)
  l = 0
  r = length - 1
  pivot = find(distances, l, r)
	while pivot != k:
    if pivot < k:
      l = pivot + 1
    else:
      r = pivot - 1
    pivot = find(distances, l, r)
	return distances

def find(distances, l, r):
  p = (r + l) // 2
  distances[r], distances[p] = distances[p], distances[r]
  idx = l
  while idx < r:
    if distances[idx][1] < distances[r][1]:
      distances[idx], distances[l] = distances[l], distances[idx]
      l += 1
   	idx += 1
  distances[l], distances[idx] = distances[idx], distances[l]
  return l
    
def getResponse(neighbors):
	# O(klgk)
  classVotes = {}
	for x in range(len(neighbors)):
		response = neighbors[x][-1]
		if response in classVotes:
			classVotes[response] += 1
		else:
			classVotes[response] = 1
	sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
	return sortedVotes[0][0]

def Boyer_Moore_majority_vote(neighbours):
  # O(k)
  res = -1
  cnt = 0
  for x in range(len(neighbours)):
    cur = neighbours[x][-1]
    if cnt == 0:
      res = cur
      cnt = 1
    elif res == cur:
      cnt += 1
    else:
      cnt -= 1
  return res
 
def getAccuracy(testSet, predictions):
	correct = 0
	for x in range(len(testSet)):
		if testSet[x][-1] == predictions[x]:
			correct += 1
	return (correct/float(len(testSet))) * 100.0