KNN

An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. Stores all data and does not learn model.

KNN has no model other than storing the entire dataset, so there is no learning required. Efficient implementations can store the data using complex data structures like k-d trees to make look-up and matching of new patterns during prediction efficient.

Distance measures:

Euclidean Distance
Hamming Distance: Calculate the distance between binary vectors.
Manhattan Distance: Calculate the distance between real vectors using the sum of their absolute difference.
Minkowski Distance: Generalization of Euclidean and Manhattan distance.

K: algorithm tuning, try from 1-21

KNN for Classification

The output can be calculated as the class with the highest frequency from the K-most similar instances. Class probabilities can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance.

Data Preparing

Rescale Data: Normalizing, standardlizing
Addressing missing data: excluded or imputed
Lower Dimensionality: suitable for lower dimensional data, feature selection

Implementation

Implemenmtation of KNN

Tutorial to implement KNN

Pros

It is extremely easy to implement
As said earlier, it is lazy learning algorithm and therefore requires no training prior to making real time predictions. This makes the KNN algorithm much faster than other algorithms that require training e.g SVM, linear regression, etc.
Since the algorithm requires no training before making predictions, new data can be added seamlessly.
There are only two parameters required to implement KNN i.e. the value of K and the distance function (e.g. Euclidean or Manhattan etc.)

Cons

The KNN algorithm doesn’t work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension.
The KNN algorithm has a high prediction cost for large datasets. This is because in large datasets the cost of calculating distance between new point and each existing point becomes higher.
Finally, the KNN algorithm doesn’t work well with categorical features since it is difficult to find the distance between dimensions with categorical features.

def euclideanDistance(instance1, instance2, length):
	distance = 0
	for x in range(length):
		distance += pow((instance1[x] - instance2[x]), 2)
	return math.sqrt(distance)
 
def getNeighbors(trainingSet, testInstance, k):
	distances = []
	length = len(testInstance)-1	# x[-1] is label
	for x in range(len(trainingSet)):
		dist = euclideanDistance(testInstance, trainingSet[x], length)
		distances.append((trainingSet[x], dist))
  # O(nlgn) sort, O(nlgk) max heap, O(n) avg quick select 
	distances.sort(key=operator.itemgetter(1))
	neighbors = []
	for x in range(k):
		neighbors.append(distances[x][0])	# trainingSet[x]
	return neighbors

def quickSelect(distances, k):
  length = len(distances)
  l = 0
  r = length - 1
  pivot = find(distances, l, r)
	while pivot != k:
    if pivot < k:
      l = pivot + 1
    else:
      r = pivot - 1
    pivot = find(distances, l, r)
	return distances

def find(distances, l, r):
  p = (r + l) // 2
  distances[r], distances[p] = distances[p], distances[r]
  idx = l
  while idx < r:
    if distances[idx][1] < distances[r][1]:
      distances[idx], distances[l] = distances[l], distances[idx]
      l += 1
   	idx += 1
  distances[l], distances[idx] = distances[idx], distances[l]
  return l
    
def getResponse(neighbors):
	# O(klgk)
  classVotes = {}
	for x in range(len(neighbors)):
		response = neighbors[x][-1]
		if response in classVotes:
			classVotes[response] += 1
		else:
			classVotes[response] = 1
	sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
	return sortedVotes[0][0]

def Boyer_Moore_majority_vote(neighbours):
  # O(k)
  res = -1
  cnt = 0
  for x in range(len(neighbours)):
    cur = neighbours[x][-1]
    if cnt == 0:
      res = cur
      cnt = 1
    elif res == cur:
      cnt += 1
    else:
      cnt -= 1
  return res
 
def getAccuracy(testSet, predictions):
	correct = 0
	for x in range(len(testSet)):
		if testSet[x][-1] == predictions[x]:
			correct += 1
	return (correct/float(len(testSet))) * 100.0

FEATURED TAGS

Java Hashset Linkedhashset BFS Array DFS Hashmap Design ICC DP Stack Python Database BinarySearch ArrayList SegmentTree BIT TopologicalSort Bit BST TreeSet TreeMap Trie String LinkedList KMP D&C Greedy Math Backtracking Tree UnionFind TwoPointers C/C++ GC Web Sort Graph DS DL PP ML Sampling Minimax Heap

KNN

KNN for Classification

Data Preparing

Implementation

CATALOG

FEATURED TAGS