2. KNN(K-Nearest Neighbor)

reference
Instance-based Learning(사례기반 학습모델)
데이터가 주어지면 그 주변의 데이터를 살펴본 뒤 더 많은 데이터가 포함되어 있는 범주로 분류하는 방식
동작
- 새로운 데이터가 들어오면 데이터 베이스에 저장되어 있는 데이터와 비교하여 거리가 가장 가까운 k 의 데이터로 예측 또는 분류를 진행
- 별도의 학습이 필요하지 않음. 따라서 lazy model이라고 불리기도 한다.
- 동작 순서
Dicision Boundary
- target feacture 가 달라지는 경계
k :적절한 k 값을 찾는게 중요
- k 값의 영향이 클 수록 데이터가 imbalance 하다

장점

relatively straight forward to update the model when new labeled instances become available

2.1 거리 계산

<aside> 💡 Euclidean distance is more influenced by a single large difference in one feature rather than a lot of small differences across a set of features

</aside>

2.1.1 유클리드 거리(Euclidean Distance)

$$ d(p,q) = \sqrt {\sum \limits_{i=1}^{n} (q_{i} - p_{i})^{2}} $$

row vector

'''
test_instance 와 instances 사이의 euclidean distance
'''
import numpy as np

instances = np.array([ [5, 2.5, 3],
                       [2.75, 7.50, 4],
                       [9.10, 4.5, 4],
                       [8.9, 2.3, 6]])
test_instance = instances[0]
distances = []

for instance in instances:
    distance = np.sqrt(np.sum((instance - test_instance)**2))
    distances.append(distance)
print(distances)

col vector

'''
col vector euclidean col
'''
import numpy as np

instances = np.array([ [5, 2.5, 3],
                       [2.75, 7.50, 4],
                       [9.10, 4.5, 4],
                       [8.9, 2.3, 6]])
print(instances)
test_instance = instances[:, 0]

n_cols = instances.shape[1]  # column의 갯수
# instances.shape 4 by 3 (4,3) tuple
distances = []

for col_idx in range(n_cols):
    instance = instances[:, col_idx] # col vector 뽑아내기
    distance = np.sqrt(np.sum((instance- test_instance)**2))
    distances.append(distance)
print(distances)

2.1.2 맨하탄 거리 (Manhattan Distance)

$$ d( \vec{v}, \vec{u}){L1} = \sum \limits{i=1}^{n} |v_{i} - u_{i} | $$