2022-10-14 데이터마이닝_6

2022. 10. 14. 22:34ㆍ학부 강의/데이터마이닝

1. K-means

대표적인 비지도 학습.

데이터를 이용해서 k개로 분류.

k-평균 알고리즘(K-means clustering algorithm)은 주어진 데이터를 k개의 클러스터로 묶는 알고리즘이다.

각 클러스터와 거리 차이의 분산을 최소화하는 방식으로 동작한다.

각 그룹의 중심 (centroid)과 그룹 내의 데이터 오브젝트와의 거리의 제곱합을 비용 함수로 정하고, 이 함숫값을 최소화하는 방향으로 각 데이터 오브젝트의 소속 그룹을 업데이트해 줌으로써 클러스터링을 수행하게 된다.

알고리즘은 자율 학습의 일종으로, 레이블이 달려 있지 않은 입력 데이터에 레이블을 달아주는 역할을 수행한다.

출처 : https://ko.wikipedia.org/wiki/K-평균_알고리즘

가. kmeans()

kmeans(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"), trace=FALSE)

데이터 행렬에 대해 kmeans 군집화를 수행한다.

Arguments

출처 : https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans

나. 실습 1

car_cc.csv

0.00MB

df <- read.csv('car_cc.csv')

str(df)

plot(df)

km <- kmeans(df, 3)

km

Value

kmeans returns an object of class "kmeans" which has a print and a fitted method. It is a list with at least the following components:

cluster : A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centers : A matrix of cluster centres.
totss : The total sum of squares.
withinss : Vector of within-cluster sum of squares, one component per cluster. 응집도. 대체로 낮을수록 좋음
tot.withinss : Total within-cluster sum of squares.
betweenss : The between-cluster sum of squares. 분리도. 대체로 높을수록 좋음
size : The number of points in each cluster.
iter : The number of (outer) iterations. 반복 횟수.
ifault : integer: indicator of a possible algorithm problem -- for experts.

(출처 : https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans)

plot(df, col = km$cluster)

col : 색상은 군집 간엔 다르게, 같은 군집끼리는 같게 출력한다.

분류된 것을 보고서 해석하는 것은 인간의 몫이다.

참고 : https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans

다. 실습 2

직접 모은 데이터를 바탕으로 kmeans를 실행하고 결과 분석하기.

bike.csv

0.00MB

df <- read.csv('bike.csv')
km <- kmeans(df, 3)
plot(df, col = km$cluster)

결과를 나름대로 분석해보자면...

cc는 가격과 제조년도, 적산 거리와 크게 관계없다.
가격과 적산 거리로 군집화가 잘 이뤄진다. → 중고 오토바이의 전산 거리가 가격에 끼치는 영향이 가장 크다.

'학부 강의 > 데이터마이닝' 카테고리의 다른 글

2022-11-13 데이터마이닝_8 (0)	2022.11.13
2022-11-13 데이터마이닝_7 (0)	2022.11.13
2022-10-06 데이터마이닝_5 (0)	2022.10.06
2022-09-28 데이터마이닝_4 (0)	2022.09.28
2022-09-21 데이터마이닝_3 (1)	2022.09.21

개발은 즐거워?