文章目录
- 1. 导入并检查数据
- 2. 确定最佳聚类数目
- 3. kmeans聚类
- 3.1 查看聚类结果
- 3.2 提取类标签并且与原始数据进行合并
- 3.3 查看每一类的数目
- 3.4 进行可视化展示
- 4. 层次聚类
-
1. 导入并检查数据
#要是没有这个包的话,首先需要安装一下
#install.packages("factoextra")
#载入包
#若载入失败,输入:
#install.packages("caret")
library(factoextra)
# 数据进行标准化
df <- scale(USArrests)
# 查看数据的前五行
head(df, n = 5)
Murder Assault UrbanPop Rape
Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
Arizona 0.07163341 1.4788032 0.9989801 1.042878388
Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144 1.7589234 2.067820292
2. 确定最佳聚类数目
#确定最佳聚类数目
fviz_nbclust(df, kmeans, method = "wss") +
+ geom_vline(xintercept = 4, linetype = 2)
#可以发现聚为四类最合适,当然这个没有绝对的,从指标上看,选择坡度变化不明显的点最为最佳聚类数目。
3. kmeans聚类
#设置随机数种子,保证实验的可重复进行
set.seed(123)
#利用k-mean是进行聚类
km_result <- kmeans(df, 4, nstart = 24)
3.1 查看聚类结果
print(km_result)
K-means clustering with 4 clusters of sizes 13, 16, 13, 8
Cluster means:
Murder Assault UrbanPop Rape
1 -0.9615407 -1.1066010 -0.9301069 -0.96676331
2 -0.4894375 -0.3826001 0.5758298 -0.26165379
3 0.6950701 1.0394414 0.7226370 1.27693964
4 1.4118898 0.8743346 -0.8145211 0.01927104
Clustering vector:
略
Within cluster sum of squares by cluster:
[1] 11.952463 16.212213 19.922437 8.316061
(between_SS / total_SS = 71.2 %)
Available components:
[1] "cluster" "centers" "totss"
[4] "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
3.2 提取类标签并且与原始数据进行合并
dd <- cbind(USArrests, cluster = km_result$cluster)
head(dd)
Murder Assault UrbanPop Rape cluster
Alabama 13.2 236 58 21.2 4
Alaska 10.0 263 48 44.5 3
Arizona 8.1 294 80 31.0 3
Arkansas 8.8 190 50 19.5 4
California 9.0 276 91 40.6 3
Colorado 7.9 204 78 38.7 3
3.3 查看每一类的数目
table(dd$cluster)
1 2 3 4
8 16 13 13
3.4 进行可视化展示
fviz_cluster(km_result, data = df,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
ellipse.type = "euclid",
star.plot = TRUE,
repel = TRUE,
ggtheme = theme_minimal()
)
4. 层次聚类
#先求样本之间两两相似性,即相似性矩阵
result <- dist(df, method = "euclidean")
#用hclust聚类
result_hc <- hclust(d = result, method = "ward.D2")
#进行初步展示
fviz_dend(result_hc, cex = 0.6)
4.1 可视化
fviz_dend(result_hc, k = 4,
cex = 0.5,
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE,
rect = TRUE
)