- 一:无监督学习
- (1)什么是无监督学习
- (2)聚类和分类
- 二:K-Means算法
- (1)算法简介
- (2)算法流程
- (3)Python实现
- (4)聚类效果展示
- A:鸢尾花数据集(效果好)
- B:三个圆圈数据集(效果差)
- 三:K-Means算法探讨
- (1)距离度量
- (2)聚类算法评价指标
- A:inertia(簇内平方和)
- B:轮廓系数
K-Means算法是一种无监督的聚类算法
(1)什么是无监督学习无监督学习:对于数据集,在训练的过程中,并没有告诉训练算法某一个数据应该属于哪一个类别;它是通过某种操作,将一堆“相似”的数据聚集在一起然后当做一个类别
- 例如图像分割,计算机面对图像数据时,并不认为这些数据有什么区别。但是通过聚类算法就可以实现简单的分类实现分割
分类和聚类是两类不同的机器学习算法, 简单来说
- 分类:根据文本的特征或属性,划分到已有的类别中;因此这些类别是已知的
- 聚类:分析时并不知道数据会分为几类,而是通过聚类算法进行归类
专业解释:对于给定的样本集合 D = { x 1 , x 2 , . . . , x N } D={{x_{1},x_{2},...,x_{N}}} D={x1,x2,...,xN},K-Means算法针对聚类所得到的簇为 C = { C 1 , C 2 , . . . , C k } C={{C_{1},C_{2},...,C_{k}}} C={C1,C2,...,Ck},则划分的最小化平方误差为:
E = ∑ i = 1 k ∑ x ∈ C i ∣ ∣ x − μ i ∣ ∣ 2 2 E=sum_{i=1}^{k}sum_{xin C_{i}}||x-mu_{i}||_{2}^{2} E=i=1∑kx∈Ci∑∣∣x−μi∣∣22
其中 μ i = 1 ∣ C i ∣ mu_{i}=frac{1}{|C_{i}|} μi=∣Ci∣1 ∑ x ∈ C i x underset{xin C_{i}}{sum}x x∈Ci∑x,如果 E E E越小则表示数据集样本相似度越高
通俗解释:K-Means算法将一组 N N N个样本划分为 K K K个无交集的 簇,簇中所有数据的均值称为这个簇的 质心
- 簇:是聚类的结果表现,一个簇中的数据就认为是同一类
- 质心:簇中所有数据的均值
在该算法中,簇个数 K K K是一个超参数,在算法开始前需要人为确定
- 以图像分割为例,意思就是你认为这个图像大概可以分为多少块, K K K就设置为多少
因此,K-Means 的核心任务就是根据我们设定好的 K,找出 K 个最优的质心,并将离这些质心最近的数据分别分配到这些质心代表的簇中去
聚类算法认为,被分在同一个簇中的数据是有相似性的,而不同簇中的数据是不同的,因此K-Means以及其他的聚类算法所追求的目标就是 簇内差异小、簇外差异大,而这个差异由 样本点到其所在簇的质心的距离来衡量,对一个簇来说,所有样本点到质心的距离之和越小,就认为这个簇中的样本越相似,差异性也越小
K-Means算法在衡量距离时,大多会采用欧氏距离
d ( x , μ ) = ∑ i = 1 n ( x i − μ i ) 2 d(x,mu)=sqrt{sum_{i=1}^{n}(x_{i}-mu_{i})^{2}} d(x,μ)=i=1∑n(xi−μi)2
其中
- x x x表示簇中的每个样本点
- μ mu μ表示该簇的质心
- n n n表示每个样本点中的特征数目
- i i i表示组成点 x x x的每个特征编号
K-Means算法流程:
- 选择初始化的 k k k个样本作为初始聚类中心,即 μ = μ 1 , μ 2 , . . . , μ k mu=mu_{1},mu_{2},...,mu_{k} μ=μ1,μ2,...,μk
- 针对数据集中每个样本 x i x_{i} xi,计算它到 k k k个聚类中心的距离并将其分配到距离最小的聚类中心所对应的类中
- 针对每个类别 μ j mu_{j} μj,重新计算属于该类的所有样本的质心: μ j = 1 ∣ C i ∣ mu_{j}=frac{1}{|C_{i}|} μj=∣Ci∣1 ∑ x ∈ C i x underset{xin C_{i}}{sum}x x∈Ci∑x
- 重复步骤2、3,知道到达某个中止条件(比如达到某个设定的迭代次数或者小于某个误差值等等)
K-Means算法流程演示:
1:假设有如下数据集,随机选取 k k k个点( k k k=3,红绿蓝)
2:接着计算数据集样本中其它的点到质心的距离,然后选取最近质心的类别作为自己的类别
3:经过上面的步骤,分为了三个簇,然后在每个簇中重新选取质心
4:然后再进行划分
5:重复
KMeans.py:K-Means的代码实现
import numpy as np
class KMeans:
def __init__(self, data_set, k): # data_set:数据集 k:簇个数
self.data_set = data_set
self.k = k
def kmeans(self, max_iterations): # max_iterations:最大迭代次数
examples_num = self.data_set.shape[0] # 样本数据个数
centroids = KMeans.centroids_init(self.data_set, self.k) # 随机初始化k个质心
closest_centroids_ids = np.empty((examples_num, 1)) # 用于保存各样本点距离哪个质心最近,例如closest_centroids_ids[0]=2表示0这个样本点和2这个执行离得最近
# 不断迭代
for _ in range(max_iterations):
# 先计算各样本点到质心最短距离
closest_centroids_ids = KMeans.centroids_find_closet(self.data_set, centroids)
# 再更新质心
centroids = KMeans.centroids_compute(self.data_set, closest_centroids_ids, self.k)
return centroids, closest_centroids_ids
@staticmethod
def centroids_init(data_set, k): # 初始化随机质心
examples_num = data_set.shape[0] # 样本数据个数
random_ids = np.random.permutation(examples_num) # 随机产生examples_num个数据
centroids = data_set[random_ids[:k], :] # 取前k个
return centroids
@staticmethod
def centroids_find_closet(data_set, centroids): # 每个样本点到质心距离
examples_num = data_set.shape[0] # 样本数据个数
centroids_num = centroids.shape[0] # 质心个数
closet_centroids_ids = np.zeros((examples_num, 1))
for examples_index in range(examples_num):
distance = np.zeros((centroids_num, 1)) # 保存examples_index这个样本点到各质心的距离
for centrodis_index in range(centroids_num):
# 欧氏距离
distance_diff = data_set[examples_index, :] - centroids[centrodis_index, :] # 某点到质心差异值
distance[centrodis_index] = np.sqrt(np.sum(np.power(distance_diff, 2)))
closet_centroids_ids[examples_index] = np.argmin(distance) # examples_index距离最近的那个质心(下标)
return closet_centroids_ids
@staticmethod
def centroids_compute(data_set, closest_centroids_ids, k):
features_num = data_set.shape[1] # 特征格式,即属性
centroids = np.zeros((k, features_num)) # k个簇每个都要计算,且每个簇的点都有features_num个属性,所以要对应计算
for centroid_id in range(k):
closest_ids = closest_centroids_ids == centroid_id # 不懂的话可以查阅“Numpy布尔数组相关内容”
centroids[centroid_id] = np.mean(data_set[closest_ids.flatten(), :], axis=0)
return centroids
(4)聚类效果展示
A:鸢尾花数据集(效果好)
Iris.csv 文件:鸢尾花数据集
Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species 1,5.1,3.5,1.4,0.2,Iris-setosa 2,4.9,3.0,1.4,0.2,Iris-setosa 3,4.7,3.2,1.3,0.2,Iris-setosa 4,4.6,3.1,1.5,0.2,Iris-setosa 5,5.0,3.6,1.4,0.2,Iris-setosa 6,5.4,3.9,1.7,0.4,Iris-setosa 7,4.6,3.4,1.4,0.3,Iris-setosa 8,5.0,3.4,1.5,0.2,Iris-setosa 9,4.4,2.9,1.4,0.2,Iris-setosa 10,4.9,3.1,1.5,0.1,Iris-setosa 11,5.4,3.7,1.5,0.2,Iris-setosa 12,4.8,3.4,1.6,0.2,Iris-setosa 13,4.8,3.0,1.4,0.1,Iris-setosa 14,4.3,3.0,1.1,0.1,Iris-setosa 15,5.8,4.0,1.2,0.2,Iris-setosa 16,5.7,4.4,1.5,0.4,Iris-setosa 17,5.4,3.9,1.3,0.4,Iris-setosa 18,5.1,3.5,1.4,0.3,Iris-setosa 19,5.7,3.8,1.7,0.3,Iris-setosa 20,5.1,3.8,1.5,0.3,Iris-setosa 21,5.4,3.4,1.7,0.2,Iris-setosa 22,5.1,3.7,1.5,0.4,Iris-setosa 23,4.6,3.6,1.0,0.2,Iris-setosa 24,5.1,3.3,1.7,0.5,Iris-setosa 25,4.8,3.4,1.9,0.2,Iris-setosa 26,5.0,3.0,1.6,0.2,Iris-setosa 27,5.0,3.4,1.6,0.4,Iris-setosa 28,5.2,3.5,1.5,0.2,Iris-setosa 29,5.2,3.4,1.4,0.2,Iris-setosa 30,4.7,3.2,1.6,0.2,Iris-setosa 31,4.8,3.1,1.6,0.2,Iris-setosa 32,5.4,3.4,1.5,0.4,Iris-setosa 33,5.2,4.1,1.5,0.1,Iris-setosa 34,5.5,4.2,1.4,0.2,Iris-setosa 35,4.9,3.1,1.5,0.1,Iris-setosa 36,5.0,3.2,1.2,0.2,Iris-setosa 37,5.5,3.5,1.3,0.2,Iris-setosa 38,4.9,3.1,1.5,0.1,Iris-setosa 39,4.4,3.0,1.3,0.2,Iris-setosa 40,5.1,3.4,1.5,0.2,Iris-setosa 41,5.0,3.5,1.3,0.3,Iris-setosa 42,4.5,2.3,1.3,0.3,Iris-setosa 43,4.4,3.2,1.3,0.2,Iris-setosa 44,5.0,3.5,1.6,0.6,Iris-setosa 45,5.1,3.8,1.9,0.4,Iris-setosa 46,4.8,3.0,1.4,0.3,Iris-setosa 47,5.1,3.8,1.6,0.2,Iris-setosa 48,4.6,3.2,1.4,0.2,Iris-setosa 49,5.3,3.7,1.5,0.2,Iris-setosa 50,5.0,3.3,1.4,0.2,Iris-setosa 51,7.0,3.2,4.7,1.4,Iris-versicolor 52,6.4,3.2,4.5,1.5,Iris-versicolor 53,6.9,3.1,4.9,1.5,Iris-versicolor 54,5.5,2.3,4.0,1.3,Iris-versicolor 55,6.5,2.8,4.6,1.5,Iris-versicolor 56,5.7,2.8,4.5,1.3,Iris-versicolor 57,6.3,3.3,4.7,1.6,Iris-versicolor 58,4.9,2.4,3.3,1.0,Iris-versicolor 59,6.6,2.9,4.6,1.3,Iris-versicolor 60,5.2,2.7,3.9,1.4,Iris-versicolor 61,5.0,2.0,3.5,1.0,Iris-versicolor 62,5.9,3.0,4.2,1.5,Iris-versicolor 63,6.0,2.2,4.0,1.0,Iris-versicolor 64,6.1,2.9,4.7,1.4,Iris-versicolor 65,5.6,2.9,3.6,1.3,Iris-versicolor 66,6.7,3.1,4.4,1.4,Iris-versicolor 67,5.6,3.0,4.5,1.5,Iris-versicolor 68,5.8,2.7,4.1,1.0,Iris-versicolor 69,6.2,2.2,4.5,1.5,Iris-versicolor 70,5.6,2.5,3.9,1.1,Iris-versicolor 71,5.9,3.2,4.8,1.8,Iris-versicolor 72,6.1,2.8,4.0,1.3,Iris-versicolor 73,6.3,2.5,4.9,1.5,Iris-versicolor 74,6.1,2.8,4.7,1.2,Iris-versicolor 75,6.4,2.9,4.3,1.3,Iris-versicolor 76,6.6,3.0,4.4,1.4,Iris-versicolor 77,6.8,2.8,4.8,1.4,Iris-versicolor 78,6.7,3.0,5.0,1.7,Iris-versicolor 79,6.0,2.9,4.5,1.5,Iris-versicolor 80,5.7,2.6,3.5,1.0,Iris-versicolor 81,5.5,2.4,3.8,1.1,Iris-versicolor 82,5.5,2.4,3.7,1.0,Iris-versicolor 83,5.8,2.7,3.9,1.2,Iris-versicolor 84,6.0,2.7,5.1,1.6,Iris-versicolor 85,5.4,3.0,4.5,1.5,Iris-versicolor 86,6.0,3.4,4.5,1.6,Iris-versicolor 87,6.7,3.1,4.7,1.5,Iris-versicolor 88,6.3,2.3,4.4,1.3,Iris-versicolor 89,5.6,3.0,4.1,1.3,Iris-versicolor 90,5.5,2.5,4.0,1.3,Iris-versicolor 91,5.5,2.6,4.4,1.2,Iris-versicolor 92,6.1,3.0,4.6,1.4,Iris-versicolor 93,5.8,2.6,4.0,1.2,Iris-versicolor 94,5.0,2.3,3.3,1.0,Iris-versicolor 95,5.6,2.7,4.2,1.3,Iris-versicolor 96,5.7,3.0,4.2,1.2,Iris-versicolor 97,5.7,2.9,4.2,1.3,Iris-versicolor 98,6.2,2.9,4.3,1.3,Iris-versicolor 99,5.1,2.5,3.0,1.1,Iris-versicolor 100,5.7,2.8,4.1,1.3,Iris-versicolor 101,6.3,3.3,6.0,2.5,Iris-virginica 102,5.8,2.7,5.1,1.9,Iris-virginica 103,7.1,3.0,5.9,2.1,Iris-virginica 104,6.3,2.9,5.6,1.8,Iris-virginica 105,6.5,3.0,5.8,2.2,Iris-virginica 106,7.6,3.0,6.6,2.1,Iris-virginica 107,4.9,2.5,4.5,1.7,Iris-virginica 108,7.3,2.9,6.3,1.8,Iris-virginica 109,6.7,2.5,5.8,1.8,Iris-virginica 110,7.2,3.6,6.1,2.5,Iris-virginica 111,6.5,3.2,5.1,2.0,Iris-virginica 112,6.4,2.7,5.3,1.9,Iris-virginica 113,6.8,3.0,5.5,2.1,Iris-virginica 114,5.7,2.5,5.0,2.0,Iris-virginica 115,5.8,2.8,5.1,2.4,Iris-virginica 116,6.4,3.2,5.3,2.3,Iris-virginica 117,6.5,3.0,5.5,1.8,Iris-virginica 118,7.7,3.8,6.7,2.2,Iris-virginica 119,7.7,2.6,6.9,2.3,Iris-virginica 120,6.0,2.2,5.0,1.5,Iris-virginica 121,6.9,3.2,5.7,2.3,Iris-virginica 122,5.6,2.8,4.9,2.0,Iris-virginica 123,7.7,2.8,6.7,2.0,Iris-virginica 124,6.3,2.7,4.9,1.8,Iris-virginica 125,6.7,3.3,5.7,2.1,Iris-virginica 126,7.2,3.2,6.0,1.8,Iris-virginica 127,6.2,2.8,4.8,1.8,Iris-virginica 128,6.1,3.0,4.9,1.8,Iris-virginica 129,6.4,2.8,5.6,2.1,Iris-virginica 130,7.2,3.0,5.8,1.6,Iris-virginica 131,7.4,2.8,6.1,1.9,Iris-virginica 132,7.9,3.8,6.4,2.0,Iris-virginica 133,6.4,2.8,5.6,2.2,Iris-virginica 134,6.3,2.8,5.1,1.5,Iris-virginica 135,6.1,2.6,5.6,1.4,Iris-virginica 136,7.7,3.0,6.1,2.3,Iris-virginica 137,6.3,3.4,5.6,2.4,Iris-virginica 138,6.4,3.1,5.5,1.8,Iris-virginica 139,6.0,3.0,4.8,1.8,Iris-virginica 140,6.9,3.1,5.4,2.1,Iris-virginica 141,6.7,3.1,5.6,2.4,Iris-virginica 142,6.9,3.1,5.1,2.3,Iris-virginica 143,5.8,2.7,5.1,1.9,Iris-virginica 144,6.8,3.2,5.9,2.3,Iris-virginica 145,6.7,3.3,5.7,2.5,Iris-virginica 146,6.7,3.0,5.2,2.3,Iris-virginica 147,6.3,2.5,5.0,1.9,Iris-virginica 148,6.5,3.0,5.2,2.0,Iris-virginica 149,6.2,3.4,5.4,2.3,Iris-virginica 150,5.9,3.0,5.1,1.8,Iris-virginica
test_Iris.py:测试数据集
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from KMeans import KMeans
Iris_types = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] # 花类型
Iris_data = pd.read_csv('./Iris.csv')
x_axis = 'PetalLengthCm' # 花瓣长度
y_axis = 'PetalWidthCm' # 花瓣宽度
# x_axis = 'SepalLengthCm' # 花萼长度
# y_axis = 'SepalWidthCm' # 花萼宽度
examples_num = Iris_data.shape[0] # 样本数量
train_data = Iris_data[[x_axis, y_axis]].values.reshape(examples_num, 2) # 整理数据
# 训练参数
K = 3 # 簇数
max_iterations = 50 # 最大迭代次数
test = KMeans(train_data, K)
centroids, closest_centroids_ids = test.kmeans(max_iterations)
plt.figure(figsize=(12, 5), dpi=80)
# 第一幅图是已知标签
plt.subplot(1, 2, 1)
for Iris_type in Iris_types:
plt.scatter(Iris_data[x_axis][Iris_data['Species'] == Iris_type], Iris_data[y_axis][Iris_data['Species'] == Iris_type], label=Iris_type)
plt.title('label know')
plt.legend()
# 第二幅图是聚类结果
plt.subplot(1, 2, 2)
for centroid_id, centroid in enumerate(centroids):
current_examples_index = (closest_centroids_ids == centroid_id).flatten()
plt.scatter(Iris_data[x_axis][current_examples_index],
Iris_data[y_axis][current_examples_index], label=Iris_data['Species'][current_examples_index].values[0])
for centroid_id, centroid in enumerate(centroids):
plt.scatter(centroid[0], centroid[1], c='red', marker='x')
plt.title('label kemans')
plt.legend()
plt.show()
效果
花瓣长度和花瓣宽度
花萼长度花萼宽度
B:三个圆圈数据集(效果差)threecircles.csv:三个圆圈数据集
4.54E-01,5.15E-01,0.00E+00 4.55E-01,5.14E-01,0.00E+00 4.48E-01,5.18E-01,0.00E+00 4.36E-01,5.21E-01,0.00E+00 4.64E-01,5.26E-01,0.00E+00 4.46E-01,5.24E-01,0.00E+00 4.53E-01,5.30E-01,0.00E+00 4.54E-01,5.31E-01,0.00E+00 4.69E-01,5.27E-01,0.00E+00 4.40E-01,5.46E-01,0.00E+00 4.61E-01,5.36E-01,0.00E+00 4.29E-01,5.47E-01,0.00E+00 4.54E-01,5.45E-01,0.00E+00 4.37E-01,5.28E-01,0.00E+00 4.42E-01,5.46E-01,0.00E+00 4.30E-01,5.14E-01,0.00E+00 4.49E-01,5.19E-01,0.00E+00 4.36E-01,5.41E-01,0.00E+00 4.36E-01,5.28E-01,0.00E+00 4.35E-01,5.34E-01,0.00E+00 4.43E-01,5.31E-01,0.00E+00 4.42E-01,5.18E-01,0.00E+00 4.37E-01,5.34E-01,0.00E+00 4.32E-01,5.41E-01,0.00E+00 4.22E-01,5.27E-01,0.00E+00 4.44E-01,5.22E-01,0.00E+00 4.37E-01,5.12E-01,0.00E+00 4.33E-01,5.15E-01,0.00E+00 4.43E-01,5.28E-01,0.00E+00 4.38E-01,5.28E-01,0.00E+00 4.30E-01,5.16E-01,0.00E+00 4.05E-01,5.19E-01,0.00E+00 4.20E-01,5.18E-01,0.00E+00 4.25E-01,5.13E-01,0.00E+00 4.17E-01,5.06E-01,0.00E+00 4.28E-01,4.96E-01,0.00E+00 4.51E-01,5.11E-01,0.00E+00 4.15E-01,4.95E-01,0.00E+00 4.49E-01,5.20E-01,0.00E+00 4.47E-01,5.04E-01,0.00E+00 4.29E-01,4.93E-01,0.00E+00 4.31E-01,5.05E-01,0.00E+00 4.40E-01,5.02E-01,0.00E+00 4.30E-01,5.20E-01,0.00E+00 4.40E-01,4.99E-01,0.00E+00 4.48E-01,4.90E-01,0.00E+00 4.39E-01,4.96E-01,0.00E+00 4.54E-01,4.85E-01,0.00E+00 4.56E-01,4.98E-01,0.00E+00 4.50E-01,5.09E-01,0.00E+00 4.48E-01,4.91E-01,0.00E+00 4.52E-01,5.20E-01,0.00E+00 4.55E-01,5.17E-01,0.00E+00 4.63E-01,5.08E-01,0.00E+00 4.52E-01,5.17E-01,0.00E+00 4.47E-01,5.21E-01,0.00E+00 4.66E-01,5.06E-01,0.00E+00 4.65E-01,5.13E-01,0.00E+00 4.53E-01,5.23E-01,0.00E+00 4.54E-01,5.18E-01,0.00E+00 4.52E-01,5.14E-01,0.00E+00 5.76E-01,5.15E-01,1.00E+00 5.74E-01,5.13E-01,1.00E+00 5.74E-01,5.31E-01,1.00E+00 5.85E-01,5.08E-01,1.00E+00 5.79E-01,5.51E-01,1.00E+00 5.82E-01,5.49E-01,1.00E+00 5.88E-01,5.52E-01,1.00E+00 5.69E-01,5.72E-01,1.00E+00 5.71E-01,5.61E-01,1.00E+00 5.69E-01,5.81E-01,1.00E+00 5.54E-01,5.64E-01,1.00E+00 5.44E-01,5.68E-01,1.00E+00 5.47E-01,5.63E-01,1.00E+00 5.56E-01,5.94E-01,1.00E+00 5.54E-01,6.07E-01,1.00E+00 5.51E-01,6.02E-01,1.00E+00 5.42E-01,6.06E-01,1.00E+00 5.51E-01,6.04E-01,1.00E+00 5.48E-01,6.09E-01,1.00E+00 5.17E-01,6.48E-01,1.00E+00 5.44E-01,6.15E-01,1.00E+00 5.16E-01,6.36E-01,1.00E+00 5.07E-01,6.44E-01,1.00E+00 5.02E-01,6.36E-01,1.00E+00 5.11E-01,6.45E-01,1.00E+00 5.11E-01,6.31E-01,1.00E+00 4.95E-01,6.59E-01,1.00E+00 4.75E-01,6.45E-01,1.00E+00 4.82E-01,6.50E-01,1.00E+00 4.78E-01,6.50E-01,1.00E+00 4.72E-01,6.47E-01,1.00E+00 4.61E-01,6.21E-01,1.00E+00 4.62E-01,6.42E-01,1.00E+00 4.46E-01,6.45E-01,1.00E+00 4.43E-01,6.67E-01,1.00E+00 4.33E-01,6.54E-01,1.00E+00 4.47E-01,6.66E-01,1.00E+00 4.30E-01,6.55E-01,1.00E+00 4.20E-01,6.51E-01,1.00E+00 4.24E-01,6.51E-01,1.00E+00 4.24E-01,6.44E-01,1.00E+00 4.05E-01,6.42E-01,1.00E+00 4.01E-01,6.38E-01,1.00E+00 3.93E-01,6.45E-01,1.00E+00 4.03E-01,6.41E-01,1.00E+00 3.91E-01,6.26E-01,1.00E+00 3.60E-01,6.24E-01,1.00E+00 3.74E-01,6.14E-01,1.00E+00 3.79E-01,6.16E-01,1.00E+00 3.61E-01,6.19E-01,1.00E+00 3.68E-01,6.27E-01,1.00E+00 3.46E-01,6.21E-01,1.00E+00 3.38E-01,6.15E-01,1.00E+00 3.56E-01,6.09E-01,1.00E+00 3.52E-01,5.92E-01,1.00E+00 3.38E-01,6.05E-01,1.00E+00 3.36E-01,5.84E-01,1.00E+00 3.39E-01,5.88E-01,1.00E+00 3.15E-01,5.94E-01,1.00E+00 3.19E-01,5.84E-01,1.00E+00 3.23E-01,5.82E-01,1.00E+00 3.01E-01,5.66E-01,1.00E+00 3.26E-01,5.55E-01,1.00E+00 3.26E-01,5.54E-01,1.00E+00 3.07E-01,5.47E-01,1.00E+00 3.24E-01,5.32E-01,1.00E+00 3.04E-01,5.52E-01,1.00E+00 3.32E-01,5.25E-01,1.00E+00 3.07E-01,5.20E-01,1.00E+00 3.23E-01,5.40E-01,1.00E+00 3.17E-01,5.30E-01,1.00E+00 3.09E-01,5.02E-01,1.00E+00 3.08E-01,5.00E-01,1.00E+00 3.09E-01,4.97E-01,1.00E+00 3.10E-01,4.95E-01,1.00E+00 2.97E-01,4.63E-01,1.00E+00 3.23E-01,4.50E-01,1.00E+00 3.31E-01,4.64E-01,1.00E+00 3.19E-01,4.76E-01,1.00E+00 3.35E-01,4.59E-01,1.00E+00 3.37E-01,4.63E-01,1.00E+00 3.30E-01,4.63E-01,1.00E+00 3.35E-01,4.49E-01,1.00E+00 3.44E-01,4.34E-01,1.00E+00 3.43E-01,4.30E-01,1.00E+00 3.43E-01,4.42E-01,1.00E+00 3.51E-01,4.13E-01,1.00E+00 3.53E-01,4.20E-01,1.00E+00 3.49E-01,4.27E-01,1.00E+00 3.63E-01,4.15E-01,1.00E+00 3.61E-01,4.01E-01,1.00E+00 3.73E-01,4.05E-01,1.00E+00 3.86E-01,3.95E-01,1.00E+00 3.79E-01,3.99E-01,1.00E+00 4.09E-01,4.10E-01,1.00E+00 3.87E-01,4.20E-01,1.00E+00 3.93E-01,3.98E-01,1.00E+00 4.08E-01,4.25E-01,1.00E+00 3.94E-01,4.15E-01,1.00E+00 4.28E-01,4.04E-01,1.00E+00 4.42E-01,3.85E-01,1.00E+00 4.21E-01,3.97E-01,1.00E+00 4.23E-01,4.02E-01,1.00E+00 4.57E-01,3.97E-01,1.00E+00 4.47E-01,3.91E-01,1.00E+00 4.52E-01,3.97E-01,1.00E+00 4.57E-01,3.77E-01,1.00E+00 4.69E-01,3.94E-01,1.00E+00 4.82E-01,3.85E-01,1.00E+00 4.88E-01,3.98E-01,1.00E+00 4.75E-01,4.07E-01,1.00E+00 4.72E-01,3.90E-01,1.00E+00 4.88E-01,3.89E-01,1.00E+00 4.94E-01,3.97E-01,1.00E+00 4.93E-01,4.11E-01,1.00E+00 5.05E-01,4.08E-01,1.00E+00 5.21E-01,4.20E-01,1.00E+00 5.13E-01,4.06E-01,1.00E+00 5.22E-01,4.14E-01,1.00E+00 5.37E-01,4.09E-01,1.00E+00 5.45E-01,4.17E-01,1.00E+00 5.32E-01,4.22E-01,1.00E+00 5.61E-01,4.43E-01,1.00E+00 5.38E-01,4.45E-01,1.00E+00 5.53E-01,4.40E-01,1.00E+00 5.61E-01,4.48E-01,1.00E+00 5.49E-01,4.56E-01,1.00E+00 5.64E-01,4.48E-01,1.00E+00 5.60E-01,4.56E-01,1.00E+00 5.55E-01,4.77E-01,1.00E+00 5.53E-01,4.66E-01,1.00E+00 5.73E-01,4.80E-01,1.00E+00 5.92E-01,4.81E-01,1.00E+00 5.59E-01,4.92E-01,1.00E+00 5.79E-01,4.87E-01,1.00E+00 5.67E-01,5.07E-01,1.00E+00 5.77E-01,5.17E-01,1.00E+00 5.98E-01,5.07E-01,1.00E+00 5.95E-01,5.12E-01,1.00E+00 7.82E-01,5.14E-01,2.00E+00 7.79E-01,5.20E-01,2.00E+00 7.87E-01,5.79E-01,2.00E+00 7.88E-01,5.78E-01,2.00E+00 7.70E-01,5.90E-01,2.00E+00 7.45E-01,6.13E-01,2.00E+00 7.62E-01,6.43E-01,2.00E+00 7.51E-01,6.70E-01,2.00E+00 7.30E-01,6.77E-01,2.00E+00 7.06E-01,7.12E-01,2.00E+00 7.11E-01,7.20E-01,2.00E+00 6.90E-01,7.46E-01,2.00E+00 6.91E-01,7.46E-01,2.00E+00 6.59E-01,7.78E-01,2.00E+00 6.44E-01,7.61E-01,2.00E+00 6.21E-01,7.89E-01,2.00E+00 6.05E-01,8.07E-01,2.00E+00 6.08E-01,8.08E-01,2.00E+00 5.73E-01,8.07E-01,2.00E+00 5.60E-01,8.25E-01,2.00E+00 5.54E-01,8.28E-01,2.00E+00 5.27E-01,8.43E-01,2.00E+00 5.05E-01,8.63E-01,2.00E+00 4.57E-01,8.49E-01,2.00E+00 4.69E-01,8.60E-01,2.00E+00 4.24E-01,8.60E-01,2.00E+00 4.15E-01,8.43E-01,2.00E+00 4.00E-01,8.44E-01,2.00E+00 3.57E-01,8.46E-01,2.00E+00 3.49E-01,8.27E-01,2.00E+00 3.07E-01,8.33E-01,2.00E+00 3.02E-01,8.22E-01,2.00E+00 2.83E-01,8.14E-01,2.00E+00 2.82E-01,7.98E-01,2.00E+00 2.48E-01,7.83E-01,2.00E+00 2.26E-01,7.72E-01,2.00E+00 2.14E-01,7.53E-01,2.00E+00 1.87E-01,7.55E-01,2.00E+00 1.80E-01,7.52E-01,2.00E+00 1.81E-01,7.24E-01,2.00E+00 1.49E-01,7.03E-01,2.00E+00 1.50E-01,6.75E-01,2.00E+00 1.39E-01,6.46E-01,2.00E+00 1.41E-01,6.26E-01,2.00E+00 1.19E-01,6.17E-01,2.00E+00 1.35E-01,5.96E-01,2.00E+00 1.22E-01,5.70E-01,2.00E+00 1.04E-01,5.51E-01,2.00E+00 1.25E-01,5.34E-01,2.00E+00 9.68E-02,5.18E-01,2.00E+00 1.20E-01,4.85E-01,2.00E+00 1.11E-01,4.70E-01,2.00E+00 1.23E-01,4.78E-01,2.00E+00 1.19E-01,4.48E-01,2.00E+00 1.29E-01,4.14E-01,2.00E+00 1.24E-01,3.84E-01,2.00E+00 1.48E-01,3.59E-01,2.00E+00 1.69E-01,3.69E-01,2.00E+00 1.51E-01,3.25E-01,2.00E+00 1.83E-01,3.05E-01,2.00E+00 1.96E-01,3.01E-01,2.00E+00 1.95E-01,2.78E-01,2.00E+00 2.12E-01,2.65E-01,2.00E+00 2.35E-01,2.64E-01,2.00E+00 2.61E-01,2.40E-01,2.00E+00 2.63E-01,2.30E-01,2.00E+00 2.98E-01,2.23E-01,2.00E+00 2.97E-01,2.09E-01,2.00E+00 3.36E-01,2.06E-01,2.00E+00 3.56E-01,1.94E-01,2.00E+00 3.55E-01,1.70E-01,2.00E+00 3.96E-01,1.91E-01,2.00E+00 3.91E-01,1.79E-01,2.00E+00 4.21E-01,1.71E-01,2.00E+00 4.51E-01,1.90E-01,2.00E+00 4.72E-01,2.02E-01,2.00E+00 4.93E-01,1.82E-01,2.00E+00 5.03E-01,1.63E-01,2.00E+00 5.31E-01,1.99E-01,2.00E+00 5.55E-01,1.93E-01,2.00E+00 6.02E-01,2.06E-01,2.00E+00 5.83E-01,2.34E-01,2.00E+00 5.99E-01,2.23E-01,2.00E+00 6.25E-01,2.47E-01,2.00E+00 6.34E-01,2.67E-01,2.00E+00 6.63E-01,2.68E-01,2.00E+00 6.75E-01,2.76E-01,2.00E+00 6.97E-01,3.06E-01,2.00E+00 7.00E-01,3.00E-01,2.00E+00 7.36E-01,3.57E-01,2.00E+00 7.20E-01,3.65E-01,2.00E+00 7.49E-01,3.46E-01,2.00E+00 7.44E-01,3.93E-01,2.00E+00 7.44E-01,4.09E-01,2.00E+00 7.54E-01,4.18E-01,2.00E+00 7.76E-01,4.59E-01,2.00E+00 7.66E-01,4.88E-01,2.00E+00 7.87E-01,5.01E-01,2.00E+00 7.76E-01,5.25E-01,2.00E+00
test_threecircles.py:测试数据集
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from KMeans import KMeans
data_types = [0.0000000e+00, 1.0000000e+00, 2.0000000e+00]
raw_data = pd.read_csv('./threecircles.csv', header=None)
raw_data.columns = ['X', 'Y', 'Type']
x_axis = 'X'
y_axis = 'Y'
examples_num = raw_data.shape[0]
train_data = raw_data[[x_axis, y_axis]].values.reshape(examples_num, 2)
# 训练参数
K = 3
max_iterations = 50
test = KMeans(train_data, K)
centroids, closest_centroids_ids = test.kmeans(max_iterations)
plt.figure(figsize=(12, 5), dpi=80)
# 第一幅图已知标签
plt.subplot(1, 2, 1)
for data_types_index in data_types:
plt.scatter(raw_data[x_axis][raw_data['Type'] == data_types_index], raw_data[y_axis][raw_data['Type'] == data_types_index], label=str(data_types_index))
plt.title('label know')
plt.legend()
# 第二幅图聚类结果
plt.subplot(1, 2, 2)
for centroid_id, centroid in enumerate(centroids):
current_examples_index = (closest_centroids_ids == centroid_id).flatten()
plt.scatter(raw_data[x_axis][current_examples_index],
raw_data[y_axis][current_examples_index],
label=raw_data['Type'][current_examples_index].values[0])
for centroid_id, centroid in enumerate(centroids):
plt.scatter(centroid[0], centroid[1], c='red', marker='x')
plt.title('label kmeans')
plt.legend()
plt.show()
效果
三:K-Means算法探讨 (1)距离度量K-Means算法中样本点到质心的距离度量公式有很多种,主要是以下三种
- 欧几里得距离(本文采用): d ( x , u ) = ∑ i = 1 n ( x i − u i ) 2 d(x,u)=sum_{i=1}^{n}(x_{i}-u_{i})^{2} d(x,u)=∑i=1n(xi−ui)2
- 曼哈顿距离(计算速度快): d ( x , u ) = ∑ i = 1 n ( ∣ x i − u i ∣ ) d(x,u)=sum_{i=1}^{n}(|x_{i}-u_{i}|) d(x,u)=∑i=1n(∣xi−ui∣)
- 余弦距离(高纬更稳定): c o s ( θ ) = A ▪ B ∣ ∣ A ∣ ∣ ∣ ∣ B ∣ ∣ = ∑ i = 1 n A i B i ∑ i = 1 n A i 2 ∑ i = 1 n B i 2 cos(theta)=frac{A▪B}{||A||||B||}=frac{sum_{i=1}^{n}A_{i}B_{i}}{sqrt{sum_{i=1}^{n}A_{i}^{2}}sqrt{sum_{i=1}^{n}B_{i}^{2}}} cos(θ)=∣∣A∣∣∣∣B∣∣A▪B=∑i=1nAi2 ∑i=1nBi2 ∑i=1nAiBi
需要注意以下几点
①:关于余弦距离:
- 余弦距离就是用1减去余弦相似度,所谓余弦相似度是指两个向量间的夹角的余弦值
- 特别注意余弦距离取值范围为[0, 2],这样才能满足非负的性质
②:余弦距离与欧氏距离的关系:
- 当向量模长是经过归一化的,此时两种距离存在单调关系,即 ∣ ∣ A − B ∣ ∣ 2 = 2 ( 1 − c o s ( A , B ) ) ||A-B||_{2}=sqrt{2(1-cos(A,B))} ∣∣A−B∣∣2=2(1−cos(A,B)) ;在此场景下,如果选择距离最小(相似度最大)的近邻,那么使用余弦相似度和欧氏距离的结果是相同的
- 欧氏距离体现数值上的绝对差异,而余弦距离体现方向上的相对差异
- 欧氏距离是计算空间中两个点的直线距离,所以它受不同维度上数值尺度的影响比较大,而余弦距离则是计算两者在空间中方向上的差异,它包含的信息更多,而且更稳定,用哪一个看具体场景
③:曼哈顿距离与欧氏距离的关系:
- 红线:曼哈顿距离
- 绿线:欧式距离
- 蓝线和 黄线:等价于曼哈顿距离
④:利用Python实现这些距离:
# 欧氏距离 def distEclud(vecA, vecB): return np.sqrt(np.sum(np.power((vecA-vecB), 2))) # 曼哈顿距离 def distManhattan(vecA, vecB): return sum(abs(vecA-vecB)) # 余弦距离(通过scipy计算) a = [1, 2, 3, 4] b = [5, 6, 7, 8] def = 1-seatial.distance.cosine(a, b)(2)聚类算法评价指标 A:inertia(簇内平方和)
当采用欧氏距离时,就可以得到一个簇中所有样本点到质心的距离的平方和(Cluster Sum Of Square(CSS)),也即簇内平方和(inertia)
C S S = ∑ j = 0 m ∑ i = 1 n ( x i − μ i ) 2 CSS=sum_{j=0}^{m}sum_{i=1}^{n}(x_{i}-mu_{i})^{2} CSS=j=0∑mi=1∑n(xi−μi)2
其中
- m m m表示一个簇中样本的个数
- j j j表示每个样本的编号
同时,如果将一个数据集的所有簇的簇内平方和相加,就得到了整体平方和(Total Cluster Sum of Square) ,也即total inertia
T C S S = ∑ i = 1 k C S S l TCSS=sum_{i=1}^{k}CSS_{l} TCSS=i=1∑kCSSl
total inertia越小代表每个簇内样本越相似,聚类效果也就越好,所以K-Means追求的是可以让簇内平方和最小化的质心。同时,在质心不断变化不断迭代的过程中,总体平方和是越来越小的,所以当整体平方和最小的时候,质心就不再发生变化了
这里使用sklearn中的KMeans,计算鸢尾花聚类后的inertia
# 导入所需模块
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
# 导入鸢尾花数据集
iris = load_iris()
X = iris.data[:]
# print(X)
print(X.shape)
# 建立簇为3的聚类器
estimator = KMeans(n_clusters=3, random_state=0)
estimator.fit(X)
# 获取标签
label_pred = estimator.labels_
# print(label_pred)
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
x2 = X[label_pred == 2]
# 绘制聚类结果
plt.scatter(x0[:, 0], x0[:, 2], c='red', marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 2], c='green', marker='*', label='label1')
plt.scatter(x2[:, 0], x2[:, 2], c='blue', marker='v', label='label2')
centers = estimator.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 2], marker='x', c='black', alpha=1, s=300)
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
print(estimator.inertia_) # 计算inertia
plt.show()
B:轮廓系数
对于一个聚类任务,我们希望得到的类别簇中,簇内尽量紧密,簇间尽量远离,轮廓系数便是类的密集与分散程度的评价指标,公式如下
s ( i ) = b ( i ) − a ( i ) m a x { a ( i ) , b ( i ) } s(i)=frac{b(i)-a(i)}{max{{a(i),b(i)}}} s(i)=max{a(i),b(i)}b(i)−a(i)
其中
- a ( i ) a(i) a(i)(越小越好):表示样本 i i i到同簇其它样本的平均距离; a ( i ) a(i) a(i)说明样本 i i i就越应该被分到该簇,且将 a ( i ) a(i) a(i)称为样本 i i i的簇内不相似度
- b ( i ) b(i) b(i)(越大越好):表示样本 i i i到其他某簇 C j C_{j} Cj的所有样本的平均距离 b i j b_{ij} bij,称为样本 i i i与某簇 C j C_{j} Cj的不相似度; b ( i ) = m i n { b i 1 , b i 2 , . . . , b i k } b(i)=min{b_{i1},b_{i2},...,b_{ik}} b(i)=min{bi1,bi2,...,bik}
轮廓系数 s ( i ) s(i) s(i)的取值范围为[-1, 1],且
- 若 s ( i ) s(i) s(i)接近1( a ( i ) a(i) a(i)< b ( i ) b(i) b(i)):说明样本 i i i聚类合理(轮廓系数越大越好)
- 若 s ( i ) s(i) s(i)接近-1( a ( i ) a(i) a(i)> b ( i ) b(i) b(i)):说明样本 i i i更应该分类到另外的簇上,聚类效果很差
- 若 s ( i ) s(i) s(i)接近0( a ( i ) a(i) a(i)= b ( i ) b(i) b(i)):说明样本 i i i在两个簇的边界上



