时间:2022/5/11
- Kmeans算法
- 0.数据集分析
- 1.算法思想
- 2.算法实现过程
- 3.算法特点
- 4.算法实现完整代码
测试使用的数据集仍然为经典的鸢尾花数据集iris.有四个属性,分别为花萼长度(sepallength)、花萼宽度(sepalwidth)、花瓣长度(petallength)、花瓣宽度(petalwidth)。决策属性为种类(setosa、versicolor、virginica)。
1.算法思想kmeans算法是聚类方法中比较简单的方法,属于划分方法。是一种基于形心的技术,以簇的平均值代表整个簇。主要思想是:在给定K值和K个初始类簇中心点的情况下,把每个点(亦即数据记录)分到离其最近的类簇中心点所代表的类簇中,所有点分配完毕之后,根据一个类簇内的所有点重新计算该类簇的中心点(取平均值),然后再迭代的进行分配点和更新类簇中心点的步骤,直至类簇中心点不再变化,或者达到指定的迭代次数。在其基础上还有K-Means的优化变体方法。包括初始化优化K-Means++, 距离计算优化elkan K-Means算法和大数据情况下的优化Mini Batch K-Means算法。
2.算法实现过程算法主要部分伪代码
算法:k均值。用于划分的K均值算法,其中每个簇的中心都用簇中所有对象的均值来表示。 输入: K:簇的数目; D: 包含N个对象的数据集; 输出:k个簇的集合 方法: (1)从D中随机选取K个对象作为初始簇中心; (2)repeat (3) 根据簇中对象的均值将每个对象分配到最相似的簇中; (4) 更新簇均值,即重新计算每个簇中对象的均值; (5)until不再发生变化;
在第5步,判断簇是否发生变化时,老师采用的是判断簇的划分在聚类前后是否发生变化;
public void clustering() {
int[] tempOldClusterArray = new int[dataset.numInstances()];
tempOldClusterArray[0] = -1;
int[] tempClusterArray = new int[dataset.numInstances()];
Arrays.fill(tempClusterArray, 0);
double[][] tempCenters = new double[numClusters][dataset.numAttributes() - 1];
// Step 1. Initialize centers.
int[] tempRandomOrders = getRandomIndices(dataset.numInstances());
for (int i = 0; i < numClusters; i++) {
for (int j = 0; j < tempCenters[0].length; j++) {
tempCenters[i][j] = dataset.instance(tempRandomOrders[i]).value(j);
} // Of for j
} // Of for i
int[] tempClusterLengths = null;
while (!Arrays.equals(tempOldClusterArray, tempClusterArray)) {
System.out.println("New loop ...");
tempOldClusterArray = tempClusterArray;
tempClusterArray = new int[dataset.numInstances()];
// Step 2.1 Minimization. Assign cluster to each instance.
int tempNearestCenter;
double tempNearestDistance;
double tempDistance;
for (int i = 0; i < dataset.numInstances(); i++) {
tempNearestCenter = -1;
tempNearestDistance = Double.MAX_VALUE;
for (int j = 0; j < numClusters; j++) {
tempDistance = distance(i, tempCenters[j]);
if (tempNearestDistance > tempDistance) {
tempNearestDistance = tempDistance;
tempNearestCenter = j;
} // Of if
} // Of for j
tempClusterArray[i] = tempNearestCenter;
} // Of for i
// Step 2.2 Mean. Find new centers.
tempClusterLengths = new int[numClusters];
Arrays.fill(tempClusterLengths, 0);
double[][] tempNewCenters = new double[numClusters][dataset.numAttributes() - 1];
// Arrays.fill(tempNewCenters, 0);
for (int i = 0; i < dataset.numInstances(); i++) {
for (int j = 0; j < tempNewCenters[0].length; j++) {
tempNewCenters[tempClusterArray[i]][j] += dataset.instance(i).value(j);
} // Of for j
tempClusterLengths[tempClusterArray[i]]++;
} // Of for i
// Step 2.3 Now average
for (int i = 0; i < tempNewCenters.length; i++) {
for (int j = 0; j < tempNewCenters[0].length; j++) {
tempNewCenters[i][j] /= tempClusterLengths[i];
} // Of for j
} // Of for i
System.out.println("Now the new centers are: " + Arrays.deepToString(tempNewCenters));
tempCenters = tempNewCenters;
} // Of while
// Step 3. Form clusters.
clusters = new int[numClusters][];
int[] tempCounters = new int[numClusters];
for (int i = 0; i < numClusters; i++) {
clusters[i] = new int[tempClusterLengths[i]];
} // Of for i
for (int i = 0; i < tempClusterArray.length; i++) {
clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
tempCounters[tempClusterArray[i]]++;
} // Of for i
System.out.println("The clusters are: " + Arrays.deepToString(clusters));
}// Of clustering
而我选择的方法是比较簇中心在聚类前后是否发生变化,因为当数据集数据过多时,采用前一种方法比较次数比后一种更多。因为K值不可能太大,所以后一种方法速度更快。
public void Clustering() {
//step1.初始化矩阵
double[][] tempClusterCenters = new double[numClusters][datasets.numAttributes() - 1];
//判断簇中心是否发生变化
boolean changeFlag = true;
int[] tempRandomOrders = shuffle(datasets.numInstances());
int index = 0;
for (int i = 0; i < numClusters; i++) {
index = random.nextInt(datasets.numInstances());
for (int j = 0; j < datasets.numAttributes() - 1; j++) {
tempClusterCenters[i][j] = datasets.instance(tempRandomOrders[index]).value(j);
}
}
//存储簇长度
int[] tempClusterLength = new int[numClusters];
//存储最近中心
int tempNearestCenter;
//最近距离
double tempNearestDistance;
double tempDistance;
//用索引的方法记录簇的划分
int[] tempClusterArray = new int[datasets.numInstances()];
while (changeFlag) {
//step1.清空簇
changeFlag = false;
Arrays.fill(tempClusterArray, 0);
Arrays.fill(tempClusterLength, 0);
//step2.开始划分
for (int i = 0; i < datasets.numInstances(); i++) {
tempNearestDistance = Double.MAX_VALUE;
tempNearestCenter = 0;
for (int j = 0; j < numClusters; j++) {
tempDistance = distance(i, tempClusterCenters[j]);
if (tempDistance < tempNearestDistance) {
tempNearestDistance = tempDistance;
tempNearestCenter = j;
}
}
tempClusterArray[i] = tempNearestCenter;
tempClusterLength[tempNearestCenter]++;
}
//step3.计算新的簇中心
double[][] tempNewCenter = new double[numClusters][datasets.numAttributes() - 1];
for (int i = 0; i < datasets.numInstances(); i++) {
for (int j = 0; j < datasets.numAttributes() - 1; j++) {
tempNewCenter[tempClusterArray[i]][j] += datasets.instance(i).value(j);
}
}
//step4.判断簇中心是否发生变化
for (int i = 0; i < numClusters; i++) {
for (int j = 0; j < datasets.numAttributes() - 1; j++) {
tempNewCenter[i][j] /= tempClusterLength[i];
}
if (!Arrays.equals(tempClusterCenters[i], tempNewCenter[i])) {
changeFlag = true;
}
}
//step5.将新的均值作为簇中心
tempClusterCenters = tempNewCenter;
}
clusters = new int[numClusters][];
int[] tempCounters = new int[numClusters];
for (int i = 0; i < numClusters; i++) {
clusters[i] = new int[tempClusterLength[i]];
}
for (int i = 0; i < tempClusterArray.length; i++) {
clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
tempCounters[tempClusterArray[i]]++;
}
for (int[] cluster : clusters) {
System.out.println("The clusters are: " + Arrays.toString(cluster));
}
}
3.算法特点
K-Means的主要优点有:
1)原理比较简单,实现也是很容易,收敛速度快。
2)聚类效果较优。
3)算法的可解释度比较强。
4)主要需要调参的参数仅仅是簇数k。
K-Means的主要缺点有:
1)K值的选取不好把握
2)对于不是凸的数据集比较难收敛
3)如果各隐含类别的数据不平衡,比如各隐含类别的数据量严重失衡,或者各隐含类别的方差不同,则聚类效果不佳。
4) 采用迭代方法,得到的结果只是局部最优。
5) 对噪音和离群点比较的敏感。
6)无法聚类不规则形状的数据
4.算法实现完整代码
package swpu.zjy.ML.Kmeans;
import weka.core.Instances;
import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;
public class myKMeans {
//曼哈顿距离
public static final int MANHATTAN = 0;
//欧几里得距离
public static final int EUCLIDEAN = 1;
//所选距离度量
public int distanceMeasure = EUCLIDEAN;
//数据集实体
Instances datasets;
//簇个数
int numClusters = 2;
int[][] clusters;
//随机数生成器
public static final Random random = new Random();
public myKMeans(String dataSetFileName) {
try {
FileReader fileReader = new FileReader(dataSetFileName);
datasets = new Instances(fileReader);
fileReader.close();
} catch (Exception e) {
e.printStackTrace();
System.out.println("Cannot read the file: " + dataSetFileName + "rn");
}
}
public void setDistanceMeasure(int distanceMeasure) {
this.distanceMeasure = distanceMeasure;
}
public void setNumClusters(int numClusters) {
this.numClusters = numClusters;
}
public static int[] shuffle(int numInstance) {
//构造索引数组
int[] tempIndices = new int[numInstance];
//初始化索引数组
for (int i = 0; i < numInstance; i++) {
tempIndices[i] = i;
}
//开始洗牌
int tempFirst, tempSecond, tempIndex;
for (int i = 0; i < numInstance; i++) {
//随机选取下标
tempFirst = random.nextInt(numInstance);
tempSecond = random.nextInt(numInstance);
//Swap
tempIndex = tempIndices[tempFirst];
tempIndices[tempFirst] = tempIndices[tempSecond];
tempIndices[tempSecond] = tempIndex;
}
return tempIndices;
}
public double distance(int dataA, double[] paraArray) {
double resultDistance = 0;
double tempDistance = 0;
switch (distanceMeasure) {
case MANHATTAN:
for (int i = 0; i < datasets.numAttributes() - 1; i++) {
tempDistance = datasets.instance(dataA).value(i) - paraArray[i];
if (tempDistance < 0) {
resultDistance -= tempDistance;
} else {
resultDistance += tempDistance;
}
}
break;
case EUCLIDEAN:
for (int i = 0; i < datasets.numAttributes() - 1; i++) {
tempDistance = datasets.instance(dataA).value(i) - paraArray[i];
resultDistance += tempDistance * tempDistance;
}
break;
default:
System.out.println("未知的距离度量");
System.exit(0);
}
return resultDistance;
}
public void Clustering() {
//step1.初始化矩阵
double[][] tempClusterCenters = new double[numClusters][datasets.numAttributes() - 1];
//判断簇中心是否发生变化
boolean changeFlag = true;
int[] tempRandomOrders = shuffle(datasets.numInstances());
int index = 0;
for (int i = 0; i < numClusters; i++) {
index = random.nextInt(datasets.numInstances());
for (int j = 0; j < datasets.numAttributes() - 1; j++) {
tempClusterCenters[i][j] = datasets.instance(tempRandomOrders[index]).value(j);
}
}
//存储簇长度
int[] tempClusterLength = new int[numClusters];
//存储最近中心
int tempNearestCenter;
//最近距离
double tempNearestDistance;
double tempDistance;
//用索引的方法记录簇的划分
int[] tempClusterArray = new int[datasets.numInstances()];
while (changeFlag) {
//step1.清空簇
changeFlag = false;
Arrays.fill(tempClusterArray, 0);
Arrays.fill(tempClusterLength, 0);
//step2.开始划分
for (int i = 0; i < datasets.numInstances(); i++) {
tempNearestDistance = Double.MAX_VALUE;
tempNearestCenter = 0;
for (int j = 0; j < numClusters; j++) {
tempDistance = distance(i, tempClusterCenters[j]);
if (tempDistance < tempNearestDistance) {
tempNearestDistance = tempDistance;
tempNearestCenter = j;
}
}
tempClusterArray[i] = tempNearestCenter;
tempClusterLength[tempNearestCenter]++;
}
//step3.计算新的簇中心
double[][] tempNewCenter = new double[numClusters][datasets.numAttributes() - 1];
for (int i = 0; i < datasets.numInstances(); i++) {
for (int j = 0; j < datasets.numAttributes() - 1; j++) {
tempNewCenter[tempClusterArray[i]][j] += datasets.instance(i).value(j);
}
}
//step4.判断簇中心是否发生变化
for (int i = 0; i < numClusters; i++) {
for (int j = 0; j < datasets.numAttributes() - 1; j++) {
tempNewCenter[i][j] /= tempClusterLength[i];
}
if (!Arrays.equals(tempClusterCenters[i], tempNewCenter[i])) {
changeFlag = true;
}
}
//step5.将新的均值作为簇中心
tempClusterCenters = tempNewCenter;
}
clusters = new int[numClusters][];
int[] tempCounters = new int[numClusters];
for (int i = 0; i < numClusters; i++) {
clusters[i] = new int[tempClusterLength[i]];
}
for (int i = 0; i < tempClusterArray.length; i++) {
clusters[tempClusterArray[i]][tempCounters[tempClusterArray[i]]] = i;
tempCounters[tempClusterArray[i]]++;
}
for (int[] cluster : clusters) {
System.out.println("The clusters are: " + Arrays.toString(cluster));
}
}
public static void main(String arags[]) {
myKMeans tempKMeans = new myKMeans("E:\DataSet\iris.arff");
tempKMeans.setNumClusters(3);
tempKMeans.Clustering();
}
}



