1,K均值聚类2,高斯混合模型 3, 二分K均值 Bisecting k-means
Mllib支持的聚类模型较少,主要有K均值聚类,高斯混合模型GMM,以及二分的K均值,隐含狄利克雷分布LDA模型等。
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import Clusteringevaluator
# 载入数据
dfdata = spark.read.format("libsvm").load("data/sample_kmeans_data.txt")
# 训练Kmeans模型
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dfdata)
# 进行预测
dfpredictions = model.transform(dfdata)
# 评估模型
evaluator = Clusteringevaluator()
silhouette = evaluator.evaluate(dfpredictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
# 打印中心点
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
Silhouette with squared euclidean distance = 0.9997530305375207 Cluster Centers: [9.1 9.1 9.1] [0.1 0.1 0.1]2,高斯混合模型
from pyspark.ml.clustering import GaussianMixture
dfdata = spark.read.format("libsvm").load("data/sample_kmeans_data.txt")
gmm = GaussianMixture().setK(2).setSeed(538009335)
model = gmm.fit(dfdata)
print("Gaussians shown as a Dataframe: ")
model.gaussiansDF.show(truncate=True)
aussians shown as a Dataframe: +--------------------+--------------------+ | mean| cov| +--------------------+--------------------+ |[0.10000000000001...|0.006666666666806...| |[9.09999999999998...|0.006666666666812...| +--------------------+--------------------+3, 二分K均值 Bisecting k-means
Bisecting k-means是一种自上而下的层次聚类算法。所有的样本点开始时属于一个cluster,然后不断通过K均值二分裂得到多个cluster。
from pyspark.ml.clustering import BisectingKMeans
dfdata = spark.read.format("libsvm").load("data/sample_kmeans_data.txt")
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dfdata)
cost = model.computeCost(dfdata)
print("Within Set Sum of Squared Errors = " + str(cost))
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
print(center)
Within Set Sum of Squared Errors = 0.11999999999994547 Cluster Centers: [0.1 0.1 0.1] [9.1 9.1 9.1]
资源链接下载
sample_kmeans_data.txt



