第一章:分类模型
文章目录
- pyspark训练模型demo
- 一、逻辑回归模型
- 标准化features
- 第五步:
一、逻辑回归模型
第一步:通过pandas、createDataframe创造模型原始数据:
# spark version 3.0.1
from pyspark.ml.classification import LogisticRegression
import pandas as pd
# 模型数据
pandas_df = pd.Dataframe({
'a': [1,1,0,1,0],
'b': [1,0,1,1,1],
'c': [0,1,0,0,0],
'y': [0,0,0,1,1],
'id':['A001', 'A002', 'A003','A004','A005']
})
df = spark.createDataframe(pandas_df).select("id","a","b","c","y")
df.show()
+----+---+---+---+---+ | id| a| b| c| y| +----+---+---+---+---+ |A001| 1| 1| 0| 0| |A002| 1| 0| 1| 0| |A003| 0| 1| 0| 0| |A004| 1| 1| 0| 1| |A005| 0| 1| 0| 1| +----+---+---+---+---+
第二步:features向量化、标准化处理
from pyspark.ml.feature import VectorAssembler from pyspark.ml.feature import Normalizer # 生成features vecAss = VectorAssembler(inputCols=['a','b','c'], outputCol='features') df_features = vecAss.transform(df)
+----+---+---+---+---+-------------+ | id| a| b| c| y| features| +----+---+---+---+---+-------------+ |A001| 1| 1| 0| 0|[1.0,1.0,0.0]| |A002| 1| 0| 1| 0|[1.0,0.0,1.0]| |A003| 0| 1| 0| 0|[0.0,1.0,0.0]| |A004| 1| 1| 0| 1|[1.0,1.0,0.0]| |A005| 0| 1| 0| 1|[0.0,1.0,0.0]| +----+---+---+---+---+-------------+标准化features
Norm = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0) df_norm_features = Norm.transform(df_features) df_norm_features.show()
+----+---+---+---+---+-------------+-------------+ | id| a| b| c| y| features| normFeatures| +----+---+---+---+---+-------------+-------------+ |A001| 1| 1| 0| 0|[1.0,1.0,0.0]|[0.5,0.5,0.0]| |A002| 1| 0| 1| 0|[1.0,0.0,1.0]|[0.5,0.0,0.5]| |A003| 0| 1| 0| 0|[0.0,1.0,0.0]|[0.0,1.0,0.0]| |A004| 1| 1| 0| 1|[1.0,1.0,0.0]|[0.5,0.5,0.0]| |A005| 0| 1| 0| 1|[0.0,1.0,0.0]|[0.0,1.0,0.0]| +----+---+---+---+---+-------------+-------------+
第三步:模型训练
# 模型训练 model = LogisticRegression(featuresCol='normFeatures', labelCol='y',maxIter=100,tol=1e-06,threshold=0.5,predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction',standardization=True).fit(df_norm_features) print(model.coefficients) [1.8029996152867545,1.803003434834563,-36.96577573215852] print(model.intercept) -1.80300332247
第四步:模型预测
# 模型预测 result = model.transform(df_norm_features) result.show()
+----+---+---+---+---+-------------+-------------+--------------------+--------------------+----------+ | id| a| b| c| y| features| normFeatures| rawPrediction| probability|prediction| +----+---+---+---+---+-------------+-------------+--------------------+--------------------+----------+ |A001| 1| 1| 0| 0|[1.0,1.0,0.0]|[0.5,0.5,0.0]|[1.79741316608250...|[0.50000044935329...| 0.0| |A002| 1| 0| 1| 0|[1.0,0.0,1.0]|[0.5,0.0,0.5]|[19.3843913809097...|[0.99999999618525...| 0.0| |A003| 0| 1| 0| 0|[0.0,1.0,0.0]|[0.0,1.0,0.0]|[-1.1236073826914...|[0.49999997190981...| 1.0| |A004| 1| 1| 0| 1|[1.0,1.0,0.0]|[0.5,0.5,0.0]|[1.79741316608250...|[0.50000044935329...| 0.0| |A005| 0| 1| 0| 1|[0.0,1.0,0.0]|[0.0,1.0,0.0]|[-1.1236073826914...|[0.49999997190981...| 1.0| +----+---+---+---+---+-------------+-------------+--------------------+--------------------+----------+
result.printSchema()第五步:
[1] https://spark.apache.org/docs/3.0.0/api/python/pyspark.ml.html#pyspark.ml.Model



