Pyspark分类--GBTClassifier

GBTClassifier

class pyspark.ml.classification.GBTClassifier(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, lossType=‘logistic’, maxIter=20, stepSize=0.1, seed=None, subsamplingRate=1.0, featureSubsetStrategy=‘all’)

用于分类的梯度提升树 (GBT) 学习算法。它支持二进制标签，以及连续和分类特征。

关于 Gradient Boosting 与 TreeBoost 的注意事项： - 此实现适用于 Stochastic Gradient Boosting，而不适用于 TreeBoost。 - 两种算法都通过最小化损失函数来学习树集合。 - TreeBoost (Friedman, 1999) 基于损失函数额外修改了树叶节点的输出，而原始梯度提升方法没有。

当前不支持多类标签。

stepSize = Param(parent=‘undefined’, name=‘stepSize’, doc=‘步长（又名学习率）在区间 (0, 1] 中用于缩小每个估计器的贡献。’)

subsamplingRate = Param(parent= ‘undefined’, name=‘subsamplingRate’, doc=‘学习每个决策树的训练数据的分数，在 (0, 1] 范围内。’)

supportedFeatureSubsetStrategies = [‘auto’, ‘all’, ‘onethird’, ‘sqrt’, ‘log2’] supportedLossTypes = [‘logistic’]

lossType = Param(parent=‘undefined’, name=‘lossType’, doc=‘GBT 试图最小化的损失函数（不区分大小写）。支持的选项：logistic’) maxBins = Param(parent=‘undefined’, name= ‘maxBins’, doc=‘离散连续特征的最大 bin 数。对于任何分类特征，必须 >=2 且 >= 类别数。’)

maxDepth = Param(parent=‘undefined’, name=‘maxDepth’, doc=‘树的最大深度。(>= 0) 例如，深度 0 表示 1 个叶子节点；深度 1 表示 1 个内部节点 2 个叶子节点。’)

maxIter = Param(parent=‘undefined’, name='maxIter ', doc=‘最大迭代次数 (>= 0).’)

maxMemoryInMB = Param(parent=‘undefined’, name=‘maxMemoryInMB’, doc=‘分配给直方图聚合的最大内存MB。如果太小，则每次迭代将拆分 1 个节点，其聚合可能超过此大小。’)

minInfoGain = Param(parent=‘undefined’, name=‘minInfoGain’, doc=‘在树节点处考虑拆分的最小信息增益.’)

minInstancesPerNode = Param(parent=’ undefined’, name=‘minInstancesPerNode’, doc=‘拆分后每个孩子必须拥有的最小实例数。如果拆分导致左或右子节点的数量少于 minInstancesPerNode，则拆分将被视为无效而丢弃。应该 >= 1。’)

model.featureimportances:估计每个特征的重要性。每个特征的重要性是其在集成中所有树的重要性的平均值。重要性向量被归一化为总和为 1。这种方法由 Hastie 等人提出。（Hastie、Tibshirani、Friedman。“统计学习的要素，第 2 版。”2001。）并遵循 scikit-learn 的实施

model.treeWeights:返回每棵树的权重

model.totalNumNodes:节点总数，对集成中的所有树求和

model.toDebugString:模型的完整描述

GBTClassifier对象的**getFeatureSubsetStrategy()**方法：获取特征子集策略的值或其默认值

01.创建对象生成数据

from pyspark.sql import SparkSession
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer
from numpy import allclose
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")
    .config("spark.ui.showConsoleProgress","false").appName("GBTClassifier")
    .master("local[*]").getOrCreate()
df = spark.createDataframe([
    (1.0, Vectors.dense(1.0)),
    (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
df.show()

输出结果：

+-----+---------+
|label| features|
+-----+---------+
|  1.0|    [1.0]|
|  0.0|(1,[],[])|
+-----+---------+

02.将标签列数值化

stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(df)
td = si_model.transform(df)
td.show()

输出结果：

+-----+---------+-------+
|label| features|indexed|
+-----+---------+-------+
|  1.0|    [1.0]|    1.0|
|  0.0|(1,[],[])|    0.0|
+-----+---------+-------+

03.构造分类器的对象，并获取特征子集策略的值或其默认值

gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="indexed", seed=42)
print(gbt.getFeatureSubsetStrategy())

输出结果：

all

04.使用分类器模型转换原有数据

model = gbt.fit(td)
model.transform(td).show()
print(model.transform(td).head(2))

输出结果：

+-----+---------+-------+--------------------+--------------------+----------+
|label| features|indexed|       rawPrediction|         probability|prediction|
+-----+---------+-------+--------------------+--------------------+----------+
|  1.0|    [1.0]|    1.0|[-1.1696739081251...|[0.08791619739525...|       1.0|
|  0.0|(1,[],[])|    0.0|[1.16967390812510...|[0.91208380260474...|       0.0|
+-----+---------+-------+--------------------+--------------------+----------+

[Row(label=1.0, features=DenseVector([1.0]), indexed=1.0, rawPrediction=DenseVector([-1.1697, 1.1697]), probability=DenseVector([0.0879, 0.9121]), prediction=1.0),
 Row(label=0.0, features=SparseVector(1, {}), indexed=0.0, rawPrediction=DenseVector([1.1697, -1.1697]), probability=DenseVector([0.9121, 0.0879]), prediction=0.0)]

05.查看特征重要程度

print(model.featureimportances)

输出结果：

(1,[0],[1.0])

06.返回每棵树的权重数组，并使用numpy的allclose比较是否相等：

allclose：如果两个数组在容差内按元素相等，则返回 True。
 公差值是正的，通常是非常小的数字。 这
 相对差 (`rtol` * abs(`b`)) 和绝对差
 `atol` 加在一起以比较绝对差异
 在`a`和`b`之间。
 如果任一数组包含一个或多个 NaN，则返回 False。
 如果 Infs 位于相同的位置且相同，则它们被视为相同
 登录两个数组。

print(model.treeWeights)
print(allclose(model.treeWeights, [1.0, 0.1, 0.1, 0.1, 0.1]))

输出结果：

[1.0, 0.1, 0.1, 0.1, 0.1]
True

07.构建两个数据，并使用之前的模型进行预测

test0 = spark.createDataframe([(Vectors.dense(-1.0),)], ["features"])
test1 = spark.createDataframe([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
test0.show()
print(model.transform(test0).head().prediction)
test1.show()
print(model.transform(test1).head().prediction)

输出结果：

+--------+
|features|
+--------+
|  [-1.0]|
+--------+

0.0

+-------------+
|     features|
+-------------+
|(1,[0],[1.0])|
+-------------+

1.0

08.查看模型的节点总数

print(model.totalNumNodes)

输出结果：15

09.查看模型的完整描述

print(model.toDebugString)

输出结果：

GBTClassificationModel (uid=GBTClassifier_e74ed365080e) with 5 trees
  Tree 0 (weight 1.0):
    If (feature 0 <= 0.5)
     Predict: -1.0
    Else (feature 0 > 0.5)
     Predict: 1.0
  Tree 1 (weight 0.1):
    If (feature 0 <= 0.5)
     Predict: -0.4768116880884702
    Else (feature 0 > 0.5)
     Predict: 0.4768116880884702
  Tree 2 (weight 0.1):
    If (feature 0 <= 0.5)
     Predict: -0.4381935810427206
    Else (feature 0 > 0.5)
     Predict: 0.4381935810427206
  Tree 3 (weight 0.1):
    If (feature 0 <= 0.5)
     Predict: -0.4051496802845983
    Else (feature 0 > 0.5)
     Predict: 0.4051496802845983
  Tree 4 (weight 0.1):
    If (feature 0 <= 0.5)
     Predict: -0.3765841318352991
    Else (feature 0 > 0.5)
     Predict: 0.3765841318352991

Pyspark分类--GBTClassifier

大数据系统相关栏目本月热门文章