直接上代码
我用的是GBDT,预测结果如下:
只给出关键代码
gbdt = GBTClassifier(featuresCol="features", labelCol="y", predictionCol="prediction",) df_train_eval = model.transform(df_train) df_train_eval.select(*['y', 'rawPrediction', 'probability', 'prediction']).show(truncate=False)
+---+----------------------------------------+-----------------------------------------+----------+ |y |rawPrediction |probability |prediction| +---+----------------------------------------+-----------------------------------------+----------+ |1 |[-1.5435020027249835,1.5435020027249835]|[0.04364652142729318,0.9563534785727068] |1.0 | |1 |[-1.5435020027249835,1.5435020027249835]|[0.04364652142729318,0.9563534785727068] |1.0 | |0 |[1.5435020027249835,-1.5435020027249835]|[0.9563534785727067,0.043646521427293306]|0.0 | |0 |[1.5435020027249835,-1.5435020027249835]|[0.9563534785727067,0.043646521427293306]|0.0 | |0 |[1.5435020027249835,-1.5435020027249835]|[0.9563534785727067,0.043646521427293306]|0.0 | |0 |[1.5435020027249835,-1.5435020027249835]|[0.9563534785727067,0.043646521427293306]|0.0 | |0 |[1.5435020027249835,-1.5435020027249835]|[0.9563534785727067,0.043646521427293306]|0.0 | |1 |[-1.5435020027249835,1.5435020027249835]|[0.04364652142729318,0.9563534785727068] |1.0 | |1 |[-1.5435020027249835,1.5435020027249835]|[0.04364652142729318,0.9563534785727068] |1.0 | |0 |[1.5435020027249835,-1.5435020027249835]|[0.9563534785727067,0.043646521427293306]|0.0 | |1 |[-1.5435020027249835,1.5435020027249835]|[0.04364652142729318,0.9563534785727068] |1.0 | |1 |[-1.5435020027249835,1.5435020027249835]|[0.04364652142729318,0.9563534785727068] |1.0 | +---+----------------------------------------+-----------------------------------------+----------+
y:原始标签rawPrediction:原始预测值probability:概率值prediction:预测结果
为什么说是softmax,而不是sigmoid?
原因一:
我们看概率值probability这一列,每一行加起来都是1,符合softmax的互斥原则。
原因二:
def softmax(x): e_x = np.exp(x) return e_x / e_x.sum() def sigmoid(x): return 1. / (1 + np.exp(-x))
同过将原始预测值rawPrediction,带到softmax和sigmoid函数中,也可以证明,它的结果就是softmax的结果。



