栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

决策树——决策树的两种实现方式(Python版)

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

决策树——决策树的两种实现方式(Python版)

从周志华的机器学习中,我们可以得到基于信息增益准则的决策树构建算法:

  输入:训练集D={(x1,y1),(x2,y2),,(xm,ym)}D={(x_1,y_1),(x_2,y_2),cdots,(x_m,y_m)}D={(x1​,y1​),(x2​,y2​),⋯,(xm​,ym​)}
       属性集A={a1,a2,,ad}A={a_1,a_2,cdots,a_d}A={a1​,a2​,⋯,ad​}
  过程:函数createTree(D,A)createTree(D,A)createTree(D,A)的实现
  1. 生成结点node;
  2. if DDD中样本全属于同一类别CCC then
  3.    将node标记为CCC类叶结点;return
  4. end if
  5. if A=A=varnothingA=∅ OR DDD中样本在AAA上取值相同 then
  6.    将node标记为叶结点,其类别标记为DDD中样本数最多的类;return
  7. end if
  8. 从AAA中选择最优划分属性aa_*a∗​(信息增益或者其他算法准则);
  9. for aa_*a∗​的每一个值ava_*^va∗v​ do
  10.    为node生成一个分支;令DvD_vDv​表示DDD中在aa_*a∗​上取值为ava_*^va∗v​的样本子集;
  11.     if DvD_vDv​ 为空 then
  12.       将分支结点标记为叶结点,其类别标记为DDD中样本最多的类;return
  13.     else
  14.       以createTree(Dv,A/{a})createTree(D_v,A/{a_*})createTree(Dv​,A/{a∗​})为分支结点
  15.     end if
  16. end for
   输出:以node为根节点的一棵决策树

故,基于信息增益的决策树算法的实现代码如下:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Oct 29 20:15:21 2018

@author: lxh
"""
#决策树的实现

from math import log
import time

def createDataSet():
    dataSet =[[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no'],]
    labels =['Tree','leaves']
    return dataSet,labels

#计算香农熵
def calcShannonEnt(dataSet):
    numEntries =len(dataSet)
    labelCounts={}
    for feaVec in dataSet:
 currentLabel =feaVec[-1]
 if currentLabel not in labelCounts:
     labelCounts[currentLabel]=0
 labelCounts[currentLabel]+=1
    shannonEnt =0.0
    for key in labelCounts:
 prob =float(labelCounts[key])/numEntries
 shannonEnt-=prob*log(prob,2)
    return shannonEnt

#去掉已经决策过的属性
def splitDataSet(dataSet,axis,value):
    retDataSet = []
    for featVec in dataSet:
 if featVec[axis]==value:
     reducedFeatVec =featVec[:axis]
     reducedFeatVec.extend(featVec[axis+1:])
     retDataSet.append(reducedFeatVec)
    return retDataSet

#根据信息增益算法,选取最优属性
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0])-1 #因为数据集的最后一项是标签
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain =0.0
    bestFeature = -1
    for i in range(numFeatures):
 featList =[example[i] for example in dataSet]
 print(featList)
 uniquevals =set(featList)
 newEntropy = 0.0
 for value in uniquevals:
     subDataSet =splitDataSet(dataSet,i,value)
     prob = len(subDataSet)/float(len(dataSet))
     newEntropy +=prob * calcShannonEnt(subDataSet)
 infoGain = baseEntropy -newEntropy
 if infoGain >bestInfoGain:
     bestInfoGain =infoGain
     bestFeature =i
    return bestFeature
#因为我们递归构建决策树是根据属性的消耗进行计算的,所以可能会存在最后属性用完了,但是分类还没有算完,
#这时候就会采用多数表决的方式计算节点分类
def majorityCnt(classList):
    classCount ={}
    for vote in classList:
 if vote not in classCount.keys():
     classCount[vote] =0
 classCount[vote]+=1
    return max(classCount)

def createTree(dataSet,labels):
    classList =[example[-1] for example in dataSet]
    if classList.count(classList[0]) ==len(classList): #类别相同则停止划分
 return classList[0]
    if len(dataSet[0])==1:#所有特征已经用完
 return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel =labels[bestFeat]
    myTree ={bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniquevals = set(featValues)
    for value in uniquevals:
 subLabels = labels[:]#为了不改变原始列表的内容复制了一下
 myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat, value),subLabels)
    return myTree
     
def main():
    data,label =createDataSet()
    t1 =time.clock()
    myTree =createTree(data,label)
    t2 =time.clock()
    print(myTree)
    print('execure time:',t2-t1)


if __name__=='__main__':
    main()
    

【基于scikit-learn的决策树实现】

scikit-learn库为我们提供了实现决策树算法的接口,我们可以通过调用接口进行代码实现,本部分代码除了包含决策树代码以外,还包含:随机森林、决策树可视化、参数自动选择等代码。具体如下(jupyer):

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets.california_housing import fetch_california_housing
housing = fetch_california_housing()
print(housing.DESCR)

housing.data.shape
housing.data[0]

from sklearn import tree
dtr = tree.DecisionTreeRegressor(max_depth = 2)
dtr.fit(housing.data[:, [6, 7]], housing.target)

#要可视化显示 首先需要安装 graphviz   http://www.graphviz.org/Download..php
dot_data = 
    tree.export_graphviz(
 dtr,
 out_file = None,
 feature_names = housing.feature_names[6:8],
 filled = True,
 impurity = False,
 rounded = True
    )
#可视化展示决策树
#pip install pydotplus
import pydotplus
graph = pydotplus.graph_from_dot_data(dot_data)
graph.get_nodes()[7].set_fillcolor("#FFF2DD")
from IPython.display import Image
Image(graph.create_png())
#保存图片
graph.write_png("dtr_white_background.png")

#随机森林算法
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = 
    train_test_split(housing.data, housing.target, test_size = 0.1, random_state = 42)
dtr = tree.DecisionTreeRegressor(random_state = 42)
dtr.fit(data_train, target_train)

dtr.score(data_test, target_test)

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor( random_state = 42)
rfr.fit(data_train, target_train)
rfr.score(data_test, target_test)

#参数自动化匹配
from sklearn.grid_search import GridSearchCV
tree_param_grid = { 'min_samples_split': list((3,6,9)),'n_estimators':list((10,50,100))}
grid = GridSearchCV(RandomForestRegressor(),param_grid=tree_param_grid, cv=5)
grid.fit(data_train, target_train)
grid.grid_scores_, grid.best_params_, grid.best_score_

rfr = RandomForestRegressor( min_samples_split=3,n_estimators = 100,random_state = 42)
rfr.fit(data_train, target_train)
rfr.score(data_test, target_test)

pd.Series(rfr.feature_importances_, index = housing.feature_names).sort_values(ascending = False)

决策树可视化结果如下所示:

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/220485.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号