我相信可以通过修改RandomForestClassifier对象的
estimators_和
n_estimators属性来实现。森林中的每棵树都存储为DecisionTreeClassifier对象,这些树的列表存储在
estimators_属性中。为了确保不存在不连续性,更改中的估计量也很有意义
n_estimators。
这种方法的优点是,您可以跨多台机器并行构建一堆小森林并将其组合。
这是使用虹膜数据集的示例:
from sklearn.ensemble import RandomForestClassifierfrom sklearn.cross_validation import train_test_splitfrom sklearn.datasets import load_irisdef generate_rf(X_train, y_train, X_test, y_test): rf = RandomForestClassifier(n_estimators=5, min_samples_leaf=3) rf.fit(X_train, y_train) print "rf score ", rf.score(X_test, y_test) return rfdef combine_rfs(rf_a, rf_b): rf_a.estimators_ += rf_b.estimators_ rf_a.n_estimators = len(rf_a.estimators_) return rf_airis = load_iris()X, y = iris.data[:, [0,1,2]], iris.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)# in the line below, we create 10 random forest classifier modelsrfs = [generate_rf(X_train, y_train, X_test, y_test) for i in xrange(10)]# in this step below, we combine the list of random forest models into one giant modelrf_combined = reduce(combine_rfs, rfs)# the combined model scores better than *most* of the component modelsprint "rf combined score", rf_combined.score(X_test, y_test)



