数据集:housing.csv(regression problem)
参考书:《Machine Learning Mastery With Python Understand Your Data, Create Accurate Models and work Projects End-to-End》
获取链接:https://github.com/aoyinke/ML_learner
本文主要论述了5种classification problem的相关算法性能指标(包括了相关概念,注意事项,以及如何使用sklearn来计算这些指标)
Mean Absolute Error.Mean Squared Error.R^2 Regression Metrics
Note that mean squared error values are inverted (negative). This is a quirk of the cross val score() function used that requires all algorithm metrics to be sorted in ascending order (larger value is better).
数据准备
from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression filename = 'housing.csv' names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] dataframe = read_csv(filename, delim_whitespace=True, names=names) array = dataframe.values X = array[:,0:13] Y = array[:,13]Mean Absolute Error(MAE)
Concept:
Absolute Error:即预测值和真实值之差,取绝对值
n = the number of errors,
Σ = summation symbol (which means “add them all up”),
|xi – x| = the absolute errors.
Code:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
kfold = KFold(n_splits=10, random_state=7)
model = LinearRegression()
scoring = 'neg_mean_absolute_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (results.mean(), results.std()) )
# MAE: -4.005 (2.084)
Mean Squared Error(MSE)
Concept:
Yi= original or observed y-value,
Yi(head)= y-value from regression.
Code:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
kfold = KFold(n_splits=10, random_state=7)
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MSE: %.3f (%.3f)" % (results.mean(), results.std()) )
# MSE: -34.705 (45.574)
R Squared
Concept:
R-Squared(R2)是一种统计措施,表示由回归模型中的独立变量或变量解释的依赖变量的方差的比例。相关性(correlation)解释了自变量和因变量之间关系的强度,而r平方解释了一个变量的方差在多大程度上解释了第二个变量的方差。因此,如果模型的R2是0.50,那么观测到的变化大约有一半可以由模型的输入来解释。
what is a good r2:
这取决于研究的领域,每一个领域对于r2的取值是不一样的。例如社会科学领域,r2是0.5分就算是好了,但是在金融领域,r2要0.7分才好
Limitations:
- R2得分高低,无法告诉你选择的模型是好是坏,也无法说明你的数据和预测值是否存在bias你无法通过r2来判断是否选择了正确的regression algorithm
Code:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
kfold = KFold(n_splits=10, random_state=7)
model = LinearRegression()
scoring = 'r2'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()) )
# R^2: 0.203 (0.595)



