员工离职概率预测

从大数据到深数据总述

当今为“数据技术”时代、“大数据”时代，万物联网提供了海量的数据，庞大的数据为当今人工智能蓬勃发展提供了可能。大数据技术创造了许多全新的可能性，人们依靠计算机强大的计算能力，在成千上万条数据中提取“经验”，总结“规律”，并为实践提供指导。

在大数据背景下，面对驳杂无章的数据，如何有效分析数据至关重要。大数据背景在许多应用场景中存在一定局限性，大部分场景得到的数据往往是缺乏经验的，其内容的可靠性欠佳，因而得到的实验结果也将会受到质疑。在一定基础上进行数据清洗，侧重于捕获可用信息，排除冗余信息，剔除无效信息，将原本数据清洗为高质量、高可用的、具有分析意义的数据，将原本价值密度低的大数据经过提纯得到深数据 deep data ( a high-quality data that focus on specific predict trends )，在此基础上进一步分析数据，有利于通过更为精确的建模，降低计算机处理时间，提高机器的学习精度，得到更为有效实验结果。

文献学习

文献《From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction》旨在通过深度数据驱动的方式分析员工离职原因，构建一种混合方式 (mixed method) 的员工流失模型，总结数据，将数据转化为相关信息，分析提取员工离职的影响因素，并通过总结数据特征预测其他员工的离职概率，给出保留员工的政策建议。

对于人力资源数据，文献采用递归式特征消除算法 (Recursive Feature Elimination) 和 SelectKBest 算法进行特征提取，二者算法均是提取出对员工离职因素影响大的多个属性，并将这些属性进行特征分析，针对经过处理后的收集特定相关数据进行分析，将公司职员离职问题的不确定性，由定量分析转化为定性分析，描述员工离职的可能意愿。

本文针对 IBM HR Analytics Employee Attrition & Performance 数据进行分析，采用逻辑回归，SVM方法预测实现。首先通过数据清洗，去除无效属性以及重复值，为简化操作细节，未同像文献一样进行特征评估，此处选用剩下所有属性作为特征，后对数据信息编码以及参数训练得到结果。

导入数据

import numpy as np
import pandas as pd
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df

数据清洗

df.nunique().nsmallest(10)  # 无效列值检查
# EmployeeCount, Over18 and StandardHours 这些属性值完全相同
# 删除无效列值
df.drop(['StandardHours'], axis=1, inplace=True)
df.drop(['EmployeeCount'], axis=1, inplace=True)
df.drop(['Over18'], axis=1, inplace=True)
df.isnull().sum()  # 缺失值检查
df[df.duplicated()]  # 重复值检查
corrmat = df.corr() # 相关性检验
# 由相关性系数绘制热力图
import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(corrmat, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 11}, cmap="YlGnBu", 
            linewidths=0.5, linecolor='blue')

# 上面图太大，分块展示
h1 = corrmat.loc['Age':'NumCompaniesWorked', 'Age':'NumCompaniesWorked']
h2 = corrmat.loc['PercentSalaryHike':, 'Age':'NumCompaniesWorked']
h3 = corrmat.loc['PercentSalaryHike':, 'PercentSalaryHike':]

sns.set(rc = {'figure.figsize':(15,8)})
sns.heatmap(h1, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 11}, cmap="YlGnBu", 
            linewidths=0.5, linecolor='blue')
# 可以看到 MonthlyIncome 与 JobLevel 相关性较强
sns.heatmap(h2, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 11}, cmap="YlGnBu", 
            linewidths=0.5, linecolor='blue')
# TotalWorkingYears 与 JobLevel 相关性较强
# TotalWorkingYears 与 MonthlyIncome 相关性较强
sns.heatmap(h3, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 11}, cmap="YlGnBu", 
            linewidths=0.5, linecolor='blue')
# PercentSalaryHike 与 PerformanceRating 相关性较强
# YearsInCurrentRole 与 YearsAtCompany 相关性较强
# YearsWithCurrManager 与 YearsAtCompany 相关性较强

# 那些相关性较强的变量之间关联程度高，此处减少变量个数将其删除
df_final = df.drop(['JobLevel','TotalWorkingYears','YearsInCurrentRole', 'YearsWithCurrManager' , 'PercentSalaryHike'], axis=1)
df_final
df_final.columns
# 数据预处理
from sklearn import preprocessing
def preprocessor(df):
    res_df = df.copy()
    le = preprocessing.LabelEncoder()
    
    res_df['BusinessTravel'] = le.fit_transform(res_df['BusinessTravel'])
    res_df['Department'] = le.fit_transform(res_df['Department'])
    res_df['Education'] = le.fit_transform(res_df['Education'])
    res_df['EducationField'] = le.fit_transform(res_df['EducationField'])
    res_df['JobRole'] = le.fit_transform(res_df['JobRole'])
    res_df['Gender'] = le.fit_transform(res_df['Gender'])
    res_df['MaritalStatus'] = le.fit_transform(res_df['MaritalStatus'])
    res_df['OverTime'] = le.fit_transform(res_df['OverTime'])
    res_df['Attrition'] = le.fit_transform(res_df['Attrition'])
    return res_df
encoded_df = preprocessor(df_final)
encoded_df
# 特征提取
X = encoded_df.drop(['Attrition'],axis =1)
y = encoded_df['Attrition']
y
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
mean = np.mean(X, axis=0)
print('均值')
print(mean)
standard_deviation = np.std(X, axis=0)
print('标准差')
print(standard_deviation)
# 拆分数据
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# 逻辑回归
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
# 性能分析
# 混淆矩阵
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc

y_pred=logreg.predict(X_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.Dataframe(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
# 准确度、精确度和召回率
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
# ROC曲线
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
# SVM
from sklearn.svm import SVC
from sklearn import metrics
# 默认参数
svc=SVC()
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
# Linear Kernel
svc=SVC(kernel='linear')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
# Polynomial Kernel
svc=SVC(kernel='poly')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
# Radial Kernel
svc=SVC(kernel='rbf')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
# GridSearchCV优化
svm_model= SVC()
tuned_parameters = {
 'C': (np.arange(2,3,0.1)) , 'kernel': ['linear', 'poly', 'rbf'],
 'C': (np.arange(2,3,0.1)) , 'gamma': [0.01,0.02,0.03,0.04,0.05], 'kernel': ['linear', 'poly', 'rbf'],
 'degree': [2,3,4] ,'gamma':[0.01,0.1,1], 'C':(np.arange(2,3,0.1)) , 'kernel':['poly', 'linear','rbf']
    }
from sklearn.model_selection import GridSearchCV
model_svm = GridSearchCV(svm_model, tuned_parameters,cv=10,scoring='accuracy')
model_svm.fit(X_train, y_train)
print(model_svm.best_score_)
print(model_svm.best_params_)
# 混淆矩阵
y_pred=model_svm.predict(X_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.Dataframe(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
# 准确度、精确度和召回率
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

后续其他方法学习实践

员工离职概率预测

大数据系统相关栏目本月热门文章