栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

员工离职概率预测

员工离职概率预测

从大数据到深数据 总述

​ 当今为“数据技术”时代、“大数据”时代,万物联网提供了海量的数据,庞大的数据为当今人工智能蓬勃发展提供了可能。大数据技术创造了许多全新的可能性,人们依靠计算机强大的计算能力,在成千上万条数据中提取“经验”,总结“规律”,并为实践提供指导。

​ 在大数据背景下,面对驳杂无章的数据,如何有效分析数据至关重要。大数据背景在许多应用场景中存在一定局限性,大部分场景得到的数据往往是缺乏经验的,其内容的可靠性欠佳,因而得到的实验结果也将会受到质疑。在一定基础上进行数据清洗,侧重于捕获可用信息,排除冗余信息,剔除无效信息,将原本数据清洗为高质量、高可用的、具有分析意义的数据,将原本价值密度低的大数据经过提纯得到深数据 deep data ( a high-quality data that focus on specific predict trends ),在此基础上进一步分析数据,有利于通过更为精确的建模,降低计算机处理时间,提高机器的学习精度,得到更为有效实验结果。

文献学习

​ 文献 《From Big Data to Deep Data to Support People Analytics for Employee Attrition Prediction》 旨在通过深度数据驱动的方式分析员工离职原因,构建一种混合方式 (mixed method) 的员工流失模型,总结数据,将数据转化为相关信息,分析提取员工离职的影响因素,并通过总结数据特征预测其他员工的离职概率,给出保留员工的政策建议。

​ 对于人力资源数据,文献采用递归式特征消除算法 (Recursive Feature Elimination) 和 SelectKBest 算法进行特征提取,二者算法均是提取出对员工离职因素影响大的多个属性,并将这些属性进行特征分析,针对经过处理后的收集特定相关数据进行分析,将公司职员离职问题的不确定性,由定量分析转化为定性分析,描述员工离职的可能意愿。

​ 本文针对 IBM HR Analytics Employee Attrition & Performance 数据进行分析,采用逻辑回归,SVM方法预测实现。首先通过数据清洗,去除无效属性以及重复值,为简化操作细节,未同像文献一样进行特征评估,此处选用剩下所有属性作为特征,后对数据信息编码以及参数训练得到结果。

导入数据

import numpy as np
import pandas as pd
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df

数据清洗

df.nunique().nsmallest(10)  # 无效列值检查
# EmployeeCount, Over18 and StandardHours 这些属性值完全相同
# 删除无效列值
df.drop(['StandardHours'], axis=1, inplace=True)
df.drop(['EmployeeCount'], axis=1, inplace=True)
df.drop(['Over18'], axis=1, inplace=True)
df.isnull().sum()  # 缺失值检查
df[df.duplicated()]  # 重复值检查
corrmat = df.corr() # 相关性检验
# 由相关性系数绘制热力图
import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(corrmat, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 11}, cmap="YlGnBu", 
            linewidths=0.5, linecolor='blue')

# 上面图太大,分块展示
h1 = corrmat.loc['Age':'NumCompaniesWorked', 'Age':'NumCompaniesWorked']
h2 = corrmat.loc['PercentSalaryHike':, 'Age':'NumCompaniesWorked']
h3 = corrmat.loc['PercentSalaryHike':, 'PercentSalaryHike':]

sns.set(rc = {'figure.figsize':(15,8)})
sns.heatmap(h1, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 11}, cmap="YlGnBu", 
            linewidths=0.5, linecolor='blue')
# 可以看到 MonthlyIncome 与 JobLevel 相关性较强
sns.heatmap(h2, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 11}, cmap="YlGnBu", 
            linewidths=0.5, linecolor='blue')
# TotalWorkingYears 与 JobLevel 相关性较强
# TotalWorkingYears 与 MonthlyIncome 相关性较强
sns.heatmap(h3, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 11}, cmap="YlGnBu", 
            linewidths=0.5, linecolor='blue')
# PercentSalaryHike 与 PerformanceRating 相关性较强
# YearsInCurrentRole 与 YearsAtCompany 相关性较强
# YearsWithCurrManager 与 YearsAtCompany 相关性较强

# 那些相关性较强的变量之间关联程度高,此处减少变量个数将其删除
df_final = df.drop(['JobLevel','TotalWorkingYears','YearsInCurrentRole', 'YearsWithCurrManager' , 'PercentSalaryHike'], axis=1)
df_final
df_final.columns
# 数据预处理
from sklearn import preprocessing
def preprocessor(df):
    res_df = df.copy()
    le = preprocessing.LabelEncoder()
    
    res_df['BusinessTravel'] = le.fit_transform(res_df['BusinessTravel'])
    res_df['Department'] = le.fit_transform(res_df['Department'])
    res_df['Education'] = le.fit_transform(res_df['Education'])
    res_df['EducationField'] = le.fit_transform(res_df['EducationField'])
    res_df['JobRole'] = le.fit_transform(res_df['JobRole'])
    res_df['Gender'] = le.fit_transform(res_df['Gender'])
    res_df['MaritalStatus'] = le.fit_transform(res_df['MaritalStatus'])
    res_df['OverTime'] = le.fit_transform(res_df['OverTime'])
    res_df['Attrition'] = le.fit_transform(res_df['Attrition'])
    return res_df
encoded_df = preprocessor(df_final)
encoded_df
# 特征提取
X = encoded_df.drop(['Attrition'],axis =1)
y = encoded_df['Attrition']
y
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
mean = np.mean(X, axis=0)
print('均值')
print(mean)
standard_deviation = np.std(X, axis=0)
print('标准差')
print(standard_deviation)
# 拆分数据
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# 逻辑回归
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
# 性能分析
# 混淆矩阵
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc

y_pred=logreg.predict(X_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.Dataframe(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
# 准确度、精确度和召回率
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
# ROC曲线
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
# SVM
from sklearn.svm import SVC
from sklearn import metrics
# 默认参数
svc=SVC()
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
# Linear Kernel
svc=SVC(kernel='linear')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
# Polynomial Kernel
svc=SVC(kernel='poly')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
# Radial Kernel
svc=SVC(kernel='rbf')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
# GridSearchCV优化
svm_model= SVC()
tuned_parameters = {
 'C': (np.arange(2,3,0.1)) , 'kernel': ['linear', 'poly', 'rbf'],
 'C': (np.arange(2,3,0.1)) , 'gamma': [0.01,0.02,0.03,0.04,0.05], 'kernel': ['linear', 'poly', 'rbf'],
 'degree': [2,3,4] ,'gamma':[0.01,0.1,1], 'C':(np.arange(2,3,0.1)) , 'kernel':['poly', 'linear','rbf']
    }
from sklearn.model_selection import GridSearchCV
model_svm = GridSearchCV(svm_model, tuned_parameters,cv=10,scoring='accuracy')
model_svm.fit(X_train, y_train)
print(model_svm.best_score_)
print(model_svm.best_params_)
# 混淆矩阵
y_pred=model_svm.predict(X_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.Dataframe(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
# 准确度、精确度和召回率
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
后续其他方法学习实践
转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/389009.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号