P y t h o n Python Python在训练模型之前,我们常常需要根据不同变量的基本情况进行相应且合理的特征工程,通过阅读文献和自行尝试,我针对多分类变量的特征工程做出了一些总结
数 据 来 源 ( a d u l t 数 据 集 ) : h t t p s : / / a r c h i v e . i c s . u c i . e d u / m l / d a t a s e t s / A d u l t 数据来源(adult数据集):https://archive.ics.uci.edu/ml/datasets/Adult 数据来源(adult数据集):https://archive.ics.uci.edu/ml/datasets/Adult
也可以直接下载我整理过来用
链接:https://pan.baidu.com/s/1UhGTfvZqPHUC6jnukfTcRg
提取码:j4C9
首先来看看下数据集的基本情况
import pandas as pd
import numpy as np
file = 'C:/Varian/Data_of_training_model/adult/train.csv'
data = pd.read_csv(file, sep=',')
# 首先用上一篇文章中写的函数获取下连续型和离散型变量
def classify(dataframe):
continuous_variables = []
categorical_variables = []
for i in dataframe.columns:
if data[i] .dtypes == object:
categorical_variables.append(i)
else:
continuous_variables.append(i)
return continuous_variables, categorical_variables
continuous_variables, categorical_variables = classify(data)
print(continuous_variables)
'''
['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
'''
print(categorical_variables)
'''
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']
'''
可以看出这个数据集中包含了非常多的离散型变量
# 再看看每个分类变量包含了多少类别
for i in categorical_variables:
print('variable_name: {} n {} n category_number: {} n'.format(i, data[i].value_counts(), len(data[i].value_counts())))
variable_name: workclass Private 33906 Self-emp-not-inc 3862 Local-gov 3136 ? 2799 State-gov 1981 Self-emp-inc 1695 Federal-gov 1432 Without-pay 21 Never-worked 10 Name: workclass, dtype: int64 category_number: 9 variable_name: education HS-grad 15784 Some-college 10878 Bachelors 8025 Masters 2657 Assoc-voc 2061 11th 1812 Assoc-acdm 1601 10th 1389 7th-8th 955 Prof-school 834 9th 756 12th 657 Doctorate 594 5th-6th 509 1st-4th 247 Preschool 83 Name: education, dtype: int64 category_number: 16 variable_name: marital-status Married-civ-spouse 22379 Never-married 16117 Divorced 6633 Separated 1530 Widowed 1518 Married-spouse-absent 628 Married-AF-spouse 37 Name: marital-status, dtype: int64 category_number: 7 variable_name: occupation Prof-specialty 6172 Craft-repair 6112 Exec-managerial 6086 Adm-clerical 5611 Sales 5504 Other-service 4923 Machine-op-inspct 3022 ? 2809 Transport-moving 2355 Handlers-cleaners 2072 Farming-fishing 1490 Tech-support 1446 Protective-serv 983 Priv-house-serv 242 Armed-Forces 15 Name: occupation, dtype: int64 category_number: 15 variable_name: relationship Husband 19716 Not-in-family 12583 Own-child 7581 Unmarried 5125 Wife 2331 Other-relative 1506 Name: relationship, dtype: int64 category_number: 6 variable_name: race White 41762 Black 4685 Asian-Pac-Islander 1519 Amer-Indian-Eskimo 470 Other 406 Name: race, dtype: int64 category_number: 5 variable_name: sex Male 32650 Female 16192 Name: sex, dtype: int64 category_number: 2 variable_name: native-country United-States 43832 Mexico 951 ? 857 Philippines 295 Germany 206 Puerto-Rico 184 Canada 182 El-Salvador 155 India 151 Cuba 138 England 127 China 122 South 115 Jamaica 106 Italy 105 Dominican-Republic 103 Japan 92 Guatemala 88 Poland 87 Vietnam 86 Columbia 85 Haiti 75 Portugal 67 Taiwan 65 Iran 59 Nicaragua 49 Greece 49 Peru 46 Ecuador 45 France 38 Ireland 37 Hong 30 Thailand 30 Cambodia 28 Trinadad&Tobago 27 Outlying-US(Guam-USVI-etc) 23 Yugoslavia 23 Laos 23 Scotland 21 Honduras 20 Hungary 19 Holand-Netherlands 1 Name: native-country, dtype: int64 category_number: 42 variable_name: income <=50K 24720 <=50K. 12435 >50K 7841 >50K. 3846 Name: income, dtype: int64 category_number: 4
一看结果,好家伙,大部分离散型变量的类别数都很多。而且因变量 i n c o m e income income 由于我合并了训练集和验证集,包含了四个类别,正常应为两个,因此可以使用Dataframe.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')处理掉
修改某一列的值
data['income'].replace(' >50K.', ' >50K', inplace = True)
data['income'].replace(' <=50K.', ' <=50K', inplace = True)
print(data['income'].value_counts())
'''
<=50K 37155
>50K 11687
Name: income, dtype: int64
'''
使用字典合并类别
变量 e d u c a t i o n education education 也包含了很多类别,但实际上可以把部分类别归为一类,使用Dataframe.map(arg, na_action=None)函数再配合字典就能实现这一功能
原始类别 -> 新类别
Preschool -> Dropout
10th -> Dropout
11th -> Dropout
12th -> Dropout
1st-4th -> Dropout
5th-6th -> Dropout
7th-8th -> Dropout
9th -> Dropout
HS-Grad -> HighGrad
Some-college -> Community
Assoc-acdm -> Community
Assoc-voc -> Community
Bachelors -> Bachelors
Masters -> Masters
Prof-school -> Masters
Doctorate -> PhD
data['education'] = data['education'].map({' Preschool':' Dropout',
' 10th' : ' Dropout',
' 11th' : ' Dropout',
' 12th' : ' Dropout',
' 1st-4th' : ' Dropout',
' 5th-6th' : ' Dropout',
' 7th-8th' : ' Dropout',
' 9th' : ' Dropout',
' HS-goad' : ' HighGrad', # 这里故意把 HS-grad写错为 HS-goad,为了生成NaN,方便后面做演示
' Some-college' : ' Community',
' Assoc-acdm' : ' Community',
' Assoc-voc' : ' Community',
' Bachelors' : ' Bachelors',
' Masters' : ' Masters',
' Prof-school' : ' Masters',
' Doctorate' : ' PhD'})
print(data['education'].value_counts())
'''
Community 14540
Bachelors 8025
Dropout 6408
Masters 3491
PhD 594
Name: education, dtype: int64
'''
转为虚拟变量
转为虚拟变量时需要注意的是:是否需要考虑多重共线性,若你想建立的是回归模型,那么答案是肯定的,这时你除了转换还需要对所生成的虚拟变量做删列处理(加参数drop_first = True);若不是,那转换即可,不用进行额外操作
# 若建立回归模型 data['education'] = pd.get_dummies(data['education'], prefix = 'education', drop_first = True) ''' # 对比能发现类别 Bachelors相应的列被删去了 education_ Community education_ Dropout education_ Masters 0 0 0 0 1 0 0 0 2 0 0 0 3 0 1 0 4 0 0 0 5 0 0 1 6 0 1 0 7 0 0 0 8 0 0 1 9 0 0 0 education_ PhD 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 ''' # 用于建立其它模型 data['education'] = pd.get_dummies(data['education'], prefix = 'education') ''' education_ Bachelors education_ Community education_ Dropout 0 1 0 0 1 1 0 0 2 0 0 0 3 0 0 1 4 1 0 0 5 0 0 0 6 0 0 1 7 0 0 0 8 0 0 0 9 1 0 0 education_ Masters education_ PhD 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0 5 1 0 6 0 0 7 0 0 8 1 0 9 0 0 '''
根据多列值,使用逻辑关系生成新的列
这里的指的是那种需要多层判断,通常情况下需要可递归函数才能实现的复杂逻辑;简单逻辑(即一层 i f 、 e l s e if、else if、else就能完成的逻辑)可以使用np.where或者np.select来实现;当然复杂的逻辑也用np.select实现,但它要求你完整地列出所有情况,这会使得代码变得极其冗长且不美观;至于np.where,它支持递归,能实现多层简单逻辑, 但是不支持存在 o r ( ∣ ) or(|) or(∣) 和 a n d and and(&)的表达式 (这个后面会给出案例)
那么怎么处理需要多层判断的复杂逻辑呢?当时由于时间紧迫没时间查资料(抄代码),且被要求尽量不写新函数来实现,所以我只能从本质入手:不过无论逻辑多复杂,只要是生成新的列,都是按照行的维度,在新列中逐个生成新元素
# 选出三列便于演示 selected_col = ['age', 'race', 'education'] test_data = data[selected_col] # 只看前十行 test_data = test_data.head(10) test_data ''' age race education 0 39 White Bachelors 1 50 White Bachelors 2 38 White NaN 3 53 Black Dropout 4 28 Black Bachelors 5 37 White Masters 6 49 Black Dropout 7 52 White NaN 8 31 White Masters 9 42 White Bachelors '''
'''
使用以下逻辑生成新的列test_col:
1.当age>37时,若race = white,若education = NaN,则test_col = Out
否则test_col = Good;
若race = black,若education = Dropout 或 NaN,则test_col = Out;
否则test_col = Good;
2.当age<=37时,若education = masters,则test_col = Best
若education = bachelors,则test_col = Impressive.
# 逻辑随便写的,本人无种族歧视,这里仅作演示
'''
# 首先创建一个空列表来存放 test_col 中的元素
test_col = []
# 根据多列的值和逻辑生成新的列,本质实际是逐行进行逻辑判断
for i in range(len(test_data)):
if test_data.iloc[i]['age'] > 37: # 若大于37岁
if test_data.iloc[i]['race'] == ' White': # 且为白人
if pd.isnull(test_data.iloc[i]['education']): # 若教育为空值
test_col.append('Out')
else: # 否则其他所有情况都记为Good
test_col.append('Good') # 新列记为Good
else: # 否则,即若为黑人
if pd.isnull(test_data.iloc[i]['education']) or test_data.iloc[i]['education'] == ' Dropout': # 若没受过教育或值为Dropout
test_col.append('Out') # 新列记为Out
else: # 除了空值和Dropout的所有值
test_col.append('Good')# 新列记为Good
else: # 否则,即年龄≤37岁
if test_data.iloc[i]['education'] == ' Masters': # 若教育值为Masters
test_col.append('Best') # 新列记为Best
else: # 除了Masters的所有值
test_col.append('Impressive') # 新列记为Impressive
# 将列表转为Dataframe
test_col = pd.Dataframe(test_col, columns=['test_col'])
# 合并Dataframe
test_data = pd.concat([test_data, test_col], axis = 1)
print(test_data)
'''
age race education test_col
0 39 White Bachelors Good
1 50 White Bachelors Good
2 38 White NaN Good
3 53 Black Dropout Out
4 28 Black Bachelors Impressive
5 37 White Masters Best
6 49 Black Dropout Out
7 52 White NaN Good
8 31 White Masters Best
9 42 White Bachelors Good
'''
总结一下:
- 计算 D a t a F r a m e Dataframe DataFrame 长度,遍历赋值给 i i i
- 使用Dataframe.iloc[i]['colname']代表第 i i i 行第 c o l n a m e colname colname 列元素
- 利用多层 i f , e l s e if ,else if,else 进行逻辑判断
:注意逻辑产生的列的长度必须和原数据长度相等,否则报错
现在让我们看下用np.where(condition, yes, no)怎么实现
selected_col = ['age', 'race', 'education']
test_data = data[selected_col]
test_data = test_data.head(10)
test_data['test_col'] = np.where(test_data['age']>37,
np.where(test_data['race']==' White',
np.where(pd.isnull(test_data['education']), 'Out', 'Good'),
np.where(test_data['education']==' Dropout', 'Out', 'Good')), # 注意这里我没加 or判断是否为空,但结果意外地和上面相同的
np.where(test_data['education']==' Masters', 'Best',
np.where(test_data['education']==' Bachelors', 'Impressive', 'else')))
# 简洁,但是代码可读性较差
print(test_data)
'''
age race education test_col
0 39 White Bachelors Good
1 50 White Bachelors Good
2 38 White NaN Out
3 53 Black Dropout Out
4 28 Black Bachelors Impressive
5 37 White Masters Best
6 49 Black Dropout Out
7 52 White NaN Out
8 31 White Masters Best
9 42 White Bachelors Good
'''
然后试试它支不支持 o r ( ∣ ) or(|) or(∣) 和 a n d and and(&)表达式:
selected_col = ['age', 'race', 'education'] test_data = data[selected_col] test_data = test_data.head(10) print(test_data) ''' age race education 0 39 White Bachelors 1 50 White Bachelors 2 38 White NaN 3 53 Black Dropout 4 28 Black Bachelors 5 37 White Masters 6 49 Black Dropout 7 52 White NaN 8 31 White Masters 9 42 White Bachelors ''' test_data['new_col'] = np.where(test_data['race']==' White'| pd.isnull(test_data['education']), 'Yes', 'No') # 报错信息 TypeError: ufunc 'bitwise_or' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' ---------- test_data['new_col'] = np.where(test_data['race']==' White'& pd.isnull(test_data['education']), 'Yes', 'No') # 报错信息 TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
所以结论是:np.where确实不支持包含 o r or or 或 a n d and and 的表达式
而用按行遍历的方法却能轻松实现 o r or or 或 a n d and and:
for i in range(len(test_data)):
if test_data.iloc[i]['race'] == ' White' or pd.isnull(test_data.iloc[i]['education']):
new_col.append('Yes')
else:
new_col.append('No')
new_col = pd.Dataframe(new_col, columns=['new_col'])
test_data = pd.concat([test_data, new_col], axis=1)
print(test_data)
'''
age race education new_col
0 39 White Bachelors Yes
1 50 White Bachelors Yes
2 38 White NaN Yes
3 53 Black Dropout No
4 28 Black Bachelors No
5 37 White Masters Yes
6 49 Black Dropout No
7 52 White NaN Yes
8 31 White Masters Yes
9 42 White Bachelors Yes
'''
Ps:如果 P y t h o n Python Python 中有和 R R R 里 i f e l s e ifelse ifelse 完全一样功能的函数就好了 (╯▔皿▔)╯
R R R
使用 R R R 语言中的dplyr包实现上述功能会简单很多
select_if(data, function)能筛选出符合 f u n c t i o n function function 的列
library(dplyr)
data = read.csv('C:/Varian/Data_of_training_model/adult/train.csv', sep = ',', header = TRUE)
# 查看连续型变量
continuous = select_if(data, is.numeric)
colnames(continuous)
[1] "age" "fnlwgt" "education.num" "capital.gain"
[5] "capital.loss" "hours.per.week"
# 查看离散型变量
categorical = select_if(data, is.factor)
colnames(categorical)
[1] "workclass" "education" "marital.status" "occupation"
[5] "relationship" "race" "sex" "native.country"
[9] "income"
修改某一列的值
方式1:
使用mutate()函数修改某列的值,通过case_when()函数来对列中具体的不同分类值做处理
data = data %>%
mutate(income = case_when(
income == ' >50K.'|income == ' >50K' ~ '>50k',
income == ' <=50K.'|income == ' <=50K' ~ '<=50k'
))
table(data$income)
<=50k >50k
37155 11687
方式2:
使用ifelse(condition, yes, no)函数(满足 c o n d i t i o n condition condition时,执行或返回 y e s yes yes,否则执行或返回 n o no no)
data = read.csv('C:/Varian/Data_of_training_model/adult/train.csv', sep = ',', header = TRUE)
data$income = ifelse(data$income==' >50K.'|data$income==' >50K', '>50k',
ifelse(data$income==' <=50K.'|data$income==' <=50K', '<=50k', 'eles'))
table(data$income)
<=50k >50k
37155 11687
合并多类为一类
原始类别 -> 新类别
Preschool -> Dropout
10th -> Dropout
11th -> Dropout
12th -> Dropout
1st-4th -> Dropout
5th-6th -> Dropout
7th-8th -> Dropout
9th -> Dropout
HS-Grad -> HighGrad
Some-college -> Community
Assoc-acdm -> Community
Assoc-voc -> Community
Bachelors -> Bachelors
Masters -> Masters
Prof-school -> Masters
Doctorate -> PhD
方式1:
data = data%>%
mutate(education = case_when(
education == ' Masters'|education == ' Prof-school' ~ 'Master',
education == ' Bachelors' ~ 'Bachelors',
education == ' Assoc-voc'|education == ' Assoc-acdm' |education == ' Some-college' ~ 'Community',
education == ' HS-grad' ~ 'HighGrad',
education == ' Preschool'|education == ' 10th'|education ==' 11th'|education == ' 12th'|education == ' 1st-4th'|education ==' 5th-6th'|education == ' 7th-8th'|education == ' 9th' ~ 'dropout',
education == ' Doctorate' ~ 'PHD'
))
table(data$education)
Bachelors Community dropout HighGrad Master PHD
8025 14540 6408 15784 3491 594
方式2:
使用ifelse(condition, yes, no)函数(满足 c o n d i t i o n condition condition时,执行或返回 y e s yes yes,否则执行或返回 n o no no)
data <- data %>%
mutate(education = factor(ifelse(education == " Preschool" | education == " 10th" | education == " 11th" | education == " 12th" | education == " 1st-4th" | education == " 5th-6th" | education == " 7th-8th" | education == " 9th", " dropout", ifelse(education == " HS-grad", " HighGrad", ifelse(education == " Some-college" | education == " Assoc-acdm" | education == " Assoc-voc", "Community",
ifelse(education == " Bachelors", "Bachelors",
ifelse(education == " Masters" | education == " Prof-school", "Master", "PhD")))))))
table(data$education)
dropout HighGrad Bachelors Community Master PhD
6408 15784 8025 14540 3491 594
根据多列值,使用逻辑关系生成新的列
还是可以使用ifelse(condition, yes, no)函数
i f e l s e ifelse ifelse 我吹爆好吧!!! n p . w h e r e np.where np.where是神马辣鸡!(╯▔皿▔)╯
data = read.csv('C:/Varian/Data_of_training_model/adult/train.csv', sep = ',', header = TRUE)
data = data%>%
mutate(education = case_when(
education == ' Masters'|education == ' Prof-school' ~ ' Masters',
education == ' Bachelors' ~ ' Bachelors',
education == ' Assoc-voc'|education == ' Assoc-acdm' |education == ' Some-college' ~ 'Community',
education == ' HS-goad' ~ 'HighGrad', # 这里也和python部分一样,故意写错来生成Na,方便后面演示
education == ' Preschool'|education == ' 10th'|education ==' 11th'|education == ' 12th'|education == ' 1st-4th'|education ==' 5th-6th'|education == ' 7th-8th'|education == ' 9th' ~ ' Dropout',
education == ' Doctorate' ~ 'PHD'
))
test_data = subset(data, select = c('age', 'race', 'education'))
test_data = test_data[1:10,]
test_data
test_data$new_col = ifelse(test_data$age>37,
ifelse(test_data$race==' White',
ifelse(is.na(test_data$education), 'Out', 'Good'),
ifelse(test_data$education==' Dropout'|is.na(test_data$education), 'Out', 'Good')),
ifelse(test_data$education==' Masters', 'Best', 'Impressive'))
print(test_data)
X age race education new_col
1 1 39 White Bachelors Good
2 2 50 White Bachelors Good
3 3 38 White Out
4 4 53 Black Dropout Out
5 5 28 Black Bachelors Impressive
6 6 37 White Masters Best
7 7 49 Black Dropout Out
8 8 52 White Out
9 9 31 White Masters Best
10 10 42 White Bachelors Good
# 和python部分的结果一样!完美!
转为虚拟变量
在逻辑回归glm(..., family = 'binomial')中,模型会自动帮我们将类型为 f a c t o r factor factor 的变量都转为虚拟变量并消除共线性(删除第一列)后才开始计算,其它回归模型我暂时没有去深入研究
当然也可以使用dummies包内的dummy函数生成虚拟变量:
library(dummies)
test = c(1,3,3,1,1,1,2)
dummy(test, sep = '_')
test_1 test_2 test_3
[1,] 1 0 0
[2,] 0 0 1
[3,] 0 0 1
[4,] 1 0 0
[5,] 1 0 0
[6,] 1 0 0
[7,] 0 1 0



