『R语言&Python』多分类离散型变量的特征工程

在训练模型之前，我们常常需要根据不同变量的基本情况进行相应且合理的特征工程，通过阅读文献和自行尝试，我针对多分类变量的特征工程做出了一些总结
数据来源（ a d u l t 数据集）： h t t p s : / / a r c h i v e . i c s . u c i . e d u / m l / d a t a s e t s / A d u l t 数据来源（adult数据集）：https://archive.ics.uci.edu/ml/datasets/Adult 数据来源（adult数据集）：https://archive.ics.uci.edu/ml/datasets/Adult
也可以直接下载我整理过来用
链接：https://pan.baidu.com/s/1UhGTfvZqPHUC6jnukfTcRg
提取码：j4C9

P y t h o n Python Python

首先来看看下数据集的基本情况

import pandas as pd
import numpy as np
file = 'C:/Varian/Data_of_training_model/adult/train.csv'
data = pd.read_csv(file, sep=',')

# 首先用上一篇文章中写的函数获取下连续型和离散型变量
def classify(dataframe):
    continuous_variables = []
    categorical_variables = []
    for i in dataframe.columns:
        if data[i] .dtypes == object:
            categorical_variables.append(i)
        else:
            continuous_variables.append(i)
    return continuous_variables, categorical_variables
            
continuous_variables, categorical_variables = classify(data)
print(continuous_variables)
'''
['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
'''
print(categorical_variables)
'''
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']
'''

可以看出这个数据集中包含了非常多的离散型变量

# 再看看每个分类变量包含了多少类别
for i in categorical_variables:
    print('variable_name: {} n {} n category_number: {} n'.format(i, data[i].value_counts(), len(data[i].value_counts())))

 variable_name:  workclass 
  Private             33906
 Self-emp-not-inc     3862
 Local-gov            3136
 ?                    2799
 State-gov            1981
 Self-emp-inc         1695
 Federal-gov          1432
 Without-pay            21
 Never-worked           10
Name: workclass, dtype: int64 
 category_number:  9 

variable_name:  education 
  HS-grad         15784
 Some-college    10878
 Bachelors        8025
 Masters          2657
 Assoc-voc        2061
 11th             1812
 Assoc-acdm       1601
 10th             1389
 7th-8th           955
 Prof-school       834
 9th               756
 12th              657
 Doctorate         594
 5th-6th           509
 1st-4th           247
 Preschool          83
Name: education, dtype: int64 
 category_number:  16 

variable_name:  marital-status 
  Married-civ-spouse       22379
 Never-married            16117
 Divorced                  6633
 Separated                 1530
 Widowed                   1518
 Married-spouse-absent      628
 Married-AF-spouse           37
Name: marital-status, dtype: int64 
 category_number:  7 

variable_name:  occupation 
  Prof-specialty       6172
 Craft-repair         6112
 Exec-managerial      6086
 Adm-clerical         5611
 Sales                5504
 Other-service        4923
 Machine-op-inspct    3022
 ?                    2809
 Transport-moving     2355
 Handlers-cleaners    2072
 Farming-fishing      1490
 Tech-support         1446
 Protective-serv       983
 Priv-house-serv       242
 Armed-Forces           15
Name: occupation, dtype: int64 
 category_number:  15 

variable_name:  relationship 
  Husband           19716
 Not-in-family     12583
 Own-child          7581
 Unmarried          5125
 Wife               2331
 Other-relative     1506
Name: relationship, dtype: int64 
 category_number:  6 

variable_name:  race 
  White                 41762
 Black                  4685
 Asian-Pac-Islander     1519
 Amer-Indian-Eskimo      470
 Other                   406
Name: race, dtype: int64 
 category_number:  5 

variable_name:  sex 
  Male      32650
 Female    16192
Name: sex, dtype: int64 
 category_number:  2 

variable_name:  native-country 
  United-States                 43832
 Mexico                          951
 ?                               857
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Nicaragua                        49
 Greece                           49
 Peru                             46
 Ecuador                          45
 France                           38
 Ireland                          37
 Hong                             30
 Thailand                         30
 Cambodia                         28
 Trinadad&Tobago                  27
 Outlying-US(Guam-USVI-etc)       23
 Yugoslavia                       23
 Laos                             23
 Scotland                         21
 Honduras                         20
 Hungary                          19
 Holand-Netherlands                1
Name: native-country, dtype: int64 
 category_number:  42 

variable_name:  income 
  <=50K     24720
 <=50K.    12435
 >50K       7841
 >50K.      3846
Name: income, dtype: int64 
 category_number:  4

一看结果，好家伙，大部分离散型变量的类别数都很多。而且因变量 i n c o m e income income 由于我合并了训练集和验证集，包含了四个类别，正常应为两个，因此可以使用Dataframe.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')处理掉

修改某一列的值

data['income'].replace(' >50K.', ' >50K', inplace = True)
data['income'].replace(' <=50K.', ' <=50K', inplace = True)
print(data['income'].value_counts())
'''
 <=50K    37155
 >50K     11687
Name: income, dtype: int64
'''

使用字典合并类别

变量 e d u c a t i o n education education 也包含了很多类别，但实际上可以把部分类别归为一类，使用Dataframe.map(arg, na_action=None)函数再配合字典就能实现这一功能

原始类别 -> 新类别
Preschool -> Dropout
10th -> Dropout
11th -> Dropout
12th -> Dropout
1st-4th -> Dropout
5th-6th -> Dropout
7th-8th -> Dropout
9th -> Dropout
HS-Grad -> HighGrad
Some-college -> Community
Assoc-acdm -> Community
Assoc-voc -> Community
Bachelors -> Bachelors
Masters -> Masters
Prof-school -> Masters
Doctorate -> PhD

data['education'] = data['education'].map({' Preschool':' Dropout',
											' 10th' : ' Dropout',
											' 11th' : ' Dropout',
											' 12th' : ' Dropout',
											' 1st-4th' : ' Dropout',
											' 5th-6th' : ' Dropout',
											' 7th-8th' : ' Dropout',
											' 9th'  : ' Dropout',
											' HS-goad' : ' HighGrad',   # 这里故意把 HS-grad写错为 HS-goad，为了生成NaN，方便后面做演示
											' Some-college' : ' Community',
											' Assoc-acdm' : ' Community',
											' Assoc-voc' : ' Community',
											' Bachelors' : ' Bachelors',
											' Masters' : ' Masters',
											' Prof-school' : ' Masters',
											' Doctorate' : ' PhD'})
print(data['education'].value_counts())
'''
 Community    14540
 Bachelors     8025
 Dropout       6408
 Masters       3491
 PhD            594
Name: education, dtype: int64
'''

转为虚拟变量

转为虚拟变量时需要注意的是：是否需要考虑多重共线性，若你想建立的是回归模型，那么答案是肯定的，这时你除了转换还需要对所生成的虚拟变量做删列处理（加参数drop_first = True）；若不是，那转换即可，不用进行额外操作

# 若建立回归模型
data['education'] = pd.get_dummies(data['education'], prefix = 'education', drop_first = True)
'''
# 对比能发现类别 Bachelors相应的列被删去了
   education_ Community  education_ Dropout  education_ Masters  
0                     0                   0                   0   
1                     0                   0                   0   
2                     0                   0                   0   
3                     0                   1                   0   
4                     0                   0                   0   
5                     0                   0                   1   
6                     0                   1                   0   
7                     0                   0                   0   
8                     0                   0                   1   
9                     0                   0                   0   

   education_ PhD  
0               0  
1               0  
2               0  
3               0  
4               0  
5               0  
6               0  
7               0  
8               0  
9               0 
'''

# 用于建立其它模型
data['education'] = pd.get_dummies(data['education'], prefix = 'education')
'''
   education_ Bachelors  education_ Community  education_ Dropout  
0                     1                     0                   0   
1                     1                     0                   0   
2                     0                     0                   0   
3                     0                     0                   1   
4                     1                     0                   0   
5                     0                     0                   0   
6                     0                     0                   1   
7                     0                     0                   0   
8                     0                     0                   0   
9                     1                     0                   0   

   education_ Masters  education_ PhD  
0                   0               0  
1                   0               0  
2                   0               0  
3                   0               0  
4                   0               0  
5                   1               0  
6                   0               0  
7                   0               0  
8                   1               0  
9                   0               0
'''

根据多列值，使用逻辑关系生成新的列

这里的指的是那种需要多层判断，通常情况下需要可递归函数才能实现的复杂逻辑；简单逻辑(即一层 i f 、 e l s e if、else if、else就能完成的逻辑)可以使用np.where或者np.select来实现；当然复杂的逻辑也用np.select实现，但它要求你完整地列出所有情况，这会使得代码变得极其冗长且不美观；至于np.where，它支持递归，能实现多层简单逻辑，但是不支持存在 o r ( ∣ ) or(|) or(∣) 和 a n d and and(&)的表达式 (这个后面会给出案例)

那么怎么处理需要多层判断的复杂逻辑呢？当时由于时间紧迫没时间查资料（~~抄代码~~），且被要求尽量不写新函数来实现，所以我只能从本质入手：不过无论逻辑多复杂，只要是生成新的列，都是按照行的维度，在新列中逐个生成新元素

# 选出三列便于演示
selected_col = ['age', 'race', 'education']
test_data = data[selected_col]
# 只看前十行
test_data = test_data.head(10)
test_data
'''
   age    race   education
0   39   White   Bachelors
1   50   White   Bachelors
2   38   White         NaN
3   53   Black     Dropout
4   28   Black   Bachelors
5   37   White     Masters
6   49   Black     Dropout
7   52   White         NaN
8   31   White     Masters
9   42   White   Bachelors
'''

'''
使用以下逻辑生成新的列test_col:
1.当age>37时,若race = white,若education = NaN,则test_col = Out
							否则test_col = Good;
             若race = black,若education = Dropout 或 NaN,则test_col = Out;
                            否则test_col = Good;
2.当age<=37时,若education = masters,则test_col = Best
             若education = bachelors,则test_col = Impressive.
             
# 逻辑随便写的，本人无种族歧视，这里仅作演示
'''
# 首先创建一个空列表来存放 test_col 中的元素
test_col = []
# 根据多列的值和逻辑生成新的列，本质实际是逐行进行逻辑判断
for i in range(len(test_data)):
    if test_data.iloc[i]['age'] > 37:  # 若大于37岁
        if test_data.iloc[i]['race'] == ' White':  # 且为白人
	        if pd.isnull(test_data.iloc[i]['education']): # 若教育为空值
	                test_col.append('Out')
	            else:              	   # 否则其他所有情况都记为Good
	                test_col.append('Good') # 新列记为Good   
        else:                          # 否则，即若为黑人
            if pd.isnull(test_data.iloc[i]['education']) or test_data.iloc[i]['education'] == ' Dropout': # 若没受过教育或值为Dropout
                test_col.append('Out') # 新列记为Out
            else:					   # 除了空值和Dropout的所有值
                test_col.append('Good')# 新列记为Good
    else:							   # 否则，即年龄≤37岁
        if test_data.iloc[i]['education'] == ' Masters':    # 若教育值为Masters
             test_col.append('Best')   # 新列记为Best
        else:						   # 除了Masters的所有值
            test_col.append('Impressive') # 新列记为Impressive
# 将列表转为Dataframe            
test_col = pd.Dataframe(test_col, columns=['test_col'])
# 合并Dataframe
test_data = pd.concat([test_data, test_col], axis = 1)
print(test_data)
'''
   age    race   education    test_col
0   39   White   Bachelors        Good
1   50   White   Bachelors        Good
2   38   White         NaN        Good
3   53   Black     Dropout         Out
4   28   Black   Bachelors  Impressive
5   37   White     Masters        Best
6   49   Black     Dropout         Out
7   52   White         NaN        Good
8   31   White     Masters        Best
9   42   White   Bachelors        Good
'''

总结一下：

计算 D a t a F r a m e Dataframe DataFrame 长度，遍历赋值给 i i i
使用Dataframe.iloc[i]['colname']代表第 i i i 行第 c o l n a m e colname colname 列元素
利用多层 i f , e l s e if ,else if,else 进行逻辑判断

:注意逻辑产生的列的长度必须和原数据长度相等，否则报错

现在让我们看下用np.where(condition, yes, no)怎么实现

selected_col = ['age', 'race', 'education']
test_data = data[selected_col]
test_data = test_data.head(10)
test_data['test_col'] = np.where(test_data['age']>37, 
                                 np.where(test_data['race']==' White', 
                                          np.where(pd.isnull(test_data['education']), 'Out', 'Good'), 
                                          np.where(test_data['education']==' Dropout', 'Out', 'Good')), # 注意这里我没加 or判断是否为空，但结果意外地和上面相同的
                                 np.where(test_data['education']==' Masters', 'Best', 
                                          np.where(test_data['education']==' Bachelors', 'Impressive', 'else')))
# 简洁，但是代码可读性较差
print(test_data)
'''
   age    race   education    test_col
0   39   White   Bachelors        Good
1   50   White   Bachelors        Good
2   38   White         NaN         Out
3   53   Black     Dropout         Out
4   28   Black   Bachelors  Impressive
5   37   White     Masters        Best
6   49   Black     Dropout         Out
7   52   White         NaN         Out
8   31   White     Masters        Best
9   42   White   Bachelors        Good
'''

然后试试它支不支持 o r ( ∣ ) or(|) or(∣) 和 a n d and and(&)表达式：

selected_col = ['age', 'race', 'education']
test_data = data[selected_col]
test_data = test_data.head(10)
print(test_data)
'''
   age    race   education
0   39   White   Bachelors
1   50   White   Bachelors
2   38   White         NaN
3   53   Black     Dropout
4   28   Black   Bachelors
5   37   White     Masters
6   49   Black     Dropout
7   52   White         NaN
8   31   White     Masters
9   42   White   Bachelors
'''
test_data['new_col'] = np.where(test_data['race']==' White'| pd.isnull(test_data['education']), 'Yes', 'No')
# 报错信息
TypeError: ufunc 'bitwise_or' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
----------
test_data['new_col'] = np.where(test_data['race']==' White'& pd.isnull(test_data['education']), 'Yes', 'No')
# 报错信息
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

所以结论是：np.where确实不支持包含 o r or or 或 a n d and and 的表达式

而用按行遍历的方法却能轻松实现 o r or or 或 a n d and and：

for i in range(len(test_data)):
    if test_data.iloc[i]['race'] == ' White' or pd.isnull(test_data.iloc[i]['education']):
        new_col.append('Yes')
    else:
        new_col.append('No')
new_col = pd.Dataframe(new_col, columns=['new_col'])
test_data = pd.concat([test_data, new_col], axis=1)
print(test_data)
'''
   age    race   education new_col
0   39   White   Bachelors     Yes
1   50   White   Bachelors     Yes
2   38   White         NaN     Yes
3   53   Black     Dropout      No
4   28   Black   Bachelors      No
5   37   White     Masters     Yes
6   49   Black     Dropout      No
7   52   White         NaN     Yes
8   31   White     Masters     Yes
9   42   White   Bachelors     Yes
'''

Ps:如果 P y t h o n Python Python 中有和 R R R 里 i f e l s e ifelse ifelse 完全一样功能的函数就好了 (╯▔皿▔)╯

R R R

使用 R R R 语言中的dplyr包实现上述功能会简单很多

select_if(data, function)能筛选出符合 f u n c t i o n function function 的列

library(dplyr)
data = read.csv('C:/Varian/Data_of_training_model/adult/train.csv', sep = ',', header = TRUE)
# 查看连续型变量
continuous = select_if(data, is.numeric)
colnames(continuous)
[1] "age"            "fnlwgt"         "education.num"  "capital.gain"  
[5] "capital.loss"   "hours.per.week"
# 查看离散型变量
categorical = select_if(data, is.factor)
colnames(categorical)
[1] "workclass"      "education"      "marital.status" "occupation"    
[5] "relationship"   "race"           "sex"            "native.country"
[9] "income"

修改某一列的值

方式1：

使用mutate()函数修改某列的值，通过case_when()函数来对列中具体的不同分类值做处理

data = data %>%
  mutate(income = case_when(
    income == ' >50K.'|income == ' >50K' ~ '>50k',
    income == ' <=50K.'|income == ' <=50K' ~ '<=50k'
  ))
 table(data$income)
<=50k  >50k 
37155 11687

方式2：

使用ifelse(condition, yes, no)函数（满足 c o n d i t i o n condition condition时，执行或返回 y e s yes yes,否则执行或返回 n o no no）

data = read.csv('C:/Varian/Data_of_training_model/adult/train.csv', sep = ',', header = TRUE)
data$income = ifelse(data$income==' >50K.'|data$income==' >50K', '>50k',
                     ifelse(data$income==' <=50K.'|data$income==' <=50K', '<=50k', 'eles'))
table(data$income)
<=50k  >50k 
37155 11687

合并多类为一类

原始类别 -> 新类别
Preschool -> Dropout
10th -> Dropout
11th -> Dropout
12th -> Dropout
1st-4th -> Dropout
5th-6th -> Dropout
7th-8th -> Dropout
9th -> Dropout
HS-Grad -> HighGrad
Some-college -> Community
Assoc-acdm -> Community
Assoc-voc -> Community
Bachelors -> Bachelors
Masters -> Masters
Prof-school -> Masters
Doctorate -> PhD

方式1：

data = data%>%
  mutate(education = case_when(
    education == ' Masters'|education == ' Prof-school' ~ 'Master',
    education == ' Bachelors' ~ 'Bachelors',
    education == ' Assoc-voc'|education == ' Assoc-acdm' |education == ' Some-college' ~ 'Community',
    education == ' HS-grad' ~ 'HighGrad', 
    education == ' Preschool'|education == ' 10th'|education ==' 11th'|education == ' 12th'|education == ' 1st-4th'|education ==' 5th-6th'|education == ' 7th-8th'|education == ' 9th' ~ 'dropout',
    education == ' Doctorate' ~ 'PHD'
      ))
table(data$education)
Bachelors Community   dropout  HighGrad    Master       PHD 
     8025     14540      6408     15784      3491       594

方式2：

使用ifelse(condition, yes, no)函数（满足 c o n d i t i o n condition condition时，执行或返回 y e s yes yes,否则执行或返回 n o no no）

data <- data %>%
  mutate(education = factor(ifelse(education == " Preschool" | education == " 10th" | education == " 11th" | education == " 12th" | education == " 1st-4th" | education == " 5th-6th" | education == " 7th-8th" | education == " 9th", " dropout", ifelse(education == " HS-grad", " HighGrad", ifelse(education == " Some-college" | education == " Assoc-acdm" | education == " Assoc-voc", "Community",
                                                                                                                                                                                                                                                                                            ifelse(education == " Bachelors", "Bachelors",
                                                                                                                                                                                                                                                                                                   ifelse(education == " Masters" | education == " Prof-school", "Master", "PhD")))))))
table(data$education)
 dropout  HighGrad Bachelors Community    Master       PhD 
    6408     15784      8025     14540      3491       594

根据多列值，使用逻辑关系生成新的列

还是可以使用ifelse(condition, yes, no)函数

i f e l s e ifelse ifelse 我吹爆好吧！！！ n p . w h e r e np.where np.where是神马辣鸡！(╯▔皿▔)╯

data = read.csv('C:/Varian/Data_of_training_model/adult/train.csv', sep = ',', header = TRUE)
data = data%>%
  mutate(education = case_when(
    education == ' Masters'|education == ' Prof-school' ~ ' Masters',
    education == ' Bachelors' ~ ' Bachelors',
    education == ' Assoc-voc'|education == ' Assoc-acdm' |education == ' Some-college' ~ 'Community',
    education == ' HS-goad' ~ 'HighGrad',               # 这里也和python部分一样，故意写错来生成Na，方便后面演示
    education == ' Preschool'|education == ' 10th'|education ==' 11th'|education == ' 12th'|education == ' 1st-4th'|education ==' 5th-6th'|education == ' 7th-8th'|education == ' 9th' ~ ' Dropout',
    education == ' Doctorate' ~ 'PHD'
  ))
test_data = subset(data, select = c('age', 'race', 'education'))
test_data = test_data[1:10,]
test_data
test_data$new_col = ifelse(test_data$age>37, 
                           ifelse(test_data$race==' White',
                                  ifelse(is.na(test_data$education), 'Out', 'Good'),
                                  ifelse(test_data$education==' Dropout'|is.na(test_data$education), 'Out', 'Good')),
                           ifelse(test_data$education==' Masters', 'Best', 'Impressive'))
print(test_data)
    X age   race  education    new_col
1   1  39  White  Bachelors       Good
2   2  50  White  Bachelors       Good
3   3  38  White               Out
4   4  53  Black    Dropout        Out
5   5  28  Black  Bachelors Impressive
6   6  37  White    Masters       Best
7   7  49  Black    Dropout        Out
8   8  52  White               Out
9   9  31  White    Masters       Best
10 10  42  White  Bachelors       Good

# 和python部分的结果一样！完美！

转为虚拟变量

在逻辑回归glm(..., family = 'binomial')中，模型会自动帮我们将类型为 f a c t o r factor factor 的变量都转为虚拟变量并消除共线性(删除第一列)后才开始计算，其它回归模型我暂时没有去深入研究

当然也可以使用dummies包内的dummy函数生成虚拟变量：

library(dummies)
test = c(1,3,3,1,1,1,2)
dummy(test, sep = '_')
     test_1 test_2 test_3
[1,]      1      0      0
[2,]      0      0      1
[3,]      0      0      1
[4,]      1      0      0
[5,]      1      0      0
[6,]      1      0      0
[7,]      0      1      0

『R语言&Python』多分类离散型变量的特征工程

Python相关栏目本月热门文章