数据透视表
import pandas as pd
example = pd.Dataframe({'Amount': [74., 235., 175., 100., 115., 245., 180., 90., 88., 129., 273., 300.],
'Category': ['Transportation', 'Grocery', 'Household', 'Entertainment',
'Transportation', 'Grocery', 'Household', 'Entertainment',
'Transportation', 'Grocery', 'Household', 'Entertainment'],
'Month': ['January', 'January', 'January', 'January',
'February', 'February', 'February', 'February',
'Marth', 'Marth', 'Marth', 'Marth']})
example
| Amount | Category | Month |
|---|
| 0 | 74.0 | Transportation | January |
|---|
| 1 | 235.0 | Grocery | January |
|---|
| 2 | 175.0 | Household | January |
|---|
| 3 | 100.0 | Entertainment | January |
|---|
| 4 | 115.0 | Transportation | February |
|---|
| 5 | 245.0 | Grocery | February |
|---|
| 6 | 180.0 | Household | February |
|---|
| 7 | 90.0 | Entertainment | February |
|---|
| 8 | 88.0 | Transportation | Marth |
|---|
| 9 | 129.0 | Grocery | Marth |
|---|
| 10 | 273.0 | Household | Marth |
|---|
| 11 | 300.0 | Entertainment | Marth |
|---|
上面这个数据表感觉非常的杂乱,数据透视的意思就是按照我们的意思重新组织这张数据表
new_example = example.pivot(index='Category', columns='Month', values='Amount')
index 表示以原先数据的哪个属性的值为行坐标进行分类 columns 表示以原先数据的哪个属性的值为新的属性
values 表示以原先数据的哪个属性的值填充前面两个 index 和 columns 定义出的新表格
new_example
| Month | February | January | Marth |
|---|
| Category | | | |
|---|
| Entertainment | 90.0 | 100.0 | 300.0 |
|---|
| Grocery | 245.0 | 235.0 | 129.0 |
|---|
| Household | 180.0 | 175.0 | 273.0 |
|---|
| Transportation | 115.0 | 74.0 | 88.0 |
|---|
new_example.sum(axis=1)
Category
Entertainment 490.0
Grocery 609.0
Household 628.0
Transportation 277.0
dtype: float64
new_example.sum(axis=0)
Month
February 630.0
January 584.0
Marth 790.0
dtype: float64
经过重新组织后,数据表的呈现方式更有意义
df = pd.read_csv('../../datasets/titanic/test.csv')
df.head(5)
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|
| 0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
|---|
| 1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
|---|
| 2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
|---|
| 3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
|---|
| 4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
|---|
需求: 统计男女在不同船舱等级的票价
df.pivot_table(index='Sex', columns='Pclass', values='Fare') # 默认就是求平均值
| Pclass | 1 | 2 | 3 |
|---|
| Sex | | | |
|---|
| female | 115.591168 | 26.438750 | 13.735129 |
|---|
| male | 75.586551 | 20.184654 | 11.826350 |
|---|
df.pivot_table(index='Sex', columns='Pclass', values='Fare', aggfunc='min') # 求最小值
| Pclass | 1 | 2 | 3 |
|---|
| Sex | | | |
|---|
| female | 25.7 | 10.5000 | 6.9500 |
|---|
| male | 0.0 | 9.6875 | 3.1708 |
|---|
需求: 统计不同船舱等级里男女的平均年龄
df.pivot_table(index='Pclass', columns='Sex', values='Age')
| Sex | female | male |
|---|
| Pclass | | |
|---|
| 1 | 41.333333 | 40.520000 |
|---|
| 2 | 24.376552 | 30.940678 |
|---|
| 3 | 23.073400 | 24.525104 |
|---|