pandas入门十大例题
训练集数据资源地址:https://download.csdn.net/download/qq_41166909/30414605
本项目在AIStudio的地址https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1
练习 1. 了解数据--某段时间内股票销售情况(stock)。 按照每题的要求,输入正确的代码并运行。
1 )- 导入正确的库——请导入Pandas。
In [12]:
import pandas as pd
2 )- 从指定地址导入数据,并存入一个名为shopper的框架中。(data.csv)
In [13]:
shopper=pd.read_csv('data/data111257/data.csv')
3 )- 查看前15行内容。
In [14]:
shopper.head(15)
Out[14]:
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
| 1 | 536365 | 71053 | WHITE metaL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | KNITTED UNIOn FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 5 | 536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 12/1/2010 8:26 | 7.65 | 17850.0 | United Kingdom |
| 6 | 536365 | 21730 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 4.25 | 17850.0 | United Kingdom |
| 7 | 536366 | 22633 | HAND WARMER UNIOn JACK | 6 | 12/1/2010 8:28 | 1.85 | 17850.0 | United Kingdom |
| 8 | 536366 | 22632 | HAND WARMER RED POLKA DOT | 6 | 12/1/2010 8:28 | 1.85 | 17850.0 | United Kingdom |
| 9 | 536367 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 12/1/2010 8:34 | 1.69 | 13047.0 | United Kingdom |
| 10 | 536367 | 22745 | POPPY'S PLAYHOUSE BEDROOM | 6 | 12/1/2010 8:34 | 2.10 | 13047.0 | United Kingdom |
| 11 | 536367 | 22748 | POPPY'S PLAYHOUSE KITCHEN | 6 | 12/1/2010 8:34 | 2.10 | 13047.0 | United Kingdom |
| 12 | 536367 | 22749 | FELTCRAFT PRINCESS CHARLOTTE DOLL | 8 | 12/1/2010 8:34 | 3.75 | 13047.0 | United Kingdom |
| 13 | 536367 | 22310 | IVORY KNITTED MUG COSY | 6 | 12/1/2010 8:34 | 1.65 | 13047.0 | United Kingdom |
| 14 | 536367 | 84969 | BOX OF 6 ASSORTED COLOUR TEASPOONS | 6 | 12/1/2010 8:34 | 4.25 | 13047.0 | United Kingdom |
4 )- 数据集中有多少列(columns)
In [15]:
number_of_rows = shopper.shape[1] number_of_rows
Out[15]:
8
5 )- 打印全部列名称。
In [16]:
shopper.columns
Out[16]:
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
'UnitPrice', 'CustomerID', 'Country'],
dtype='object')
6 )- 数据集的索引是怎样的。
In [17]:
shopper.index
Out[17]:
RangeIndex(start=0, stop=36997, step=1)
7 )- 被下单最多StockCode是什么?
In [18]:
a=shopper[['StockCode','Quantity']].groupby(by='StockCode').agg('sum') #根据股票编码分组
a=a.sort_values(['Quantity'],ascending=False)
a.head(1)
Out[18]:
| Quantity | |
|---|---|
| StockCode | |
| 84077 | 4948 |
8 )- 在StockCode这一列中,一共有多少只股票被下单?
In [19]:
pd.unique(shopper['StockCode']).size
Out[19]:
2743
9 )- 在Description列中,下单最多的是哪种股票?
In [20]:
shopper['StockCode'].value_counts().head(1)
Out[20]:
85123A 209 Name: StockCode, dtype: int64
10 )- 一共有多少股票被下单?
In [21]:
shopper['Quantity'].sum()
Out[21]:
298210
11 )- 每一张单据(InvoiceNo)对应一笔交易,总共有多少笔交易?
In [22]:
shopper['InvoiceNo'].unique().size
Out[22]:
1771
12 ).每一笔对应的平均总价是多少?
In [23]:
#求总收入 shopper['sub_total'] = round(shopper['UnitPrice'] * shopper['Quantity'],2) shopper['sub_total'].sum()
Out[23]:
648004.1099999999
In [24]:
#求每一笔订单的平均总价是多少
shopper[['InvoiceNo','sub_total']].groupby(by=['InvoiceNo']
).agg({'sub_total':'sum'})['sub_total'].mean()
Out[24]:
365.8972953133823
练习 2. 数据过滤与排序——NBA比赛情况统计(NBA_stats.xlsx)。按照每步的要求,输入正确的代码并运行。
1 ) - 导入必要的库
In [25]:
import pandas as pd
2 ) - 从以下地址导入数据集(NBA_stats.xlsx)
In [26]:
NBA2019=pd.read_excel('data/data111257/NBA_stats.xlsx')
3 )- 将数据集命名为NBA2019
In [27]:
NBA2019=pd.read_excel('data/data111257/NBA_stats.xlsx')
NBA2019
Out[27]:
| 排名 | 球队 | 投篮命中率 | 投篮命中 | 投篮出手 | 三分命中率 | 三分命中 | 三分出手 | 罚球命中率 | 罚球命中 | 罚球出手 | 总篮板 | 进攻 | 防守 | 助攻 | 失误 | 抢断 | 盖帽 | 犯规 | 得分 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 雄鹿 | 0.487 | 44.7 | 91.8 | 0.389 | 14.4 | 37.1 | 0.760 | 16.2 | 21.4 | 48.1 | 10.3 | 37.8 | 25.5 | 13.82 | 8.13 | 4.64 | 17.28 | 120.1 |
| 1 | 2 | 篮网 | 0.494 | 43.1 | 87.3 | 0.392 | 14.2 | 36.1 | 0.804 | 18.1 | 22.5 | 44.4 | 8.9 | 35.5 | 26.8 | 13.54 | 6.72 | 5.26 | 19.04 | 118.6 |
| 2 | 3 | 奇才 | 0.475 | 43.2 | 90.9 | 0.351 | 10.2 | 29.0 | 0.769 | 20.1 | 26.2 | 45.2 | 9.7 | 35.5 | 25.5 | 14.40 | 7.33 | 4.13 | 21.60 | 116.6 |
| 3 | 4 | 爵士 | 0.468 | 41.3 | 88.1 | 0.389 | 16.7 | 43.0 | 0.799 | 17.2 | 21.5 | 48.3 | 10.6 | 37.6 | 23.7 | 14.21 | 6.58 | 5.15 | 18.54 | 116.4 |
| 4 | 5 | 开拓者 | 0.453 | 41.3 | 91.1 | 0.385 | 15.7 | 40.8 | 0.823 | 17.8 | 21.6 | 44.5 | 10.6 | 33.9 | 21.3 | 11.10 | 6.89 | 5.04 | 18.92 | 116.1 |
| 5 | 6 | 太阳 | 0.490 | 43.3 | 88.3 | 0.378 | 13.1 | 34.6 | 0.834 | 15.6 | 18.7 | 42.9 | 8.8 | 34.2 | 26.9 | 12.53 | 7.18 | 4.33 | 19.08 | 115.3 |
| 6 | 7 | 步行者 | 0.474 | 43.3 | 91.2 | 0.364 | 12.3 | 34.0 | 0.792 | 16.4 | 20.7 | 42.7 | 9.0 | 33.7 | 27.4 | 13.54 | 8.49 | 6.39 | 20.18 | 115.3 |
| 7 | 8 | 掘金 | 0.485 | 43.2 | 89.2 | 0.377 | 12.9 | 34.2 | 0.803 | 15.7 | 19.5 | 44.4 | 10.5 | 33.9 | 26.8 | 13.50 | 8.08 | 4.49 | 19.08 | 115.1 |
| 8 | 9 | 鹈鹕 | 0.477 | 42.5 | 89.1 | 0.348 | 10.6 | 30.4 | 0.729 | 19.0 | 26.1 | 47.4 | 11.7 | 35.7 | 26.0 | 14.61 | 7.57 | 4.38 | 17.99 | 114.6 |
| 9 | 10 | 快船 | 0.482 | 41.8 | 86.7 | 0.411 | 14.3 | 34.7 | 0.839 | 16.2 | 19.3 | 44.2 | 9.4 | 34.7 | 24.4 | 13.19 | 7.07 | 4.10 | 19.21 | 114.0 |
| 10 | 11 | 勇士 | 0.468 | 41.3 | 88.2 | 0.376 | 14.6 | 38.7 | 0.785 | 16.6 | 21.1 | 43.0 | 8.0 | 35.1 | 27.7 | 15.00 | 8.15 | 4.75 | 21.19 | 113.7 |
| 11 | 12 | 老鹰 | 0.468 | 40.8 | 87.2 | 0.373 | 12.4 | 33.4 | 0.812 | 19.7 | 24.2 | 45.6 | 10.6 | 35.1 | 24.1 | 13.24 | 6.99 | 4.75 | 19.33 | 113.7 |
| 12 | 13 | 国王 | 0.481 | 42.6 | 88.6 | 0.364 | 12.1 | 33.3 | 0.745 | 16.4 | 22.0 | 41.4 | 9.4 | 32.0 | 25.5 | 13.38 | 7.54 | 4.97 | 19.44 | 113.7 |
| 13 | 14 | 76人 | 0.476 | 41.4 | 86.9 | 0.374 | 11.3 | 30.1 | 0.767 | 19.6 | 25.5 | 45.1 | 10.0 | 35.0 | 23.7 | 14.44 | 9.10 | 6.21 | 20.22 | 113.6 |
| 14 | 15 | 灰熊 | 0.467 | 42.8 | 91.8 | 0.356 | 11.2 | 31.4 | 0.771 | 16.4 | 21.3 | 46.5 | 11.2 | 35.3 | 26.9 | 13.29 | 9.10 | 5.06 | 18.74 | 113.3 |
| 15 | 16 | 凯尔特人 | 0.466 | 41.5 | 88.9 | 0.374 | 13.6 | 36.4 | 0.775 | 16.1 | 20.8 | 44.3 | 10.6 | 33.6 | 23.5 | 14.06 | 7.72 | 5.32 | 20.43 | 112.6 |
| 16 | 17 | 独行侠 | 0.470 | 41.1 | 87.3 | 0.362 | 13.8 | 38.1 | 0.778 | 16.5 | 21.2 | 43.3 | 9.1 | 34.2 | 22.9 | 12.07 | 6.25 | 4.32 | 19.39 | 112.4 |
| 17 | 18 | 森林狼 | 0.448 | 40.7 | 90.9 | 0.349 | 13.1 | 37.6 | 0.761 | 17.6 | 23.1 | 43.5 | 10.5 | 33.0 | 25.6 | 14.26 | 8.78 | 5.53 | 20.93 | 112.1 |
| 18 | 19 | 猛龙 | 0.448 | 39.7 | 88.7 | 0.368 | 14.5 | 39.3 | 0.815 | 17.4 | 21.3 | 41.6 | 9.4 | 32.1 | 24.1 | 13.22 | 8.58 | 5.40 | 21.19 | 111.3 |
| 19 | 20 | 马刺 | 0.462 | 41.9 | 90.5 | 0.350 | 9.9 | 28.4 | 0.792 | 17.4 | 22.0 | 43.9 | 9.3 | 34.6 | 24.4 | 11.40 | 7.01 | 5.08 | 17.96 | 111.1 |
| 20 | 21 | 公牛 | 0.476 | 42.2 | 88.6 | 0.370 | 12.6 | 34.0 | 0.791 | 13.8 | 17.5 | 45.0 | 9.6 | 35.3 | 26.8 | 15.13 | 6.69 | 4.22 | 18.92 | 110.7 |
| 21 | 22 | 湖人 | 0.472 | 40.6 | 86.1 | 0.354 | 11.1 | 31.2 | 0.739 | 17.2 | 23.3 | 44.2 | 9.7 | 34.6 | 24.7 | 15.21 | 7.81 | 5.36 | 19.13 | 109.5 |
| 22 | 23 | 黄蜂 | 0.455 | 39.9 | 87.8 | 0.369 | 13.7 | 37.0 | 0.761 | 15.9 | 20.9 | 43.8 | 10.6 | 33.2 | 26.8 | 14.85 | 7.85 | 4.78 | 18.03 | 109.5 |
| 23 | 24 | 火箭 | 0.444 | 39.2 | 88.5 | 0.339 | 13.8 | 40.6 | 0.740 | 16.5 | 22.3 | 42.6 | 9.3 | 33.3 | 23.6 | 14.72 | 7.58 | 5.01 | 19.54 | 108.8 |
| 24 | 25 | 热火 | 0.468 | 39.2 | 83.7 | 0.358 | 12.9 | 36.2 | 0.790 | 16.7 | 21.1 | 41.5 | 8.0 | 33.5 | 26.3 | 14.07 | 7.90 | 3.97 | 18.93 | 108.1 |
| 25 | 26 | 尼克斯 | 0.456 | 39.4 | 86.5 | 0.392 | 11.8 | 30.0 | 0.784 | 16.4 | 20.9 | 45.1 | 9.7 | 35.5 | 21.4 | 12.94 | 7.04 | 5.07 | 20.46 | 107.0 |
| 26 | 27 | 活塞 | 0.452 | 38.7 | 85.6 | 0.351 | 11.6 | 32.9 | 0.759 | 17.8 | 23.4 | 42.7 | 9.6 | 33.1 | 24.2 | 14.93 | 7.38 | 5.15 | 20.51 | 106.6 |
| 27 | 28 | 雷霆 | 0.441 | 38.8 | 88.0 | 0.339 | 11.9 | 35.1 | 0.725 | 15.5 | 21.3 | 45.6 | 9.9 | 35.7 | 22.1 | 16.14 | 7.00 | 4.39 | 18.13 | 105.0 |
| 28 | 29 | 魔术 | 0.429 | 38.2 | 89.2 | 0.343 | 10.9 | 31.8 | 0.775 | 16.6 | 21.4 | 45.4 | 10.4 | 35.1 | 21.8 | 12.83 | 6.89 | 4.42 | 17.18 | 104.0 |
| 29 | 30 | 骑士 | 0.450 | 38.6 | 85.8 | 0.336 | 10.0 | 29.7 | 0.743 | 16.7 | 22.4 | 42.8 | 10.4 | 32.3 | 23.8 | 15.47 | 7.76 | 4.51 | 18.17 | 103.8 |
4 )-只选取 “得分” 这一列
In [28]:
NBA2019['得分']
Out[28]:
0 120.1 1 118.6 2 116.6 3 116.4 4 116.1 5 115.3 6 115.3 7 115.1 8 114.6 9 114.0 10 113.7 11 113.7 12 113.7 13 113.6 14 113.3 15 112.6 16 112.4 17 112.1 18 111.3 19 111.1 20 110.7 21 109.5 22 109.5 23 108.8 24 108.1 25 107.0 26 106.6 27 105.0 28 104.0 29 103.8 Name: 得分, dtype: float64
5 )- 有多少球队参与了比赛?
In [29]:
NBA2019['球队'].nunique
Out[29]:
6 )-该数据集中一共有多少列(columns)?
In [30]:
NBA2019.shape[1]
Out[30]:
20
7 )- 将数据集中的列“投篮命中率”、“投篮命中”和“投篮出手”单独存为一个名叫“投篮”的数据框
In [31]:
shots=NBA2019[['投篮命中率','投篮命中','投篮出手']] shots
Out[31]:
| 投篮命中率 | 投篮命中 | 投篮出手 | |
|---|---|---|---|
| 0 | 0.487 | 44.7 | 91.8 |
| 1 | 0.494 | 43.1 | 87.3 |
| 2 | 0.475 | 43.2 | 90.9 |
| 3 | 0.468 | 41.3 | 88.1 |
| 4 | 0.453 | 41.3 | 91.1 |
| 5 | 0.490 | 43.3 | 88.3 |
| 6 | 0.474 | 43.3 | 91.2 |
| 7 | 0.485 | 43.2 | 89.2 |
| 8 | 0.477 | 42.5 | 89.1 |
| 9 | 0.482 | 41.8 | 86.7 |
| 10 | 0.468 | 41.3 | 88.2 |
| 11 | 0.468 | 40.8 | 87.2 |
| 12 | 0.481 | 42.6 | 88.6 |
| 13 | 0.476 | 41.4 | 86.9 |
| 14 | 0.467 | 42.8 | 91.8 |
| 15 | 0.466 | 41.5 | 88.9 |
| 16 | 0.470 | 41.1 | 87.3 |
| 17 | 0.448 | 40.7 | 90.9 |
| 18 | 0.448 | 39.7 | 88.7 |
| 19 | 0.462 | 41.9 | 90.5 |
| 20 | 0.476 | 42.2 | 88.6 |
| 21 | 0.472 | 40.6 | 86.1 |
| 22 | 0.455 | 39.9 | 87.8 |
| 23 | 0.444 | 39.2 | 88.5 |
| 24 | 0.468 | 39.2 | 83.7 |
| 25 | 0.456 | 39.4 | 86.5 |
| 26 | 0.452 | 38.7 | 85.6 |
| 27 | 0.441 | 38.8 | 88.0 |
| 28 | 0.429 | 38.2 | 89.2 |
| 29 | 0.450 | 38.6 | 85.8 |
In [32]:
NBA2019.columns
Out[32]:
Index(['排名', '球队', '投篮命中率', '投篮命中', '投篮出手', '三分命中率', '三分命中', '三分出手', '罚球命中率',
'罚球命中', '罚球出手', '总篮板', '进攻', '防守', '助攻', '失误', '抢断', '盖帽', '犯规', '得分'],
dtype='object')
8 ) - 对数据框“投篮”按照先“投篮命中率”再“投篮出手”进行排序
In [33]:
shots.sort_values(['投篮命中率','投篮出手'],ascending=[False,False])
Out[33]:
| 投篮命中率 | 投篮命中 | 投篮出手 | |
|---|---|---|---|
| 1 | 0.494 | 43.1 | 87.3 |
| 5 | 0.490 | 43.3 | 88.3 |
| 0 | 0.487 | 44.7 | 91.8 |
| 7 | 0.485 | 43.2 | 89.2 |
| 9 | 0.482 | 41.8 | 86.7 |
| 12 | 0.481 | 42.6 | 88.6 |
| 8 | 0.477 | 42.5 | 89.1 |
| 20 | 0.476 | 42.2 | 88.6 |
| 13 | 0.476 | 41.4 | 86.9 |
| 2 | 0.475 | 43.2 | 90.9 |
| 6 | 0.474 | 43.3 | 91.2 |
| 21 | 0.472 | 40.6 | 86.1 |
| 16 | 0.470 | 41.1 | 87.3 |
| 10 | 0.468 | 41.3 | 88.2 |
| 3 | 0.468 | 41.3 | 88.1 |
| 11 | 0.468 | 40.8 | 87.2 |
| 24 | 0.468 | 39.2 | 83.7 |
| 14 | 0.467 | 42.8 | 91.8 |
| 15 | 0.466 | 41.5 | 88.9 |
| 19 | 0.462 | 41.9 | 90.5 |
| 25 | 0.456 | 39.4 | 86.5 |
| 22 | 0.455 | 39.9 | 87.8 |
| 4 | 0.453 | 41.3 | 91.1 |
| 26 | 0.452 | 38.7 | 85.6 |
| 29 | 0.450 | 38.6 | 85.8 |
| 17 | 0.448 | 40.7 | 90.9 |
| 18 | 0.448 | 39.7 | 88.7 |
| 23 | 0.444 | 39.2 | 88.5 |
| 27 | 0.441 | 38.8 | 88.0 |
| 28 | 0.429 | 38.2 | 89.2 |
9 )- 计算每个球队“进攻”的平均值
In [34]:
discipline=NBA2019[['进攻']] discipline
Out[34]:
| 进攻 | |
|---|---|
| 0 | 10.3 |
| 1 | 8.9 |
| 2 | 9.7 |
| 3 | 10.6 |
| 4 | 10.6 |
| 5 | 8.8 |
| 6 | 9.0 |
| 7 | 10.5 |
| 8 | 11.7 |
| 9 | 9.4 |
| 10 | 8.0 |
| 11 | 10.6 |
| 12 | 9.4 |
| 13 | 10.0 |
| 14 | 11.2 |
| 15 | 10.6 |
| 16 | 9.1 |
| 17 | 10.5 |
| 18 | 9.4 |
| 19 | 9.3 |
| 20 | 9.6 |
| 21 | 9.7 |
| 22 | 10.6 |
| 23 | 9.3 |
| 24 | 8.0 |
| 25 | 9.7 |
| 26 | 9.6 |
| 27 | 9.9 |
| 28 | 10.4 |
| 29 | 10.4 |
In [35]:
round(discipline['进攻'].mean())
Out[35]:
10.0
10 )- 找到“罚球命中”小于16.3(不含)的球队数据
In [36]:
NBA2019[NBA2019.罚球命中>16.3]
Out[36]:
| 排名 | 球队 | 投篮命中率 | 投篮命中 | 投篮出手 | 三分命中率 | 三分命中 | 三分出手 | 罚球命中率 | 罚球命中 | 罚球出手 | 总篮板 | 进攻 | 防守 | 助攻 | 失误 | 抢断 | 盖帽 | 犯规 | 得分 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 篮网 | 0.494 | 43.1 | 87.3 | 0.392 | 14.2 | 36.1 | 0.804 | 18.1 | 22.5 | 44.4 | 8.9 | 35.5 | 26.8 | 13.54 | 6.72 | 5.26 | 19.04 | 118.6 |
| 2 | 3 | 奇才 | 0.475 | 43.2 | 90.9 | 0.351 | 10.2 | 29.0 | 0.769 | 20.1 | 26.2 | 45.2 | 9.7 | 35.5 | 25.5 | 14.40 | 7.33 | 4.13 | 21.60 | 116.6 |
| 3 | 4 | 爵士 | 0.468 | 41.3 | 88.1 | 0.389 | 16.7 | 43.0 | 0.799 | 17.2 | 21.5 | 48.3 | 10.6 | 37.6 | 23.7 | 14.21 | 6.58 | 5.15 | 18.54 | 116.4 |
| 4 | 5 | 开拓者 | 0.453 | 41.3 | 91.1 | 0.385 | 15.7 | 40.8 | 0.823 | 17.8 | 21.6 | 44.5 | 10.6 | 33.9 | 21.3 | 11.10 | 6.89 | 5.04 | 18.92 | 116.1 |
| 6 | 7 | 步行者 | 0.474 | 43.3 | 91.2 | 0.364 | 12.3 | 34.0 | 0.792 | 16.4 | 20.7 | 42.7 | 9.0 | 33.7 | 27.4 | 13.54 | 8.49 | 6.39 | 20.18 | 115.3 |
| 8 | 9 | 鹈鹕 | 0.477 | 42.5 | 89.1 | 0.348 | 10.6 | 30.4 | 0.729 | 19.0 | 26.1 | 47.4 | 11.7 | 35.7 | 26.0 | 14.61 | 7.57 | 4.38 | 17.99 | 114.6 |
| 10 | 11 | 勇士 | 0.468 | 41.3 | 88.2 | 0.376 | 14.6 | 38.7 | 0.785 | 16.6 | 21.1 | 43.0 | 8.0 | 35.1 | 27.7 | 15.00 | 8.15 | 4.75 | 21.19 | 113.7 |
| 11 | 12 | 老鹰 | 0.468 | 40.8 | 87.2 | 0.373 | 12.4 | 33.4 | 0.812 | 19.7 | 24.2 | 45.6 | 10.6 | 35.1 | 24.1 | 13.24 | 6.99 | 4.75 | 19.33 | 113.7 |
| 12 | 13 | 国王 | 0.481 | 42.6 | 88.6 | 0.364 | 12.1 | 33.3 | 0.745 | 16.4 | 22.0 | 41.4 | 9.4 | 32.0 | 25.5 | 13.38 | 7.54 | 4.97 | 19.44 | 113.7 |
| 13 | 14 | 76人 | 0.476 | 41.4 | 86.9 | 0.374 | 11.3 | 30.1 | 0.767 | 19.6 | 25.5 | 45.1 | 10.0 | 35.0 | 23.7 | 14.44 | 9.10 | 6.21 | 20.22 | 113.6 |
| 14 | 15 | 灰熊 | 0.467 | 42.8 | 91.8 | 0.356 | 11.2 | 31.4 | 0.771 | 16.4 | 21.3 | 46.5 | 11.2 | 35.3 | 26.9 | 13.29 | 9.10 | 5.06 | 18.74 | 113.3 |
| 16 | 17 | 独行侠 | 0.470 | 41.1 | 87.3 | 0.362 | 13.8 | 38.1 | 0.778 | 16.5 | 21.2 | 43.3 | 9.1 | 34.2 | 22.9 | 12.07 | 6.25 | 4.32 | 19.39 | 112.4 |
| 17 | 18 | 森林狼 | 0.448 | 40.7 | 90.9 | 0.349 | 13.1 | 37.6 | 0.761 | 17.6 | 23.1 | 43.5 | 10.5 | 33.0 | 25.6 | 14.26 | 8.78 | 5.53 | 20.93 | 112.1 |
| 18 | 19 | 猛龙 | 0.448 | 39.7 | 88.7 | 0.368 | 14.5 | 39.3 | 0.815 | 17.4 | 21.3 | 41.6 | 9.4 | 32.1 | 24.1 | 13.22 | 8.58 | 5.40 | 21.19 | 111.3 |
| 19 | 20 | 马刺 | 0.462 | 41.9 | 90.5 | 0.350 | 9.9 | 28.4 | 0.792 | 17.4 | 22.0 | 43.9 | 9.3 | 34.6 | 24.4 | 11.40 | 7.01 | 5.08 | 17.96 | 111.1 |
| 21 | 22 | 湖人 | 0.472 | 40.6 | 86.1 | 0.354 | 11.1 | 31.2 | 0.739 | 17.2 | 23.3 | 44.2 | 9.7 | 34.6 | 24.7 | 15.21 | 7.81 | 5.36 | 19.13 | 109.5 |
| 23 | 24 | 火箭 | 0.444 | 39.2 | 88.5 | 0.339 | 13.8 | 40.6 | 0.740 | 16.5 | 22.3 | 42.6 | 9.3 | 33.3 | 23.6 | 14.72 | 7.58 | 5.01 | 19.54 | 108.8 |
| 24 | 25 | 热火 | 0.468 | 39.2 | 83.7 | 0.358 | 12.9 | 36.2 | 0.790 | 16.7 | 21.1 | 41.5 | 8.0 | 33.5 | 26.3 | 14.07 | 7.90 | 3.97 | 18.93 | 108.1 |
| 25 | 26 | 尼克斯 | 0.456 | 39.4 | 86.5 | 0.392 | 11.8 | 30.0 | 0.784 | 16.4 | 20.9 | 45.1 | 9.7 | 35.5 | 21.4 | 12.94 | 7.04 | 5.07 | 20.46 | 107.0 |
| 26 | 27 | 活塞 | 0.452 | 38.7 | 85.6 | 0.351 | 11.6 | 32.9 | 0.759 | 17.8 | 23.4 | 42.7 | 9.6 | 33.1 | 24.2 | 14.93 | 7.38 | 5.15 | 20.51 | 106.6 |
| 28 | 29 | 魔术 | 0.429 | 38.2 | 89.2 | 0.343 | 10.9 | 31.8 | 0.775 | 16.6 | 21.4 | 45.4 | 10.4 | 35.1 | 21.8 | 12.83 | 6.89 | 4.42 | 17.18 | 104.0 |
| 29 | 30 | 骑士 | 0.450 | 38.6 | 85.8 | 0.336 | 10.0 | 29.7 | 0.743 | 16.7 | 22.4 | 42.8 | 10.4 | 32.3 | 23.8 | 15.47 | 7.76 | 4.51 | 18.17 | 103.8 |
11) - 选取球队名字含有“人”字的球队数据
In [37]:
NBA2019[NBA2019.球队.str.startswith('人')]
Out[37]:
12) - 选取前5列
In [38]:
NBA2019.iloc[:,0:5]
Out[38]:
| 排名 | 球队 | 投篮命中率 | 投篮命中 | 投篮出手 | |
|---|---|---|---|---|---|
| 0 | 1 | 雄鹿 | 0.487 | 44.7 | 91.8 |
| 1 | 2 | 篮网 | 0.494 | 43.1 | 87.3 |
| 2 | 3 | 奇才 | 0.475 | 43.2 | 90.9 |
| 3 | 4 | 爵士 | 0.468 | 41.3 | 88.1 |
| 4 | 5 | 开拓者 | 0.453 | 41.3 | 91.1 |
| 5 | 6 | 太阳 | 0.490 | 43.3 | 88.3 |
| 6 | 7 | 步行者 | 0.474 | 43.3 | 91.2 |
| 7 | 8 | 掘金 | 0.485 | 43.2 | 89.2 |
| 8 | 9 | 鹈鹕 | 0.477 | 42.5 | 89.1 |
| 9 | 10 | 快船 | 0.482 | 41.8 | 86.7 |
| 10 | 11 | 勇士 | 0.468 | 41.3 | 88.2 |
| 11 | 12 | 老鹰 | 0.468 | 40.8 | 87.2 |
| 12 | 13 | 国王 | 0.481 | 42.6 | 88.6 |
| 13 | 14 | 76人 | 0.476 | 41.4 | 86.9 |
| 14 | 15 | 灰熊 | 0.467 | 42.8 | 91.8 |
| 15 | 16 | 凯尔特人 | 0.466 | 41.5 | 88.9 |
| 16 | 17 | 独行侠 | 0.470 | 41.1 | 87.3 |
| 17 | 18 | 森林狼 | 0.448 | 40.7 | 90.9 |
| 18 | 19 | 猛龙 | 0.448 | 39.7 | 88.7 |
| 19 | 20 | 马刺 | 0.462 | 41.9 | 90.5 |
| 20 | 21 | 公牛 | 0.476 | 42.2 | 88.6 |
| 21 | 22 | 湖人 | 0.472 | 40.6 | 86.1 |
| 22 | 23 | 黄蜂 | 0.455 | 39.9 | 87.8 |
| 23 | 24 | 火箭 | 0.444 | 39.2 | 88.5 |
| 24 | 25 | 热火 | 0.468 | 39.2 | 83.7 |
| 25 | 26 | 尼克斯 | 0.456 | 39.4 | 86.5 |
| 26 | 27 | 活塞 | 0.452 | 38.7 | 85.6 |
| 27 | 28 | 雷霆 | 0.441 | 38.8 | 88.0 |
| 28 | 29 | 魔术 | 0.429 | 38.2 | 89.2 |
| 29 | 30 | 骑士 | 0.450 | 38.6 | 85.8 |
13) - 选取除了最后4列之外的全部列
In [39]:
NBA2019.iloc[:,:-4]
Out[39]:
| 排名 | 球队 | 投篮命中率 | 投篮命中 | 投篮出手 | 三分命中率 | 三分命中 | 三分出手 | 罚球命中率 | 罚球命中 | 罚球出手 | 总篮板 | 进攻 | 防守 | 助攻 | 失误 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 雄鹿 | 0.487 | 44.7 | 91.8 | 0.389 | 14.4 | 37.1 | 0.760 | 16.2 | 21.4 | 48.1 | 10.3 | 37.8 | 25.5 | 13.82 |
| 1 | 2 | 篮网 | 0.494 | 43.1 | 87.3 | 0.392 | 14.2 | 36.1 | 0.804 | 18.1 | 22.5 | 44.4 | 8.9 | 35.5 | 26.8 | 13.54 |
| 2 | 3 | 奇才 | 0.475 | 43.2 | 90.9 | 0.351 | 10.2 | 29.0 | 0.769 | 20.1 | 26.2 | 45.2 | 9.7 | 35.5 | 25.5 | 14.40 |
| 3 | 4 | 爵士 | 0.468 | 41.3 | 88.1 | 0.389 | 16.7 | 43.0 | 0.799 | 17.2 | 21.5 | 48.3 | 10.6 | 37.6 | 23.7 | 14.21 |
| 4 | 5 | 开拓者 | 0.453 | 41.3 | 91.1 | 0.385 | 15.7 | 40.8 | 0.823 | 17.8 | 21.6 | 44.5 | 10.6 | 33.9 | 21.3 | 11.10 |
| 5 | 6 | 太阳 | 0.490 | 43.3 | 88.3 | 0.378 | 13.1 | 34.6 | 0.834 | 15.6 | 18.7 | 42.9 | 8.8 | 34.2 | 26.9 | 12.53 |
| 6 | 7 | 步行者 | 0.474 | 43.3 | 91.2 | 0.364 | 12.3 | 34.0 | 0.792 | 16.4 | 20.7 | 42.7 | 9.0 | 33.7 | 27.4 | 13.54 |
| 7 | 8 | 掘金 | 0.485 | 43.2 | 89.2 | 0.377 | 12.9 | 34.2 | 0.803 | 15.7 | 19.5 | 44.4 | 10.5 | 33.9 | 26.8 | 13.50 |
| 8 | 9 | 鹈鹕 | 0.477 | 42.5 | 89.1 | 0.348 | 10.6 | 30.4 | 0.729 | 19.0 | 26.1 | 47.4 | 11.7 | 35.7 | 26.0 | 14.61 |
| 9 | 10 | 快船 | 0.482 | 41.8 | 86.7 | 0.411 | 14.3 | 34.7 | 0.839 | 16.2 | 19.3 | 44.2 | 9.4 | 34.7 | 24.4 | 13.19 |
| 10 | 11 | 勇士 | 0.468 | 41.3 | 88.2 | 0.376 | 14.6 | 38.7 | 0.785 | 16.6 | 21.1 | 43.0 | 8.0 | 35.1 | 27.7 | 15.00 |
| 11 | 12 | 老鹰 | 0.468 | 40.8 | 87.2 | 0.373 | 12.4 | 33.4 | 0.812 | 19.7 | 24.2 | 45.6 | 10.6 | 35.1 | 24.1 | 13.24 |
| 12 | 13 | 国王 | 0.481 | 42.6 | 88.6 | 0.364 | 12.1 | 33.3 | 0.745 | 16.4 | 22.0 | 41.4 | 9.4 | 32.0 | 25.5 | 13.38 |
| 13 | 14 | 76人 | 0.476 | 41.4 | 86.9 | 0.374 | 11.3 | 30.1 | 0.767 | 19.6 | 25.5 | 45.1 | 10.0 | 35.0 | 23.7 | 14.44 |
| 14 | 15 | 灰熊 | 0.467 | 42.8 | 91.8 | 0.356 | 11.2 | 31.4 | 0.771 | 16.4 | 21.3 | 46.5 | 11.2 | 35.3 | 26.9 | 13.29 |
| 15 | 16 | 凯尔特人 | 0.466 | 41.5 | 88.9 | 0.374 | 13.6 | 36.4 | 0.775 | 16.1 | 20.8 | 44.3 | 10.6 | 33.6 | 23.5 | 14.06 |
| 16 | 17 | 独行侠 | 0.470 | 41.1 | 87.3 | 0.362 | 13.8 | 38.1 | 0.778 | 16.5 | 21.2 | 43.3 | 9.1 | 34.2 | 22.9 | 12.07 |
| 17 | 18 | 森林狼 | 0.448 | 40.7 | 90.9 | 0.349 | 13.1 | 37.6 | 0.761 | 17.6 | 23.1 | 43.5 | 10.5 | 33.0 | 25.6 | 14.26 |
| 18 | 19 | 猛龙 | 0.448 | 39.7 | 88.7 | 0.368 | 14.5 | 39.3 | 0.815 | 17.4 | 21.3 | 41.6 | 9.4 | 32.1 | 24.1 | 13.22 |
| 19 | 20 | 马刺 | 0.462 | 41.9 | 90.5 | 0.350 | 9.9 | 28.4 | 0.792 | 17.4 | 22.0 | 43.9 | 9.3 | 34.6 | 24.4 | 11.40 |
| 20 | 21 | 公牛 | 0.476 | 42.2 | 88.6 | 0.370 | 12.6 | 34.0 | 0.791 | 13.8 | 17.5 | 45.0 | 9.6 | 35.3 | 26.8 | 15.13 |
| 21 | 22 | 湖人 | 0.472 | 40.6 | 86.1 | 0.354 | 11.1 | 31.2 | 0.739 | 17.2 | 23.3 | 44.2 | 9.7 | 34.6 | 24.7 | 15.21 |
| 22 | 23 | 黄蜂 | 0.455 | 39.9 | 87.8 | 0.369 | 13.7 | 37.0 | 0.761 | 15.9 | 20.9 | 43.8 | 10.6 | 33.2 | 26.8 | 14.85 |
| 23 | 24 | 火箭 | 0.444 | 39.2 | 88.5 | 0.339 | 13.8 | 40.6 | 0.740 | 16.5 | 22.3 | 42.6 | 9.3 | 33.3 | 23.6 | 14.72 |
| 24 | 25 | 热火 | 0.468 | 39.2 | 83.7 | 0.358 | 12.9 | 36.2 | 0.790 | 16.7 | 21.1 | 41.5 | 8.0 | 33.5 | 26.3 | 14.07 |
| 25 | 26 | 尼克斯 | 0.456 | 39.4 | 86.5 | 0.392 | 11.8 | 30.0 | 0.784 | 16.4 | 20.9 | 45.1 | 9.7 | 35.5 | 21.4 | 12.94 |
| 26 | 27 | 活塞 | 0.452 | 38.7 | 85.6 | 0.351 | 11.6 | 32.9 | 0.759 | 17.8 | 23.4 | 42.7 | 9.6 | 33.1 | 24.2 | 14.93 |
| 27 | 28 | 雷霆 | 0.441 | 38.8 | 88.0 | 0.339 | 11.9 | 35.1 | 0.725 | 15.5 | 21.3 | 45.6 | 9.9 | 35.7 | 22.1 | 16.14 |
| 28 | 29 | 魔术 | 0.429 | 38.2 | 89.2 | 0.343 | 10.9 | 31.8 | 0.775 | 16.6 | 21.4 | 45.4 | 10.4 | 35.1 | 21.8 | 12.83 |
| 29 | 30 | 骑士 | 0.450 | 38.6 | 85.8 | 0.336 | 10.0 | 29.7 | 0.743 | 16.7 | 22.4 | 42.8 | 10.4 | 32.3 | 23.8 | 15.47 |
14) - 找到“快船”和“国王”的“得分”
In [40]:
NBA2019.loc[NBA2019.球队.isin(['快船','国王']),['球队','得分']]
Out[40]:
| 球队 | 得分 | |
|---|---|---|
| 9 | 快船 | 114.0 |
| 12 | 国王 | 113.7 |
练习3. 数据分组----探索消费数据(OnlineRetail1.csv)
1)- 导入必要的库
In [41]:
import pandas as pd
2)- 从以下地址导入数据
In [42]:
path3='data/data111257/OnlineRetail.csv'
3)- 将数据框命名为ProductSales
In [43]:
ProductSales=pd.read_csv('data/data111257/OnlineRetail.csv')
4)- 哪个国家(Country)销售收入(Quantity*UnitPrice)更多?
In [44]:
ProductSales['TotalCost'] = ProductSales['UnitPrice'] * ProductSales['Quantity']
cost_per_country = ProductSales.groupby('Country')['TotalCost'].sum()
cost_per_country.idxmax()
Out[44]:
'Canada'
In [45]:
ProductSales.head()
Out[45]:
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | TotalCost | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2021/8/21 | 2.55 | 17850 | United Kingdom | 15.30 |
| 1 | 536365 | 71053 | WHITE metaL LANTERN | 6 | 2021/8/12 | 3.39 | 17850 | United Kingdom | 20.34 |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2021/7/24 | 2.75 | 17850 | United Kingdom | 22.00 |
| 3 | 536365 | 84029G | KNITTED UNIOn FLAG HOT WATER BOTTLE | 6 | 2021/8/10 | 3.39 | 17850 | United Kingdom | 20.34 |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2021/9/12 | 3.39 | 17850 | United Kingdom | 20.34 |
5)- 哪个月销售收入更多?
In [46]:
ProductSales['Month'] = ProductSales['InvoiceDate'].apply(lambda x:x[:6]).tolist()
In [47]:
ProductSales.head()
Out[47]:
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | TotalCost | Month | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2021/8/21 | 2.55 | 17850 | United Kingdom | 15.30 | 2021/8 |
| 1 | 536365 | 71053 | WHITE metaL LANTERN | 6 | 2021/8/12 | 3.39 | 17850 | United Kingdom | 20.34 | 2021/8 |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2021/7/24 | 2.75 | 17850 | United Kingdom | 22.00 | 2021/7 |
| 3 | 536365 | 84029G | KNITTED UNIOn FLAG HOT WATER BOTTLE | 6 | 2021/8/10 | 3.39 | 17850 | United Kingdom | 20.34 | 2021/8 |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2021/9/12 | 3.39 | 17850 | United Kingdom | 20.34 | 2021/9 |
In [48]:
ProductSales['TotalCost'] = ProductSales['UnitPrice'] * ProductSales['Quantity']
cost_per_country = ProductSales.groupby('Month')['TotalCost'].sum()
cost_per_country
Out[48]:
Month 2021/7 431.35 2021/8 1570.94 2021/9 667.43 Name: TotalCost, dtype: float64
In [49]:
cost_per_country.idxmax()
Out[49]:
'2021/8'
6)- 打印出每个国家销售数量的平均值,最大值和最小值
In [50]:
ProductSales.groupby('Country').Quantity.agg(['mean', 'min', 'max'])
Out[50]:
| mean | min | max | |
|---|---|---|---|
| Country | |||
| Australia | 10.500000 | 5 | 20 |
| Belgium | 7.000000 | 2 | 24 |
| Canada | 9.625000 | 1 | 16 |
| Channel Islands | 11.222222 | 1 | 24 |
| France | 22.500000 | 3 | 36 |
| Germany | 10.750000 | 4 | 24 |
| Italy | 10.750000 | 1 | 24 |
| USA | 6.750000 | 2 | 12 |
| United Kingdom | 9.263158 | 2 | 80 |
练习 4. 运用函数--分析美国某次人口普查状况。 按照每题的要求,输入正确的代码并运行。
1)-导入必要的库
In [51]:
import pandas as pd
2)-从以下地址导入数据集
In [52]:
path4='data/data111257/adult.csv'
3)-将数据框命名为census
In [53]:
census=pd.read_csv('data/data111257/adult.csv')
In [54]:
census.head()
Out[54]:
| year | age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1966 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
| 1 | 1971 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
| 2 | 1977 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
| 3 | 1977 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
| 4 | 1973 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
4)-每一列(column)的数据类型是什么样的?
In [55]:
census.info()
RangeIndex: 32561 entries, 0 to 32560 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 32561 non-null int64 1 age 32561 non-null int64 2 workclass 32561 non-null object 3 fnlwgt 32561 non-null int64 4 education 32561 non-null object 5 education-num 32074 non-null float64 6 marital-status 32561 non-null object 7 occupation 32049 non-null object 8 relationship 32561 non-null object 9 race 32561 non-null object 10 sex 32561 non-null object 11 capital-gain 32561 non-null int64 12 capital-loss 32561 non-null int64 13 hours-per-week 32561 non-null int64 14 native-country 32561 non-null object 15 salary 32561 non-null object dtypes: float64(1), int64(6), object(9) memory usage: 4.0+ MB
5)-将year的数据类型转换为 datetime64
In [56]:
census.year = pd.to_datetime(census.year, format='%Y') census.info()
RangeIndex: 32561 entries, 0 to 32560 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 32561 non-null datetime64[ns] 1 age 32561 non-null int64 2 workclass 32561 non-null object 3 fnlwgt 32561 non-null int64 4 education 32561 non-null object 5 education-num 32074 non-null float64 6 marital-status 32561 non-null object 7 occupation 32049 non-null object 8 relationship 32561 non-null object 9 race 32561 non-null object 10 sex 32561 non-null object 11 capital-gain 32561 non-null int64 12 capital-loss 32561 non-null int64 13 hours-per-week 32561 non-null int64 14 native-country 32561 non-null object 15 salary 32561 non-null object dtypes: datetime64[ns](1), float64(1), int64(5), object(9) memory usage: 4.0+ MB
6)-将列year设置为数据框的索引
In [57]:
census = census.set_index('year', drop = True)
census.head()
Out[57]:
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| year | |||||||||||||||
| 1966-01-01 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
| 1971-01-01 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
| 1977-01-01 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
| 1977-01-01 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
| 1973-01-01 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
7)-删除名为fnlwgt的列
In [58]:
del census['fnlwgt'] census.head()
Out[58]:
| age | workclass | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| year | ||||||||||||||
| 1966-01-01 | 49 | Private | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
| 1971-01-01 | 44 | Private | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
| 1977-01-01 | 38 | Private | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
| 1977-01-01 | 38 | Self-emp-inc | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
| 1973-01-01 | 42 | Self-emp-not-inc | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
8)-按照year对数据框进行分组,并对hours-per-week求和
ProductSales['TotalCost'] = ProductSales['UnitPrice'] * ProductSales['Quantity'] cost_per_country = ProductSales.groupby('InvoiceDate')['TotalCost'].sum() cost_per_country.idxmax()
In [59]:
census.groupby('year')['hours-per-week'].agg(sum)
Out[59]:
year
1925-01-01 1583
1927-01-01 120
1928-01-01 2
1929-01-01 40
1930-01-01 88
...
1994-01-01 24505
1995-01-01 24307
1996-01-01 21843
1997-01-01 14252
1998-01-01 8440
Name: hours-per-week, Length: 73, dtype: int64
9)-哪个年龄(age)已婚( Married-civ-spouse)人士最多
In [60]:
marry=census[census['marital-status'].str.contains('Married-civ-spouse')]
marry.age.value_counts().head(1)
Out[60]:
38 499 Name: age, dtype: int64
练习 5. 合并 1)-导入必要的库
In [61]:
import pandas as pd
2)-按照如下的元数据内容创建数据框
raw_data_1={
'store_id':['a','b','c','d','e'],
'item_name':['book','rule','glue','tape','pen']
'sales':[100,10,30,4,60]}
raw_data_2={
'store_id':['d',e','f','g','h'],
'item_name':['stapler','notebook','pencil','eraser','sharpener']
'sales':[20,60,150,40,50]}
raw_data_3={
'store_id':['a','b','c','d',e','f','g','h','i','j'],
'score':[80,79,68,99,60,84,75,93,59,60] }
In [62]:
raw_data_1={
'store_id':['a','b','c','d','e'],
'item_name':['book','rule','glue','tape','pen'],
'sales':[100,10,30,4,60]}
raw_data_2={
'store_id':['d','e','f','g','h'],
'item_name':['stapler','notebook','pencil','eraser','sharpener'],
'sales':[20,60,150,40,50]}
raw_data_3={
'store_id':['a','b','c','d','e','f','g','h','i','j'],
'score':[80,79,68,99,60,84,75,93,59,60] }
In [63]:
raw_data_1
Out[63]:
{'store_id': ['a', 'b', 'c', 'd', 'e'],
'item_name': ['book', 'rule', 'glue', 'tape', 'pen'],
'sales': [100, 10, 30, 4, 60]}
3)-将上述的数据框分别命名为data1, data2, data3
In [64]:
data1 = pd.Dataframe(raw_data_1, columns = ['store_id', 'item_name', 'sales']) data2 = pd.Dataframe(raw_data_2, columns = ['store_id', 'item_name', 'sales']) data3 = pd.Dataframe(raw_data_3, columns = ['store_id','score'])
In [65]:
data1
Out[65]:
| store_id | item_name | sales | |
|---|---|---|---|
| 0 | a | book | 100 |
| 1 | b | rule | 10 |
| 2 | c | glue | 30 |
| 3 | d | tape | 4 |
| 4 | e | pen | 60 |
4)-将data1和data2两个数据框按照列的维度进行合并,命名为all_data
In [66]:
all_data=pd.concat([data1,data2]) all_data
Out[66]:
| store_id | item_name | sales | |
|---|---|---|---|
| 0 | a | book | 100 |
| 1 | b | rule | 10 |
| 2 | c | glue | 30 |
| 3 | d | tape | 4 |
| 4 | e | pen | 60 |
| 0 | d | stapler | 20 |
| 1 | e | notebook | 60 |
| 2 | f | pencil | 150 |
| 3 | g | eraser | 40 |
| 4 | h | sharpener | 50 |
5)-将data1和data2两个数据框按照列的维度进行合并,命名为all_data_col
In [67]:
all_data_col=pd.concat([data1,data2],axis=1) all_data_col
Out[67]:
| store_id | item_name | sales | store_id | item_name | sales | |
|---|---|---|---|---|---|---|
| 0 | a | book | 100 | d | stapler | 20 |
| 1 | b | rule | 10 | e | notebook | 60 |
| 2 | c | glue | 30 | f | pencil | 150 |
| 3 | d | tape | 4 | g | eraser | 40 |
| 4 | e | pen | 60 | h | sharpener | 50 |
6)-打印data3
In [68]:
data3
Out[68]:
| store_id | score | |
|---|---|---|
| 0 | a | 80 |
| 1 | b | 79 |
| 2 | c | 68 |
| 3 | d | 99 |
| 4 | e | 60 |
| 5 | f | 84 |
| 6 | g | 75 |
| 7 | h | 93 |
| 8 | i | 59 |
| 9 | j | 60 |
7)-按照store_id的值对all_data和data3作合并
In [69]:
pd.merge(all_data,data3,on='store_id')
Out[69]:
| store_id | item_name | sales | score | |
|---|---|---|---|---|
| 0 | a | book | 100 | 80 |
| 1 | b | rule | 10 | 79 |
| 2 | c | glue | 30 | 68 |
| 3 | d | tape | 4 | 99 |
| 4 | d | stapler | 20 | 99 |
| 5 | e | pen | 60 | 60 |
| 6 | e | notebook | 60 | 60 |
| 7 | f | pencil | 150 | 84 |
| 8 | g | eraser | 40 | 75 |
| 9 | h | sharpener | 50 | 93 |
8)-对‘data1’和‘data2’按照‘store_id’作连接
In [70]:
pd.merge(data1,data2,on='store_id',how='inner')
Out[70]:
| store_id | item_name_x | sales_x | item_name_y | sales_y | |
|---|---|---|---|---|---|
| 0 | d | tape | 4 | stapler | 20 |
| 1 | e | pen | 60 | notebook | 60 |
9)-找到 data1 和 data2 合并之后的所有匹配结果
In [71]:
pd.merge(data1, data2, on='store_id', how='outer')
Out[71]:
| store_id | item_name_x | sales_x | item_name_y | sales_y | |
|---|---|---|---|---|---|
| 0 | a | book | 100.0 | NaN | NaN |
| 1 | b | rule | 10.0 | NaN | NaN |
| 2 | c | glue | 30.0 | NaN | NaN |
| 3 | d | tape | 4.0 | stapler | 20.0 |
| 4 | e | pen | 60.0 | notebook | 60.0 |
| 5 | f | NaN | NaN | pencil | 150.0 |
| 6 | g | NaN | NaN | eraser | 40.0 |
| 7 | h | NaN | NaN | sharpener | 50.0 |
练习 6. 统计----气象数据(austin_weather.csv)
1-导入必要的库
In [72]:
import pandas as pd import datetime
2)- 从以下地址导入数据集
In [73]:
path6='data/data111257/austin_weather.csv'
In [74]:
df6=pd.read_csv('data/data111257/austin_weather.csv')
In [75]:
df6
Out[75]:
| Year | Month | Day | Location | TempHighF | TempLowF | DewPointHighF | DewPointAvgF | DewPointLowF | HumidityHighPercent | ... | SeaLevelPressureAvgInches | SeaLevelPressureLowInches | VisibilityHighMiles | VisibilityAvgMiles | VisibilityLowMiles | WindHighMPH | WindAvgMPH | WindGustMPH | PrecipitationSumInches | Events | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 12 | 21 | 2.0 | 74 | 45 | 67 | 49 | 43 | 93 | ... | 29.68 | 29.59 | 10 | 7 | 2 | 20 | 4 | 31 | 0.46 | Rain , Thunderstorm |
| 1 | 2013 | 12 | 22 | 7.0 | 56 | 39 | 43 | 36 | 28 | 93 | ... | 30.13 | 29.87 | 10 | 10 | 5 | 16 | 6 | 25 | 0 | |
| 2 | 2013 | 12 | 23 | 6.0 | 58 | 32 | 31 | 27 | 23 | 76 | ... | 30.49 | 30.41 | 10 | 10 | 10 | 8 | 3 | 12 | 0 | |
| 3 | 2013 | 12 | 24 | 7.0 | 61 | 31 | 36 | 28 | 21 | 89 | ... | 30.45 | 30.3 | 10 | 10 | 7 | 12 | 4 | 20 | 0 | |
| 4 | 2013 | 12 | 25 | 9.0 | 58 | 41 | 44 | 40 | 36 | 86 | ... | 30.33 | 30.27 | 10 | 10 | 7 | 10 | 2 | 16 | T | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1314 | 2017 | 7 | 27 | 2.0 | 103 | 75 | 71 | 67 | 61 | 82 | ... | 29.97 | 29.88 | 10 | 10 | 10 | 12 | 5 | 21 | 0 | |
| 1315 | 2017 | 7 | 28 | 8.0 | 105 | 76 | 71 | 64 | 55 | 87 | ... | 29.9 | 29.81 | 10 | 10 | 10 | 14 | 5 | 20 | 0 | |
| 1316 | 2017 | 7 | 29 | 9.0 | 107 | 77 | 72 | 64 | 55 | 82 | ... | 29.86 | 29.79 | 10 | 10 | 10 | 12 | 4 | 17 | 0 | |
| 1317 | 2017 | 7 | 30 | 4.0 | 106 | 79 | 70 | 68 | 63 | 69 | ... | 29.91 | 29.87 | 10 | 10 | 10 | 13 | 4 | 20 | 0 | |
| 1318 | 2017 | 7 | 31 | 9.0 | 99 | 77 | 66 | 61 | 54 | 64 | ... | 29.97 | 29.91 | 10 | 10 | 10 | 12 | 4 | 20 | 0 |
1319 rows × 23 columns
3)- 将数据作存储并且设置前三列(Year, Month, Day)为合适的索引Date
In [76]:
data6=pd.read_csv(path6,parse_dates=['Year','Month','Day']) data6.head()
Out[76]:
| Year | Month | Day | Location | TempHighF | TempLowF | DewPointHighF | DewPointAvgF | DewPointLowF | HumidityHighPercent | ... | SeaLevelPressureAvgInches | SeaLevelPressureLowInches | VisibilityHighMiles | VisibilityAvgMiles | VisibilityLowMiles | WindHighMPH | WindAvgMPH | WindGustMPH | PrecipitationSumInches | Events | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | 12 | 21 | 2.0 | 74 | 45 | 67 | 49 | 43 | 93 | ... | 29.68 | 29.59 | 10 | 7 | 2 | 20 | 4 | 31 | 0.46 | Rain , Thunderstorm |
| 1 | 2013-01-01 | 12 | 22 | 7.0 | 56 | 39 | 43 | 36 | 28 | 93 | ... | 30.13 | 29.87 | 10 | 10 | 5 | 16 | 6 | 25 | 0 | |
| 2 | 2013-01-01 | 12 | 23 | 6.0 | 58 | 32 | 31 | 27 | 23 | 76 | ... | 30.49 | 30.41 | 10 | 10 | 10 | 8 | 3 | 12 | 0 | |
| 3 | 2013-01-01 | 12 | 24 | 7.0 | 61 | 31 | 36 | 28 | 21 | 89 | ... | 30.45 | 30.3 | 10 | 10 | 7 | 12 | 4 | 20 | 0 | |
| 4 | 2013-01-01 | 12 | 25 | 9.0 | 58 | 41 | 44 | 40 | 36 | 86 | ... | 30.33 | 30.27 | 10 | 10 | 7 | 10 | 2 | 16 | T |
5 rows × 23 columns
4)- 2034、2065年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug
In [77]:
def fix_century(x):
year = x.year - 100 if x.year > 1989 else x.year
return datetime.date(year, x.month, x.day)
# apply the function fix_century on the column and replace the values to the right ones
data6['Year'] = data6['Year'].apply(fix_century)
# data.info()
data6.head()
Out[77]:
| Year | Month | Day | Location | TempHighF | TempLowF | DewPointHighF | DewPointAvgF | DewPointLowF | HumidityHighPercent | ... | SeaLevelPressureAvgInches | SeaLevelPressureLowInches | VisibilityHighMiles | VisibilityAvgMiles | VisibilityLowMiles | WindHighMPH | WindAvgMPH | WindGustMPH | PrecipitationSumInches | Events | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1913-01-01 | 12 | 21 | 2.0 | 74 | 45 | 67 | 49 | 43 | 93 | ... | 29.68 | 29.59 | 10 | 7 | 2 | 20 | 4 | 31 | 0.46 | Rain , Thunderstorm |
| 1 | 1913-01-01 | 12 | 22 | 7.0 | 56 | 39 | 43 | 36 | 28 | 93 | ... | 30.13 | 29.87 | 10 | 10 | 5 | 16 | 6 | 25 | 0 | |
| 2 | 1913-01-01 | 12 | 23 | 6.0 | 58 | 32 | 31 | 27 | 23 | 76 | ... | 30.49 | 30.41 | 10 | 10 | 10 | 8 | 3 | 12 | 0 | |
| 3 | 1913-01-01 | 12 | 24 | 7.0 | 61 | 31 | 36 | 28 | 21 | 89 | ... | 30.45 | 30.3 | 10 | 10 | 7 | 12 | 4 | 20 | 0 | |
| 4 | 1913-01-01 | 12 | 25 | 9.0 | 58 | 41 | 44 | 40 | 36 | 86 | ... | 30.33 | 30.27 | 10 | 10 | 7 | 10 | 2 | 16 | T |
5 rows × 23 columns
5)- 将日期(Date)设为索引,注意数据类型,应该是datetime64
In [78]:
data6["Year"] = pd.to_datetime(data6["Year"])
In [79]:
data6 = data6.set_index('Year')
In [80]:
data6.head()
Out[80]:
| Month | Day | Location | TempHighF | TempLowF | DewPointHighF | DewPointAvgF | DewPointLowF | HumidityHighPercent | HumidityAvgPercent | ... | SeaLevelPressureAvgInches | SeaLevelPressureLowInches | VisibilityHighMiles | VisibilityAvgMiles | VisibilityLowMiles | WindHighMPH | WindAvgMPH | WindGustMPH | PrecipitationSumInches | Events | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | |||||||||||||||||||||
| 1913-01-01 | 12 | 21 | 2.0 | 74 | 45 | 67 | 49 | 43 | 93 | 75 | ... | 29.68 | 29.59 | 10 | 7 | 2 | 20 | 4 | 31 | 0.46 | Rain , Thunderstorm |
| 1913-01-01 | 12 | 22 | 7.0 | 56 | 39 | 43 | 36 | 28 | 93 | 68 | ... | 30.13 | 29.87 | 10 | 10 | 5 | 16 | 6 | 25 | 0 | |
| 1913-01-01 | 12 | 23 | 6.0 | 58 | 32 | 31 | 27 | 23 | 76 | 52 | ... | 30.49 | 30.41 | 10 | 10 | 10 | 8 | 3 | 12 | 0 | |
| 1913-01-01 | 12 | 24 | 7.0 | 61 | 31 | 36 | 28 | 21 | 89 | 56 | ... | 30.45 | 30.3 | 10 | 10 | 7 | 12 | 4 | 20 | 0 | |
| 1913-01-01 | 12 | 25 | 9.0 | 58 | 41 | 44 | 40 | 36 | 86 | 71 | ... | 30.33 | 30.27 | 10 | 10 | 7 | 10 | 2 | 16 | T |
5 rows × 22 columns
6)- 对应每一个DewPointAvgF,一共有多少数据值缺失
In [81]:
data6.isnull().sum()
Out[81]:
Month 0 Day 0 Location 12 TempHighF 0 TempLowF 0 DewPointHighF 0 DewPointAvgF 0 DewPointLowF 0 HumidityHighPercent 0 HumidityAvgPercent 0 HumidityLowPercent 0 SeaLevelPressureHighInches 0 SeaLevelPressureAvgInches 0 SeaLevelPressureLowInches 0 VisibilityHighMiles 0 VisibilityAvgMiles 0 VisibilityLowMiles 0 WindHighMPH 0 WindAvgMPH 0 WindGustMPH 0 PrecipitationSumInches 0 Events 0 dtype: int64
7)- 对应每一个DewPointAvgF,一共有多少完整的数据值
In [82]:
data6.shape[0] - data6.isnull().sum()
Out[82]:
Month 1319 Day 1319 Location 1307 TempHighF 1319 TempLowF 1319 DewPointHighF 1319 DewPointAvgF 1319 DewPointLowF 1319 HumidityHighPercent 1319 HumidityAvgPercent 1319 HumidityLowPercent 1319 SeaLevelPressureHighInches 1319 SeaLevelPressureAvgInches 1319 SeaLevelPressureLowInches 1319 VisibilityHighMiles 1319 VisibilityAvgMiles 1319 VisibilityLowMiles 1319 WindHighMPH 1319 WindAvgMPH 1319 WindGustMPH 1319 PrecipitationSumInches 1319 Events 1319 dtype: int64
8)- 对于全体数据,计算温度(TempF)的平均值
In [83]:
a68=data6.mean() a68
Out[83]:
Month inf Day inf Location 5.435348 TempHighF 80.862775 TempLowF 59.902957 dtype: float64
In [84]:
a68[3],a68[4]
Out[84]:
(80.86277482941622, 59.902956785443514)
In [85]:
avg_TempF=0.5*(a68[3]+a68[4]) avg_TempF
Out[85]:
70.38286580742987
9)- 创建一个名为loc_stats的数据框去计算并存储每个location多年湿度(Humidity)的最小值,最大值,平均值和标准差
In [86]:
loc_stats=pd.Dataframe() loc_stats['min']=data6.min() loc_stats['max']=data6.max() loc_stats['mean']=data6.mean() loc_stats['std']=data6.std()#standard deviations loc_stats
Out[86]:
| min | max | mean | std | |
|---|---|---|---|---|
| Month | 1 | 9 | inf | NaN |
| Day | 1 | 9 | inf | NaN |
| Location | 1 | 10 | 5.435348 | 2.830496 |
| TempHighF | 32 | 107 | 80.862775 | 14.766523 |
| TempLowF | 19 | 81 | 59.902957 | 14.190648 |
| DewPointHighF | - | 80 | NaN | NaN |
| DewPointAvgF | - | 8 | NaN | NaN |
| DewPointLowF | - | 9 | NaN | NaN |
| HumidityHighPercent | - | 97 | NaN | NaN |
| HumidityAvgPercent | - | 97 | NaN | NaN |
| HumidityLowPercent | - | 93 | NaN | NaN |
| SeaLevelPressureHighInches | - | 30.83 | NaN | NaN |
| SeaLevelPressureAvgInches | - | 30.74 | NaN | NaN |
| SeaLevelPressureLowInches | - | 30.61 | NaN | NaN |
| VisibilityHighMiles | - | 9 | NaN | NaN |
| VisibilityAvgMiles | - | 9 | NaN | NaN |
| VisibilityLowMiles | - | 9 | NaN | NaN |
| WindHighMPH | - | 9 | NaN | NaN |
| WindAvgMPH | - | 9 | NaN | NaN |
| WindGustMPH | - | 9 | NaN | NaN |
| PrecipitationSumInches | 0 | T | NaN | NaN |
| Events | Thunderstorm | NaN | NaN |
10)- 创建一个名为day_stats的数据框去计算并存储所有location中风速最小值,最大值,平均值和标准差
In [87]:
day_status=pd.Dataframe() day_status['min']=data6.min(axis=1) day_status['max']=data6.max(axis=1) day_status['mean']=data6.mean(axis=1) day_status['std']=data6.std(axis=1) day_status.head()
Out[87]:
| min | max | mean | std | |
|---|---|---|---|---|
| Year | ||||
| 1913-01-01 | 2.0 | 74.0 | 40.333333 | 36.226142 |
| 1913-01-01 | 7.0 | 56.0 | 34.000000 | 24.879711 |
| 1913-01-01 | 6.0 | 58.0 | 32.000000 | 26.000000 |
| 1913-01-01 | 7.0 | 61.0 | 33.000000 | 27.055499 |
| 1913-01-01 | 9.0 | 58.0 | 36.000000 | 24.879711 |
11)- 对于每一个location,计算多年的平均SeaLevelPressure值。
In [88]:
data6['SeaLevelPressureAvgInches'] = data6['SeaLevelPressureAvgInches'].apply(pd.to_numeric, errors='coerce').fillna(0.0) data6.groupby(['Location','Year'])['SeaLevelPressureAvgInches'].mean()
Out[88]:
Location Year
1.0 1914-01-01 30.021818
1915-01-01 29.987059
1916-01-01 29.999512
1917-01-01 29.922500
2.0 1913-01-01 29.680000
1914-01-01 29.975667
1915-01-01 30.020303
1916-01-01 30.048750
1917-01-01 29.988750
3.0 1913-01-01 30.360000
1914-01-01 30.083333
1915-01-01 29.390000
1916-01-01 30.029000
1917-01-01 30.025455
4.0 1914-01-01 30.079535
1915-01-01 30.023023
1916-01-01 30.073438
1917-01-01 29.999412
5.0 1914-01-01 30.004848
1915-01-01 30.023750
1916-01-01 30.031667
1917-01-01 29.986316
6.0 1913-01-01 30.253333
1914-01-01 30.022045
1915-01-01 30.032250
1916-01-01 30.021579
1917-01-01 30.021364
7.0 1913-01-01 30.290000
1914-01-01 30.024242
1915-01-01 28.356216
1916-01-01 30.016216
1917-01-01 30.004444
8.0 1914-01-01 30.035517
1915-01-01 30.041304
1916-01-01 29.996667
1917-01-01 29.954762
1934-01-01 29.870000
9.0 1913-01-01 30.365000
1914-01-01 30.003889
1915-01-01 30.050270
1916-01-01 30.074848
1917-01-01 29.948182
1965-01-01 30.060000
10.0 1914-01-01 30.036389
1915-01-01 30.063438
1916-01-01 30.012368
1917-01-01 29.898571
Name: SeaLevelPressureAvgInches, dtype: float64
练习 7. 可视化----某班成绩可视化分析(StudentsPerformance.csv) 1)-导入必要的库
In [89]:
import pandas as pd import numpy as np2)-从以下地址导入数据集
In [90]:
path7='data/data111257/StudentsPerformance.csv'
3)-将数据框命名为studentperformance
In [91]:
studentperformance=pd.read_csv('data/data111257/StudentsPerformance.csv')
4)-将studentID设置为索引
In [92]:
studentperformance=studentperformance.set_index('studentID',drop=True)
studentperformance.head()
Out[92]:
| gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
|---|---|---|---|---|---|---|---|---|
| studentID | ||||||||
| 1 | female | group B | bachelor's degree | standard | none | 72 | 72 | 74 |
| 2 | female | group C | some college | standard | completed | 69 | 90 | 88 |
| 3 | female | group B | master's degree | standard | none | 90 | 95 | 93 |
| 4 | male | group A | associate's degree | free/reduced | none | 47 | 57 | 44 |
| 5 | male | group C | some college | standard | none | 76 | 78 | 75 |
5)-绘制一个展示男女生比例的扇形图
In [93]:
#统计性别数量 aa=studentperformance.groupby(['gender'])['race/ethnicity'].count() aa
Out[93]:
gender female 518 male 482 Name: race/ethnicity, dtype: int64
In [94]:
#计算性别比例 female_rate=aa[0]/aa.sum() male_rate=aa[1]/aa.sum() female_rate,male_rate
Out[94]:
(0.518, 0.482)
In [128]:
#-*- coding: utf-8 -*-
from matplotlib import pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #解决中文乱码
plt.figure(figsize=(6,9)) #调节图形大小
labels = [u'female_rate',u'male_rate',] #定义标签
sizes = [female_rate,male_rate] #每块值
colors = ['blue', 'red'] #每块颜色定义
explode = (0,0) #将某一块分割出来,值越大分割出的间隙越大
patches,text1,text2 = plt.pie(sizes,
explode=explode,
labels=labels,
colors=colors,
autopct = '%3.2f%%', #数值保留固定小数位
shadow = False, #无阴影设置
startangle =90, #逆时针起始角度设置
pctdistance = 0.2) #数值距圆心半径倍数距离
#patches饼图的返回值,texts1饼图外label的文本,texts2饼图内部的文本
# x,y轴刻度设置一致,保证饼图为圆形
plt.axis('equal')
plt.legend() # 右上角显示
plt.show()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
方法2:
#绘制学生性别比例扇形图
Male = (studentperformance.gender == 'male').sum()
Female = (studentperformance.gender == 'female').sum()
proportions = [Male,Female]
plt.pie(proportions, labels=['Male','Female'],shadow=True,
autopct='%1.1f%%',startangle=90,explode=(0.15,0))
plt.axis('equal')
plt.title('Sex Proportion')
plt.tight_layout()
plt.show()
6)-绘制一个展示math score成绩, 与学生性别gender和父母受教育程度parental level of education的散点图
散点图
fig = plt.figure() ax=fig.add_subplot(1,1,1) ax.scatter(df['Chinese'],df['Math']) plt.xlabel('Chinese') plt.ylabel('Math') plt.title('语文和数学成绩散点图') plt.show()
In [96]:
#散点图 修改 fig = plt.figure()
In [97]:
import seaborn as sns lm = sns.lmplot(x='gender',y='math score',data=studentperformance,hue='parental level of education',fit_reg=False) lm.set(title='Math Score x Age') axes = lm.axes axes[0,0].set_ylim(0,100) axes[0,0].set_xlim(-5,5)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext])) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
Out[97]:
(-5, 5)
7)-绘制一个展示reading score成绩的直方图
In [98]:
#直方图
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(studentperformance['reading score'],bins=30)
plt.title('reading score')
plt.xlabel('score')
plt.ylabel('Total number')
plt.show()
练习 8.创建数据框
In [ ]:
1)-导入必要的库
In [99]:
import pandas as pd
2)-创建一个数据字典
raw_data = {
"Alcoholic": [0,0,0,0,0.0976],
"Country": ['United Kingdom','China','Algeria','Bangladesh','Korea South'],
"Eggs": [1.0778,2.3066,1.2171,1.2664,1.3524],
"Fish": [0.5176,0.9817,0.2008,2.1646,1.9147],
"Spices": [0.1276,0.0676,0.1568,1.0602,0.0372]
}
In [100]:
raw_data = {
"Alcoholic": [0,0,0,0,0.0976],
"Country": ['United Kingdom','China','Algeria','Bangladesh','Korea South'],
"Eggs": [1.0778,2.3066,1.2171,1.2664,1.3524],
"Fish": [0.5176,0.9817,0.2008,2.1646,1.9147],
"Spices": [0.1276,0.0676,0.1568,1.0602,0.0372]
}
3)-将数据字典存为一个名叫diet的数据框中
In [101]:
diet = pd.Dataframe(raw_data) diet.head()
Out[101]:
| Alcoholic | Country | Eggs | Fish | Spices | |
|---|---|---|---|---|---|
| 0 | 0.0000 | United Kingdom | 1.0778 | 0.5176 | 0.1276 |
| 1 | 0.0000 | China | 2.3066 | 0.9817 | 0.0676 |
| 2 | 0.0000 | Algeria | 1.2171 | 0.2008 | 0.1568 |
| 3 | 0.0000 | Bangladesh | 1.2664 | 2.1646 | 1.0602 |
| 4 | 0.0976 | Korea South | 1.3524 | 1.9147 | 0.0372 |
4)-数据框的列排序是字母顺序,请重新修改为Country, Spices, Fish, Alcoholic, Eggs这个顺序
In [102]:
diet=diet[['Country','Spices','Fish','Alcoholic','Eggs']]
5)-添加一个列Health
diet['Health']=['yes','yes','no','yes','no']
In [103]:
diet['Health']=['yes','yes','no','yes','no']
6)-查看每个列的数据类型
In [104]:
diet.dtypes
Out[104]:
Country object Spices float64 Fish float64 Alcoholic float64 Eggs float64 Health object dtype: object
练习 9.时间序列(hyundi.csv) 1)-导入必要的库
In [105]:
import pandas as pd import numpy as np #visualization import matplotlib.pyplot as plt %matplotlib inline2)-从以下地址导入数据集
In [106]:
path9='data/data111257/hyundi.csv'
3)-读取数据并存为一个名叫hyundai的数据框
In [107]:
hyundai = pd.read_csv(path9) hyundai.head()
Out[107]:
| model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | record | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I20 | 2017 | 7999 | Manual | 17307 | Petrol | 145 | 58.9 | 1.2 | 2021/6/20 |
| 1 | Tucson | 2016 | 14499 | Automatic | 25233 | Diesel | 235 | 43.5 | 2.0 | 2016/3/5 |
| 2 | Tucson | 2016 | 11399 | Manual | 37877 | Diesel | 30 | 61.7 | 1.7 | 2016/6/18 |
| 3 | I10 | 2016 | 6499 | Manual | 23789 | Petrol | 20 | 60.1 | 1.0 | 2019/1/2 |
| 4 | IX35 | 2015 | 10199 | Manual | 33177 | Diesel | 160 | 51.4 | 2.0 | 2020/1/7 |
4)-查看每一列的数据类型
In [108]:
hyundai.dtypes
Out[108]:
model object year int64 price int64 transmission object mileage int64 fuelType object tax int64 mpg float64 engineSize float64 record object dtype: object
5)-将record这个列转换为datetime类型
In [109]:
hyundai.record = pd.to_datetime(hyundai.record) hyundai['record'].head()
Out[109]:
0 2021-06-20 1 2016-03-05 2 2016-06-18 3 2019-01-02 4 2020-01-07 Name: record, dtype: datetime64[ns]
6)-将record设置为索引
In [110]:
hyundai=hyundai.set_index('record')
hyundai.head()
Out[110]:
| model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | |
|---|---|---|---|---|---|---|---|---|---|
| record | |||||||||
| 2021-06-20 | I20 | 2017 | 7999 | Manual | 17307 | Petrol | 145 | 58.9 | 1.2 |
| 2016-03-05 | Tucson | 2016 | 14499 | Automatic | 25233 | Diesel | 235 | 43.5 | 2.0 |
| 2016-06-18 | Tucson | 2016 | 11399 | Manual | 37877 | Diesel | 30 | 61.7 | 1.7 |
| 2019-01-02 | I10 | 2016 | 6499 | Manual | 23789 | Petrol | 20 | 60.1 | 1.0 |
| 2020-01-07 | IX35 | 2015 | 10199 | Manual | 33177 | Diesel | 160 | 51.4 | 2.0 |
7)-有重复的日期吗?
In [111]:
hyundai.index.is_unique
Out[111]:
False
8)-将index设置为升序
In [112]:
hyundai.sort_index(ascending = True).head()
Out[112]:
| model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | |
|---|---|---|---|---|---|---|---|---|---|
| record | |||||||||
| 2016-03-05 | Tucson | 2016 | 14499 | Automatic | 25233 | Diesel | 235 | 43.5 | 2.0 |
| 2016-06-18 | Tucson | 2016 | 11399 | Manual | 37877 | Diesel | 30 | 61.7 | 1.7 |
| 2018-03-04 | I10 | 2016 | 7500 | Manual | 6000 | Petrol | 20 | 57.6 | 1.2 |
| 2018-03-17 | I10 | 2018 | 7100 | Manual | 35000 | Petrol | 145 | 60.1 | 1.0 |
| 2018-05-09 | I40 | 2015 | 7830 | Manual | 59508 | Diesel | 30 | 65.7 | 1.7 |
9)-找到每个月的最后一个记录日(record)
In [113]:
hyundai_month = hyundai.resample('BM').sum()
hyundai_month.head()
Out[113]:
| year | price | mileage | tax | mpg | engineSize | |
|---|---|---|---|---|---|---|
| record | ||||||
| 2016-03-31 | 2016 | 14499 | 25233 | 235 | 43.5 | 2.0 |
| 2016-04-29 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 2016-05-31 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 2016-06-30 | 2016 | 11399 | 37877 | 30 | 61.7 | 1.7 |
| 2016-07-29 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
10)-数据集中最早的记录日期(record)和最晚的记录日期相差多少天?
In [114]:
(hyundai.index.max() - hyundai.index.min()).days
Out[114]:
1951
11)-在record数据中一共有多少个月?
In [115]:
hyundai_month = hyundai.resample('BM').mean()
len(hyundai_month.index)
Out[115]:
65
12)-按照record时间顺序可视化mpg值
In [116]:
hyundai_open=hyundai['mpg'].plot(title='hyundai record') fig=hyundai_open.get_figure() fig.set_size_inches(13.5,9)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans (prop.get_family(), self.defaultFamily[fontext]))
练习 10.删除数据(drug200.csv) 1)-导入必要的库
In [117]:
import pandas as pd import numpy as np2)-从以下地址导入数据集
In [118]:
path10='data/data111257/drug200.csv'
3)-将数据集存成变量drug
In [119]:
drug=pd.read_csv(path10) drug.head()
Out[119]:
| 23 | F | HIGH | HIGH.1 | 25.355 | DrugY | |
|---|---|---|---|---|---|---|
| 0 | 47 | M | LOW | HIGH | 13.093 | drugC |
| 1 | 47 | M | LOW | HIGH | 10.114 | drugC |
| 2 | 28 | F | NORMAL | HIGH | 7.798 | drugX |
| 3 | 61 | F | LOW | HIGH | 18.043 | DrugY |
| 4 | 22 | F | NORMAL | HIGH | 8.607 | drugX |
4)-创建数据框的列名称
Age
Sex
Blood Pressure Levels (BP)
Cholesterol Levels
Na to Potassium Ration(Na_to_K)
Drug Class
In [120]:
drug=pd.read_csv(path10,names=['Age','Sex','Blood_Pressure_Levels (BP)','Cholesterol_Levels','Na_to_Potassium_Ration(Na_to_K)','Drug_Class']) drug.head()
Out[120]:
| Age | Sex | Blood_Pressure_Levels (BP) | Cholesterol_Levels | Na_to_Potassium_Ration(Na_to_K) | Drug_Class | |
|---|---|---|---|---|---|---|
| 0 | 23 | F | HIGH | HIGH | 25.355 | DrugY |
| 1 | 47 | M | LOW | HIGH | 13.093 | drugC |
| 2 | 47 | M | LOW | HIGH | 10.114 | drugC |
| 3 | 28 | F | NORMAL | HIGH | 7.798 | drugX |
| 4 | 61 | F | LOW | HIGH | 18.043 | DrugY |
5)-数据框中有缺失值吗?
In [121]:
pd.isnull(drug).sum()
Out[121]:
Age 0 Sex 0 Blood_Pressure_Levels (BP) 0 Cholesterol_Levels 0 Na_to_Potassium_Ration(Na_to_K) 0 Drug_Class 0 dtype: int64
6)-将列Cholesterol的第10到19行设置为缺失值
In [122]:
drug.iloc[10:20,3:4]=np.nan drug.head(20)
Out[122]:
| Age | Sex | Blood_Pressure_Levels (BP) | Cholesterol_Levels | Na_to_Potassium_Ration(Na_to_K) | Drug_Class | |
|---|---|---|---|---|---|---|
| 0 | 23 | F | HIGH | HIGH | 25.355 | DrugY |
| 1 | 47 | M | LOW | HIGH | 13.093 | drugC |
| 2 | 47 | M | LOW | HIGH | 10.114 | drugC |
| 3 | 28 | F | NORMAL | HIGH | 7.798 | drugX |
| 4 | 61 | F | LOW | HIGH | 18.043 | DrugY |
| 5 | 22 | F | NORMAL | HIGH | 8.607 | drugX |
| 6 | 49 | F | NORMAL | HIGH | 16.275 | DrugY |
| 7 | 41 | M | LOW | HIGH | 11.037 | drugC |
| 8 | 60 | M | NORMAL | HIGH | 15.171 | DrugY |
| 9 | 43 | M | LOW | NORMAL | 19.368 | DrugY |
| 10 | 47 | F | LOW | NaN | 11.767 | drugC |
| 11 | 34 | F | HIGH | NaN | 19.199 | DrugY |
| 12 | 43 | M | LOW | NaN | 15.376 | DrugY |
| 13 | 74 | F | LOW | NaN | 20.942 | DrugY |
| 14 | 50 | F | NORMAL | NaN | 12.703 | drugX |
| 15 | 16 | F | HIGH | NaN | 15.516 | DrugY |
| 16 | 69 | M | LOW | NaN | 11.455 | drugX |
| 17 | 43 | M | HIGH | NaN | 13.972 | drugA |
| 18 | 23 | M | LOW | NaN | 7.298 | drugC |
| 19 | 32 | F | HIGH | NaN | 25.974 | DrugY |
7)-将缺失值全部替换为1.0
In [123]:
drug.Cholesterol_Levels.fillna(1,inplace=True) drug
Out[123]:
| Age | Sex | Blood_Pressure_Levels (BP) | Cholesterol_Levels | Na_to_Potassium_Ration(Na_to_K) | Drug_Class | |
|---|---|---|---|---|---|---|
| 0 | 23 | F | HIGH | HIGH | 25.355 | DrugY |
| 1 | 47 | M | LOW | HIGH | 13.093 | drugC |
| 2 | 47 | M | LOW | HIGH | 10.114 | drugC |
| 3 | 28 | F | NORMAL | HIGH | 7.798 | drugX |
| 4 | 61 | F | LOW | HIGH | 18.043 | DrugY |
| ... | ... | ... | ... | ... | ... | ... |
| 195 | 56 | F | LOW | HIGH | 11.567 | drugC |
| 196 | 16 | M | LOW | HIGH | 12.006 | drugC |
| 197 | 52 | M | NORMAL | HIGH | 9.894 | drugX |
| 198 | 23 | M | NORMAL | NORMAL | 14.020 | drugX |
| 199 | 40 | F | LOW | NORMAL | 11.349 | drugX |
200 rows × 6 columns
8)-删除列Cholesterol Levels,并打印删除后的数据框架
In [124]:
del drug['Cholesterol_Levels'] drug.head()
Out[124]:
| Age | Sex | Blood_Pressure_Levels (BP) | Na_to_Potassium_Ration(Na_to_K) | Drug_Class | |
|---|---|---|---|---|---|
| 0 | 23 | F | HIGH | 25.355 | DrugY |
| 1 | 47 | M | LOW | 13.093 | drugC |
| 2 | 47 | M | LOW | 10.114 | drugC |
| 3 | 28 | F | NORMAL | 7.798 | drugX |
| 4 | 61 | F | LOW | 18.043 | DrugY |
9)-将数据框前三行设置为缺失值
In [125]:
drug.iloc[0:3,:]=np.nan drug.head()
Out[125]:
| Age | Sex | Blood_Pressure_Levels (BP) | Na_to_Potassium_Ration(Na_to_K) | Drug_Class | |
|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN |
| 3 | 28.0 | F | NORMAL | 7.798 | drugX |
| 4 | 61.0 | F | LOW | 18.043 | DrugY |
10)-删除有缺失值的行
In [126]:
drug=drug.dropna(how='any') drug.head()
Out[126]:
| Age | Sex | Blood_Pressure_Levels (BP) | Na_to_Potassium_Ration(Na_to_K) | Drug_Class | |
|---|---|---|---|---|---|
| 3 | 28.0 | F | NORMAL | 7.798 | drugX |
| 4 | 61.0 | F | LOW | 18.043 | DrugY |
| 5 | 22.0 | F | NORMAL | 8.607 | drugX |
| 6 | 49.0 | F | NORMAL | 16.275 | DrugY |
| 7 | 41.0 | M | LOW | 11.037 | drugC |
11)-重新设置索引
In [127]:
drug=drug.reset_index(drop=True) drug.head()
Out[127]:
| Age | Sex | Blood_Pressure_Levels (BP) | Na_to_Potassium_Ration(Na_to_K) | Drug_Class | |
|---|---|---|---|---|---|
| 0 | 28.0 | F | NORMAL | 7.798 | drugX |
| 1 | 61.0 | F | LOW | 18.043 | DrugY |
| 2 | 22.0 | F | NORMAL | 8.607 | drugX |
| 3 | 49.0 | F | NORMAL | 16.275 | DrugY |
| 4 | 41.0 | M | LOW | 11.037 | drugC |
分组求和
http://www.crazyant.net/2594.html



