pandas入门十大例题

训练集数据资源地址：https://download.csdn.net/download/qq_41166909/30414605
本项目在AIStudio的地址https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1

练习 1. 了解数据--某段时间内股票销售情况(stock)。按照每题的要求，输入正确的代码并运行。

1 )- 导入正确的库——请导入Pandas。

In [12]:

import pandas as pd

2 )- 从指定地址导入数据，并存入一个名为shopper的框架中。（data.csv）

In [13]:

shopper=pd.read_csv('data/data111257/data.csv')

3 )- 查看前15行内容。

In [14]:

shopper.head(15)

Out[14]:

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE metaL LANTERN	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNIOn FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
5	536365	22752	SET 7 BABUSHKA NESTING BOXES	2	12/1/2010 8:26	7.65	17850.0	United Kingdom
6	536365	21730	GLASS STAR FROSTED T-LIGHT HOLDER	6	12/1/2010 8:26	4.25	17850.0	United Kingdom
7	536366	22633	HAND WARMER UNIOn JACK	6	12/1/2010 8:28	1.85	17850.0	United Kingdom
8	536366	22632	HAND WARMER RED POLKA DOT	6	12/1/2010 8:28	1.85	17850.0	United Kingdom
9	536367	84879	ASSORTED COLOUR BIRD ORNAMENT	32	12/1/2010 8:34	1.69	13047.0	United Kingdom
10	536367	22745	POPPY'S PLAYHOUSE BEDROOM	6	12/1/2010 8:34	2.10	13047.0	United Kingdom
11	536367	22748	POPPY'S PLAYHOUSE KITCHEN	6	12/1/2010 8:34	2.10	13047.0	United Kingdom
12	536367	22749	FELTCRAFT PRINCESS CHARLOTTE DOLL	8	12/1/2010 8:34	3.75	13047.0	United Kingdom
13	536367	22310	IVORY KNITTED MUG COSY	6	12/1/2010 8:34	1.65	13047.0	United Kingdom
14	536367	84969	BOX OF 6 ASSORTED COLOUR TEASPOONS	6	12/1/2010 8:34	4.25	13047.0	United Kingdom

4 )- 数据集中有多少列(columns)

In [15]:

number_of_rows = shopper.shape[1]
number_of_rows

Out[15]:

5 )- 打印全部列名称。

In [16]:

shopper.columns

Out[16]:

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

6 )- 数据集的索引是怎样的。

In [17]:

shopper.index

Out[17]:

RangeIndex(start=0, stop=36997, step=1)

7 )- 被下单最多StockCode是什么？

In [18]:

a=shopper[['StockCode','Quantity']].groupby(by='StockCode').agg('sum') #根据股票编码分组
a=a.sort_values(['Quantity'],ascending=False)
a.head(1)

Out[18]:

	Quantity
StockCode
84077	4948

8 )- 在StockCode这一列中，一共有多少只股票被下单？

In [19]:

pd.unique(shopper['StockCode']).size

Out[19]:

9 )- 在Description列中，下单最多的是哪种股票？

In [20]:

shopper['StockCode'].value_counts().head(1)

Out[20]:

85123A    209
Name: StockCode, dtype: int64

10 )- 一共有多少股票被下单？

In [21]:

shopper['Quantity'].sum()

Out[21]:

11 )- 每一张单据(InvoiceNo)对应一笔交易，总共有多少笔交易？

In [22]:

shopper['InvoiceNo'].unique().size

Out[22]:

12 ).每一笔对应的平均总价是多少？

In [23]:

#求总收入
shopper['sub_total'] = round(shopper['UnitPrice'] * shopper['Quantity'],2)
shopper['sub_total'].sum()

Out[23]:

648004.1099999999

In [24]:

#求每一笔订单的平均总价是多少
shopper[['InvoiceNo','sub_total']].groupby(by=['InvoiceNo']
).agg({'sub_total':'sum'})['sub_total'].mean()

Out[24]:

365.8972953133823

练习 2. 数据过滤与排序——NBA比赛情况统计(NBA_stats.xlsx)。按照每步的要求，输入正确的代码并运行。

1 ) - 导入必要的库

In [25]:

import pandas as pd

2 ) - 从以下地址导入数据集(NBA_stats.xlsx)

In [26]:

NBA2019=pd.read_excel('data/data111257/NBA_stats.xlsx')

3 )- 将数据集命名为NBA2019

In [27]:

NBA2019=pd.read_excel('data/data111257/NBA_stats.xlsx')
NBA2019

Out[27]:

	排名	球队	投篮命中率	投篮命中	投篮出手	三分命中率	三分命中	三分出手	罚球命中率	罚球命中	罚球出手	总篮板	进攻	防守	助攻	失误	抢断	盖帽	犯规	得分
0	1	雄鹿	0.487	44.7	91.8	0.389	14.4	37.1	0.760	16.2	21.4	48.1	10.3	37.8	25.5	13.82	8.13	4.64	17.28	120.1
1	2	篮网	0.494	43.1	87.3	0.392	14.2	36.1	0.804	18.1	22.5	44.4	8.9	35.5	26.8	13.54	6.72	5.26	19.04	118.6
2	3	奇才	0.475	43.2	90.9	0.351	10.2	29.0	0.769	20.1	26.2	45.2	9.7	35.5	25.5	14.40	7.33	4.13	21.60	116.6
3	4	爵士	0.468	41.3	88.1	0.389	16.7	43.0	0.799	17.2	21.5	48.3	10.6	37.6	23.7	14.21	6.58	5.15	18.54	116.4
4	5	开拓者	0.453	41.3	91.1	0.385	15.7	40.8	0.823	17.8	21.6	44.5	10.6	33.9	21.3	11.10	6.89	5.04	18.92	116.1
5	6	太阳	0.490	43.3	88.3	0.378	13.1	34.6	0.834	15.6	18.7	42.9	8.8	34.2	26.9	12.53	7.18	4.33	19.08	115.3
6	7	步行者	0.474	43.3	91.2	0.364	12.3	34.0	0.792	16.4	20.7	42.7	9.0	33.7	27.4	13.54	8.49	6.39	20.18	115.3
7	8	掘金	0.485	43.2	89.2	0.377	12.9	34.2	0.803	15.7	19.5	44.4	10.5	33.9	26.8	13.50	8.08	4.49	19.08	115.1
8	9	鹈鹕	0.477	42.5	89.1	0.348	10.6	30.4	0.729	19.0	26.1	47.4	11.7	35.7	26.0	14.61	7.57	4.38	17.99	114.6
9	10	快船	0.482	41.8	86.7	0.411	14.3	34.7	0.839	16.2	19.3	44.2	9.4	34.7	24.4	13.19	7.07	4.10	19.21	114.0
10	11	勇士	0.468	41.3	88.2	0.376	14.6	38.7	0.785	16.6	21.1	43.0	8.0	35.1	27.7	15.00	8.15	4.75	21.19	113.7
11	12	老鹰	0.468	40.8	87.2	0.373	12.4	33.4	0.812	19.7	24.2	45.6	10.6	35.1	24.1	13.24	6.99	4.75	19.33	113.7
12	13	国王	0.481	42.6	88.6	0.364	12.1	33.3	0.745	16.4	22.0	41.4	9.4	32.0	25.5	13.38	7.54	4.97	19.44	113.7
13	14	76人	0.476	41.4	86.9	0.374	11.3	30.1	0.767	19.6	25.5	45.1	10.0	35.0	23.7	14.44	9.10	6.21	20.22	113.6
14	15	灰熊	0.467	42.8	91.8	0.356	11.2	31.4	0.771	16.4	21.3	46.5	11.2	35.3	26.9	13.29	9.10	5.06	18.74	113.3
15	16	凯尔特人	0.466	41.5	88.9	0.374	13.6	36.4	0.775	16.1	20.8	44.3	10.6	33.6	23.5	14.06	7.72	5.32	20.43	112.6
16	17	独行侠	0.470	41.1	87.3	0.362	13.8	38.1	0.778	16.5	21.2	43.3	9.1	34.2	22.9	12.07	6.25	4.32	19.39	112.4
17	18	森林狼	0.448	40.7	90.9	0.349	13.1	37.6	0.761	17.6	23.1	43.5	10.5	33.0	25.6	14.26	8.78	5.53	20.93	112.1
18	19	猛龙	0.448	39.7	88.7	0.368	14.5	39.3	0.815	17.4	21.3	41.6	9.4	32.1	24.1	13.22	8.58	5.40	21.19	111.3
19	20	马刺	0.462	41.9	90.5	0.350	9.9	28.4	0.792	17.4	22.0	43.9	9.3	34.6	24.4	11.40	7.01	5.08	17.96	111.1
20	21	公牛	0.476	42.2	88.6	0.370	12.6	34.0	0.791	13.8	17.5	45.0	9.6	35.3	26.8	15.13	6.69	4.22	18.92	110.7
21	22	湖人	0.472	40.6	86.1	0.354	11.1	31.2	0.739	17.2	23.3	44.2	9.7	34.6	24.7	15.21	7.81	5.36	19.13	109.5
22	23	黄蜂	0.455	39.9	87.8	0.369	13.7	37.0	0.761	15.9	20.9	43.8	10.6	33.2	26.8	14.85	7.85	4.78	18.03	109.5
23	24	火箭	0.444	39.2	88.5	0.339	13.8	40.6	0.740	16.5	22.3	42.6	9.3	33.3	23.6	14.72	7.58	5.01	19.54	108.8
24	25	热火	0.468	39.2	83.7	0.358	12.9	36.2	0.790	16.7	21.1	41.5	8.0	33.5	26.3	14.07	7.90	3.97	18.93	108.1
25	26	尼克斯	0.456	39.4	86.5	0.392	11.8	30.0	0.784	16.4	20.9	45.1	9.7	35.5	21.4	12.94	7.04	5.07	20.46	107.0
26	27	活塞	0.452	38.7	85.6	0.351	11.6	32.9	0.759	17.8	23.4	42.7	9.6	33.1	24.2	14.93	7.38	5.15	20.51	106.6
27	28	雷霆	0.441	38.8	88.0	0.339	11.9	35.1	0.725	15.5	21.3	45.6	9.9	35.7	22.1	16.14	7.00	4.39	18.13	105.0
28	29	魔术	0.429	38.2	89.2	0.343	10.9	31.8	0.775	16.6	21.4	45.4	10.4	35.1	21.8	12.83	6.89	4.42	17.18	104.0
29	30	骑士	0.450	38.6	85.8	0.336	10.0	29.7	0.743	16.7	22.4	42.8	10.4	32.3	23.8	15.47	7.76	4.51	18.17	103.8

4 )-只选取 “得分” 这一列

In [28]:

NBA2019['得分']

Out[28]:

0     120.1
1     118.6
2     116.6
3     116.4
4     116.1
5     115.3
6     115.3
7     115.1
8     114.6
9     114.0
10    113.7
11    113.7
12    113.7
13    113.6
14    113.3
15    112.6
16    112.4
17    112.1
18    111.3
19    111.1
20    110.7
21    109.5
22    109.5
23    108.8
24    108.1
25    107.0
26    106.6
27    105.0
28    104.0
29    103.8
Name: 得分, dtype: float64

5 )- 有多少球队参与了比赛？

In [29]:

NBA2019['球队'].nunique

Out[29]:

6 )-该数据集中一共有多少列(columns)?

In [30]:

NBA2019.shape[1]

Out[30]:

7 )- 将数据集中的列“投篮命中率”、“投篮命中”和“投篮出手”单独存为一个名叫“投篮”的数据框

In [31]:

shots=NBA2019[['投篮命中率','投篮命中','投篮出手']]
shots

Out[31]:

	投篮命中率	投篮命中	投篮出手
0	0.487	44.7	91.8
1	0.494	43.1	87.3
2	0.475	43.2	90.9
3	0.468	41.3	88.1
4	0.453	41.3	91.1
5	0.490	43.3	88.3
6	0.474	43.3	91.2
7	0.485	43.2	89.2
8	0.477	42.5	89.1
9	0.482	41.8	86.7
10	0.468	41.3	88.2
11	0.468	40.8	87.2
12	0.481	42.6	88.6
13	0.476	41.4	86.9
14	0.467	42.8	91.8
15	0.466	41.5	88.9
16	0.470	41.1	87.3
17	0.448	40.7	90.9
18	0.448	39.7	88.7
19	0.462	41.9	90.5
20	0.476	42.2	88.6
21	0.472	40.6	86.1
22	0.455	39.9	87.8
23	0.444	39.2	88.5
24	0.468	39.2	83.7
25	0.456	39.4	86.5
26	0.452	38.7	85.6
27	0.441	38.8	88.0
28	0.429	38.2	89.2
29	0.450	38.6	85.8

In [32]:

NBA2019.columns

Out[32]:

Index(['排名', '球队', '投篮命中率', '投篮命中', '投篮出手', '三分命中率', '三分命中', '三分出手', '罚球命中率',
       '罚球命中', '罚球出手', '总篮板', '进攻', '防守', '助攻', '失误', '抢断', '盖帽', '犯规', '得分'],
      dtype='object')

8 ) - 对数据框“投篮”按照先“投篮命中率”再“投篮出手”进行排序

In [33]:

shots.sort_values(['投篮命中率','投篮出手'],ascending=[False,False])

Out[33]:

	投篮命中率	投篮命中	投篮出手
1	0.494	43.1	87.3
5	0.490	43.3	88.3
0	0.487	44.7	91.8
7	0.485	43.2	89.2
9	0.482	41.8	86.7
12	0.481	42.6	88.6
8	0.477	42.5	89.1
20	0.476	42.2	88.6
13	0.476	41.4	86.9
2	0.475	43.2	90.9
6	0.474	43.3	91.2
21	0.472	40.6	86.1
16	0.470	41.1	87.3
10	0.468	41.3	88.2
3	0.468	41.3	88.1
11	0.468	40.8	87.2
24	0.468	39.2	83.7
14	0.467	42.8	91.8
15	0.466	41.5	88.9
19	0.462	41.9	90.5
25	0.456	39.4	86.5
22	0.455	39.9	87.8
4	0.453	41.3	91.1
26	0.452	38.7	85.6
29	0.450	38.6	85.8
17	0.448	40.7	90.9
18	0.448	39.7	88.7
23	0.444	39.2	88.5
27	0.441	38.8	88.0
28	0.429	38.2	89.2

9 )- 计算每个球队“进攻”的平均值

In [34]:

discipline=NBA2019[['进攻']]
discipline

Out[34]:

	进攻
0	10.3
1	8.9
2	9.7
3	10.6
4	10.6
5	8.8
6	9.0
7	10.5
8	11.7
9	9.4
10	8.0
11	10.6
12	9.4
13	10.0
14	11.2
15	10.6
16	9.1
17	10.5
18	9.4
19	9.3
20	9.6
21	9.7
22	10.6
23	9.3
24	8.0
25	9.7
26	9.6
27	9.9
28	10.4
29	10.4

In [35]:

round(discipline['进攻'].mean())

Out[35]:

10.0

10 )- 找到“罚球命中”小于16.3（不含）的球队数据

In [36]:

NBA2019[NBA2019.罚球命中>16.3]

Out[36]:

	排名	球队	投篮命中率	投篮命中	投篮出手	三分命中率	三分命中	三分出手	罚球命中率	罚球命中	罚球出手	总篮板	进攻	防守	助攻	失误	抢断	盖帽	犯规	得分
1	2	篮网	0.494	43.1	87.3	0.392	14.2	36.1	0.804	18.1	22.5	44.4	8.9	35.5	26.8	13.54	6.72	5.26	19.04	118.6
2	3	奇才	0.475	43.2	90.9	0.351	10.2	29.0	0.769	20.1	26.2	45.2	9.7	35.5	25.5	14.40	7.33	4.13	21.60	116.6
3	4	爵士	0.468	41.3	88.1	0.389	16.7	43.0	0.799	17.2	21.5	48.3	10.6	37.6	23.7	14.21	6.58	5.15	18.54	116.4
4	5	开拓者	0.453	41.3	91.1	0.385	15.7	40.8	0.823	17.8	21.6	44.5	10.6	33.9	21.3	11.10	6.89	5.04	18.92	116.1
6	7	步行者	0.474	43.3	91.2	0.364	12.3	34.0	0.792	16.4	20.7	42.7	9.0	33.7	27.4	13.54	8.49	6.39	20.18	115.3
8	9	鹈鹕	0.477	42.5	89.1	0.348	10.6	30.4	0.729	19.0	26.1	47.4	11.7	35.7	26.0	14.61	7.57	4.38	17.99	114.6
10	11	勇士	0.468	41.3	88.2	0.376	14.6	38.7	0.785	16.6	21.1	43.0	8.0	35.1	27.7	15.00	8.15	4.75	21.19	113.7
11	12	老鹰	0.468	40.8	87.2	0.373	12.4	33.4	0.812	19.7	24.2	45.6	10.6	35.1	24.1	13.24	6.99	4.75	19.33	113.7
12	13	国王	0.481	42.6	88.6	0.364	12.1	33.3	0.745	16.4	22.0	41.4	9.4	32.0	25.5	13.38	7.54	4.97	19.44	113.7
13	14	76人	0.476	41.4	86.9	0.374	11.3	30.1	0.767	19.6	25.5	45.1	10.0	35.0	23.7	14.44	9.10	6.21	20.22	113.6
14	15	灰熊	0.467	42.8	91.8	0.356	11.2	31.4	0.771	16.4	21.3	46.5	11.2	35.3	26.9	13.29	9.10	5.06	18.74	113.3
16	17	独行侠	0.470	41.1	87.3	0.362	13.8	38.1	0.778	16.5	21.2	43.3	9.1	34.2	22.9	12.07	6.25	4.32	19.39	112.4
17	18	森林狼	0.448	40.7	90.9	0.349	13.1	37.6	0.761	17.6	23.1	43.5	10.5	33.0	25.6	14.26	8.78	5.53	20.93	112.1
18	19	猛龙	0.448	39.7	88.7	0.368	14.5	39.3	0.815	17.4	21.3	41.6	9.4	32.1	24.1	13.22	8.58	5.40	21.19	111.3
19	20	马刺	0.462	41.9	90.5	0.350	9.9	28.4	0.792	17.4	22.0	43.9	9.3	34.6	24.4	11.40	7.01	5.08	17.96	111.1
21	22	湖人	0.472	40.6	86.1	0.354	11.1	31.2	0.739	17.2	23.3	44.2	9.7	34.6	24.7	15.21	7.81	5.36	19.13	109.5
23	24	火箭	0.444	39.2	88.5	0.339	13.8	40.6	0.740	16.5	22.3	42.6	9.3	33.3	23.6	14.72	7.58	5.01	19.54	108.8
24	25	热火	0.468	39.2	83.7	0.358	12.9	36.2	0.790	16.7	21.1	41.5	8.0	33.5	26.3	14.07	7.90	3.97	18.93	108.1
25	26	尼克斯	0.456	39.4	86.5	0.392	11.8	30.0	0.784	16.4	20.9	45.1	9.7	35.5	21.4	12.94	7.04	5.07	20.46	107.0
26	27	活塞	0.452	38.7	85.6	0.351	11.6	32.9	0.759	17.8	23.4	42.7	9.6	33.1	24.2	14.93	7.38	5.15	20.51	106.6
28	29	魔术	0.429	38.2	89.2	0.343	10.9	31.8	0.775	16.6	21.4	45.4	10.4	35.1	21.8	12.83	6.89	4.42	17.18	104.0
29	30	骑士	0.450	38.6	85.8	0.336	10.0	29.7	0.743	16.7	22.4	42.8	10.4	32.3	23.8	15.47	7.76	4.51	18.17	103.8

11) - 选取球队名字含有“人”字的球队数据

In [37]:

NBA2019[NBA2019.球队.str.startswith('人')]

Out[37]:

12) - 选取前5列

In [38]:

NBA2019.iloc[:,0:5]

Out[38]:

	排名	球队	投篮命中率	投篮命中	投篮出手
0	1	雄鹿	0.487	44.7	91.8
1	2	篮网	0.494	43.1	87.3
2	3	奇才	0.475	43.2	90.9
3	4	爵士	0.468	41.3	88.1
4	5	开拓者	0.453	41.3	91.1
5	6	太阳	0.490	43.3	88.3
6	7	步行者	0.474	43.3	91.2
7	8	掘金	0.485	43.2	89.2
8	9	鹈鹕	0.477	42.5	89.1
9	10	快船	0.482	41.8	86.7
10	11	勇士	0.468	41.3	88.2
11	12	老鹰	0.468	40.8	87.2
12	13	国王	0.481	42.6	88.6
13	14	76人	0.476	41.4	86.9
14	15	灰熊	0.467	42.8	91.8
15	16	凯尔特人	0.466	41.5	88.9
16	17	独行侠	0.470	41.1	87.3
17	18	森林狼	0.448	40.7	90.9
18	19	猛龙	0.448	39.7	88.7
19	20	马刺	0.462	41.9	90.5
20	21	公牛	0.476	42.2	88.6
21	22	湖人	0.472	40.6	86.1
22	23	黄蜂	0.455	39.9	87.8
23	24	火箭	0.444	39.2	88.5
24	25	热火	0.468	39.2	83.7
25	26	尼克斯	0.456	39.4	86.5
26	27	活塞	0.452	38.7	85.6
27	28	雷霆	0.441	38.8	88.0
28	29	魔术	0.429	38.2	89.2
29	30	骑士	0.450	38.6	85.8

13) - 选取除了最后4列之外的全部列

In [39]:

NBA2019.iloc[:,:-4]

Out[39]:

	排名	球队	投篮命中率	投篮命中	投篮出手	三分命中率	三分命中	三分出手	罚球命中率	罚球命中	罚球出手	总篮板	进攻	防守	助攻	失误
0	1	雄鹿	0.487	44.7	91.8	0.389	14.4	37.1	0.760	16.2	21.4	48.1	10.3	37.8	25.5	13.82
1	2	篮网	0.494	43.1	87.3	0.392	14.2	36.1	0.804	18.1	22.5	44.4	8.9	35.5	26.8	13.54
2	3	奇才	0.475	43.2	90.9	0.351	10.2	29.0	0.769	20.1	26.2	45.2	9.7	35.5	25.5	14.40
3	4	爵士	0.468	41.3	88.1	0.389	16.7	43.0	0.799	17.2	21.5	48.3	10.6	37.6	23.7	14.21
4	5	开拓者	0.453	41.3	91.1	0.385	15.7	40.8	0.823	17.8	21.6	44.5	10.6	33.9	21.3	11.10
5	6	太阳	0.490	43.3	88.3	0.378	13.1	34.6	0.834	15.6	18.7	42.9	8.8	34.2	26.9	12.53
6	7	步行者	0.474	43.3	91.2	0.364	12.3	34.0	0.792	16.4	20.7	42.7	9.0	33.7	27.4	13.54
7	8	掘金	0.485	43.2	89.2	0.377	12.9	34.2	0.803	15.7	19.5	44.4	10.5	33.9	26.8	13.50
8	9	鹈鹕	0.477	42.5	89.1	0.348	10.6	30.4	0.729	19.0	26.1	47.4	11.7	35.7	26.0	14.61
9	10	快船	0.482	41.8	86.7	0.411	14.3	34.7	0.839	16.2	19.3	44.2	9.4	34.7	24.4	13.19
10	11	勇士	0.468	41.3	88.2	0.376	14.6	38.7	0.785	16.6	21.1	43.0	8.0	35.1	27.7	15.00
11	12	老鹰	0.468	40.8	87.2	0.373	12.4	33.4	0.812	19.7	24.2	45.6	10.6	35.1	24.1	13.24
12	13	国王	0.481	42.6	88.6	0.364	12.1	33.3	0.745	16.4	22.0	41.4	9.4	32.0	25.5	13.38
13	14	76人	0.476	41.4	86.9	0.374	11.3	30.1	0.767	19.6	25.5	45.1	10.0	35.0	23.7	14.44
14	15	灰熊	0.467	42.8	91.8	0.356	11.2	31.4	0.771	16.4	21.3	46.5	11.2	35.3	26.9	13.29
15	16	凯尔特人	0.466	41.5	88.9	0.374	13.6	36.4	0.775	16.1	20.8	44.3	10.6	33.6	23.5	14.06
16	17	独行侠	0.470	41.1	87.3	0.362	13.8	38.1	0.778	16.5	21.2	43.3	9.1	34.2	22.9	12.07
17	18	森林狼	0.448	40.7	90.9	0.349	13.1	37.6	0.761	17.6	23.1	43.5	10.5	33.0	25.6	14.26
18	19	猛龙	0.448	39.7	88.7	0.368	14.5	39.3	0.815	17.4	21.3	41.6	9.4	32.1	24.1	13.22
19	20	马刺	0.462	41.9	90.5	0.350	9.9	28.4	0.792	17.4	22.0	43.9	9.3	34.6	24.4	11.40
20	21	公牛	0.476	42.2	88.6	0.370	12.6	34.0	0.791	13.8	17.5	45.0	9.6	35.3	26.8	15.13
21	22	湖人	0.472	40.6	86.1	0.354	11.1	31.2	0.739	17.2	23.3	44.2	9.7	34.6	24.7	15.21
22	23	黄蜂	0.455	39.9	87.8	0.369	13.7	37.0	0.761	15.9	20.9	43.8	10.6	33.2	26.8	14.85
23	24	火箭	0.444	39.2	88.5	0.339	13.8	40.6	0.740	16.5	22.3	42.6	9.3	33.3	23.6	14.72
24	25	热火	0.468	39.2	83.7	0.358	12.9	36.2	0.790	16.7	21.1	41.5	8.0	33.5	26.3	14.07
25	26	尼克斯	0.456	39.4	86.5	0.392	11.8	30.0	0.784	16.4	20.9	45.1	9.7	35.5	21.4	12.94
26	27	活塞	0.452	38.7	85.6	0.351	11.6	32.9	0.759	17.8	23.4	42.7	9.6	33.1	24.2	14.93
27	28	雷霆	0.441	38.8	88.0	0.339	11.9	35.1	0.725	15.5	21.3	45.6	9.9	35.7	22.1	16.14
28	29	魔术	0.429	38.2	89.2	0.343	10.9	31.8	0.775	16.6	21.4	45.4	10.4	35.1	21.8	12.83
29	30	骑士	0.450	38.6	85.8	0.336	10.0	29.7	0.743	16.7	22.4	42.8	10.4	32.3	23.8	15.47

14) - 找到“快船”和“国王”的“得分”

In [40]:

NBA2019.loc[NBA2019.球队.isin(['快船','国王']),['球队','得分']]

Out[40]:

	球队	得分
9	快船	114.0
12	国王	113.7

练习3. 数据分组----探索消费数据（OnlineRetail1.csv）

1)- 导入必要的库

In [41]:

import pandas as pd

2)- 从以下地址导入数据

In [42]:

path3='data/data111257/OnlineRetail.csv'

3)- 将数据框命名为ProductSales

In [43]:

ProductSales=pd.read_csv('data/data111257/OnlineRetail.csv')

4)- 哪个国家(Country)销售收入(Quantity*UnitPrice)更多？

In [44]:

ProductSales['TotalCost'] = ProductSales['UnitPrice'] * ProductSales['Quantity']
cost_per_country = ProductSales.groupby('Country')['TotalCost'].sum()
cost_per_country.idxmax()

Out[44]:

'Canada'

In [45]:

ProductSales.head()

Out[45]:

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	TotalCost
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2021/8/21	2.55	17850	United Kingdom	15.30
1	536365	71053	WHITE metaL LANTERN	6	2021/8/12	3.39	17850	United Kingdom	20.34
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2021/7/24	2.75	17850	United Kingdom	22.00
3	536365	84029G	KNITTED UNIOn FLAG HOT WATER BOTTLE	6	2021/8/10	3.39	17850	United Kingdom	20.34
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2021/9/12	3.39	17850	United Kingdom	20.34

5)- 哪个月销售收入更多？

In [46]:

ProductSales['Month'] = ProductSales['InvoiceDate'].apply(lambda x:x[:6]).tolist()

In [47]:

ProductSales.head()

Out[47]:

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	TotalCost	Month
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2021/8/21	2.55	17850	United Kingdom	15.30	2021/8
1	536365	71053	WHITE metaL LANTERN	6	2021/8/12	3.39	17850	United Kingdom	20.34	2021/8
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2021/7/24	2.75	17850	United Kingdom	22.00	2021/7
3	536365	84029G	KNITTED UNIOn FLAG HOT WATER BOTTLE	6	2021/8/10	3.39	17850	United Kingdom	20.34	2021/8
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2021/9/12	3.39	17850	United Kingdom	20.34	2021/9

In [48]:

ProductSales['TotalCost'] = ProductSales['UnitPrice'] * ProductSales['Quantity']
cost_per_country = ProductSales.groupby('Month')['TotalCost'].sum()
cost_per_country

Out[48]:

Month
2021/7     431.35
2021/8    1570.94
2021/9     667.43
Name: TotalCost, dtype: float64

In [49]:

cost_per_country.idxmax()

Out[49]:

'2021/8'

6)- 打印出每个国家销售数量的平均值，最大值和最小值

In [50]:

ProductSales.groupby('Country').Quantity.agg(['mean', 'min', 'max'])

Out[50]:

	mean	min	max
Country
Australia	10.500000	5	20
Belgium	7.000000	2	24
Canada	9.625000	1	16
Channel Islands	11.222222	1	24
France	22.500000	3	36
Germany	10.750000	4	24
Italy	10.750000	1	24
USA	6.750000	2	12
United Kingdom	9.263158	2	80

练习 4. 运用函数--分析美国某次人口普查状况。按照每题的要求，输入正确的代码并运行。

1)-导入必要的库

In [51]:

import pandas as pd

2)-从以下地址导入数据集

In [52]:

path4='data/data111257/adult.csv'

3)-将数据框命名为census

In [53]:

census=pd.read_csv('data/data111257/adult.csv')

In [54]:

census.head()

Out[54]:

	year	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	1966	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	1971	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	1977	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	1977	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	1973	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

4)-每一列(column)的数据类型是什么样的？

In [55]:

census.info()


RangeIndex: 32561 entries, 0 to 32560
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            32561 non-null  int64  
 1   age             32561 non-null  int64  
 2   workclass       32561 non-null  object 
 3   fnlwgt          32561 non-null  int64  
 4   education       32561 non-null  object 
 5   education-num   32074 non-null  float64
 6   marital-status  32561 non-null  object 
 7   occupation      32049 non-null  object 
 8   relationship    32561 non-null  object 
 9   race            32561 non-null  object 
 10  sex             32561 non-null  object 
 11  capital-gain    32561 non-null  int64  
 12  capital-loss    32561 non-null  int64  
 13  hours-per-week  32561 non-null  int64  
 14  native-country  32561 non-null  object 
 15  salary          32561 non-null  object 
dtypes: float64(1), int64(6), object(9)
memory usage: 4.0+ MB

5)-将year的数据类型转换为 datetime64

In [56]:

census.year = pd.to_datetime(census.year, format='%Y')
census.info()


RangeIndex: 32561 entries, 0 to 32560
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   year            32561 non-null  datetime64[ns]
 1   age             32561 non-null  int64         
 2   workclass       32561 non-null  object        
 3   fnlwgt          32561 non-null  int64         
 4   education       32561 non-null  object        
 5   education-num   32074 non-null  float64       
 6   marital-status  32561 non-null  object        
 7   occupation      32049 non-null  object        
 8   relationship    32561 non-null  object        
 9   race            32561 non-null  object        
 10  sex             32561 non-null  object        
 11  capital-gain    32561 non-null  int64         
 12  capital-loss    32561 non-null  int64         
 13  hours-per-week  32561 non-null  int64         
 14  native-country  32561 non-null  object        
 15  salary          32561 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(5), object(9)
memory usage: 4.0+ MB

6)-将列year设置为数据框的索引

In [57]:

census = census.set_index('year', drop = True)
census.head()

Out[57]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
year
1966-01-01	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1971-01-01	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
1977-01-01	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
1977-01-01	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
1973-01-01	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

7)-删除名为fnlwgt的列

In [58]:

del census['fnlwgt']
census.head()

Out[58]:

	age	workclass	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
year
1966-01-01	49	Private	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1971-01-01	44	Private	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
1977-01-01	38	Private	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
1977-01-01	38	Self-emp-inc	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
1973-01-01	42	Self-emp-not-inc	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

8)-按照year对数据框进行分组，并对hours-per-week求和

ProductSales['TotalCost'] = ProductSales['UnitPrice'] * ProductSales['Quantity'] cost_per_country = ProductSales.groupby('InvoiceDate')['TotalCost'].sum() cost_per_country.idxmax()

In [59]:

census.groupby('year')['hours-per-week'].agg(sum)

Out[59]:

year
1925-01-01     1583
1927-01-01      120
1928-01-01        2
1929-01-01       40
1930-01-01       88
              ...  
1994-01-01    24505
1995-01-01    24307
1996-01-01    21843
1997-01-01    14252
1998-01-01     8440
Name: hours-per-week, Length: 73, dtype: int64

9)-哪个年龄(age)已婚( Married-civ-spouse)人士最多

In [60]:

marry=census[census['marital-status'].str.contains('Married-civ-spouse')]
marry.age.value_counts().head(1)

Out[60]:

38    499
Name: age, dtype: int64

练习 5. 合并 1)-导入必要的库

In [61]:

import pandas as pd

2)-按照如下的元数据内容创建数据框

raw_data_1={
        'store_id':['a','b','c','d','e'],
        'item_name':['book','rule','glue','tape','pen']
        'sales':[100,10,30,4,60]}
raw_data_2={
        'store_id':['d',e','f','g','h'],
        'item_name':['stapler','notebook','pencil','eraser','sharpener']
        'sales':[20,60,150,40,50]}
raw_data_3={
        'store_id':['a','b','c','d',e','f','g','h','i','j'],
        'score':[80,79,68,99,60,84,75,93,59,60] }

In [62]:

raw_data_1={
		'store_id':['a','b','c','d','e'],
		'item_name':['book','rule','glue','tape','pen'],
		'sales':[100,10,30,4,60]}
raw_data_2={
		'store_id':['d','e','f','g','h'],
		'item_name':['stapler','notebook','pencil','eraser','sharpener'],
		'sales':[20,60,150,40,50]}
raw_data_3={
		'store_id':['a','b','c','d','e','f','g','h','i','j'],
		'score':[80,79,68,99,60,84,75,93,59,60] }

In [63]:

raw_data_1

Out[63]:

{'store_id': ['a', 'b', 'c', 'd', 'e'],
 'item_name': ['book', 'rule', 'glue', 'tape', 'pen'],
 'sales': [100, 10, 30, 4, 60]}

3)-将上述的数据框分别命名为data1, data2, data3

In [64]:

data1 = pd.Dataframe(raw_data_1, columns = ['store_id', 'item_name', 'sales'])
data2 = pd.Dataframe(raw_data_2, columns = ['store_id', 'item_name', 'sales'])
data3 = pd.Dataframe(raw_data_3, columns = ['store_id','score'])

In [65]:

data1

Out[65]:

	store_id	item_name	sales
0	a	book	100
1	b	rule	10
2	c	glue	30
3	d	tape	4
4	e	pen	60

4)-将data1和data2两个数据框按照列的维度进行合并，命名为all_data

In [66]:

all_data=pd.concat([data1,data2])
all_data

Out[66]:

	store_id	item_name	sales
0	a	book	100
1	b	rule	10
2	c	glue	30
3	d	tape	4
4	e	pen	60
0	d	stapler	20
1	e	notebook	60
2	f	pencil	150
3	g	eraser	40
4	h	sharpener	50

5)-将data1和data2两个数据框按照列的维度进行合并，命名为all_data_col

In [67]:

all_data_col=pd.concat([data1,data2],axis=1)
all_data_col

Out[67]:

	store_id	item_name	sales	store_id	item_name	sales
0	a	book	100	d	stapler	20
1	b	rule	10	e	notebook	60
2	c	glue	30	f	pencil	150
3	d	tape	4	g	eraser	40
4	e	pen	60	h	sharpener	50

6)-打印data3

In [68]:

data3

Out[68]:

	store_id	score
0	a	80
1	b	79
2	c	68
3	d	99
4	e	60
5	f	84
6	g	75
7	h	93
8	i	59
9	j	60

7)-按照store_id的值对all_data和data3作合并

In [69]:

pd.merge(all_data,data3,on='store_id')

Out[69]:

	store_id	item_name	sales	score
0	a	book	100	80
1	b	rule	10	79
2	c	glue	30	68
3	d	tape	4	99
4	d	stapler	20	99
5	e	pen	60	60
6	e	notebook	60	60
7	f	pencil	150	84
8	g	eraser	40	75
9	h	sharpener	50	93

8)-对‘data1’和‘data2’按照‘store_id’作连接

In [70]:

pd.merge(data1,data2,on='store_id',how='inner')

Out[70]:

	store_id	item_name_x	sales_x	item_name_y	sales_y
0	d	tape	4	stapler	20
1	e	pen	60	notebook	60

9)-找到 data1 和 data2 合并之后的所有匹配结果

In [71]:

pd.merge(data1, data2, on='store_id', how='outer')

Out[71]:

	store_id	item_name_x	sales_x	item_name_y	sales_y
0	a	book	100.0	NaN	NaN
1	b	rule	10.0	NaN	NaN
2	c	glue	30.0	NaN	NaN
3	d	tape	4.0	stapler	20.0
4	e	pen	60.0	notebook	60.0
5	f	NaN	NaN	pencil	150.0
6	g	NaN	NaN	eraser	40.0
7	h	NaN	NaN	sharpener	50.0

练习 6. 统计----气象数据(austin_weather.csv)

1-导入必要的库

In [72]:

import pandas as pd
import datetime

2)- 从以下地址导入数据集

In [73]:

path6='data/data111257/austin_weather.csv'

In [74]:

df6=pd.read_csv('data/data111257/austin_weather.csv')

In [75]:

df6

Out[75]:

	Year	Month	Day	Location	TempHighF	TempLowF	DewPointHighF	DewPointAvgF	DewPointLowF	HumidityHighPercent	...	SeaLevelPressureAvgInches	SeaLevelPressureLowInches	VisibilityHighMiles	VisibilityAvgMiles	VisibilityLowMiles	WindHighMPH	WindAvgMPH	WindGustMPH	PrecipitationSumInches	Events
0	2013	12	21	2.0	74	45	67	49	43	93	...	29.68	29.59	10	7	2	20	4	31	0.46	Rain , Thunderstorm
1	2013	12	22	7.0	56	39	43	36	28	93	...	30.13	29.87	10	10	5	16	6	25	0
2	2013	12	23	6.0	58	32	31	27	23	76	...	30.49	30.41	10	10	10	8	3	12	0
3	2013	12	24	7.0	61	31	36	28	21	89	...	30.45	30.3	10	10	7	12	4	20	0
4	2013	12	25	9.0	58	41	44	40	36	86	...	30.33	30.27	10	10	7	10	2	16	T
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1314	2017	7	27	2.0	103	75	71	67	61	82	...	29.97	29.88	10	10	10	12	5	21	0
1315	2017	7	28	8.0	105	76	71	64	55	87	...	29.9	29.81	10	10	10	14	5	20	0
1316	2017	7	29	9.0	107	77	72	64	55	82	...	29.86	29.79	10	10	10	12	4	17	0
1317	2017	7	30	4.0	106	79	70	68	63	69	...	29.91	29.87	10	10	10	13	4	20	0
1318	2017	7	31	9.0	99	77	66	61	54	64	...	29.97	29.91	10	10	10	12	4	20	0

1319 rows × 23 columns

3)- 将数据作存储并且设置前三列(Year, Month, Day)为合适的索引Date

In [76]:

data6=pd.read_csv(path6,parse_dates=['Year','Month','Day'])
data6.head()

Out[76]:

	Year	Month	Day	Location	TempHighF	TempLowF	DewPointHighF	DewPointAvgF	DewPointLowF	HumidityHighPercent	...	SeaLevelPressureAvgInches	SeaLevelPressureLowInches	VisibilityHighMiles	VisibilityAvgMiles	VisibilityLowMiles	WindHighMPH	WindAvgMPH	WindGustMPH	PrecipitationSumInches	Events
0	2013-01-01	12	21	2.0	74	45	67	49	43	93	...	29.68	29.59	10	7	2	20	4	31	0.46	Rain , Thunderstorm
1	2013-01-01	12	22	7.0	56	39	43	36	28	93	...	30.13	29.87	10	10	5	16	6	25	0
2	2013-01-01	12	23	6.0	58	32	31	27	23	76	...	30.49	30.41	10	10	10	8	3	12	0
3	2013-01-01	12	24	7.0	61	31	36	28	21	89	...	30.45	30.3	10	10	7	12	4	20	0
4	2013-01-01	12	25	9.0	58	41	44	40	36	86	...	30.33	30.27	10	10	7	10	2	16	T

5 rows × 23 columns

4)- 2034、2065年？我们真的有这一年的数据？创建一个函数并用它去修复这个bug

In [77]:

def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)

# apply the function fix_century on the column and replace the values to the right ones
data6['Year'] = data6['Year'].apply(fix_century)

# data.info()
data6.head()

Out[77]:

	Year	Month	Day	Location	TempHighF	TempLowF	DewPointHighF	DewPointAvgF	DewPointLowF	HumidityHighPercent	...	SeaLevelPressureAvgInches	SeaLevelPressureLowInches	VisibilityHighMiles	VisibilityAvgMiles	VisibilityLowMiles	WindHighMPH	WindAvgMPH	WindGustMPH	PrecipitationSumInches	Events
0	1913-01-01	12	21	2.0	74	45	67	49	43	93	...	29.68	29.59	10	7	2	20	4	31	0.46	Rain , Thunderstorm
1	1913-01-01	12	22	7.0	56	39	43	36	28	93	...	30.13	29.87	10	10	5	16	6	25	0
2	1913-01-01	12	23	6.0	58	32	31	27	23	76	...	30.49	30.41	10	10	10	8	3	12	0
3	1913-01-01	12	24	7.0	61	31	36	28	21	89	...	30.45	30.3	10	10	7	12	4	20	0
4	1913-01-01	12	25	9.0	58	41	44	40	36	86	...	30.33	30.27	10	10	7	10	2	16	T

5 rows × 23 columns

5)- 将日期(Date)设为索引，注意数据类型，应该是datetime64

In [78]:

data6["Year"] = pd.to_datetime(data6["Year"])

In [79]:

data6 = data6.set_index('Year')

In [80]:

data6.head()

Out[80]:

	Month	Day	Location	TempHighF	TempLowF	DewPointHighF	DewPointAvgF	DewPointLowF	HumidityHighPercent	HumidityAvgPercent	...	SeaLevelPressureAvgInches	SeaLevelPressureLowInches	VisibilityHighMiles	VisibilityAvgMiles	VisibilityLowMiles	WindHighMPH	WindAvgMPH	WindGustMPH	PrecipitationSumInches	Events
Year
1913-01-01	12	21	2.0	74	45	67	49	43	93	75	...	29.68	29.59	10	7	2	20	4	31	0.46	Rain , Thunderstorm
1913-01-01	12	22	7.0	56	39	43	36	28	93	68	...	30.13	29.87	10	10	5	16	6	25	0
1913-01-01	12	23	6.0	58	32	31	27	23	76	52	...	30.49	30.41	10	10	10	8	3	12	0
1913-01-01	12	24	7.0	61	31	36	28	21	89	56	...	30.45	30.3	10	10	7	12	4	20	0
1913-01-01	12	25	9.0	58	41	44	40	36	86	71	...	30.33	30.27	10	10	7	10	2	16	T

5 rows × 22 columns

6)- 对应每一个DewPointAvgF，一共有多少数据值缺失

In [81]:

data6.isnull().sum()

Out[81]:

Month                          0
Day                            0
Location                      12
TempHighF                      0
TempLowF                       0
DewPointHighF                  0
DewPointAvgF                   0
DewPointLowF                   0
HumidityHighPercent            0
HumidityAvgPercent             0
HumidityLowPercent             0
SeaLevelPressureHighInches     0
SeaLevelPressureAvgInches      0
SeaLevelPressureLowInches      0
VisibilityHighMiles            0
VisibilityAvgMiles             0
VisibilityLowMiles             0
WindHighMPH                    0
WindAvgMPH                     0
WindGustMPH                    0
PrecipitationSumInches         0
Events                         0
dtype: int64

7)- 对应每一个DewPointAvgF，一共有多少完整的数据值

In [82]:

data6.shape[0] - data6.isnull().sum()

Out[82]:

Month                         1319
Day                           1319
Location                      1307
TempHighF                     1319
TempLowF                      1319
DewPointHighF                 1319
DewPointAvgF                  1319
DewPointLowF                  1319
HumidityHighPercent           1319
HumidityAvgPercent            1319
HumidityLowPercent            1319
SeaLevelPressureHighInches    1319
SeaLevelPressureAvgInches     1319
SeaLevelPressureLowInches     1319
VisibilityHighMiles           1319
VisibilityAvgMiles            1319
VisibilityLowMiles            1319
WindHighMPH                   1319
WindAvgMPH                    1319
WindGustMPH                   1319
PrecipitationSumInches        1319
Events                        1319
dtype: int64

8)- 对于全体数据，计算温度(TempF)的平均值

In [83]:

a68=data6.mean()
a68

Out[83]:

Month              inf
Day                inf
Location      5.435348
TempHighF    80.862775
TempLowF     59.902957
dtype: float64

In [84]:

a68[3],a68[4]

Out[84]:

(80.86277482941622, 59.902956785443514)

In [85]:

avg_TempF=0.5*(a68[3]+a68[4])
avg_TempF

Out[85]:

70.38286580742987

9)- 创建一个名为loc_stats的数据框去计算并存储每个location多年湿度（Humidity）的最小值，最大值，平均值和标准差

In [86]:

loc_stats=pd.Dataframe()

loc_stats['min']=data6.min()
loc_stats['max']=data6.max()
loc_stats['mean']=data6.mean()
loc_stats['std']=data6.std()#standard deviations
loc_stats

Out[86]:

	min	max	mean	std
Month	1	9	inf	NaN
Day	1	9	inf	NaN
Location	1	10	5.435348	2.830496
TempHighF	32	107	80.862775	14.766523
TempLowF	19	81	59.902957	14.190648
DewPointHighF	-	80	NaN	NaN
DewPointAvgF	-	8	NaN	NaN
DewPointLowF	-	9	NaN	NaN
HumidityHighPercent	-	97	NaN	NaN
HumidityAvgPercent	-	97	NaN	NaN
HumidityLowPercent	-	93	NaN	NaN
SeaLevelPressureHighInches	-	30.83	NaN	NaN
SeaLevelPressureAvgInches	-	30.74	NaN	NaN
SeaLevelPressureLowInches	-	30.61	NaN	NaN
VisibilityHighMiles	-	9	NaN	NaN
VisibilityAvgMiles	-	9	NaN	NaN
VisibilityLowMiles	-	9	NaN	NaN
WindHighMPH	-	9	NaN	NaN
WindAvgMPH	-	9	NaN	NaN
WindGustMPH	-	9	NaN	NaN
PrecipitationSumInches	0	T	NaN	NaN
Events		Thunderstorm	NaN	NaN

10)- 创建一个名为day_stats的数据框去计算并存储所有location中风速最小值，最大值，平均值和标准差

In [87]:

day_status=pd.Dataframe()

day_status['min']=data6.min(axis=1)
day_status['max']=data6.max(axis=1)
day_status['mean']=data6.mean(axis=1)
day_status['std']=data6.std(axis=1)

day_status.head()

Out[87]:

	min	max	mean	std
Year
1913-01-01	2.0	74.0	40.333333	36.226142
1913-01-01	7.0	56.0	34.000000	24.879711
1913-01-01	6.0	58.0	32.000000	26.000000
1913-01-01	7.0	61.0	33.000000	27.055499
1913-01-01	9.0	58.0	36.000000	24.879711

11)- 对于每一个location，计算多年的平均SeaLevelPressure值。

In [88]:

data6['SeaLevelPressureAvgInches'] = data6['SeaLevelPressureAvgInches'].apply(pd.to_numeric, errors='coerce').fillna(0.0)
data6.groupby(['Location','Year'])['SeaLevelPressureAvgInches'].mean()

Out[88]:

Location  Year      
1.0       1914-01-01    30.021818
          1915-01-01    29.987059
          1916-01-01    29.999512
          1917-01-01    29.922500
2.0       1913-01-01    29.680000
          1914-01-01    29.975667
          1915-01-01    30.020303
          1916-01-01    30.048750
          1917-01-01    29.988750
3.0       1913-01-01    30.360000
          1914-01-01    30.083333
          1915-01-01    29.390000
          1916-01-01    30.029000
          1917-01-01    30.025455
4.0       1914-01-01    30.079535
          1915-01-01    30.023023
          1916-01-01    30.073438
          1917-01-01    29.999412
5.0       1914-01-01    30.004848
          1915-01-01    30.023750
          1916-01-01    30.031667
          1917-01-01    29.986316
6.0       1913-01-01    30.253333
          1914-01-01    30.022045
          1915-01-01    30.032250
          1916-01-01    30.021579
          1917-01-01    30.021364
7.0       1913-01-01    30.290000
          1914-01-01    30.024242
          1915-01-01    28.356216
          1916-01-01    30.016216
          1917-01-01    30.004444
8.0       1914-01-01    30.035517
          1915-01-01    30.041304
          1916-01-01    29.996667
          1917-01-01    29.954762
          1934-01-01    29.870000
9.0       1913-01-01    30.365000
          1914-01-01    30.003889
          1915-01-01    30.050270
          1916-01-01    30.074848
          1917-01-01    29.948182
          1965-01-01    30.060000
10.0      1914-01-01    30.036389
          1915-01-01    30.063438
          1916-01-01    30.012368
          1917-01-01    29.898571
Name: SeaLevelPressureAvgInches, dtype: float64

练习 7. 可视化----某班成绩可视化分析（StudentsPerformance.csv） 1)-导入必要的库

In [89]:

import pandas as pd
import numpy as np

2)-从以下地址导入数据集

In [90]:

path7='data/data111257/StudentsPerformance.csv'

3)-将数据框命名为studentperformance

In [91]:

studentperformance=pd.read_csv('data/data111257/StudentsPerformance.csv')

4)-将studentID设置为索引

In [92]:

studentperformance=studentperformance.set_index('studentID',drop=True)
studentperformance.head()

Out[92]:

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
studentID
1	female	group B	bachelor's degree	standard	none	72	72	74
2	female	group C	some college	standard	completed	69	90	88
3	female	group B	master's degree	standard	none	90	95	93
4	male	group A	associate's degree	free/reduced	none	47	57	44
5	male	group C	some college	standard	none	76	78	75

5)-绘制一个展示男女生比例的扇形图

In [93]:

#统计性别数量
aa=studentperformance.groupby(['gender'])['race/ethnicity'].count()
aa

Out[93]:

gender
female    518
male      482
Name: race/ethnicity, dtype: int64

In [94]:

#计算性别比例
female_rate=aa[0]/aa.sum()
male_rate=aa[1]/aa.sum()
female_rate,male_rate

Out[94]:

(0.518, 0.482)

In [128]:

#-*- coding: utf-8 -*-
from matplotlib import pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #解决中文乱码
plt.figure(figsize=(6,9)) #调节图形大小

labels = [u'female_rate',u'male_rate',] #定义标签
sizes = [female_rate,male_rate] #每块值
colors = ['blue', 'red'] #每块颜色定义
explode = (0,0) #将某一块分割出来，值越大分割出的间隙越大
patches,text1,text2 = plt.pie(sizes,
                      explode=explode,
                      labels=labels,
                      colors=colors,
                      autopct = '%3.2f%%', #数值保留固定小数位
                      shadow = False, #无阴影设置
                      startangle =90, #逆时针起始角度设置
                      pctdistance = 0.2) #数值距圆心半径倍数距离
#patches饼图的返回值，texts1饼图外label的文本，texts2饼图内部的文本
# x，y轴刻度设置一致，保证饼图为圆形
plt.axis('equal')
plt.legend() # 右上角显示
plt.show()

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

方法2：

#绘制学生性别比例扇形图
Male = (studentperformance.gender == 'male').sum()
Female = (studentperformance.gender == 'female').sum()
proportions = [Male,Female]

plt.pie(proportions, labels=['Male','Female'],shadow=True,
        autopct='%1.1f%%',startangle=90,explode=(0.15,0))
plt.axis('equal')
plt.title('Sex Proportion')
plt.tight_layout()
plt.show()

6)-绘制一个展示math score成绩, 与学生性别gender和父母受教育程度parental level of education的散点图

散点图

fig = plt.figure() ax=fig.add_subplot(1,1,1) ax.scatter(df['Chinese'],df['Math']) plt.xlabel('Chinese') plt.ylabel('Math') plt.title('语文和数学成绩散点图') plt.show()

In [96]:

#散点图 修改
fig = plt.figure()

In [97]:

import seaborn as sns
lm = sns.lmplot(x='gender',y='math score',data=studentperformance,hue='parental level of education',fit_reg=False)
lm.set(title='Math Score x Age')
axes = lm.axes
axes[0,0].set_ylim(0,100)
axes[0,0].set_xlim(-5,5)

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Out[97]:

(-5, 5)

7)-绘制一个展示reading score成绩的直方图

In [98]:

#直方图
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(studentperformance['reading score'],bins=30)
plt.title('reading score')
plt.xlabel('score')
plt.ylabel('Total number')
plt.show()

练习 8.创建数据框

In [ ]:

1)-导入必要的库

In [99]:

import pandas as pd

2)-创建一个数据字典

raw_data = {
            "Alcoholic": [0,0,0,0,0.0976],
            "Country": ['United Kingdom','China','Algeria','Bangladesh','Korea South'],
            "Eggs": [1.0778,2.3066,1.2171,1.2664,1.3524],
            "Fish": [0.5176,0.9817,0.2008,2.1646,1.9147],
            "Spices": [0.1276,0.0676,0.1568,1.0602,0.0372]
        }

In [100]:

raw_data = {
			"Alcoholic": [0,0,0,0,0.0976],
			"Country": ['United Kingdom','China','Algeria','Bangladesh','Korea South'],
			"Eggs": [1.0778,2.3066,1.2171,1.2664,1.3524],
			"Fish": [0.5176,0.9817,0.2008,2.1646,1.9147],
			"Spices": [0.1276,0.0676,0.1568,1.0602,0.0372]
        }

3)-将数据字典存为一个名叫diet的数据框中

In [101]:

diet = pd.Dataframe(raw_data)
diet.head()

Out[101]:

	Alcoholic	Country	Eggs	Fish	Spices
0	0.0000	United Kingdom	1.0778	0.5176	0.1276
1	0.0000	China	2.3066	0.9817	0.0676
2	0.0000	Algeria	1.2171	0.2008	0.1568
3	0.0000	Bangladesh	1.2664	2.1646	1.0602
4	0.0976	Korea South	1.3524	1.9147	0.0372

4)-数据框的列排序是字母顺序，请重新修改为Country, Spices, Fish, Alcoholic, Eggs这个顺序

In [102]:

diet=diet[['Country','Spices','Fish','Alcoholic','Eggs']]

5)-添加一个列Health

    diet['Health']=['yes','yes','no','yes','no']

In [103]:

diet['Health']=['yes','yes','no','yes','no']

6)-查看每个列的数据类型

In [104]:

diet.dtypes

Out[104]:

Country       object
Spices       float64
Fish         float64
Alcoholic    float64
Eggs         float64
Health        object
dtype: object

练习 9.时间序列(hyundi.csv) 1)-导入必要的库

In [105]:

import pandas as pd
import numpy as np 
#visualization
import matplotlib.pyplot as plt 

%matplotlib inline

2)-从以下地址导入数据集

In [106]:

path9='data/data111257/hyundi.csv'

3)-读取数据并存为一个名叫hyundai的数据框

In [107]:

hyundai = pd.read_csv(path9)
hyundai.head()

Out[107]:

	model	year	price	transmission	mileage	fuelType	tax	mpg	engineSize	record
0	I20	2017	7999	Manual	17307	Petrol	145	58.9	1.2	2021/6/20
1	Tucson	2016	14499	Automatic	25233	Diesel	235	43.5	2.0	2016/3/5
2	Tucson	2016	11399	Manual	37877	Diesel	30	61.7	1.7	2016/6/18
3	I10	2016	6499	Manual	23789	Petrol	20	60.1	1.0	2019/1/2
4	IX35	2015	10199	Manual	33177	Diesel	160	51.4	2.0	2020/1/7

4)-查看每一列的数据类型

In [108]:

hyundai.dtypes

Out[108]:

model            object
year              int64
price             int64
transmission     object
mileage           int64
fuelType         object
tax               int64
mpg             float64
engineSize      float64
record           object
dtype: object

5)-将record这个列转换为datetime类型

In [109]:

hyundai.record = pd.to_datetime(hyundai.record)
hyundai['record'].head()

Out[109]:

0   2021-06-20
1   2016-03-05
2   2016-06-18
3   2019-01-02
4   2020-01-07
Name: record, dtype: datetime64[ns]

6)-将record设置为索引

In [110]:

hyundai=hyundai.set_index('record')
hyundai.head()

Out[110]:

	model	year	price	transmission	mileage	fuelType	tax	mpg	engineSize
record
2021-06-20	I20	2017	7999	Manual	17307	Petrol	145	58.9	1.2
2016-03-05	Tucson	2016	14499	Automatic	25233	Diesel	235	43.5	2.0
2016-06-18	Tucson	2016	11399	Manual	37877	Diesel	30	61.7	1.7
2019-01-02	I10	2016	6499	Manual	23789	Petrol	20	60.1	1.0
2020-01-07	IX35	2015	10199	Manual	33177	Diesel	160	51.4	2.0

7)-有重复的日期吗？

In [111]:

hyundai.index.is_unique

Out[111]:

False

8)-将index设置为升序

In [112]:

hyundai.sort_index(ascending = True).head()

Out[112]:

	model	year	price	transmission	mileage	fuelType	tax	mpg	engineSize
record
2016-03-05	Tucson	2016	14499	Automatic	25233	Diesel	235	43.5	2.0
2016-06-18	Tucson	2016	11399	Manual	37877	Diesel	30	61.7	1.7
2018-03-04	I10	2016	7500	Manual	6000	Petrol	20	57.6	1.2
2018-03-17	I10	2018	7100	Manual	35000	Petrol	145	60.1	1.0
2018-05-09	I40	2015	7830	Manual	59508	Diesel	30	65.7	1.7

9)-找到每个月的最后一个记录日(record)

In [113]:

hyundai_month = hyundai.resample('BM').sum()
hyundai_month.head()

Out[113]:

	year	price	mileage	tax	mpg	engineSize
record
2016-03-31	2016	14499	25233	235	43.5	2.0
2016-04-29	0	0	0	0	0.0	0.0
2016-05-31	0	0	0	0	0.0	0.0
2016-06-30	2016	11399	37877	30	61.7	1.7
2016-07-29	0	0	0	0	0.0	0.0

10)-数据集中最早的记录日期(record)和最晚的记录日期相差多少天？

In [114]:

(hyundai.index.max() - hyundai.index.min()).days

Out[114]:

11)-在record数据中一共有多少个月？

In [115]:

hyundai_month = hyundai.resample('BM').mean()
len(hyundai_month.index)

Out[115]:

12)-按照record时间顺序可视化mpg值

In [116]:

hyundai_open=hyundai['mpg'].plot(title='hyundai record')

fig=hyundai_open.get_figure()
fig.set_size_inches(13.5,9)

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

练习 10.删除数据(drug200.csv) 1)-导入必要的库

In [117]:

import pandas as pd
import numpy as np

2)-从以下地址导入数据集

In [118]:

path10='data/data111257/drug200.csv'

3)-将数据集存成变量drug

In [119]:

drug=pd.read_csv(path10)
drug.head()

Out[119]:

	23	F	HIGH	HIGH.1	25.355	DrugY
0	47	M	LOW	HIGH	13.093	drugC
1	47	M	LOW	HIGH	10.114	drugC
2	28	F	NORMAL	HIGH	7.798	drugX
3	61	F	LOW	HIGH	18.043	DrugY
4	22	F	NORMAL	HIGH	8.607	drugX

4)-创建数据框的列名称

Age
Sex
Blood Pressure Levels (BP)
Cholesterol Levels
Na to Potassium Ration（Na_to_K）
Drug Class

In [120]:

drug=pd.read_csv(path10,names=['Age','Sex','Blood_Pressure_Levels (BP)','Cholesterol_Levels','Na_to_Potassium_Ration（Na_to_K）','Drug_Class'])
drug.head()

Out[120]:

	Age	Sex	Blood_Pressure_Levels (BP)	Cholesterol_Levels	Na_to_Potassium_Ration（Na_to_K）	Drug_Class
0	23	F	HIGH	HIGH	25.355	DrugY
1	47	M	LOW	HIGH	13.093	drugC
2	47	M	LOW	HIGH	10.114	drugC
3	28	F	NORMAL	HIGH	7.798	drugX
4	61	F	LOW	HIGH	18.043	DrugY

5)-数据框中有缺失值吗？

In [121]:

pd.isnull(drug).sum()

Out[121]:

Age                                0
Sex                                0
Blood_Pressure_Levels (BP)         0
Cholesterol_Levels                 0
Na_to_Potassium_Ration（Na_to_K）    0
Drug_Class                         0
dtype: int64

6)-将列Cholesterol的第10到19行设置为缺失值

In [122]:

drug.iloc[10:20,3:4]=np.nan
drug.head(20)

Out[122]:

	Age	Sex	Blood_Pressure_Levels (BP)	Cholesterol_Levels	Na_to_Potassium_Ration（Na_to_K）	Drug_Class
0	23	F	HIGH	HIGH	25.355	DrugY
1	47	M	LOW	HIGH	13.093	drugC
2	47	M	LOW	HIGH	10.114	drugC
3	28	F	NORMAL	HIGH	7.798	drugX
4	61	F	LOW	HIGH	18.043	DrugY
5	22	F	NORMAL	HIGH	8.607	drugX
6	49	F	NORMAL	HIGH	16.275	DrugY
7	41	M	LOW	HIGH	11.037	drugC
8	60	M	NORMAL	HIGH	15.171	DrugY
9	43	M	LOW	NORMAL	19.368	DrugY
10	47	F	LOW	NaN	11.767	drugC
11	34	F	HIGH	NaN	19.199	DrugY
12	43	M	LOW	NaN	15.376	DrugY
13	74	F	LOW	NaN	20.942	DrugY
14	50	F	NORMAL	NaN	12.703	drugX
15	16	F	HIGH	NaN	15.516	DrugY
16	69	M	LOW	NaN	11.455	drugX
17	43	M	HIGH	NaN	13.972	drugA
18	23	M	LOW	NaN	7.298	drugC
19	32	F	HIGH	NaN	25.974	DrugY

7)-将缺失值全部替换为1.0

In [123]:

drug.Cholesterol_Levels.fillna(1,inplace=True)
drug

Out[123]:

	Age	Sex	Blood_Pressure_Levels (BP)	Cholesterol_Levels	Na_to_Potassium_Ration（Na_to_K）	Drug_Class
0	23	F	HIGH	HIGH	25.355	DrugY
1	47	M	LOW	HIGH	13.093	drugC
2	47	M	LOW	HIGH	10.114	drugC
3	28	F	NORMAL	HIGH	7.798	drugX
4	61	F	LOW	HIGH	18.043	DrugY
...	...	...	...	...	...	...
195	56	F	LOW	HIGH	11.567	drugC
196	16	M	LOW	HIGH	12.006	drugC
197	52	M	NORMAL	HIGH	9.894	drugX
198	23	M	NORMAL	NORMAL	14.020	drugX
199	40	F	LOW	NORMAL	11.349	drugX

200 rows × 6 columns

8)-删除列Cholesterol Levels，并打印删除后的数据框架

In [124]:

del drug['Cholesterol_Levels']
drug.head()

Out[124]:

	Age	Sex	Blood_Pressure_Levels (BP)	Na_to_Potassium_Ration（Na_to_K）	Drug_Class
0	23	F	HIGH	25.355	DrugY
1	47	M	LOW	13.093	drugC
2	47	M	LOW	10.114	drugC
3	28	F	NORMAL	7.798	drugX
4	61	F	LOW	18.043	DrugY

9)-将数据框前三行设置为缺失值

In [125]:

drug.iloc[0:3,:]=np.nan
drug.head()

Out[125]:

	Age	Sex	Blood_Pressure_Levels (BP)	Na_to_Potassium_Ration（Na_to_K）	Drug_Class
0	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN
3	28.0	F	NORMAL	7.798	drugX
4	61.0	F	LOW	18.043	DrugY

10)-删除有缺失值的行

In [126]:

drug=drug.dropna(how='any')
drug.head()

Out[126]:

	Age	Sex	Blood_Pressure_Levels (BP)	Na_to_Potassium_Ration（Na_to_K）	Drug_Class
3	28.0	F	NORMAL	7.798	drugX
4	61.0	F	LOW	18.043	DrugY
5	22.0	F	NORMAL	8.607	drugX
6	49.0	F	NORMAL	16.275	DrugY
7	41.0	M	LOW	11.037	drugC

11)-重新设置索引

In [127]:

drug=drug.reset_index(drop=True)
drug.head()

Out[127]:

	Age	Sex	Blood_Pressure_Levels (BP)	Na_to_Potassium_Ration（Na_to_K）	Drug_Class
0	28.0	F	NORMAL	7.798	drugX
1	61.0	F	LOW	18.043	DrugY
2	22.0	F	NORMAL	8.607	drugX
3	49.0	F	NORMAL	16.275	DrugY
4	41.0	M	LOW	11.037	drugC

分组求和

http://www.crazyant.net/2594.html

python数据计算(pandas)100题

pandas入门十大例题

训练集数据资源地址：https://download.csdn.net/download/qq_41166909/30414605 本项目在AIStudio的地址https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1

练习 1. 了解数据--某段时间内股票销售情况(stock)。 按照每题的要求，输入正确的代码并运行。

1 )- 导入正确的库——请导入Pandas。 In [12]: import pandas as pd

2 )- 从指定地址导入数据，并存入一个名为shopper的框架中。（data.csv） In [13]: shopper=pd.read_csv('data/data111257/data.csv')

4 )- 数据集中有多少列(columns) In [15]: number_of_rows = shopper.shape[1] number_of_rows Out[15]: 8

5 )- 打印全部列名称。 In [16]: shopper.columns Out[16]: Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country'], dtype='object')

6 )- 数据集的索引是怎样的。 In [17]: shopper.index Out[17]: RangeIndex(start=0, stop=36997, step=1)

7 )- 被下单最多StockCode是什么？ In [18]: a=shopper[['StockCode','Quantity']].groupby(by='StockCode').agg('sum') #根据股票编码分组 a=a.sort_values(['Quantity'],ascending=False) a.head(1) Out[18]: QuantityStockCode840774948

8 )- 在StockCode这一列中，一共有多少只股票被下单？ In [19]: pd.unique(shopper['StockCode']).size Out[19]: 2743

9 )- 在Description列中，下单最多的是哪种股票？ In [20]: shopper['StockCode'].value_counts().head(1) Out[20]: 85123A 209 Name: StockCode, dtype: int64

10 )- 一共有多少股票被下单？ In [21]: shopper['Quantity'].sum() Out[21]: 298210

11 )- 每一张单据(InvoiceNo)对应一笔交易，总共有多少笔交易？ In [22]: shopper['InvoiceNo'].unique().size Out[22]: 1771

练习 2. 数据过滤与排序——NBA比赛情况统计(NBA_stats.xlsx)。按照每步的要求，输入正确的代码并运行。

1 ) - 导入必要的库 In [25]: import pandas as pd

2 ) - 从以下地址导入数据集(NBA_stats.xlsx) In [26]: NBA2019=pd.read_excel('data/data111257/NBA_stats.xlsx')

5 )- 有多少球队参与了比赛？ In [29]: NBA2019['球队'].nunique Out[29]:

6 )-该数据集中一共有多少列(columns)? In [30]: NBA2019.shape[1] Out[30]: 20

11) - 选取球队名字含有“人”字的球队数据 In [37]: NBA2019[NBA2019.球队.str.startswith('人')] Out[37]:

14) - 找到“快船”和“国王”的“得分” In [40]: NBA2019.loc[NBA2019.球队.isin(['快船','国王']),['球队','得分']] Out[40]: 球队得分9快船114.012国王113.7

练习3. 数据分组----探索消费数据（OnlineRetail1.csv）

1)- 导入必要的库 In [41]: import pandas as pd

2)- 从以下地址导入数据 In [42]: path3='data/data111257/OnlineRetail.csv'

3)- 将数据框命名为ProductSales In [43]: ProductSales=pd.read_csv('data/data111257/OnlineRetail.csv')

练习 4. 运用函数--分析美国某次人口普查状况。 按照每题的要求，输入正确的代码并运行。

1)-导入必要的库 In [51]: import pandas as pd

2)-从以下地址导入数据集 In [52]: path4='data/data111257/adult.csv'

9)-哪个年龄(age)已婚( Married-civ-spouse)人士最多 In [60]: marry=census[census['marital-status'].str.contains('Married-civ-spouse')] marry.age.value_counts().head(1) Out[60]: 38 499 Name: age, dtype: int64

练习 5. 合并 1)-导入必要的库 In [61]: import pandas as pd

4)-将data1和data2两个数据框按照列的维度进行合并，命名为all_data In [66]: all_data=pd.concat([data1,data2]) all_data Out[66]: store_iditem_namesales0abook1001brule102cglue303dtape44epen600dstapler201enotebook602fpencil1503geraser404hsharpener50

6)-打印data3 In [68]: data3 Out[68]: store_idscore0a801b792c683d994e605f846g757h938i599j60

7)-按照store_id的值对all_data和data3作合并 In [69]: pd.merge(all_data,data3,on='store_id') Out[69]: store_iditem_namesalesscore0abook100801brule10792cglue30683dtape4994dstapler20995epen60606enotebook60607fpencil150848geraser40759hsharpener5093

8)-对‘data1’和‘data2’按照‘store_id’作连接 In [70]: pd.merge(data1,data2,on='store_id',how='inner') Out[70]: store_iditem_name_xsales_xitem_name_ysales_y0dtape4stapler201epen60notebook60

练习 6. 统计----气象数据(austin_weather.csv)

1-导入必要的库 In [72]: import pandas as pd import datetime

练习 7. 可视化----某班成绩可视化分析（StudentsPerformance.csv） 1)-导入必要的库 In [89]: import pandas as pd import numpy as np 2)-从以下地址导入数据集 In [90]: path7='data/data111257/StudentsPerformance.csv'

3)-将数据框命名为studentperformance In [91]: studentperformance=pd.read_csv('data/data111257/StudentsPerformance.csv')

6)-绘制一个展示math score成绩, 与学生性别gender和父母受教育程度parental level of education的散点图

7)-绘制一个展示reading score成绩的直方图 In [98]: #直方图 fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.hist(studentperformance['reading score'],bins=30) plt.title('reading score') plt.xlabel('score') plt.ylabel('Total number') plt.show()

练习 8.创建数据框 In [ ]: 1)-导入必要的库 In [99]: import pandas as pd

4)-数据框的列排序是字母顺序，请重新修改为Country, Spices, Fish, Alcoholic, Eggs这个顺序 In [102]: diet=diet[['Country','Spices','Fish','Alcoholic','Eggs']]

5)-添加一个列Health diet['Health']=['yes','yes','no','yes','no'] In [103]: diet['Health']=['yes','yes','no','yes','no']

6)-查看每个列的数据类型 In [104]: diet.dtypes Out[104]: Country object Spices float64 Fish float64 Alcoholic float64 Eggs float64 Health object dtype: object

练习 9.时间序列(hyundi.csv) 1)-导入必要的库 In [105]: import pandas as pd import numpy as np #visualization import matplotlib.pyplot as plt %matplotlib inline 2)-从以下地址导入数据集 In [106]: path9='data/data111257/hyundi.csv'

4)-查看每一列的数据类型 In [108]: hyundai.dtypes Out[108]: model object year int64 price int64 transmission object mileage int64 fuelType object tax int64 mpg float64 engineSize float64 record object dtype: object

5)-将record这个列转换为datetime类型 In [109]: hyundai.record = pd.to_datetime(hyundai.record) hyundai['record'].head() Out[109]: 0 2021-06-20 1 2016-03-05 2 2016-06-18 3 2019-01-02 4 2020-01-07 Name: record, dtype: datetime64[ns]

7)-有重复的日期吗？ In [111]: hyundai.index.is_unique Out[111]: False

10)-数据集中最早的记录日期(record)和最晚的记录日期相差多少天？ In [114]: (hyundai.index.max() - hyundai.index.min()).days Out[114]: 1951

11)-在record数据中一共有多少个月？ In [115]: hyundai_month = hyundai.resample('BM').mean() len(hyundai_month.index) Out[115]: 65

练习 10.删除数据(drug200.csv) 1)-导入必要的库 In [117]: import pandas as pd import numpy as np 2)-从以下地址导入数据集 In [118]: path10='data/data111257/drug200.csv'

3)-将数据集存成变量drug In [119]: drug=pd.read_csv(path10) drug.head() Out[119]: 23FHIGHHIGH.125.355DrugY047MLOWHIGH13.093drugC147MLOWHIGH10.114drugC228FNORMALHIGH7.798drugX361FLOWHIGH18.043DrugY422FNORMALHIGH8.607drugX

5)-数据框中有缺失值吗？ In [121]: pd.isnull(drug).sum() Out[121]: Age 0 Sex 0 Blood_Pressure_Levels (BP) 0 Cholesterol_Levels 0 Na_to_Potassium_Ration（Na_to_K） 0 Drug_Class 0 dtype: int64

9)-将数据框前三行设置为缺失值 In [125]: drug.iloc[0:3,:]=np.nan drug.head() Out[125]: AgeSexBlood_Pressure_Levels (BP)Na_to_Potassium_Ration（Na_to_K）Drug_Class0NaNNaNNaNNaNNaN1NaNNaNNaNNaNNaN2NaNNaNNaNNaNNaN328.0FNORMAL7.798drugX461.0FLOW18.043DrugY

10)-删除有缺失值的行 In [126]: drug=drug.dropna(how='any') drug.head() Out[126]: AgeSexBlood_Pressure_Levels (BP)Na_to_Potassium_Ration（Na_to_K）Drug_Class328.0FNORMAL7.798drugX461.0FLOW18.043DrugY522.0FNORMAL8.607drugX649.0FNORMAL16.275DrugY741.0MLOW11.037drugC

Python相关栏目本月热门文章

训练集数据资源地址：https://download.csdn.net/download/qq_41166909/30414605
本项目在AIStudio的地址https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1

练习 1. 了解数据--某段时间内股票销售情况(stock)。按照每题的要求，输入正确的代码并运行。

1 )- 导入正确的库——请导入Pandas。
In [12]:

import pandas as pd

2 )- 从指定地址导入数据，并存入一个名为shopper的框架中。（data.csv）
In [13]:

shopper=pd.read_csv('data/data111257/data.csv')

4 )- 数据集中有多少列(columns)
In [15]:

number_of_rows = shopper.shape[1] number_of_rows

Out[15]:

8

5 )- 打印全部列名称。
In [16]:

shopper.columns

Out[16]:

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country'], dtype='object')

6 )- 数据集的索引是怎样的。
In [17]:

shopper.index

Out[17]:

RangeIndex(start=0, stop=36997, step=1)

7 )- 被下单最多StockCode是什么？
In [18]:

a=shopper[['StockCode','Quantity']].groupby(by='StockCode').agg('sum') #根据股票编码分组 a=a.sort_values(['Quantity'],ascending=False) a.head(1)

Out[18]:

Quantity
StockCode
84077 4948

8 )- 在StockCode这一列中，一共有多少只股票被下单？
In [19]:

pd.unique(shopper['StockCode']).size

Out[19]:

2743

9 )- 在Description列中，下单最多的是哪种股票？
In [20]:

shopper['StockCode'].value_counts().head(1)

Out[20]:

85123A 209 Name: StockCode, dtype: int64

10 )- 一共有多少股票被下单？
In [21]:

shopper['Quantity'].sum()

Out[21]:

298210

11 )- 每一张单据(InvoiceNo)对应一笔交易，总共有多少笔交易？
In [22]:

shopper['InvoiceNo'].unique().size

Out[22]:

1771

1 ) - 导入必要的库
In [25]:

import pandas as pd

2 ) - 从以下地址导入数据集(NBA_stats.xlsx)
In [26]:

NBA2019=pd.read_excel('data/data111257/NBA_stats.xlsx')

5 )- 有多少球队参与了比赛？
In [29]:

NBA2019['球队'].nunique

Out[29]:

6 )-该数据集中一共有多少列(columns)?
In [30]:

NBA2019.shape[1]

Out[30]:

20

11) - 选取球队名字含有“人”字的球队数据
In [37]:

NBA2019[NBA2019.球队.str.startswith('人')]

Out[37]:

14) - 找到“快船”和“国王”的“得分”
In [40]:

NBA2019.loc[NBA2019.球队.isin(['快船','国王']),['球队','得分']]

Out[40]:

球队得分
9 快船 114.0
12 国王 113.7

1)- 导入必要的库
In [41]:

import pandas as pd

2)- 从以下地址导入数据
In [42]:

path3='data/data111257/OnlineRetail.csv'

3)- 将数据框命名为ProductSales
In [43]:

ProductSales=pd.read_csv('data/data111257/OnlineRetail.csv')

练习 4. 运用函数--分析美国某次人口普查状况。按照每题的要求，输入正确的代码并运行。

1)-导入必要的库
In [51]:

import pandas as pd

2)-从以下地址导入数据集
In [52]:

path4='data/data111257/adult.csv'

9)-哪个年龄(age)已婚( Married-civ-spouse)人士最多
In [60]:

marry=census[census['marital-status'].str.contains('Married-civ-spouse')] marry.age.value_counts().head(1)

Out[60]:

38 499 Name: age, dtype: int64

练习 5. 合并 1)-导入必要的库
In [61]:

import pandas as pd

6)-打印data3
In [68]:

data3

Out[68]:

store_id score
0 a 80
1 b 79
2 c 68
3 d 99
4 e 60
5 f 84
6 g 75
7 h 93
8 i 59
9 j 60

8)-对‘data1’和‘data2’按照‘store_id’作连接
In [70]:

pd.merge(data1,data2,on='store_id',how='inner')

Out[70]:

store_id item_name_x sales_x item_name_y sales_y
0 d tape 4 stapler 20
1 e pen 60 notebook 60

1-导入必要的库
In [72]:

import pandas as pd import datetime

练习 7. 可视化----某班成绩可视化分析（StudentsPerformance.csv） 1)-导入必要的库
In [89]:

import pandas as pd import numpy as np
2)-从以下地址导入数据集
In [90]:

path7='data/data111257/StudentsPerformance.csv'

3)-将数据框命名为studentperformance
In [91]:

studentperformance=pd.read_csv('data/data111257/StudentsPerformance.csv')

7)-绘制一个展示reading score成绩的直方图
In [98]:

#直方图 fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.hist(studentperformance['reading score'],bins=30) plt.title('reading score') plt.xlabel('score') plt.ylabel('Total number') plt.show()

练习 8.创建数据框
In [ ]:

1)-导入必要的库
In [99]:

import pandas as pd

4)-数据框的列排序是字母顺序，请重新修改为Country, Spices, Fish, Alcoholic, Eggs这个顺序
In [102]:

diet=diet[['Country','Spices','Fish','Alcoholic','Eggs']]

5)-添加一个列Health
diet['Health']=['yes','yes','no','yes','no']

In [103]:

diet['Health']=['yes','yes','no','yes','no']

6)-查看每个列的数据类型
In [104]:

diet.dtypes

Out[104]:

Country object Spices float64 Fish float64 Alcoholic float64 Eggs float64 Health object dtype: object

练习 9.时间序列(hyundi.csv) 1)-导入必要的库
In [105]:

import pandas as pd import numpy as np #visualization import matplotlib.pyplot as plt %matplotlib inline
2)-从以下地址导入数据集
In [106]:

path9='data/data111257/hyundi.csv'

4)-查看每一列的数据类型
In [108]:

hyundai.dtypes

Out[108]:

model object year int64 price int64 transmission object mileage int64 fuelType object tax int64 mpg float64 engineSize float64 record object dtype: object

5)-将record这个列转换为datetime类型
In [109]:

hyundai.record = pd.to_datetime(hyundai.record) hyundai['record'].head()

Out[109]:

0 2021-06-20 1 2016-03-05 2 2016-06-18 3 2019-01-02 4 2020-01-07 Name: record, dtype: datetime64[ns]

7)-有重复的日期吗？
In [111]:

hyundai.index.is_unique

Out[111]:

False

10)-数据集中最早的记录日期(record)和最晚的记录日期相差多少天？
In [114]:

(hyundai.index.max() - hyundai.index.min()).days

Out[114]:

1951

11)-在record数据中一共有多少个月？
In [115]:

hyundai_month = hyundai.resample('BM').mean() len(hyundai_month.index)

Out[115]:

65

练习 10.删除数据(drug200.csv) 1)-导入必要的库
In [117]:

import pandas as pd import numpy as np
2)-从以下地址导入数据集
In [118]:

path10='data/data111257/drug200.csv'

3)-将数据集存成变量drug
In [119]:

drug=pd.read_csv(path10) drug.head()

Out[119]:

23 F HIGH HIGH.1 25.355 DrugY
0 47 M LOW HIGH 13.093 drugC
1 47 M LOW HIGH 10.114 drugC
2 28 F NORMAL HIGH 7.798 drugX
3 61 F LOW HIGH 18.043 DrugY
4 22 F NORMAL HIGH 8.607 drugX

5)-数据框中有缺失值吗？
In [121]:

pd.isnull(drug).sum()

Out[121]:

Age 0 Sex 0 Blood_Pressure_Levels (BP) 0 Cholesterol_Levels 0 Na_to_Potassium_Ration（Na_to_K） 0 Drug_Class 0 dtype: int64