栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

python数据计算(pandas)100题

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

python数据计算(pandas)100题

pandas入门十大例题

训练集数据资源地址:https://download.csdn.net/download/qq_41166909/30414605

本项目在AIStudio的地址https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1https://aistudio.baidu.com/aistudio/projectdetail/2446775?contributionType=1&shared=1

练习 1. 了解数据--某段时间内股票销售情况(stock)。 按照每题的要求,输入正确的代码并运行。

1 )- 导入正确的库——请导入Pandas。

In [12]:

import pandas as pd

2 )- 从指定地址导入数据,并存入一个名为shopper的框架中。(data.csv)

In [13]:

shopper=pd.read_csv('data/data111257/data.csv')

3 )- 查看前15行内容。

In [14]:

shopper.head(15)

Out[14]:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER612/1/2010 8:262.5517850.0United Kingdom
153636571053WHITE metaL LANTERN612/1/2010 8:263.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER812/1/2010 8:262.7517850.0United Kingdom
353636584029GKNITTED UNIOn FLAG HOT WATER BOTTLE612/1/2010 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.612/1/2010 8:263.3917850.0United Kingdom
553636522752SET 7 BABUSHKA NESTING BOXES212/1/2010 8:267.6517850.0United Kingdom
653636521730GLASS STAR FROSTED T-LIGHT HOLDER612/1/2010 8:264.2517850.0United Kingdom
753636622633HAND WARMER UNIOn JACK612/1/2010 8:281.8517850.0United Kingdom
853636622632HAND WARMER RED POLKA DOT612/1/2010 8:281.8517850.0United Kingdom
953636784879ASSORTED COLOUR BIRD ORNAMENT3212/1/2010 8:341.6913047.0United Kingdom
1053636722745POPPY'S PLAYHOUSE BEDROOM612/1/2010 8:342.1013047.0United Kingdom
1153636722748POPPY'S PLAYHOUSE KITCHEN612/1/2010 8:342.1013047.0United Kingdom
1253636722749FELTCRAFT PRINCESS CHARLOTTE DOLL812/1/2010 8:343.7513047.0United Kingdom
1353636722310IVORY KNITTED MUG COSY612/1/2010 8:341.6513047.0United Kingdom
1453636784969BOX OF 6 ASSORTED COLOUR TEASPOONS612/1/2010 8:344.2513047.0United Kingdom

4 )- 数据集中有多少列(columns)

In [15]:

number_of_rows = shopper.shape[1]
number_of_rows

Out[15]:

8

5 )- 打印全部列名称。

In [16]:

shopper.columns

Out[16]:

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

6 )- 数据集的索引是怎样的。

In [17]:

shopper.index

Out[17]:

RangeIndex(start=0, stop=36997, step=1)

7 )- 被下单最多StockCode是什么?

In [18]:

a=shopper[['StockCode','Quantity']].groupby(by='StockCode').agg('sum') #根据股票编码分组
a=a.sort_values(['Quantity'],ascending=False)
a.head(1)

Out[18]:

Quantity
StockCode
840774948

8 )- 在StockCode这一列中,一共有多少只股票被下单?

In [19]:

pd.unique(shopper['StockCode']).size  

Out[19]:

2743

9 )- 在Description列中,下单最多的是哪种股票?

In [20]:

shopper['StockCode'].value_counts().head(1)

Out[20]:

85123A    209
Name: StockCode, dtype: int64

10 )- 一共有多少股票被下单?

In [21]:

shopper['Quantity'].sum()

Out[21]:

298210

11 )- 每一张单据(InvoiceNo)对应一笔交易,总共有多少笔交易?

In [22]:

shopper['InvoiceNo'].unique().size

Out[22]:

1771

12 ).每一笔对应的平均总价是多少?

In [23]:

#求总收入
shopper['sub_total'] = round(shopper['UnitPrice'] * shopper['Quantity'],2)
shopper['sub_total'].sum()

Out[23]:

648004.1099999999

In [24]:

#求每一笔订单的平均总价是多少
shopper[['InvoiceNo','sub_total']].groupby(by=['InvoiceNo']
).agg({'sub_total':'sum'})['sub_total'].mean()

Out[24]:

365.8972953133823

练习 2. 数据过滤与排序——NBA比赛情况统计(NBA_stats.xlsx)。按照每步的要求,输入正确的代码并运行。

1 ) - 导入必要的库

In [25]:

import pandas as pd

2 ) - 从以下地址导入数据集(NBA_stats.xlsx)

In [26]:

NBA2019=pd.read_excel('data/data111257/NBA_stats.xlsx')

3 )- 将数据集命名为NBA2019

In [27]:

NBA2019=pd.read_excel('data/data111257/NBA_stats.xlsx')
NBA2019

Out[27]:

排名球队投篮命中率投篮命中投篮出手三分命中率三分命中三分出手罚球命中率罚球命中罚球出手总篮板进攻防守助攻失误抢断盖帽犯规得分
01雄鹿0.48744.791.80.38914.437.10.76016.221.448.110.337.825.513.828.134.6417.28120.1
12篮网0.49443.187.30.39214.236.10.80418.122.544.48.935.526.813.546.725.2619.04118.6
23奇才0.47543.290.90.35110.229.00.76920.126.245.29.735.525.514.407.334.1321.60116.6
34爵士0.46841.388.10.38916.743.00.79917.221.548.310.637.623.714.216.585.1518.54116.4
45开拓者0.45341.391.10.38515.740.80.82317.821.644.510.633.921.311.106.895.0418.92116.1
56太阳0.49043.388.30.37813.134.60.83415.618.742.98.834.226.912.537.184.3319.08115.3
67步行者0.47443.391.20.36412.334.00.79216.420.742.79.033.727.413.548.496.3920.18115.3
78掘金0.48543.289.20.37712.934.20.80315.719.544.410.533.926.813.508.084.4919.08115.1
89鹈鹕0.47742.589.10.34810.630.40.72919.026.147.411.735.726.014.617.574.3817.99114.6
910快船0.48241.886.70.41114.334.70.83916.219.344.29.434.724.413.197.074.1019.21114.0
1011勇士0.46841.388.20.37614.638.70.78516.621.143.08.035.127.715.008.154.7521.19113.7
1112老鹰0.46840.887.20.37312.433.40.81219.724.245.610.635.124.113.246.994.7519.33113.7
1213国王0.48142.688.60.36412.133.30.74516.422.041.49.432.025.513.387.544.9719.44113.7
131476人0.47641.486.90.37411.330.10.76719.625.545.110.035.023.714.449.106.2120.22113.6
1415灰熊0.46742.891.80.35611.231.40.77116.421.346.511.235.326.913.299.105.0618.74113.3
1516凯尔特人0.46641.588.90.37413.636.40.77516.120.844.310.633.623.514.067.725.3220.43112.6
1617独行侠0.47041.187.30.36213.838.10.77816.521.243.39.134.222.912.076.254.3219.39112.4
1718森林狼0.44840.790.90.34913.137.60.76117.623.143.510.533.025.614.268.785.5320.93112.1
1819猛龙0.44839.788.70.36814.539.30.81517.421.341.69.432.124.113.228.585.4021.19111.3
1920马刺0.46241.990.50.3509.928.40.79217.422.043.99.334.624.411.407.015.0817.96111.1
2021公牛0.47642.288.60.37012.634.00.79113.817.545.09.635.326.815.136.694.2218.92110.7
2122湖人0.47240.686.10.35411.131.20.73917.223.344.29.734.624.715.217.815.3619.13109.5
2223黄蜂0.45539.987.80.36913.737.00.76115.920.943.810.633.226.814.857.854.7818.03109.5
2324火箭0.44439.288.50.33913.840.60.74016.522.342.69.333.323.614.727.585.0119.54108.8
2425热火0.46839.283.70.35812.936.20.79016.721.141.58.033.526.314.077.903.9718.93108.1
2526尼克斯0.45639.486.50.39211.830.00.78416.420.945.19.735.521.412.947.045.0720.46107.0
2627活塞0.45238.785.60.35111.632.90.75917.823.442.79.633.124.214.937.385.1520.51106.6
2728雷霆0.44138.888.00.33911.935.10.72515.521.345.69.935.722.116.147.004.3918.13105.0
2829魔术0.42938.289.20.34310.931.80.77516.621.445.410.435.121.812.836.894.4217.18104.0
2930骑士0.45038.685.80.33610.029.70.74316.722.442.810.432.323.815.477.764.5118.17103.8

4 )-只选取 “得分” 这一列

In [28]:

NBA2019['得分']

Out[28]:

0     120.1
1     118.6
2     116.6
3     116.4
4     116.1
5     115.3
6     115.3
7     115.1
8     114.6
9     114.0
10    113.7
11    113.7
12    113.7
13    113.6
14    113.3
15    112.6
16    112.4
17    112.1
18    111.3
19    111.1
20    110.7
21    109.5
22    109.5
23    108.8
24    108.1
25    107.0
26    106.6
27    105.0
28    104.0
29    103.8
Name: 得分, dtype: float64

5 )- 有多少球队参与了比赛?

In [29]:

NBA2019['球队'].nunique

Out[29]:

6 )-该数据集中一共有多少列(columns)?

In [30]:

NBA2019.shape[1]

Out[30]:

20

7 )- 将数据集中的列“投篮命中率”、“投篮命中”和“投篮出手”单独存为一个名叫“投篮”的数据框

In [31]:

shots=NBA2019[['投篮命中率','投篮命中','投篮出手']]
shots

Out[31]:

投篮命中率投篮命中投篮出手
00.48744.791.8
10.49443.187.3
20.47543.290.9
30.46841.388.1
40.45341.391.1
50.49043.388.3
60.47443.391.2
70.48543.289.2
80.47742.589.1
90.48241.886.7
100.46841.388.2
110.46840.887.2
120.48142.688.6
130.47641.486.9
140.46742.891.8
150.46641.588.9
160.47041.187.3
170.44840.790.9
180.44839.788.7
190.46241.990.5
200.47642.288.6
210.47240.686.1
220.45539.987.8
230.44439.288.5
240.46839.283.7
250.45639.486.5
260.45238.785.6
270.44138.888.0
280.42938.289.2
290.45038.685.8

In [32]:

NBA2019.columns

Out[32]:

Index(['排名', '球队', '投篮命中率', '投篮命中', '投篮出手', '三分命中率', '三分命中', '三分出手', '罚球命中率',
       '罚球命中', '罚球出手', '总篮板', '进攻', '防守', '助攻', '失误', '抢断', '盖帽', '犯规', '得分'],
      dtype='object')

8 ) - 对数据框“投篮”按照先“投篮命中率”再“投篮出手”进行排序

In [33]:

shots.sort_values(['投篮命中率','投篮出手'],ascending=[False,False])

Out[33]:

投篮命中率投篮命中投篮出手
10.49443.187.3
50.49043.388.3
00.48744.791.8
70.48543.289.2
90.48241.886.7
120.48142.688.6
80.47742.589.1
200.47642.288.6
130.47641.486.9
20.47543.290.9
60.47443.391.2
210.47240.686.1
160.47041.187.3
100.46841.388.2
30.46841.388.1
110.46840.887.2
240.46839.283.7
140.46742.891.8
150.46641.588.9
190.46241.990.5
250.45639.486.5
220.45539.987.8
40.45341.391.1
260.45238.785.6
290.45038.685.8
170.44840.790.9
180.44839.788.7
230.44439.288.5
270.44138.888.0
280.42938.289.2

9 )- 计算每个球队“进攻”的平均值

In [34]:

discipline=NBA2019[['进攻']]
discipline

Out[34]:

进攻
010.3
18.9
29.7
310.6
410.6
58.8
69.0
710.5
811.7
99.4
108.0
1110.6
129.4
1310.0
1411.2
1510.6
169.1
1710.5
189.4
199.3
209.6
219.7
2210.6
239.3
248.0
259.7
269.6
279.9
2810.4
2910.4

In [35]:

round(discipline['进攻'].mean())

Out[35]:

10.0

10 )- 找到“罚球命中”小于16.3(不含)的球队数据

In [36]:

NBA2019[NBA2019.罚球命中>16.3]

Out[36]:

排名球队投篮命中率投篮命中投篮出手三分命中率三分命中三分出手罚球命中率罚球命中罚球出手总篮板进攻防守助攻失误抢断盖帽犯规得分
12篮网0.49443.187.30.39214.236.10.80418.122.544.48.935.526.813.546.725.2619.04118.6
23奇才0.47543.290.90.35110.229.00.76920.126.245.29.735.525.514.407.334.1321.60116.6
34爵士0.46841.388.10.38916.743.00.79917.221.548.310.637.623.714.216.585.1518.54116.4
45开拓者0.45341.391.10.38515.740.80.82317.821.644.510.633.921.311.106.895.0418.92116.1
67步行者0.47443.391.20.36412.334.00.79216.420.742.79.033.727.413.548.496.3920.18115.3
89鹈鹕0.47742.589.10.34810.630.40.72919.026.147.411.735.726.014.617.574.3817.99114.6
1011勇士0.46841.388.20.37614.638.70.78516.621.143.08.035.127.715.008.154.7521.19113.7
1112老鹰0.46840.887.20.37312.433.40.81219.724.245.610.635.124.113.246.994.7519.33113.7
1213国王0.48142.688.60.36412.133.30.74516.422.041.49.432.025.513.387.544.9719.44113.7
131476人0.47641.486.90.37411.330.10.76719.625.545.110.035.023.714.449.106.2120.22113.6
1415灰熊0.46742.891.80.35611.231.40.77116.421.346.511.235.326.913.299.105.0618.74113.3
1617独行侠0.47041.187.30.36213.838.10.77816.521.243.39.134.222.912.076.254.3219.39112.4
1718森林狼0.44840.790.90.34913.137.60.76117.623.143.510.533.025.614.268.785.5320.93112.1
1819猛龙0.44839.788.70.36814.539.30.81517.421.341.69.432.124.113.228.585.4021.19111.3
1920马刺0.46241.990.50.3509.928.40.79217.422.043.99.334.624.411.407.015.0817.96111.1
2122湖人0.47240.686.10.35411.131.20.73917.223.344.29.734.624.715.217.815.3619.13109.5
2324火箭0.44439.288.50.33913.840.60.74016.522.342.69.333.323.614.727.585.0119.54108.8
2425热火0.46839.283.70.35812.936.20.79016.721.141.58.033.526.314.077.903.9718.93108.1
2526尼克斯0.45639.486.50.39211.830.00.78416.420.945.19.735.521.412.947.045.0720.46107.0
2627活塞0.45238.785.60.35111.632.90.75917.823.442.79.633.124.214.937.385.1520.51106.6
2829魔术0.42938.289.20.34310.931.80.77516.621.445.410.435.121.812.836.894.4217.18104.0
2930骑士0.45038.685.80.33610.029.70.74316.722.442.810.432.323.815.477.764.5118.17103.8

11) - 选取球队名字含有“人”字的球队数据

In [37]:

NBA2019[NBA2019.球队.str.startswith('人')]

Out[37]:

12) - 选取前5列

In [38]:

NBA2019.iloc[:,0:5]

Out[38]:

排名球队投篮命中率投篮命中投篮出手
01雄鹿0.48744.791.8
12篮网0.49443.187.3
23奇才0.47543.290.9
34爵士0.46841.388.1
45开拓者0.45341.391.1
56太阳0.49043.388.3
67步行者0.47443.391.2
78掘金0.48543.289.2
89鹈鹕0.47742.589.1
910快船0.48241.886.7
1011勇士0.46841.388.2
1112老鹰0.46840.887.2
1213国王0.48142.688.6
131476人0.47641.486.9
1415灰熊0.46742.891.8
1516凯尔特人0.46641.588.9
1617独行侠0.47041.187.3
1718森林狼0.44840.790.9
1819猛龙0.44839.788.7
1920马刺0.46241.990.5
2021公牛0.47642.288.6
2122湖人0.47240.686.1
2223黄蜂0.45539.987.8
2324火箭0.44439.288.5
2425热火0.46839.283.7
2526尼克斯0.45639.486.5
2627活塞0.45238.785.6
2728雷霆0.44138.888.0
2829魔术0.42938.289.2
2930骑士0.45038.685.8

13) - 选取除了最后4列之外的全部列

In [39]:

NBA2019.iloc[:,:-4]

Out[39]:

排名球队投篮命中率投篮命中投篮出手三分命中率三分命中三分出手罚球命中率罚球命中罚球出手总篮板进攻防守助攻失误
01雄鹿0.48744.791.80.38914.437.10.76016.221.448.110.337.825.513.82
12篮网0.49443.187.30.39214.236.10.80418.122.544.48.935.526.813.54
23奇才0.47543.290.90.35110.229.00.76920.126.245.29.735.525.514.40
34爵士0.46841.388.10.38916.743.00.79917.221.548.310.637.623.714.21
45开拓者0.45341.391.10.38515.740.80.82317.821.644.510.633.921.311.10
56太阳0.49043.388.30.37813.134.60.83415.618.742.98.834.226.912.53
67步行者0.47443.391.20.36412.334.00.79216.420.742.79.033.727.413.54
78掘金0.48543.289.20.37712.934.20.80315.719.544.410.533.926.813.50
89鹈鹕0.47742.589.10.34810.630.40.72919.026.147.411.735.726.014.61
910快船0.48241.886.70.41114.334.70.83916.219.344.29.434.724.413.19
1011勇士0.46841.388.20.37614.638.70.78516.621.143.08.035.127.715.00
1112老鹰0.46840.887.20.37312.433.40.81219.724.245.610.635.124.113.24
1213国王0.48142.688.60.36412.133.30.74516.422.041.49.432.025.513.38
131476人0.47641.486.90.37411.330.10.76719.625.545.110.035.023.714.44
1415灰熊0.46742.891.80.35611.231.40.77116.421.346.511.235.326.913.29
1516凯尔特人0.46641.588.90.37413.636.40.77516.120.844.310.633.623.514.06
1617独行侠0.47041.187.30.36213.838.10.77816.521.243.39.134.222.912.07
1718森林狼0.44840.790.90.34913.137.60.76117.623.143.510.533.025.614.26
1819猛龙0.44839.788.70.36814.539.30.81517.421.341.69.432.124.113.22
1920马刺0.46241.990.50.3509.928.40.79217.422.043.99.334.624.411.40
2021公牛0.47642.288.60.37012.634.00.79113.817.545.09.635.326.815.13
2122湖人0.47240.686.10.35411.131.20.73917.223.344.29.734.624.715.21
2223黄蜂0.45539.987.80.36913.737.00.76115.920.943.810.633.226.814.85
2324火箭0.44439.288.50.33913.840.60.74016.522.342.69.333.323.614.72
2425热火0.46839.283.70.35812.936.20.79016.721.141.58.033.526.314.07
2526尼克斯0.45639.486.50.39211.830.00.78416.420.945.19.735.521.412.94
2627活塞0.45238.785.60.35111.632.90.75917.823.442.79.633.124.214.93
2728雷霆0.44138.888.00.33911.935.10.72515.521.345.69.935.722.116.14
2829魔术0.42938.289.20.34310.931.80.77516.621.445.410.435.121.812.83
2930骑士0.45038.685.80.33610.029.70.74316.722.442.810.432.323.815.47

14) - 找到“快船”和“国王”的“得分”

In [40]:

NBA2019.loc[NBA2019.球队.isin(['快船','国王']),['球队','得分']]

Out[40]:

球队得分
9快船114.0
12国王113.7

练习3. 数据分组----探索消费数据(OnlineRetail1.csv)

1)- 导入必要的库

In [41]:

import pandas as pd

2)- 从以下地址导入数据

In [42]:

path3='data/data111257/OnlineRetail.csv'

3)- 将数据框命名为ProductSales

In [43]:

ProductSales=pd.read_csv('data/data111257/OnlineRetail.csv')

4)- 哪个国家(Country)销售收入(Quantity*UnitPrice)更多?

In [44]:

ProductSales['TotalCost'] = ProductSales['UnitPrice'] * ProductSales['Quantity']
cost_per_country = ProductSales.groupby('Country')['TotalCost'].sum()
cost_per_country.idxmax()

Out[44]:

'Canada'

In [45]:

ProductSales.head()

Out[45]:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryTotalCost
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62021/8/212.5517850United Kingdom15.30
153636571053WHITE metaL LANTERN62021/8/123.3917850United Kingdom20.34
253636584406BCREAM CUPID HEARTS COAT HANGER82021/7/242.7517850United Kingdom22.00
353636584029GKNITTED UNIOn FLAG HOT WATER BOTTLE62021/8/103.3917850United Kingdom20.34
453636584029ERED WOOLLY HOTTIE WHITE HEART.62021/9/123.3917850United Kingdom20.34

5)- 哪个月销售收入更多?

In [46]:

ProductSales['Month'] = ProductSales['InvoiceDate'].apply(lambda x:x[:6]).tolist()

In [47]:

ProductSales.head()

Out[47]:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryTotalCostMonth
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62021/8/212.5517850United Kingdom15.302021/8
153636571053WHITE metaL LANTERN62021/8/123.3917850United Kingdom20.342021/8
253636584406BCREAM CUPID HEARTS COAT HANGER82021/7/242.7517850United Kingdom22.002021/7
353636584029GKNITTED UNIOn FLAG HOT WATER BOTTLE62021/8/103.3917850United Kingdom20.342021/8
453636584029ERED WOOLLY HOTTIE WHITE HEART.62021/9/123.3917850United Kingdom20.342021/9

In [48]:

ProductSales['TotalCost'] = ProductSales['UnitPrice'] * ProductSales['Quantity']
cost_per_country = ProductSales.groupby('Month')['TotalCost'].sum()
cost_per_country

Out[48]:

Month
2021/7     431.35
2021/8    1570.94
2021/9     667.43
Name: TotalCost, dtype: float64

In [49]:

cost_per_country.idxmax()

Out[49]:

'2021/8'

6)- 打印出每个国家销售数量的平均值,最大值和最小值

In [50]:

ProductSales.groupby('Country').Quantity.agg(['mean', 'min', 'max'])

Out[50]:

meanminmax
Country
Australia10.500000520
Belgium7.000000224
Canada9.625000116
Channel Islands11.222222124
France22.500000336
Germany10.750000424
Italy10.750000124
USA6.750000212
United Kingdom9.263158280

练习 4. 运用函数--分析美国某次人口普查状况。 按照每题的要求,输入正确的代码并运行。

1)-导入必要的库

In [51]:

import pandas as pd

2)-从以下地址导入数据集

In [52]:

path4='data/data111257/adult.csv'

3)-将数据框命名为census

In [53]:

census=pd.read_csv('data/data111257/adult.csv')

In [54]:

census.head()

Out[54]:

yearageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
0196649Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
1197144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
2197738Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
3197738Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
4197342Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k

4)-每一列(column)的数据类型是什么样的?

In [55]:

census.info()

RangeIndex: 32561 entries, 0 to 32560
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            32561 non-null  int64  
 1   age             32561 non-null  int64  
 2   workclass       32561 non-null  object 
 3   fnlwgt          32561 non-null  int64  
 4   education       32561 non-null  object 
 5   education-num   32074 non-null  float64
 6   marital-status  32561 non-null  object 
 7   occupation      32049 non-null  object 
 8   relationship    32561 non-null  object 
 9   race            32561 non-null  object 
 10  sex             32561 non-null  object 
 11  capital-gain    32561 non-null  int64  
 12  capital-loss    32561 non-null  int64  
 13  hours-per-week  32561 non-null  int64  
 14  native-country  32561 non-null  object 
 15  salary          32561 non-null  object 
dtypes: float64(1), int64(6), object(9)
memory usage: 4.0+ MB

5)-将year的数据类型转换为 datetime64

In [56]:

census.year = pd.to_datetime(census.year, format='%Y')
census.info()

RangeIndex: 32561 entries, 0 to 32560
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   year            32561 non-null  datetime64[ns]
 1   age             32561 non-null  int64         
 2   workclass       32561 non-null  object        
 3   fnlwgt          32561 non-null  int64         
 4   education       32561 non-null  object        
 5   education-num   32074 non-null  float64       
 6   marital-status  32561 non-null  object        
 7   occupation      32049 non-null  object        
 8   relationship    32561 non-null  object        
 9   race            32561 non-null  object        
 10  sex             32561 non-null  object        
 11  capital-gain    32561 non-null  int64         
 12  capital-loss    32561 non-null  int64         
 13  hours-per-week  32561 non-null  int64         
 14  native-country  32561 non-null  object        
 15  salary          32561 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(5), object(9)
memory usage: 4.0+ MB

6)-将列year设置为数据框的索引

In [57]:

census = census.set_index('year', drop = True)
census.head()

Out[57]:

ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
year
1966-01-0149Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
1971-01-0144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
1977-01-0138Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
1977-01-0138Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
1973-01-0142Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k

7)-删除名为fnlwgt的列

In [58]:

del census['fnlwgt']
census.head()

Out[58]:

ageworkclasseducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
year
1966-01-0149PrivateAssoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
1971-01-0144PrivateMasters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
1977-01-0138PrivateHS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
1977-01-0138Self-emp-incProf-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
1973-01-0142Self-emp-not-inc7th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k

8)-按照year对数据框进行分组,并对hours-per-week求和

ProductSales['TotalCost'] = ProductSales['UnitPrice'] * ProductSales['Quantity'] cost_per_country = ProductSales.groupby('InvoiceDate')['TotalCost'].sum() cost_per_country.idxmax()

In [59]:

census.groupby('year')['hours-per-week'].agg(sum)

Out[59]:

year
1925-01-01     1583
1927-01-01      120
1928-01-01        2
1929-01-01       40
1930-01-01       88
              ...  
1994-01-01    24505
1995-01-01    24307
1996-01-01    21843
1997-01-01    14252
1998-01-01     8440
Name: hours-per-week, Length: 73, dtype: int64

9)-哪个年龄(age)已婚( Married-civ-spouse)人士最多

In [60]:

marry=census[census['marital-status'].str.contains('Married-civ-spouse')]
marry.age.value_counts().head(1)

Out[60]:

38    499
Name: age, dtype: int64

练习 5. 合并 1)-导入必要的库

In [61]:

import pandas as pd

2)-按照如下的元数据内容创建数据框
raw_data_1={
        'store_id':['a','b','c','d','e'],
        'item_name':['book','rule','glue','tape','pen']
        'sales':[100,10,30,4,60]}
raw_data_2={
        'store_id':['d',e','f','g','h'],
        'item_name':['stapler','notebook','pencil','eraser','sharpener']
        'sales':[20,60,150,40,50]}
raw_data_3={
        'store_id':['a','b','c','d',e','f','g','h','i','j'],
        'score':[80,79,68,99,60,84,75,93,59,60] }          

In [62]:

raw_data_1={
		'store_id':['a','b','c','d','e'],
		'item_name':['book','rule','glue','tape','pen'],
		'sales':[100,10,30,4,60]}
raw_data_2={
		'store_id':['d','e','f','g','h'],
		'item_name':['stapler','notebook','pencil','eraser','sharpener'],
		'sales':[20,60,150,40,50]}
raw_data_3={
		'store_id':['a','b','c','d','e','f','g','h','i','j'],
		'score':[80,79,68,99,60,84,75,93,59,60] }  

In [63]:

raw_data_1

Out[63]:

{'store_id': ['a', 'b', 'c', 'd', 'e'],
 'item_name': ['book', 'rule', 'glue', 'tape', 'pen'],
 'sales': [100, 10, 30, 4, 60]}

3)-将上述的数据框分别命名为data1, data2, data3

In [64]:

data1 = pd.Dataframe(raw_data_1, columns = ['store_id', 'item_name', 'sales'])
data2 = pd.Dataframe(raw_data_2, columns = ['store_id', 'item_name', 'sales'])
data3 = pd.Dataframe(raw_data_3, columns = ['store_id','score'])

In [65]:

data1

Out[65]:

store_iditem_namesales
0abook100
1brule10
2cglue30
3dtape4
4epen60

4)-将data1和data2两个数据框按照列的维度进行合并,命名为all_data

In [66]:

all_data=pd.concat([data1,data2])
all_data

Out[66]:

store_iditem_namesales
0abook100
1brule10
2cglue30
3dtape4
4epen60
0dstapler20
1enotebook60
2fpencil150
3geraser40
4hsharpener50

5)-将data1和data2两个数据框按照列的维度进行合并,命名为all_data_col

In [67]:

all_data_col=pd.concat([data1,data2],axis=1)
all_data_col

Out[67]:

store_iditem_namesalesstore_iditem_namesales
0abook100dstapler20
1brule10enotebook60
2cglue30fpencil150
3dtape4geraser40
4epen60hsharpener50

6)-打印data3

In [68]:

data3

Out[68]:

store_idscore
0a80
1b79
2c68
3d99
4e60
5f84
6g75
7h93
8i59
9j60

7)-按照store_id的值对all_data和data3作合并

In [69]:

pd.merge(all_data,data3,on='store_id')

Out[69]:

store_iditem_namesalesscore
0abook10080
1brule1079
2cglue3068
3dtape499
4dstapler2099
5epen6060
6enotebook6060
7fpencil15084
8geraser4075
9hsharpener5093

8)-对‘data1’和‘data2’按照‘store_id’作连接

In [70]:

pd.merge(data1,data2,on='store_id',how='inner')

Out[70]:

store_iditem_name_xsales_xitem_name_ysales_y
0dtape4stapler20
1epen60notebook60

9)-找到 data1 和 data2 合并之后的所有匹配结果

In [71]:

pd.merge(data1, data2, on='store_id', how='outer')

Out[71]:

store_iditem_name_xsales_xitem_name_ysales_y
0abook100.0NaNNaN
1brule10.0NaNNaN
2cglue30.0NaNNaN
3dtape4.0stapler20.0
4epen60.0notebook60.0
5fNaNNaNpencil150.0
6gNaNNaNeraser40.0
7hNaNNaNsharpener50.0

练习 6. 统计----气象数据(austin_weather.csv)

1-导入必要的库

In [72]:

import pandas as pd
import datetime

2)- 从以下地址导入数据集

In [73]:

path6='data/data111257/austin_weather.csv'

In [74]:

df6=pd.read_csv('data/data111257/austin_weather.csv')

In [75]:

df6

Out[75]:

YearMonthDayLocationTempHighFTempLowFDewPointHighFDewPointAvgFDewPointLowFHumidityHighPercent...SeaLevelPressureAvgInchesSeaLevelPressureLowInchesVisibilityHighMilesVisibilityAvgMilesVisibilityLowMilesWindHighMPHWindAvgMPHWindGustMPHPrecipitationSumInchesEvents
0201312212.0744567494393...29.6829.591072204310.46Rain , Thunderstorm
1201312227.0563943362893...30.1329.8710105166250
2201312236.0583231272376...30.4930.4110101083120
3201312247.0613136282189...30.4530.310107124200
4201312259.0584144403686...30.3330.271010710216T
..................................................................
131420177272.01037571676182...29.9729.88101010125210
131520177288.01057671645587...29.929.81101010145200
131620177299.01077772645582...29.8629.79101010124170
131720177304.01067970686369...29.9129.87101010134200
131820177319.0997766615464...29.9729.91101010124200

1319 rows × 23 columns

3)- 将数据作存储并且设置前三列(Year, Month, Day)为合适的索引Date

In [76]:

data6=pd.read_csv(path6,parse_dates=['Year','Month','Day'])
data6.head()

Out[76]:

YearMonthDayLocationTempHighFTempLowFDewPointHighFDewPointAvgFDewPointLowFHumidityHighPercent...SeaLevelPressureAvgInchesSeaLevelPressureLowInchesVisibilityHighMilesVisibilityAvgMilesVisibilityLowMilesWindHighMPHWindAvgMPHWindGustMPHPrecipitationSumInchesEvents
02013-01-0112212.0744567494393...29.6829.591072204310.46Rain , Thunderstorm
12013-01-0112227.0563943362893...30.1329.8710105166250
22013-01-0112236.0583231272376...30.4930.4110101083120
32013-01-0112247.0613136282189...30.4530.310107124200
42013-01-0112259.0584144403686...30.3330.271010710216T

5 rows × 23 columns

4)- 2034、2065年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug

In [77]:

def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)

# apply the function fix_century on the column and replace the values to the right ones
data6['Year'] = data6['Year'].apply(fix_century)

# data.info()
data6.head()

Out[77]:

YearMonthDayLocationTempHighFTempLowFDewPointHighFDewPointAvgFDewPointLowFHumidityHighPercent...SeaLevelPressureAvgInchesSeaLevelPressureLowInchesVisibilityHighMilesVisibilityAvgMilesVisibilityLowMilesWindHighMPHWindAvgMPHWindGustMPHPrecipitationSumInchesEvents
01913-01-0112212.0744567494393...29.6829.591072204310.46Rain , Thunderstorm
11913-01-0112227.0563943362893...30.1329.8710105166250
21913-01-0112236.0583231272376...30.4930.4110101083120
31913-01-0112247.0613136282189...30.4530.310107124200
41913-01-0112259.0584144403686...30.3330.271010710216T

5 rows × 23 columns

5)- 将日期(Date)设为索引,注意数据类型,应该是datetime64

In [78]:

data6["Year"] = pd.to_datetime(data6["Year"])

In [79]:

data6 = data6.set_index('Year')

In [80]:

data6.head()

Out[80]:

MonthDayLocationTempHighFTempLowFDewPointHighFDewPointAvgFDewPointLowFHumidityHighPercentHumidityAvgPercent...SeaLevelPressureAvgInchesSeaLevelPressureLowInchesVisibilityHighMilesVisibilityAvgMilesVisibilityLowMilesWindHighMPHWindAvgMPHWindGustMPHPrecipitationSumInchesEvents
Year
1913-01-0112212.074456749439375...29.6829.591072204310.46Rain , Thunderstorm
1913-01-0112227.056394336289368...30.1329.8710105166250
1913-01-0112236.058323127237652...30.4930.4110101083120
1913-01-0112247.061313628218956...30.4530.310107124200
1913-01-0112259.058414440368671...30.3330.271010710216T

5 rows × 22 columns

6)- 对应每一个DewPointAvgF,一共有多少数据值缺失

In [81]:

data6.isnull().sum()

Out[81]:

Month                          0
Day                            0
Location                      12
TempHighF                      0
TempLowF                       0
DewPointHighF                  0
DewPointAvgF                   0
DewPointLowF                   0
HumidityHighPercent            0
HumidityAvgPercent             0
HumidityLowPercent             0
SeaLevelPressureHighInches     0
SeaLevelPressureAvgInches      0
SeaLevelPressureLowInches      0
VisibilityHighMiles            0
VisibilityAvgMiles             0
VisibilityLowMiles             0
WindHighMPH                    0
WindAvgMPH                     0
WindGustMPH                    0
PrecipitationSumInches         0
Events                         0
dtype: int64

7)- 对应每一个DewPointAvgF,一共有多少完整的数据值

In [82]:

data6.shape[0] - data6.isnull().sum()

Out[82]:

Month                         1319
Day                           1319
Location                      1307
TempHighF                     1319
TempLowF                      1319
DewPointHighF                 1319
DewPointAvgF                  1319
DewPointLowF                  1319
HumidityHighPercent           1319
HumidityAvgPercent            1319
HumidityLowPercent            1319
SeaLevelPressureHighInches    1319
SeaLevelPressureAvgInches     1319
SeaLevelPressureLowInches     1319
VisibilityHighMiles           1319
VisibilityAvgMiles            1319
VisibilityLowMiles            1319
WindHighMPH                   1319
WindAvgMPH                    1319
WindGustMPH                   1319
PrecipitationSumInches        1319
Events                        1319
dtype: int64

8)- 对于全体数据,计算温度(TempF)的平均值

In [83]:

a68=data6.mean()
a68

Out[83]:

Month              inf
Day                inf
Location      5.435348
TempHighF    80.862775
TempLowF     59.902957
dtype: float64

In [84]:

a68[3],a68[4]

Out[84]:

(80.86277482941622, 59.902956785443514)

In [85]:

avg_TempF=0.5*(a68[3]+a68[4])
avg_TempF

Out[85]:

70.38286580742987

9)- 创建一个名为loc_stats的数据框去计算并存储每个location多年湿度(Humidity)的最小值,最大值,平均值和标准差

In [86]:

loc_stats=pd.Dataframe()

loc_stats['min']=data6.min()
loc_stats['max']=data6.max()
loc_stats['mean']=data6.mean()
loc_stats['std']=data6.std()#standard deviations
loc_stats

Out[86]:

minmaxmeanstd
Month19infNaN
Day19infNaN
Location1105.4353482.830496
TempHighF3210780.86277514.766523
TempLowF198159.90295714.190648
DewPointHighF-80NaNNaN
DewPointAvgF-8NaNNaN
DewPointLowF-9NaNNaN
HumidityHighPercent-97NaNNaN
HumidityAvgPercent-97NaNNaN
HumidityLowPercent-93NaNNaN
SeaLevelPressureHighInches-30.83NaNNaN
SeaLevelPressureAvgInches-30.74NaNNaN
SeaLevelPressureLowInches-30.61NaNNaN
VisibilityHighMiles-9NaNNaN
VisibilityAvgMiles-9NaNNaN
VisibilityLowMiles-9NaNNaN
WindHighMPH-9NaNNaN
WindAvgMPH-9NaNNaN
WindGustMPH-9NaNNaN
PrecipitationSumInches0TNaNNaN
EventsThunderstormNaNNaN

10)- 创建一个名为day_stats的数据框去计算并存储所有location中风速最小值,最大值,平均值和标准差

In [87]:

day_status=pd.Dataframe()

day_status['min']=data6.min(axis=1)
day_status['max']=data6.max(axis=1)
day_status['mean']=data6.mean(axis=1)
day_status['std']=data6.std(axis=1)

day_status.head()

Out[87]:

minmaxmeanstd
Year
1913-01-012.074.040.33333336.226142
1913-01-017.056.034.00000024.879711
1913-01-016.058.032.00000026.000000
1913-01-017.061.033.00000027.055499
1913-01-019.058.036.00000024.879711

11)- 对于每一个location,计算多年的平均SeaLevelPressure值。

In [88]:

data6['SeaLevelPressureAvgInches'] = data6['SeaLevelPressureAvgInches'].apply(pd.to_numeric, errors='coerce').fillna(0.0)
data6.groupby(['Location','Year'])['SeaLevelPressureAvgInches'].mean()

Out[88]:

Location  Year      
1.0       1914-01-01    30.021818
          1915-01-01    29.987059
          1916-01-01    29.999512
          1917-01-01    29.922500
2.0       1913-01-01    29.680000
          1914-01-01    29.975667
          1915-01-01    30.020303
          1916-01-01    30.048750
          1917-01-01    29.988750
3.0       1913-01-01    30.360000
          1914-01-01    30.083333
          1915-01-01    29.390000
          1916-01-01    30.029000
          1917-01-01    30.025455
4.0       1914-01-01    30.079535
          1915-01-01    30.023023
          1916-01-01    30.073438
          1917-01-01    29.999412
5.0       1914-01-01    30.004848
          1915-01-01    30.023750
          1916-01-01    30.031667
          1917-01-01    29.986316
6.0       1913-01-01    30.253333
          1914-01-01    30.022045
          1915-01-01    30.032250
          1916-01-01    30.021579
          1917-01-01    30.021364
7.0       1913-01-01    30.290000
          1914-01-01    30.024242
          1915-01-01    28.356216
          1916-01-01    30.016216
          1917-01-01    30.004444
8.0       1914-01-01    30.035517
          1915-01-01    30.041304
          1916-01-01    29.996667
          1917-01-01    29.954762
          1934-01-01    29.870000
9.0       1913-01-01    30.365000
          1914-01-01    30.003889
          1915-01-01    30.050270
          1916-01-01    30.074848
          1917-01-01    29.948182
          1965-01-01    30.060000
10.0      1914-01-01    30.036389
          1915-01-01    30.063438
          1916-01-01    30.012368
          1917-01-01    29.898571
Name: SeaLevelPressureAvgInches, dtype: float64

练习 7. 可视化----某班成绩可视化分析(StudentsPerformance.csv) 1)-导入必要的库

In [89]:

import pandas as pd
import numpy as np
2)-从以下地址导入数据集

In [90]:

path7='data/data111257/StudentsPerformance.csv'

3)-将数据框命名为studentperformance

In [91]:

studentperformance=pd.read_csv('data/data111257/StudentsPerformance.csv')

4)-将studentID设置为索引

In [92]:

studentperformance=studentperformance.set_index('studentID',drop=True)
studentperformance.head()

Out[92]:

genderrace/ethnicityparental level of educationlunchtest preparation coursemath scorereading scorewriting score
studentID
1femalegroup Bbachelor's degreestandardnone727274
2femalegroup Csome collegestandardcompleted699088
3femalegroup Bmaster's degreestandardnone909593
4malegroup Aassociate's degreefree/reducednone475744
5malegroup Csome collegestandardnone767875

5)-绘制一个展示男女生比例的扇形图

In [93]:

#统计性别数量
aa=studentperformance.groupby(['gender'])['race/ethnicity'].count()
aa

Out[93]:

gender
female    518
male      482
Name: race/ethnicity, dtype: int64

In [94]:

#计算性别比例
female_rate=aa[0]/aa.sum()
male_rate=aa[1]/aa.sum()
female_rate,male_rate

Out[94]:

(0.518, 0.482)

In [128]:

#-*- coding: utf-8 -*-
from matplotlib import pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #解决中文乱码
plt.figure(figsize=(6,9)) #调节图形大小

labels = [u'female_rate',u'male_rate',] #定义标签
sizes = [female_rate,male_rate] #每块值
colors = ['blue', 'red'] #每块颜色定义
explode = (0,0) #将某一块分割出来,值越大分割出的间隙越大
patches,text1,text2 = plt.pie(sizes,
                      explode=explode,
                      labels=labels,
                      colors=colors,
                      autopct = '%3.2f%%', #数值保留固定小数位
                      shadow = False, #无阴影设置
                      startangle =90, #逆时针起始角度设置
                      pctdistance = 0.2) #数值距圆心半径倍数距离
#patches饼图的返回值,texts1饼图外label的文本,texts2饼图内部的文本
# x,y轴刻度设置一致,保证饼图为圆形
plt.axis('equal')
plt.legend() # 右上角显示
plt.show()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

方法2:

#绘制学生性别比例扇形图
Male = (studentperformance.gender == 'male').sum()
Female = (studentperformance.gender == 'female').sum()
proportions = [Male,Female]

plt.pie(proportions, labels=['Male','Female'],shadow=True,
        autopct='%1.1f%%',startangle=90,explode=(0.15,0))
plt.axis('equal')
plt.title('Sex Proportion')
plt.tight_layout()
plt.show()

6)-绘制一个展示math score成绩, 与学生性别gender和父母受教育程度parental level of education的散点图

散点图

fig = plt.figure() ax=fig.add_subplot(1,1,1) ax.scatter(df['Chinese'],df['Math']) plt.xlabel('Chinese') plt.ylabel('Math') plt.title('语文和数学成绩散点图') plt.show()

In [96]:

#散点图 修改
fig = plt.figure()

In [97]:

import seaborn as sns
lm = sns.lmplot(x='gender',y='math score',data=studentperformance,hue='parental level of education',fit_reg=False)
lm.set(title='Math Score x Age')
axes = lm.axes
axes[0,0].set_ylim(0,100)
axes[0,0].set_xlim(-5,5)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Out[97]:

(-5, 5)

7)-绘制一个展示reading score成绩的直方图

In [98]:

#直方图
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(studentperformance['reading score'],bins=30)
plt.title('reading score')
plt.xlabel('score')
plt.ylabel('Total number')
plt.show()

练习 8.创建数据框

In [ ]:

 
1)-导入必要的库 

In [99]:

import pandas as pd

2)-创建一个数据字典
raw_data = {
            "Alcoholic": [0,0,0,0,0.0976],
            "Country": ['United Kingdom','China','Algeria','Bangladesh','Korea South'],
            "Eggs": [1.0778,2.3066,1.2171,1.2664,1.3524],
            "Fish": [0.5176,0.9817,0.2008,2.1646,1.9147],
            "Spices": [0.1276,0.0676,0.1568,1.0602,0.0372]
        }

In [100]:

raw_data = {
			"Alcoholic": [0,0,0,0,0.0976],
			"Country": ['United Kingdom','China','Algeria','Bangladesh','Korea South'],
			"Eggs": [1.0778,2.3066,1.2171,1.2664,1.3524],
			"Fish": [0.5176,0.9817,0.2008,2.1646,1.9147],
			"Spices": [0.1276,0.0676,0.1568,1.0602,0.0372]
        }

3)-将数据字典存为一个名叫diet的数据框中

In [101]:

diet = pd.Dataframe(raw_data)
diet.head()

Out[101]:

AlcoholicCountryEggsFishSpices
00.0000United Kingdom1.07780.51760.1276
10.0000China2.30660.98170.0676
20.0000Algeria1.21710.20080.1568
30.0000Bangladesh1.26642.16461.0602
40.0976Korea South1.35241.91470.0372

4)-数据框的列排序是字母顺序,请重新修改为Country, Spices, Fish, Alcoholic, Eggs这个顺序

In [102]:

diet=diet[['Country','Spices','Fish','Alcoholic','Eggs']]

5)-添加一个列Health
    diet['Health']=['yes','yes','no','yes','no']

In [103]:

diet['Health']=['yes','yes','no','yes','no']

6)-查看每个列的数据类型

In [104]:

diet.dtypes

Out[104]:

Country       object
Spices       float64
Fish         float64
Alcoholic    float64
Eggs         float64
Health        object
dtype: object

练习 9.时间序列(hyundi.csv) 1)-导入必要的库

In [105]:

import pandas as pd
import numpy as np 
#visualization
import matplotlib.pyplot as plt 

%matplotlib inline
2)-从以下地址导入数据集

In [106]:

path9='data/data111257/hyundi.csv'

3)-读取数据并存为一个名叫hyundai的数据框

In [107]:

hyundai = pd.read_csv(path9)
hyundai.head()

Out[107]:

modelyearpricetransmissionmileagefuelTypetaxmpgengineSizerecord
0I2020177999Manual17307Petrol14558.91.22021/6/20
1Tucson201614499Automatic25233Diesel23543.52.02016/3/5
2Tucson201611399Manual37877Diesel3061.71.72016/6/18
3I1020166499Manual23789Petrol2060.11.02019/1/2
4IX35201510199Manual33177Diesel16051.42.02020/1/7

4)-查看每一列的数据类型

In [108]:

hyundai.dtypes

Out[108]:

model            object
year              int64
price             int64
transmission     object
mileage           int64
fuelType         object
tax               int64
mpg             float64
engineSize      float64
record           object
dtype: object

5)-将record这个列转换为datetime类型

In [109]:

hyundai.record = pd.to_datetime(hyundai.record)
hyundai['record'].head()

Out[109]:

0   2021-06-20
1   2016-03-05
2   2016-06-18
3   2019-01-02
4   2020-01-07
Name: record, dtype: datetime64[ns]

6)-将record设置为索引

In [110]:

hyundai=hyundai.set_index('record')
hyundai.head()

Out[110]:

modelyearpricetransmissionmileagefuelTypetaxmpgengineSize
record
2021-06-20I2020177999Manual17307Petrol14558.91.2
2016-03-05Tucson201614499Automatic25233Diesel23543.52.0
2016-06-18Tucson201611399Manual37877Diesel3061.71.7
2019-01-02I1020166499Manual23789Petrol2060.11.0
2020-01-07IX35201510199Manual33177Diesel16051.42.0

7)-有重复的日期吗?

In [111]:

hyundai.index.is_unique

Out[111]:

False

8)-将index设置为升序

In [112]:

hyundai.sort_index(ascending = True).head()

Out[112]:

modelyearpricetransmissionmileagefuelTypetaxmpgengineSize
record
2016-03-05Tucson201614499Automatic25233Diesel23543.52.0
2016-06-18Tucson201611399Manual37877Diesel3061.71.7
2018-03-04I1020167500Manual6000Petrol2057.61.2
2018-03-17I1020187100Manual35000Petrol14560.11.0
2018-05-09I4020157830Manual59508Diesel3065.71.7

9)-找到每个月的最后一个记录日(record)

In [113]:

hyundai_month = hyundai.resample('BM').sum()
hyundai_month.head()

Out[113]:

yearpricemileagetaxmpgengineSize
record
2016-03-312016144992523323543.52.0
2016-04-2900000.00.0
2016-05-3100000.00.0
2016-06-30201611399378773061.71.7
2016-07-2900000.00.0

10)-数据集中最早的记录日期(record)和最晚的记录日期相差多少天?

In [114]:

(hyundai.index.max() - hyundai.index.min()).days

Out[114]:

1951

11)-在record数据中一共有多少个月?

In [115]:

hyundai_month = hyundai.resample('BM').mean()
len(hyundai_month.index)

Out[115]:

65

12)-按照record时间顺序可视化mpg值

In [116]:

hyundai_open=hyundai['mpg'].plot(title='hyundai record')

fig=hyundai_open.get_figure()
fig.set_size_inches(13.5,9)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/font_manager.py:1331: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

练习 10.删除数据(drug200.csv) 1)-导入必要的库

In [117]:

import pandas as pd
import numpy as np 
2)-从以下地址导入数据集

In [118]:

path10='data/data111257/drug200.csv'

3)-将数据集存成变量drug

In [119]:

drug=pd.read_csv(path10)
drug.head()

Out[119]:

23FHIGHHIGH.125.355DrugY
047MLOWHIGH13.093drugC
147MLOWHIGH10.114drugC
228FNORMALHIGH7.798drugX
361FLOWHIGH18.043DrugY
422FNORMALHIGH8.607drugX

4)-创建数据框的列名称
Age
Sex
Blood Pressure Levels (BP)
Cholesterol Levels
Na to Potassium Ration(Na_to_K)
Drug Class

In [120]:

drug=pd.read_csv(path10,names=['Age','Sex','Blood_Pressure_Levels (BP)','Cholesterol_Levels','Na_to_Potassium_Ration(Na_to_K)','Drug_Class'])
drug.head()

Out[120]:

AgeSexBlood_Pressure_Levels (BP)Cholesterol_LevelsNa_to_Potassium_Ration(Na_to_K)Drug_Class
023FHIGHHIGH25.355DrugY
147MLOWHIGH13.093drugC
247MLOWHIGH10.114drugC
328FNORMALHIGH7.798drugX
461FLOWHIGH18.043DrugY

5)-数据框中有缺失值吗?

In [121]:

pd.isnull(drug).sum()

Out[121]:

Age                                0
Sex                                0
Blood_Pressure_Levels (BP)         0
Cholesterol_Levels                 0
Na_to_Potassium_Ration(Na_to_K)    0
Drug_Class                         0
dtype: int64

6)-将列Cholesterol的第10到19行设置为缺失值

In [122]:

drug.iloc[10:20,3:4]=np.nan
drug.head(20)

Out[122]:

AgeSexBlood_Pressure_Levels (BP)Cholesterol_LevelsNa_to_Potassium_Ration(Na_to_K)Drug_Class
023FHIGHHIGH25.355DrugY
147MLOWHIGH13.093drugC
247MLOWHIGH10.114drugC
328FNORMALHIGH7.798drugX
461FLOWHIGH18.043DrugY
522FNORMALHIGH8.607drugX
649FNORMALHIGH16.275DrugY
741MLOWHIGH11.037drugC
860MNORMALHIGH15.171DrugY
943MLOWNORMAL19.368DrugY
1047FLOWNaN11.767drugC
1134FHIGHNaN19.199DrugY
1243MLOWNaN15.376DrugY
1374FLOWNaN20.942DrugY
1450FNORMALNaN12.703drugX
1516FHIGHNaN15.516DrugY
1669MLOWNaN11.455drugX
1743MHIGHNaN13.972drugA
1823MLOWNaN7.298drugC
1932FHIGHNaN25.974DrugY

7)-将缺失值全部替换为1.0

In [123]:

drug.Cholesterol_Levels.fillna(1,inplace=True)
drug

Out[123]:

AgeSexBlood_Pressure_Levels (BP)Cholesterol_LevelsNa_to_Potassium_Ration(Na_to_K)Drug_Class
023FHIGHHIGH25.355DrugY
147MLOWHIGH13.093drugC
247MLOWHIGH10.114drugC
328FNORMALHIGH7.798drugX
461FLOWHIGH18.043DrugY
.....................
19556FLOWHIGH11.567drugC
19616MLOWHIGH12.006drugC
19752MNORMALHIGH9.894drugX
19823MNORMALNORMAL14.020drugX
19940FLOWNORMAL11.349drugX

200 rows × 6 columns

8)-删除列Cholesterol Levels,并打印删除后的数据框架

In [124]:

del drug['Cholesterol_Levels']
drug.head()

Out[124]:

AgeSexBlood_Pressure_Levels (BP)Na_to_Potassium_Ration(Na_to_K)Drug_Class
023FHIGH25.355DrugY
147MLOW13.093drugC
247MLOW10.114drugC
328FNORMAL7.798drugX
461FLOW18.043DrugY

9)-将数据框前三行设置为缺失值

In [125]:

drug.iloc[0:3,:]=np.nan
drug.head()

Out[125]:

AgeSexBlood_Pressure_Levels (BP)Na_to_Potassium_Ration(Na_to_K)Drug_Class
0NaNNaNNaNNaNNaN
1NaNNaNNaNNaNNaN
2NaNNaNNaNNaNNaN
328.0FNORMAL7.798drugX
461.0FLOW18.043DrugY

10)-删除有缺失值的行

In [126]:

drug=drug.dropna(how='any')
drug.head()

Out[126]:

AgeSexBlood_Pressure_Levels (BP)Na_to_Potassium_Ration(Na_to_K)Drug_Class
328.0FNORMAL7.798drugX
461.0FLOW18.043DrugY
522.0FNORMAL8.607drugX
649.0FNORMAL16.275DrugY
741.0MLOW11.037drugC

11)-重新设置索引

In [127]:

drug=drug.reset_index(drop=True)
drug.head()

Out[127]:

AgeSexBlood_Pressure_Levels (BP)Na_to_Potassium_Ration(Na_to_K)Drug_Class
028.0FNORMAL7.798drugX
161.0FLOW18.043DrugY
222.0FNORMAL8.607drugX
349.0FNORMAL16.275DrugY
441.0MLOW11.037drugC

分组求和

http://www.crazyant.net/2594.html

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/309074.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号