pandas包含的数据结构和操作工具能快速简单地清洗和分析数据。
pandas经常与NumPy和SciPy这样的数据计算工具,statsmodels和scikit-learn之类的分析库及数据可视化库(如matplotlib)等一起用使用。pandas基于NumPy的数组,经常可以不使用循环就能处理好大量数据。
pandas适合处理表格数据或巨量数据。NumPy则适合处理巨量的数值数组数据。
这里约定导入方式:
技术支持qq群:521070358 630011153
#!pythonimport pandas as pd
主要数据结构:Series和Dataframe。
SeriesSeries类似于一维数组的对象,它由一组数据(NumPy类似数据类型)以及相关的数据标签(即索引)组成。仅由一组数据即可产生最简单的Series:
#!pythonIn [2]: import pandas as pd In [3]: obj = pd.Series([4, 7, -5, 3]) In [4]: obj Out[4]: 0 41 72 -53 3dtype: int64 In [5]: obj.values Out[5]: array([ 4, 7, -5, 3]) In [6]: obj.index Out[6]: Int64Index([0, 1, 2, 3], dtype='int64')
指定索引:
#!pythonIn [2]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) In [3]: obj2 Out[3]: d 4 b 7 a -5 c 3 dtype: int64 In [4]: obj2.index Out[4]: Index(['d', 'b', 'a', 'c'], dtype='object') In [10]: obj2['a'] Out[10]: -5 In [11]: obj2['d'] = 6 In [12]: obj2[['c', 'a', 'd']] Out[12]: c 3 a -5 d 6 dtype: int64
可见与普通NumPy数组相比,你还可以通过索引的方式选取Series中的值。
NumPy函数或类似操作,如根据布尔型数组进行过滤、标量乘法、应用数学函数等)都会保留索引和值之间的链接:
#!pythonIn [13]: obj2[obj2 > 0] Out[13]: d 6b 7c 3dtype: int64 In [14]: obj2 * 2Out[14]: d 12b 14a -10c 6dtype: int64 In [15]: obj2 Out[15]: d 6b 7a -5c 3dtype: int64 In [17]: import numpy as np In [18]: np.exp(obj2) Out[18]: d 403.428793b 1096.633158a 0.006738c 20.085537dtype: float64 In [19]: 'b' in obj2 Out[19]: TrueIn [20]: 'e' in obj2 Out[20]: False
可见可以吧Series看成是定长的有序字典。也可由字典创建Series:
#!pythonIn [21]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [22]: obj3 = pd.Series(sdata)
In [23]: obj3
Out[23]:
Ohio 35000Oregon 16000Texas 71000Utah 5000dtype: int64
In [24]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [25]: obj4 = pd.Series(sdata, index=states)
In [26]: obj4
Out[26]:
California NaN
Ohio 35000.0Oregon 16000.0Texas 71000.0dtype: float64
In [27]: pd.isnull(obj4)
Out[27]:
California TrueOhio FalseOregon FalseTexas Falsedtype: bool
In [28]: pd.notnull(obj4)
Out[28]:
California FalseOhio TrueOregon TrueTexas Truedtype: bool
In [29]: obj4.isnull()
Out[29]:
California TrueOhio FalseOregon FalseTexas Falsedtype: bool
In [32]: obj4.notnull()
Out[32]:
California FalseOhio TrueOregon TrueTexas Truedtype: bool相加
#!pythonIn [33]: obj3 Out[33]: Ohio 35000Oregon 16000Texas 71000Utah 5000dtype: int64 In [34]: obj4 Out[34]: California NaNOhio 35000.0Oregon 16000.0Texas 71000.0dtype: float64 In [35]: obj3 + obj4 Out[35]: California NaNOhio 70000.0Oregon 32000.0Texas 142000.0Utah NaNdtype: float64 In [36]: obj4.name = 'population'In [37]: obj4.index.name = 'state'In [38]: obj4 Out[38]: state California NaNOhio 35000.0Oregon 16000.0Texas 71000.0Name: population, dtype: float64 In [40]: obj = pd.Series([4, 7, -5, 3]) In [41]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] In [42]: obj Out[42]: Bob 4Steve 7Jeff -5Ryan 3dtype: int64
本文代码地址:https://github.com/china-testing/python-api-tesing/
本文最新版本地址:http://t.cn/R8tJ9JH
交流QQ群:python 测试开发 144081101
wechat: pythontesting
淘宝天猫可以把链接发给qq850766020,为你生成优惠券,降低你的购物成本!
DataframeDataframe是矩状表格型的数据结构,包含有序的列,每列可以是不同的类型(数值、字符串、布尔值等)。Dataframe既有行索引也有列索引,它可以被看做由相同索引的Series组成的字典。Dataframe中的数据是以一个或多个二维块存放的。
构建Dataframe的办法有很多,最常用的是直接传入等长列表或NumPy数组组成的字典。Dataframe会自动加上索引(跟Series一样),有序排列。
#!pythonIn [1]: import pandas as pd
In [2]: import numpy as np
In [3]:
In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
...: 'year': [2000, 2001, 2002, 2001, 2002, 2003],
...: 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
In [4]:
In [4]: frame = pd.Dataframe(data)
In [5]: frame
Out[5]:
pop state year0 1.5 Ohio 20001 1.7 Ohio 20012 3.6 Ohio 20023 2.4 Nevada 20014 2.9 Nevada 20025 3.2 Nevada 2003In [6]: frame.head()
Out[6]:
pop state year0 1.5 Ohio 20001 1.7 Ohio 20012 3.6 Ohio 20023 2.4 Nevada 20014 2.9 Nevada 2002In [7]:
In [7]: pd.Dataframe(data, columns=['year', 'state', 'pop'])
Out[7]:
year state pop0 2000 Ohio 1.51 2001 Ohio 1.72 2002 Ohio 3.63 2001 Nevada 2.44 2002 Nevada 2.95 2003 Nevada 3.2In [8]: frame2 = pd.Dataframe(data, columns=['year', 'state', 'pop', 'debt'],
...: index=['one', 'two', 'three', 'four', 'five', 'six'])
In [9]: frame2
Out[9]:
year state pop debt
one 2000 Ohio 1.5 NaNtwo 2001 Ohio 1.7 NaNthree 2002 Ohio 3.6 NaNfour 2001 Nevada 2.4 NaNfive 2002 Nevada 2.9 NaNsix 2003 Nevada 3.2 NaNIn [10]: frame2['state']
Out[10]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object可见还可以通过columns指定Dataframe的列序, index指定索引名。跟Series一样,如果传入的列在数据中找不到,就会产生NaN值。通过类似字典的方式或属性的方式,可以将Dataframe的列获取为Series,返回的Series拥有Dataframe相同的索引,且其name属性也已经被相应地设置好。
行也可以用loc属性通过位置或名称的方式进行获取。列可以通过赋值的方式进行修改。
将列表或数组赋值给某个列时,其长度必须跟Dataframe的长度相匹配。如果赋值的是Series,就会精确匹配
#!pythonIn [11]: frame2.loc['three'] Out[11]: year 2002state Ohio pop 3.6debt NaNName: three, dtype: object In [12]: frame2['debt'] = 16.5In [13]: frame2 Out[13]: year state pop debt one 2000 Ohio 1.5 16.5two 2001 Ohio 1.7 16.5three 2002 Ohio 3.6 16.5four 2001 Nevada 2.4 16.5five 2002 Nevada 2.9 16.5six 2003 Nevada 3.2 16.5In [14]: frame2['debt'] = np.arange(6.) In [15]: frame2 Out[15]: year state pop debt one 2000 Ohio 1.5 0.0two 2001 Ohio 1.7 1.0three 2002 Ohio 3.6 2.0four 2001 Nevada 2.4 3.0five 2002 Nevada 2.9 4.0six 2003 Nevada 3.2 5.0In [16]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) In [17]: frame2['debt'] = val In [18]: frame2 Out[18]: year state pop debt one 2000 Ohio 1.5 NaNtwo 2001 Ohio 1.7 -1.2three 2002 Ohio 3.6 NaNfour 2001 Nevada 2.4 -1.5five 2002 Nevada 2.9 -1.7six 2003 Nevada 3.2 NaN
为不存在的列赋值会创建出一个新列。关键字del用于删除列:
#!pythonIn [19]: frame2['eastern'] = frame2['state'] == 'Ohio'In [20]: frame2 Out[20]: year state pop debt eastern one 2000 Ohio 1.5 NaN Truetwo 2001 Ohio 1.7 -1.2 Truethree 2002 Ohio 3.6 NaN Truefour 2001 Nevada 2.4 -1.5 Falsefive 2002 Nevada 2.9 -1.7 Falsesix 2003 Nevada 3.2 NaN FalseIn [21]: del frame2['eastern'] In [22]: frame2.columns Out[22]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
通过索引方式返回的列只是相应数据的视图而不是副本。因此,对返回的Series所做的任何就地修改
全都会反映到源Dataframe上。通过Series的copy方法即可显式地复制列。
另一种常见的数据形式是嵌套字典,外层字典的键作为列,内层键则作为行索引:
#!pythonIn [23]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [24]: frame3 = pd.Dataframe(pop)
In [25]: frame3
Out[25]:
Nevada Ohio2000 NaN 1.52001 2.4 1.72002 2.9 3.6In [26]: frame3.T
Out[26]:
2000 2001 2002Nevada NaN 2.4 2.9Ohio 1.5 1.7 3.6In [27]: pd.Dataframe(pop, index=[2001, 2002, 2003])
Out[27]:
Nevada Ohio2001 2.4 1.72002 2.9 3.62003 NaN NaNIn [28]: pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}
In [29]: pdata
Out[29]:
{'Ohio': 2000 1.5
2001 1.7
Name: Ohio, dtype: float64, 'Nevada': 2000 NaN
2001 2.4
Name: Nevada, dtype: float64}
In [30]: pd.Dataframe(pdata)
Out[30]:
Nevada Ohio2000 NaN 1.52001 2.4 1.7In [31]: frame3.index.name = 'year'; frame3.columns.name = 'state'In [32]: frame3
Out[32]:
state Nevada Ohio
year
2000 NaN 1.52001 2.4 1.72002 2.9 3.6In [33]: frame3.values
Out[33]:
array([[ nan, 1.5],
[ 2.4, 1.7],
[ 2.9, 3.6]])
In [34]: frame2.values
Out[34]:
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)可见可以转置,由Series组成的字典和字典类似。如果设置了Dataframe的index和columns的name属性,则这些信息也会被显示出来。跟Series一样,values属性也会以二维ndarray的形式返回Dataframe中的数据。如果Dataframe各列的数据类型不同,则值数组的数据类型就会选用能兼容所有列的数据类型。
Dataframe的constructor接受的类型为:2D ndarray、dict of arrays, lists, or tuples、NumPy structured/record、array、dict of Series、dict of dicts、List of dicts or Series、List of lists or tuples、Another Dataframe、NumPy MaskedArray。
更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Dataframe.html
索引对象pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或Dataframe时,所用到的任何数组或其他序列的标签都会被转换成Index。
#!pythonIn [35]: obj = pd.Series(range(3), index=['a', 'b', 'c']) In [36]: index = obj.index In [37]: index Out[37]: Index(['a', 'b', 'c'], dtype='object') In [38]: index[1:] Out[38]: Index(['b', 'c'], dtype='object') In [39]: index[1] = 'd'--------------------------------------------------------------------------- TypeError Traceback (most recent call last)in ()----> 1 index[1] = 'd'/usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in __setitem__(self, key, value) 1722 1723 def __setitem__(self, key, value): -> 1724 raise TypeError("Index does not support mutable operations") 1725 1726 def __getitem__(self, key):TypeError: Index does not support mutable operations In [40]: labels = pd.Index(np.arange(3)) In [41]: labels Out[41]: Int64Index([0, 1, 2], dtype='int64') In [42]: obj2 = pd.Series([1.5, -2.5, 0], index=labels) In [43]: obj2 Out[43]: 0 1.51 -2.52 0.0dtype: float64 In [44]: obj2.index is labels Out[44]: True In [45]: frame3 Out[45]: state Nevada Ohio year 2000 NaN 1.52001 2.4 1.72002 2.9 3.6In [46]: frame3.columns Out[46]: Index(['Nevada', 'Ohio'], dtype='object', name='state') In [47]: 'Ohio' in frame3.columns Out[47]: True In [48]: 2003 in frame3.index Out[48]: False In [49]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar']) In [50]: dup_labels Out[50]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
Index对象是不可变的,因此用户不能对其进行修改,这样Index对象在多个数据结构之间可安全共享。除了像数组,Index类似固定大小的集合。
Index的方法和属性有:append,difference,intersection,union,isin,delete,drop,insert,is_monotonic,unique。
更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
基本功能本节中,我将介绍操作Series和Dataframe中的数据的基本手段。
重新索引#!pythonIn [51]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) In [52]: obj Out[52]: d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64# 调用reindex将会根据新索引进行重排。如果某个索引值当前不存在,就为NaNIn [53]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) In [54]: obj2 Out[54]: a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64 In [55]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4]) In [56]: obj3 Out[56]: 0 blue 2 purple 4 yellow dtype: object# 对于时间序列这样的有序数据,重新索引时可能需要做插值处理。method选项即可达到此目的,例如,使用ffill以实现前向值填充:In [57]: obj3.reindex(range(6), method='ffill') Out[57]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object# Dataframe中reindex可以调整行列In [58]: frame = pd.Dataframe(np.arange(9).reshape((3, 3)), ....: index=['a', 'c', 'd'], ....: columns=['Ohio', 'Texas', 'California']) In [59]: frame Out[59]: Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 In [60]: frame2 = frame.reindex(['a', 'b', 'c', 'd']) In [61]: frame2 Out[61]: Ohio Texas California a 0.0 1.0 2.0 b NaN NaN NaN c 3.0 4.0 5.0 d 6.0 7.0 8.0 In [62]: states = ['Texas', 'Utah', 'California'] In [63]: frame.reindex(columns=states) Out[63]: Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8 In [69]: frame2 = frame.reindex(['a', 'b', 'c', 'd'],columns=states) In [70]: frame2 Out[70]: Texas Utah California a 1.0 NaN 2.0 b NaN NaN NaN c 4.0 NaN 5.0 d 7.0 NaN 8.0
reindex函数的参数有index,method,fill_value,limit,tolerance,level,copy等。
更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Dataframe.reindex.html
丢弃指定轴上的项丢弃某条轴上的一项很简单,只要有索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是在指定轴上删除了指定值的新对象:
#!pythonIn [71]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [72]: obj
Out[72]:
a 0.0b 1.0c 2.0d 3.0e 4.0dtype: float64
In [73]: new_obj = obj.drop('c')
In [74]: new_obj
Out[74]:
a 0.0b 1.0d 3.0e 4.0dtype: float64
In [75]: obj
Out[75]:
a 0.0b 1.0c 2.0d 3.0e 4.0dtype: float64
In [76]: obj.drop(['d', 'c'])
Out[76]:
a 0.0b 1.0e 4.0dtype: float64
In [77]: obj
Out[77]:
a 0.0b 1.0c 2.0d 3.0e 4.0dtype: float64
In [78]: data = pd.Dataframe(np.arange(16).reshape((4, 4)),
....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
....: columns=['one', 'two', 'three', 'four'])
In [79]: data
Out[79]:
one two three four
Ohio 0 1 2 3Colorado 4 5 6 7Utah 8 9 10 11New York 12 13 14 15In [80]: data.drop(['Colorado', 'Ohio'])
Out[80]:
one two three four
Utah 8 9 10 11New York 12 13 14 15In []: data.drop('two',1)
Out[57]:
one three four
Ohio 0 2 3Colorado 4 6 7Utah 8 10 11New York 12 14 15In []: data.drop('two', axis=1)
Out[58]:
one three four
Ohio 0 2 3Colorado 4 6 7Utah 8 10 11New York 12 14 15In []: data.drop(['two', 'four'], axis='columns')
Out[59]:
one three
Ohio 0 2Colorado 4 6Utah 8 10New York 12 14In []: obj.drop('c', inplace=True)
In []: obj
Out[61]:
d 4.5b 7.2a -5.3dtype: float64
作者:python作业AI毕业设计
链接:https://www.jianshu.com/p/252932b6d25b



