究极细啃!丝毫不放过细节!欢迎讨论交流!我会继续肯下去的!欢迎关注
- Pandas 入门
- pandas数据结构介绍
- Series
- Series总结
- DataFrame
- **DataFrame总结**
- 索引对象
- 基本功能
- 重建索引 reindex方法
- 轴向上删除条目 drop 方法
- 索引、选择与过滤
- 总结:
- 整数索引
- 算术和数据对齐
- 函数应用和映射 apply()
- 排序和排名 sort_index()sort_values()rank()
- 含有重复标签的轴索引
- 描述性统计的概述与计算
- 相关性和协方差 corrcov
- 唯一值、计数和成员属性
- 唯一值、计数和成员属性
是一维数组型对象,包含了一个值序列和数据标签index
1.Series对象的创建
2.Series具有values和index两个属性
import pandas as pd
obj=pd.Series([4,7,-5,3])
obj
Out[3]:
0 4
1 7
2 -5
3 3
dtype: int64
obj.values
Out[4]: array([ 4, 7, -5, 3], dtype=int64)
obj.index
Out[5]: RangeIndex(start=0, stop=4, step=1)
3.在创建Series时标注索引
就可以使用标签/标签列表来索引
obj2=pd.Series([4,7,-5,3],index=['d','b','a','c']) obj2 Out[7]: d 4 b 7 a -5 c 3 dtype: int64 obj2.index Out[8]: Index(['d', 'b', 'a', 'c'], dtype='object') obj2['a'] Out[15]: -5 obj2['d'] Out[16]: 4 #索引列表 obj2[['c','a','d']] Out[17]: c 3 a -5 d 4 dtype: int64
4.可以使用numpy的函数或具有numpy的风格
obj2[obj2>0] Out[9]: d 4 b 7 c 3 dtype: int64 obj2*2 Out[10]: d 8 b 14 a -10 c 6 dtype: int64 import numpy as np np.exp(obj2) Out[13]: d 54.598150 b 1096.633158 a 0.006738 c 20.085537 dtype: float64
5.字典的角度来考虑
利用字典创造Series
在创建时,可以通过字典的键值来指定顺序
'b' in obj2
Out[18]: True
'e' in obj2
Out[19]: False
sdata={'Ohio':35000,'Texas':71000,}
sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
onj3=pd.Series(sdata)
obj3=pd.Series(sdata)
obj3
Out[24]:
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
#按照想要的顺序传递给构造函数,生成的索引顺序符合预期
states=['California','Ohio','Oregon','Texas']
obj4=pd.Series(sdata,index=states)
obj4
Out[27]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
6.判断是否为空值
pd.isnull(obj4) Out[28]: California True Ohio False Oregon False Texas False dtype: bool pd.notnull(obj4) Out[29]: California False Ohio True Oregon True Texas True dtype: bool obj4.isnull() Out[30]: California True Ohio False Oregon False Texas False dtype: bool
7.索引自动对齐
obj3 Out[31]: Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 obj4 Out[32]: California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float64 obj3+obj4 Out[33]: California NaN Ohio 70000.0 Oregon 32000.0 Texas 142000.0 Utah NaN dtype: float64
8.Series的name属性
obj4.name='population' obj4.index.name='state' obj4 Out[36]: state California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 Name: population, dtype: float64
9.Series的索引按位置赋值
obj Out[37]: 0 4 1 7 2 -5 3 3 dtype: int64 obj.index=['Bob','Steve','Jeff','Ryan'] obj Out[39]: Bob 4 Steve 7 Jeff -5 Ryan 3 dtype: int64Series总结
1.Series的属性:values,index,name
index也具有name属性
2.判断空值
3.索引自动对齐
4.创建Series:
指定index标签,可以通过标签进行索引/按索引位置赋值index/通过字典创建/字典创建时指定顺序
5.具有numpy的函数或风格
DataFrame1.通过字典创建,并指定列columns顺序
2.pd.DataFrame的head()方法
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002,2003],'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame=pd.DataFrame(data)
frame
Out[45]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
frame.head()
Out[46]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
frame=pd.DataFrame(data,columns=['year','state','pop'])
frame
Out[48]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
3.可以为df指定index,columns
frame2=pd.DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five','six'])
frame2
Out[50]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
frame2.columns
Out[51]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
4.对某一列进行索引
frame2[column]对任意列名均有效,但是frame.column只在列名是有效的python变量名时才有效
索引得到的数据类型是Series;
他们的index属性和原dataframe保持一致
他们的name属性来源于原dataframe的列名
frame2['state'] Out[52]: one Ohio two Ohio three Ohio four Nevada five Nevada six Nevada Name: state, dtype: object frame2.year Out[53]: one 2000 two 2001 three 2002 four 2001 five 2002 six 2003 Name: year, dtype: int64 s1=frame2.year s1.name Out[55]: 'year' s1.index Out[56]: Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object') type(s1) Out[57]: pandas.core.series.Series type(frame2['year']) Out[58]: pandas.core.series.Series
也可以用属性loc进行选取
frame2.loc['three'] Out[7]: year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object
5.对某一列进行修改
对列的引用可以直接修改
frame2['debt']=16.5
frame2
Out[15]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
frame2['debt']=np.arange(6.)
frame2
Out[17]:
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0
将Series赋给Pandas的一列时,Series的索引会按照DataFrame的索引重新排列,并在空缺的地方填充缺失值
val=pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
val
Out[19]:
two -1.2
four -1.5
five -1.7
dtype: float64
frame2['debt']=val
frame2
Out[21]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN
用frame[‘’]创建新的列,frame2.eastern的语法无法创建新的列。
frame2['eastern']=frame2.state=='Ohio'
frame2
Out[23]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False
6.删除列
del frame2['eastern']
frame2
Out[25]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN
从DataFrame中选取的列是视图,而不是拷贝,修改会映射到原DataFrame上,要复制应该用copy方法。
7.创建DataFrame
pop={'Nevada':{2001:2.4,2002:2.9},'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame3=pd.DataFrame(pop)
frame3
Out[30]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
#类似numpy的语法
frame3.T
Out[31]:
2001 2002 2000
Nevada 2.4 2.9 NaN
Ohio 1.7 3.6 1.5
pd.DataFrame(pop,index=[2001,2002,2003])
Out[32]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
#包含Series的对象构造dataframe
pdata={'Ohio':frame3['Ohio'][:-1],'Nevada':frame3['Nevada'][:2]}
pd.DataFrame(pdata)
Out[34]:
Ohio Nevada
2001 1.7 2.4
2002 3.6 2.9
8.dataframe 的属性
name;values
frame3.index.name='year'
frame3.columns.name='state'
frame3
Out[38]:
state Nevada Ohio
year
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
frame3
Out[39]:
state Nevada Ohio
year
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
frame3.values
Out[40]:
array([[2.4, 1.7],
[2.9, 3.6],
[nan, 1.5]])
frame2.values
Out[42]:
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)
DataFrame总结
1.创建:字典等
2.属性:index,columns,values;index.name,columns.name
3.选择列(视图)、删除列del、复制列copy、创建新列、修改列
4.选择行loc
5.索引:frame3.index=[2000,2001,2003] 这种形式就是直接修改索引;
利用一个已有索引的数据库创建新数据框中指定索引,是在进行选择和排序
索引对象1.索引对象时是不可变的,不能修改索引对象(我的理解是:不能修改索引对象里面的内容,只能直接把整个索引对象换掉)
obj=pd.Series(range(3),index=['a','b','c']) index=obj.index index Out[52]: Index(['a', 'b', 'c'], dtype='object') index[1:] Out[53]: Index(['b', 'c'], dtype='object') index[1]='d' Traceback (most recent call last): File "", line 1, in index[1]='d' File "D:anaconda3libsite-packagespandascoreindexesbase.py", line 4081, in __setitem__ raise TypeError("Index does not support mutable operations") TypeError: Index does not support mutable operations
2.创建索引对象
labels=pd.Index(np.arange(3)) labels Out[56]: Int64Index([0, 1, 2], dtype='int64') obj2=pd.Series([1.5,-2.5,0],index=labels) obj2 Out[59]: 0 1.5 1 -2.5 2 0.0 dtype: float64 obj2.index is labels Out[60]: True
3.索引对象类似大小固定 的集合,但可以包含重复标签
frame3 Out[61]: state Nevada Ohio 2000 2.4 1.7 2001 2.9 3.6 2003 NaN 1.5 frame3.columns Out[62]: Index(['Nevada', 'Ohio'], dtype='object', name='state') 'Ohio' in frame3.columns Out[64]: True 2003 in frame3.index Out[65]: True 2004 in frame3.index Out[66]: False
4.索引对象的方法和属性
| 方法 | 描述 |
|---|---|
| append | 将额外的索引对象粘贴到原索引后,产生一个新的索引。(是两个索引的相加) |
| difference | 计算两个索引的差集 |
| intersection | 计算两个索引的交集 |
| union | 计算两个索引的并集 |
| isin | 计算表示每一个值是否在传值容器中的布尔函数 |
| delete | 将位置i的元素删除,并产生新的索引 |
| drop | 根据传参删除指定索引值,并产生新的索引 |
| insert | 在位置i插入元素,并产生新的索引(元素与索引对象) |
| is_monotonic | 属性,如果索引序列递增 则返回True |
| is_unique | 属性,如果索引序列唯一,则返回True |
| unique | 计算索引的唯一值序列 |
#append
labels.append(pd.Index([4]))
Out[72]: Int64Index([0, 1, 2, 4], dtype='int64')
#isin
frame3.index.isin([2004])
Out[73]: array([False, False, False])
#delete
labels
Out[75]: Int64Index([0, 1, 2], dtype='int64')
labels.delete(1)
Out[76]: Int64Index([0, 2], dtype='int64')
#drop
labels
Out[78]: Int64Index([0, 1, 2], dtype='int64')
labels.drop(2)
Out[79]: Int64Index([0, 1], dtype='int64')
#insert
labels.insert(1,99)
Out[80]: Int64Index([0, 99, 1, 2], dtype='int64')
labels
Out[81]: Int64Index([0, 1, 2], dtype='int64')
#is_unique
labels.is_unique
Out[86]: True
#is_monotonic
labels.is_monotonic
Out[87]: True
#unique
labels.unique()
Out[88]: Int64Index([0, 1, 2], dtype='int64')
基本功能
重建索引 reindex方法
obj=pd.Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj
Out[90]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
obj2=obj.reindex(['a','b','c','d','e'])
#会按照新的索引进行排序,如果某个索引值并不存在,则引入缺失值
obj2
Out[92]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
obj3= pd.Series(['blue','purple','yellow'],index=[0,2,4])
obj3
Out[94]:
0 blue
2 purple
4 yellow
dtype: object
#对于顺序数据,如时间序列,在重建索引是可能需要插值或填值
obj3.reindex(range(6),method='ffill')
#在重建索引时插值,
Out[96]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
#可以改变行索引和列索引
frame=pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','b','c'],columns=['Ohio','Texas','California'])
frame
Out[98]:
Ohio Texas California
a 0 1 2
b 3 4 5
c 6 7 8
frame2=frame.reindex(['a','b','c','d'])
frame2
Out[100]:
Ohio Texas California
a 0.0 1.0 2.0
b 3.0 4.0 5.0
c 6.0 7.0 8.0
d NaN NaN NaN
frame=pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Ohio','Texas','California'])
frame2
Out[102]:
Ohio Texas California
a 0.0 1.0 2.0
b 3.0 4.0 5.0
c 6.0 7.0 8.0
d NaN NaN NaN
frame=pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Ohio','Texas','California'])
frame
Out[104]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
frame2=frame.reindex(['a','b','c','d'])
frame2
Out[106]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
states=['Texas','Utah','California']
frame.reindex(columns=states)
Out[108]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
reindex方法的参数
| 参数 | 描述 |
|---|---|
| index | 序列型结构或索引实例 |
| fill_value | 通过索引引入缺失值时的替代值 |
| limit | 当向前或者向后填充,所需填充的最大尺寸间隙(以元素数量) |
| tolerance | 向前或向后填充时,所需填充的不精确匹配下的最大尺寸间隙(及绝对数字距离) |
| level | 匹配MultiIndex级别的简单索引,否则选择子集 |
| copy | 如果为True,是新索引=旧索引,也总时复制底层数据,如果是False,则在索引相同时不要复制数据 |
(后面这几个参数我也不太懂,用到时在更新吧)
轴向上删除条目 drop 方法对Series:
obj=pd.Series(np.arange(5.),index=['a','b','c','d','e'])
obj
Out[134]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
new_obj=obj.drop('c')
new_obj
Out[136]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
obj#默认是没有改变原函数的
Out[137]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
对dataframe:
axis参数:
默认0:删除某一行
1: 删除某一列
data=pd.DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data
Out[140]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.drop(['Colorado','Ohio'])
Out[141]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
data
Out[142]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.drop('two',axis=1)
Out[143]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
data.drop(['two','four'],axis='columns')
Out[144]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
obj
Out[145]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
obj.drop('c',inplace=True)
obj
Out[147]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
inplace参数:
默认False:不改变原数据
True:替代原数据
索引、选择与过滤Series
obj=pd.Series(np.arange(4.),index=['a','b','c','d'])
obj
Out[149]:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
obj['b']
Out[150]: 1.0
obj[1]
Out[151]: 1.0
obj[2:4]
Out[152]:
c 2.0
d 3.0
dtype: float64
obj[['b','a','d']]
Out[153]:
b 1.0
a 0.0
d 3.0
dtype: float64
obj[[1,3]]
Out[154]:
b 1.0
d 3.0
dtype: float64
obj[obj<2]
Out[155]:
a 0.0
b 1.0
dtype: float64
#index的切片:注意!
obj['b':'c']
Out[156]:
b 1.0
c 2.0
dtype: float64
obj['b':'c']=5
obj
Out[158]:
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
obj[1:2]
Out[159]:
b 5.0
dtype: float64
dataframe
data=pd.DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data
Out[162]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
#对列的索引:通过列名来索引
data['two']
Out[163]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
data[['three','one']]
Out[164]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
#对行的索引
data[:2]
Out[165]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
#布尔索引
data[data['three']>5]
Out[166]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data<5
Out[168]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data<5]=0
data
Out[170]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
loc和iloc
data.loc['Colorado',['two','three']]
Out[175]:
two 5
three 6
Name: Colorado, dtype: int32
data.iloc[2,[3,0,1]]
Out[176]:
four 11
one 8
two 9
Name: Utah, dtype: int32
data.iloc[2]
Out[177]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
data.iloc[[1,2],[3,0,1]]
Out[178]:
four one two
Colorado 7 0 5
Utah 11 8 9
data.loc[:'Utah','two']#可以取到尾部的索引
Out[179]:
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
data.iloc[:,:3][data.three>0]
Out[185]:
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
dataframe索引选项
| 类型 | 描述 |
|---|---|
| df[val] | 从dataframe中选择单列或者列序列, 布尔数组过滤行 行的切片 |
| df.loc[val] | 根据标签选择单行或多行 |
| df.loc[:,val] | 根据标签选择单列或多列 |
| df.loc[val1,val2] | 同时选择行和列中的一部分 |
| df.iloc[where] | 根据整数位置选择单行或多行 |
| df.iloc[:,where] | 根据整数位置选择单列或多列 |
| df.iloc[where_i,where_j] | 根据整数位置同时选择行和列中的一部分 |
| df.at[label_i,label_j] | 根据行列标签选择单个标量值 |
1.Series:
利用数字或索引进行选择,也可以进行逻辑索引
利用Series.index进行的切片是包括尾部的,利用数字进行的切片不包括。
2.DataFrame
①对列的索引、对行的索引、布尔数组切片、
②用loc(轴标签)和iloc(整数标签)进行选择
整数索引整数索引是说seriesdataframe的索引标签(index)是数值,这使得在进行选择数据时有歧义:
比如,我通过位置来索引,这个位置,到底识别为位置还是索引标签?
就是需要注意下,所以推荐使用loc,iloc进行处理
ser=pd.Series(np.arange(3.))
ser
Out[21]:
0 0.0
1 1.0
2 2.0
dtype: float64
ser[-1]
Traceback (most recent call last):
File "D:anaconda3libsite-packagespandascoreindexesrange.py", line 355, in get_loc
return self._range.index(new_key)
ValueError: -1 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "", line 1, in
ser[-1]
File "D:anaconda3libsite-packagespandascoreseries.py", line 882, in __getitem__
return self._get_value(key)
File "D:anaconda3libsite-packagespandascoreseries.py", line 989, in _get_value
loc = self.index.get_loc(label)
File "D:anaconda3libsite-packagespandascoreindexesrange.py", line 357, in get_loc
raise KeyError(key) from err
KeyError: -1
ser2=pd.Series(np.arange(3.),index=['a','b','c'])
ser2
Out[24]:
a 0.0
b 1.0
c 2.0
dtype: float64
ser2[-1]
Out[25]: 2.0
ser[:1]
Out[26]:
0 0.0
dtype: float64
ser.loc[:1]
Out[27]:
0 0.0
1 1.0
dtype: float64
ser.iloc[:1]
Out[28]:
0 0.0
dtype: float64
算术和数据对齐
相当于数据库中根据索引进行的外连接。非交集的部分就是空值;
如果两个Series或dataframe的标签完全不同,则结果全为空(因为没有交集嘛~)
#对于Series
s1=pd.Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])
s2=pd.Series([-2.1,3.6,1.5,4,3.1],index=['a','c','e','f','g'])
s1
Out[6]:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
s2
Out[7]:
a -2.1
c 3.6
e 1.5
f 4.0
g 3.1
dtype: float64
s1+s2
Out[8]:
a 5.2
c 1.1
d NaN
e 3.0
f NaN
g NaN
dtype: float64
#对于DataFrame
df1=pd.DataFrame(np.arange(9.).reshape(3,3),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
df2=pd.DataFrame(np.arange(12.).reshape(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
df1
Out[12]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
df2
Out[13]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
df1+df2
Out[14]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
1.指定填充值避免NaN
#按上述所讲的,直接用+号运算
df1=pd.DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))
df2=pd.DataFrame(np.arange(20).reshape(4,5),columns=list('abcde'))
df2.loc[1,'b']=np.nan
df1
Out[32]:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
df2
Out[33]:
a b c d e
0 0 1.0 2 3 4
1 5 NaN 7 8 9
2 10 11.0 12 13 14
3 15 16.0 17 18 19
df1+df2
Out[34]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
#利用add方法的fill_value参数
df1.add(df2,fill_value=0)
Out[36]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
#利用reindex重建索引时,也可以利用fill_value参数
df1.reindex(columns=df2.columns)
Out[37]:
a b c d e
0 0.0 1.0 2.0 3.0 NaN
1 4.0 5.0 6.0 7.0 NaN
2 8.0 9.0 10.0 11.0 NaN
df1.reindex(columns=df2.columns,fill_value=0)
Out[38]:
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0
灵活的算术方法
| 方法 | 描述 |
|---|---|
| add,radd | 加法+ |
| sub,rsub | 减法- |
| div,rdiv | 除法/ |
| floordiv,rfloordiv | 整除// |
| mul,rmul | 乘法* |
| pow,rpow | 幂次方 |
有r和没r的区别就是,调换了下两个元素的位置
ser
Out[45]:
0 0.0
1 1.0
2 2.0
dtype: float64
ser1
Out[46]:
1 66
2 88
3 99
dtype: int64
#ser-ser1
ser.sub(ser1)
Out[47]:
0 NaN
1 -65.0
2 -86.0
3 NaN
dtype: float64
#ser1-ser
ser.rsub(ser1)
Out[48]:
0 NaN
1 65.0
2 86.0
3 NaN
dtype: float64
也可以使用numpy中的通用函数进行计算
np.subtract(ser1,ser) Out[51]: 0 NaN 1 65.0 2 86.0 3 NaN dtype: float64 np.sqrt(ser1) Out[52]: 1 8.124038 2 9.380832 3 9.949874 dtype: float64 np.square(ser) Out[53]: 0 0.0 1 1.0 2 4.0 dtype: float64
2.DataFrame和Series间的操作
DataFrame和Series间的算术计算类似Numpy中的广播操作
但是Series默认是 相当于DataFrame中的一行
frame=pd.DataFrame(np.arange(12.).reshape(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
series=frame.iloc[0]
frame
Out[57]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
series
Out[58]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame-series
Out[59]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
#索引不同,并联操作
series2=pd.Series(range(3),index=['b','e','f'])
frame+series2
Out[61]:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
series3=frame['b']
series3
Out[63]:
Utah 0.0
Ohio 3.0
Texas 6.0
Oregon 9.0
Name: b, dtype: float64
series3+frame
Out[64]:
Ohio Oregon Texas Utah b d e
Utah NaN NaN NaN NaN NaN NaN NaN
Ohio NaN NaN NaN NaN NaN NaN NaN
Texas NaN NaN NaN NaN NaN NaN NaN
Oregon NaN NaN NaN NaN NaN NaN NaN
#因为Series默认是 相当于DataFrame中的一行
#dataframe和dataframe可以自动广播,没有行列的问题。
series4=frame[['b','d']]
series4
Out[66]:
b d
Utah 0.0 1.0
Ohio 3.0 4.0
Texas 6.0 7.0
Oregon 9.0 10.0
series4+frame
Out[67]:
b d e
Utah 0.0 2.0 NaN
Ohio 6.0 8.0 NaN
Texas 12.0 14.0 NaN
Oregon 18.0 20.0 NaN
解决Series和dataframe进行列上的广播
必须使用算术方法中的一种
series3
Out[68]:
Utah 0.0
Ohio 3.0
Texas 6.0
Oregon 9.0
Name: b, dtype: float64
frame
Out[69]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
frame.sub(series3,axis='index')
Out[70]:
b d e
Utah 0.0 1.0 2.0
Ohio 0.0 1.0 2.0
Texas 0.0 1.0 2.0
Oregon 0.0 1.0 2.0
函数应用和映射 apply()
1.可以使用numpy的方法
frame=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utha','Ohio','Texas','Oregon'])
frame
Out[74]:
b d e
Utha -0.511257 -1.060258 0.254645
Ohio 0.589232 -0.993371 -0.404250
Texas 0.204766 -0.333545 0.718460
Oregon -1.497646 -0.271655 0.413346
np.abs(frame)
Out[75]:
b d e
Utha 0.511257 1.060258 0.254645
Ohio 0.589232 0.993371 0.404250
Texas 0.204766 0.333545 0.718460
Oregon 1.497646 0.271655 0.413346
np.sum(frame)
Out[76]:
b -1.214905
d -2.658828
e 0.982200
dtype: float64
np.sum(frame,axis=1)
Out[77]:
Utha -1.316871
Ohio -0.808389
Texas 0.589682
Oregon -1.355955
dtype: float64
#不同的是,numpy默认计算全部的统计值。pandas默认计算axis=0的统计值
np.mean(frame)
Out[78]:
b -0.303726
d -0.664707
e 0.245550
dtype: float64
2.apply方法:将函数应用到一行或一列的一维数组上
#函数返回标量
f=lambda x:x.max()-x.min()
frame.apply(f)
Out[82]:
b 2.086878
d 0.788604
e 1.122710
dtype: float64
frame.apply(f,axis='columns')
Out[83]:
Utha 1.314903
Ohio 1.582603
Texas 1.052004
Oregon 1.910992
dtype: float64
#函数返回数组
def f(x):
return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)
Out[85]:
b d e
min -1.497646 -1.060258 -0.40425
max 0.589232 -0.271655 0.71846
#逐元素的函数
format=lambda x:'%.2f' % x
frame.applymap(format)
Out[87]:
b d e
Utha -0.51 -1.06 0.25
Ohio 0.59 -0.99 -0.40
Texas 0.20 -0.33 0.72
Oregon -1.50 -0.27 0.41
排序和排名 sort_index()sort_values()rank()
1.Series
按键排序:sort_index()返回一个新的、排序后的对象;参数ascending=False,表示降序;默认是升序
按值排序:sort_values(),默认情况下,缺失值在末尾
#对索引排序 obj=pd.Series(range(4),index=['d','a','b','c']) obj_new=obj.sort_index() obj_new Out[90]: a 1 b 2 c 3 d 0 dtype: int64 #对值排序 obj=pd.Series([4,7,-3,2]) obj.sort_values() Out[97]: 2 -3 3 2 0 4 1 7 dtype: int64 obj=pd.Series([4,np.nan,7,np.nan,-3,2]) obj.sort_values() Out[101]: 4 -3.0 5 2.0 0 4.0 2 7.0 1 NaN 3 NaN dtype: float64
2.DataFrame
按键排序,参数ascending=False,表示降序;默认是升序,参数axis指定在哪个轴上排序
按值排序:sort_values(),参数by可以选择一列或多列作为排序键
#对索引排序
frame=pd.DataFrame(np.arange(8).reshape(2,4),index=['three','one'],columns=['d','a','b','c'])
frame.sort_index()
Out[93]:
d a b c
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis=1)
Out[94]:
a b c d
three 1 2 3 0
one 5 6 7 4
frame.sort_index(axis=1,ascending=False)
Out[95]:
d c b a
three 0 3 2 1
one 4 7 6 5
#对值排序
frame=pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
frame
Out[103]:
b a
0 4 0
1 7 1
2 -3 0
3 2 1
frame.sort_values(by='b')
Out[104]:
b a
2 -3 0
3 2 1
0 4 0
1 7 1
frame.sort_values(by=['a','b'])
Out[105]:
b a
2 -3 0
0 4 0
3 2 1
1 7 1
3.排名rank()
升序、降序
评级处理方法
obj=pd.Series([7,-5,7,4,2,0,4])
obj.rank()
Out[107]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
obj.rank(method='first')
Out[108]:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
obj.rank(ascending=False,method='max')
Out[109]:
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
frame=pd.DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
frame
Out[111]:
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
frame.rank(axis='columns')
Out[112]:
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0
评级处理方法:rank()函数中method=的选项
| 方法 | 描述 |
|---|---|
| ‘average’ | 默认:在每个组中分配平均排名 |
| ‘min’ | 使用最小值作为排名 |
| ‘max’ | 使用最大值作为排名 |
| ‘first’ | 按照值在数据中出现的次序排名 |
| ‘dense’ | 类似于method=‘min’,但组键排名总是增加1,而不是一个组中想等元素的个数 |
#理解dense obj Out[114]: 0 7 1 -5 2 7 3 4 4 2 5 0 6 4 dtype: int64 #min:由于重复值,没有第5名和第7名 obj.rank(method='min') Out[115]: 0 6.0 1 1.0 2 6.0 3 4.0 4 3.0 5 2.0 6 4.0 dtype: float64 #dense obj.rank(method='dense') Out[116]: 0 5.0 1 1.0 2 5.0 3 4.0 4 3.0 5 2.0 6 4.0 dtype: float64含有重复标签的轴索引
1.用is_unique属性,返回是否唯一
obj=pd.Series(range(5),index=['a','a','b','b','c']) obj Out[118]: a 0 a 1 b 2 b 3 c 4 dtype: int64 obj.index.is_unique Out[119]: False
2.如果有重复,返回的一个序列,没有重复就返回一个标量值
obj['a']
Out[120]:
a 0
a 1
dtype: int64
obj['c']
Out[121]: 4
df=pd.DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
df
Out[123]:
0 1 2
a 0.950662 -0.016734 0.096477
a 1.221462 -0.341776 0.162987
b -1.557950 -2.673330 0.029102
b 1.615951 0.336764 -0.717436
df.loc['b']
Out[124]:
0 1 2
b -1.557950 -2.673330 0.029102
b 1.615951 0.336764 -0.717436
描述性统计的概述与计算
描述性统计和汇总性统计
| 方法 | 描述 |
|---|---|
| count | 非NA值的个数 |
| describe | 计算Series,dataframe各列的汇总统计集合 |
| min,max | 计算最大值,最小值 |
| argmin,argmax | 分别计算最大值、最小值所在的索引位置(整数) |
| idxmin,idxman | 分别计算最大值,最小值所在的索引标签 |
| quantile | 计算样本从0到1间的分位数 |
| sum | 加和 |
| mean | 均值 |
| median | 中位数 |
| mad | 平均值的平均绝对偏差 |
| prod | 所有值的积 |
| var | 值的样本方差 |
| std | 值的样本标准差 |
| skew | 样本偏度(第三时刻)值 |
| kurt | 样本峰度(第四时刻)值 |
| cumsum | 累计值 |
| cummin,cummax | 累计值的最小值或最大值 |
| cumprod | 值的累计积 |
| diff | 计算第一个算术差值(对事件序列有用) |
| pct_change | 计算百分比 |
来实践一下吧~
df
Out[130]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.count()
Out[131]:
one 3
two 2
dtype: int64
df.describe()
Out[136]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
df.min()
Out[137]:
one 0.75
two -4.50
dtype: float64
df.min(skipna=False)
Out[138]:
one NaN
two NaN
dtype: float64
#argmax()只针对Series..对dataframe报错了
df['one'].argmax()
Out[143]: 1
df['two'].argmin()
Out[144]: 1
df['two'].idxmin()
Out[145]: 'b'
df.idxmin()
Out[146]:
one d
two b
dtype: object
df.quantile()
Out[148]:
one 1.4
two -2.9
Name: 0.5, dtype: float64
df.sum()
Out[149]:
one 9.25
two -5.80
dtype: float64
df.mean()
Out[150]:
one 3.083333
two -2.900000
dtype: float64
df.median()
Out[151]:
one 1.4
two -2.9
dtype: float64
#平均偏差是数列中各项数值与其算术平均数的离差绝对值的算术平均数
df.mad()
Out[152]:
one 2.677778
two 1.600000
dtype: float64
df.prod()
Out[153]:
one 7.455
two 5.850
dtype: float64
df.var()
Out[154]:
one 12.205833
two 5.120000
dtype: float64
df.std()
Out[155]:
one 3.493685
two 2.262742
dtype: float64
df.skew()
Out[156]:
one 1.664846
two NaN
dtype: float64
df.kurt()
Out[158]:
one NaN
two NaN
dtype: float64
df
Out[160]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.cumsum()
Out[161]:
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
df.cummin()
Out[162]:
one two
a 1.40 NaN
b 1.40 -4.5
c NaN NaN
d 0.75 -4.5
df.cumprod()
Out[163]:
one two
a 1.400 NaN
b 9.940 -4.50
c NaN NaN
d 7.455 5.85
df.cumprod(skipna=False)
Out[164]:
one two
a 1.40 NaN
b 9.94 NaN
c NaN NaN
d NaN NaN
df.pct_change()
Out[165]:
one two
a NaN NaN
b 4.071429 NaN
c 0.000000 0.000000
d -0.894366 -0.711111
#非数值型数据的describe方法
obj=pd.Series(['a','a','b','c']*4)
obj.describe()
Out[167]:
count 16
unique 3
top a
freq 8
dtype: object
类似sum(),mean()这种规约方法(书上这么说,没给出具体定义),默认跳过缺失值,即skipna=True;
也可以手动设置skipna=False
df=pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index=['a','b','c','d'],columns=['one','two'])
df
Out[126]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.sum
Out[127]:
df.sum(axis='columns')
Out[128]:
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
df.mean(axis='columns',skipna=False)
Out[129]:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
相关性和协方差 corrcov
由于书上的数据已经不存在,就用简单的数据演示下
1.两个series数据
df=pd.DataFrame(np.random.randn(4,4),index=['a','b','c','d'],columns=['h','i','j','k'])
df
Out[10]:
h i j k
a -0.759614 -0.276090 0.183978 0.324912
b 0.834184 0.432033 0.776634 0.289982
c 1.049343 1.220511 -0.949492 -0.878830
d 0.185257 -0.609921 -0.413074 -0.523556
df.h.corr(df.i)
Out[11]: 0.751479294774673
df.h.cov(df.i)
Out[12]: 0.49565978237291075
2.整个dataframe数据
df.corr()
Out[13]:
h i j k
h 1.000000 0.751479 -0.238332 -0.523827
i 0.751479 1.000000 -0.300089 -0.387645
j -0.238332 -0.300089 1.000000 0.935527
k -0.523827 -0.387645 0.935527 1.000000
df.cov()
Out[14]:
h i j k
h 0.659945 0.495660 -0.144402 -0.255452
i 0.495660 0.659213 -0.181719 -0.188936
j -0.144402 -0.181719 0.556260 0.418854
k -0.255452 -0.188936 0.418854 0.360358
3.dataframe和某一列
df.corrwith(df.i) Out[15]: h 0.751479 i 1.000000 j -0.300089 k -0.387645 dtype: float64
默认是对每一列处理,axis='columns’就会逐行处理
唯一值、计数和成员属性1.unique()方法给出Series的唯一值
2.value_counts()方法计算出Series包含的值的个数。sort=True可以排序
3.isin()过滤出子集,返回一个布尔值Series
4.Index.get_indexer(df)表示获取Index中df内容的索引位置
(将df转化成一个数值Series对象,每个数值代表的是每个元素在Index中对应索引的的位置),结合例子理解更佳
obj=pd.Series(['c','a','d','a','a','b','b','c','c']) uniques=obj.unique() uniques Out[20]: array(['c', 'a', 'd', 'b'], dtype=object) obj.value_counts() Out[21]: c 3 a 3 b 2 d 1 dtype: int64 pd.value_counts(obj.values,sort=False) Out[23]: a 3 b 2 c 3 d 1 dtype: int64 obj Out[24]: 0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object obj.isin(['b','c']) Out[25]: 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool mask=obj.isin(['b','c']) obj[mask] Out[27]: 0 c 5 b 6 b 7 c 8 c dtype: object to_match=pd.Series(['c','a','b','b','c','a']) unique_values=pd.Series(['c','b','a']) pd.Index(unique_values).get_indexer(to_match) #pd.Index(unique_values)创建了一个索引对象 #.get_indexer(to_match):将to_match中的数据和index索引位置匹配 Out[30]: array([0, 2, 1, 1, 0, 2], dtype=int64)
index=[‘a’,‘b’,‘c’,‘d’],columns=[‘h’,‘i’,‘j’,‘k’])
df
Out[10]:
h i j k
a -0.759614 -0.276090 0.183978 0.324912
b 0.834184 0.432033 0.776634 0.289982
c 1.049343 1.220511 -0.949492 -0.878830
d 0.185257 -0.609921 -0.413074 -0.523556
df.h.corr(df.i)
Out[11]: 0.751479294774673
df.h.cov(df.i)
Out[12]: 0.49565978237291075
2.整个dataframe数据
```python
df.corr()
Out[13]:
h i j k
h 1.000000 0.751479 -0.238332 -0.523827
i 0.751479 1.000000 -0.300089 -0.387645
j -0.238332 -0.300089 1.000000 0.935527
k -0.523827 -0.387645 0.935527 1.000000
df.cov()
Out[14]:
h i j k
h 0.659945 0.495660 -0.144402 -0.255452
i 0.495660 0.659213 -0.181719 -0.188936
j -0.144402 -0.181719 0.556260 0.418854
k -0.255452 -0.188936 0.418854 0.360358
3.dataframe和某一列
df.corrwith(df.i) Out[15]: h 0.751479 i 1.000000 j -0.300089 k -0.387645 dtype: float64
默认是对每一列处理,axis='columns’就会逐行处理
唯一值、计数和成员属性1.unique()方法给出Series的唯一值
2.value_counts()方法计算出Series包含的值的个数。sort=True可以排序
3.isin()过滤出子集,返回一个布尔值Series
4.Index.get_indexer(df)表示获取Index中df内容的索引位置
(将df转化成一个数值Series对象,每个数值代表的是每个元素在Index中对应索引的的位置),结合例子理解更佳
obj=pd.Series(['c','a','d','a','a','b','b','c','c']) uniques=obj.unique() uniques Out[20]: array(['c', 'a', 'd', 'b'], dtype=object) obj.value_counts() Out[21]: c 3 a 3 b 2 d 1 dtype: int64 pd.value_counts(obj.values,sort=False) Out[23]: a 3 b 2 c 3 d 1 dtype: int64 obj Out[24]: 0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object obj.isin(['b','c']) Out[25]: 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool mask=obj.isin(['b','c']) obj[mask] Out[27]: 0 c 5 b 6 b 7 c 8 c dtype: object to_match=pd.Series(['c','a','b','b','c','a']) unique_values=pd.Series(['c','b','a']) pd.Index(unique_values).get_indexer(to_match) #pd.Index(unique_values)创建了一个索引对象 #.get_indexer(to_match):将to_match中的数据和index索引位置匹配 Out[30]: array([0, 2, 1, 1, 0, 2], dtype=int64)



