Pandas进阶——操作数据

1、重置索引 1、1 Series 1、1、1 自定义索引

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s.reindex(['e', 'a', 'c', 'b', 'g']))

e   -1.019419
a   -1.969583
c   -0.622471
b    0.018110
g         NaN
dtype: float64

1、1、2 与其他对象对齐

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s1 = pd.Series(np.random.randn(7), index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
print(s1.reindex_like(s))

#对齐前：
a    0.858259
b   -0.473204
c   -1.610665
d   -0.032633
e   -1.232306
f   -0.095461
g   -1.742095
dtype: float64

#对齐后：
a    0.858259
b   -0.473204
c   -1.610665
d   -0.032633
e   -1.232306
dtype: float64

也可通过共享轴标签对齐。

s = pd.Series(np.random.randn(7), index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'f'])
print(s1.reindex(s.index))

a   -0.148640
b         NaN
c    0.336717
d         NaN
e         NaN
f    0.329316
g         NaN
dtype: float64

1、2 Dataframe 1、2、1 自定义索引

1、2、1、1 行

df = pd.Dataframe({'one': [2, 6, 4], 'two': [5, 7, 10], 'three': [4, 8, 8]}, index=['a', 'b', 'c'])
print(df.reindex(['c', 'h', 'a'], axis='index'))

   one   two  three
c  4.0  10.0    8.0
h  NaN   NaN    NaN
a  2.0   5.0    4.0

1、2、1、2 列

print(df.reindex(['three', 'one', 'two'], axis='columns'))

   three  one  two
a      4    2    5
b      8    6    7
c      8    4   10

1、2、1、3 行、列

print(df.reindex(index= ['c', 'a', 'b'], columns=['three', 'two', 'one']))

   three  two  one
c      8   10    4
a      4    5    2
b      8    7    6

1、2、2 与其他对象对齐

df = pd.Dataframe({'one': [2, 6, 4], 'two': [5, 7, 10], 'three': [4, 8, 8]}, index=['a', 'b', 'c'])
df1 = pd.Dataframe({'one': [2, 6, 4, 12], 'two': [5, 7, 10, 45], 'three': [4, 8, 9, 23], 'four': [5, 19, 2, 89]}, index=['a', 'b', 'c', 'd'])
print(df1.reindex_like(df))

# 对齐前：
   one  two  three  four
a    2    5      4     5
b    6    7      8    19
c    4   10      9     2
d   12   45     23    89 

# 对齐后：
   one  two  three
a    2    5      4
b    6    7      8
c    4   10      9

2、填充索引 2、1 Series 2、1、1 填充方向

2、1、1、1 索引需要排序

方法	动作
pad / ffill	先前填充
bfill / backfill	向后填充
nearest	从最近的索引值填充

s1的索引需要按递增或递减排序。

s = pd.Series(np.random.randn(7), index=['h', 'b', 'c', 'd', 'e', 'f', 'g'])
s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'f'])
print(s1.reindex(s.index, method='ffill')) #向前填充
print(s1.reindex(s.index, method='bfill')) #向后填充

a   -0.035560
b         NaN
c    1.289840
d         NaN
e         NaN
f   -0.887376
g         NaN
dtype: float64

#向前填充：
a   -0.035560
b   -0.035560
c    1.289840
d    1.289840
e    1.289840
f   -0.887376
g   -0.887376
dtype: float64

#向后填充：
a   -0.035560
b    1.289840
c    1.289840
d   -0.887376
e   -0.887376
f   -0.887376
g         NaN
dtype: float64

不知道什么原因，调用nearest方法，将索引换为数字则不会影响。

2、1、1、2 索引不需排序

print(s1.reindex(s.index).fillna(method='ffill'))

a   -0.120874
b   -0.120874
c    0.339105
d    0.339105
e    0.339105
f    0.734595
g    0.734595
dtype: float64

2、1、2 填充数量

2、1、2、1 limit

限制连续匹配的最大数量。

print(s1.reindex(s.index, method='ffill', limit=1))

a    0.176983
b    0.176983
c   -1.678991
d   -1.678991
e         NaN
f   -0.075107
g   -0.075107
dtype: float64

2、1、2、2 tolerance

2、2 Dataframe 3、重命名

rename()支持按不同的轴基于映射（字典或 Series）调整标签。

3、1 Series

调用的是函数，该函数在处理标签时，必须返回唯一一个值。

print(s.rename(str.upper))

A   -1.752828
B   -0.158358
C   -1.238692
D    0.128135
E   -2.204282
F    0.703956
G   -0.039073
dtype: float64

rename()还支持用标量或列表更改 Series.name 属性。

print(s.rename('series-1'))

a   -0.871909
b    0.714525
c    0.896795
d    1.439972
e    1.023890
f   -0.524506
g   -0.519692
Name: series-1, dtype: float64

3、2 Dataframe

3、2、1 行

print(df.rename({'one': 'fir', 'two': 'sec', 'three': 'thi'}, axis=1))

   fir  sec  thi
a    2    5    4
b    6    7    8
c    4   10    8

3、2、2 列

print(df.rename({'a': 1, 'b': 2, 'c': 3}, axis=0))

   one  two  three
1    2    5      4
2    6    7      8
3    4   10      8

3、2、3 行/列

print(df.rename(index={'a': 1, 'b': 2, 'c': 3}, columns={'one': 'fir', 'two': 'sec', 'three': 'thi'}))

   fir  sec  thi
1    2    5    4
2    6    7    8
3    4   10    8

4、迭代 4、1 Series 4、2 Dataframe

   one  two  three
a    2    5      4
b    6    7      8
c    4   10      8

4、2、1 item()

以列为基准进行迭代，第一个返回值为列标签。

for a, b in df.items():
    print(a)
    print(b)

one
a    2
b    6
c    4
Name: one, dtype: int64
two
a     5
b     7
c    10
Name: two, dtype: int64
three
a    4
b    8
c    8
Name: three, dtype: int64

4、2、2 iterrows()

以行为基准进行迭代，第一个返回值为行标签，第二个返回值为每行数据的 Seies。

for a, b in df.iterrows():
    print(a)
    print(b)

a
one      2
two      5
three    4
Name: a, dtype: int64
b
one      6
two      7
three    8
Name: b, dtype: int64
c
one       4
two      10
three     8
Name: c, dtype: int64

还可以用来实现矩阵转置。

df1 = pd.Dataframe({a: b for a, b in df.iterrows()})

       a  b   c
one    2  6   4
two    5  7  10
three  4  8   8

4、2、3 itertuples()

以行为基准，返回值为包含每行数据的元组。与iterrows()不同，并不会把行转换为Series。

for a in df.itertuples():
    print(a)

Pandas(Index='a', one=2, two=5, three=4)
Pandas(Index='b', one=6, two=7, three=8)
Pandas(Index='c', one=4, two=10, three=8)

5、排序 5、1 Series 5、1、1 按索引排序

s1 = pd.Series(np.random.randn(5), index=['b', 'c', 'a', 'd', 'e'])
print(s1.sort_index())

a    1.706496
b   -0.594099
c   -1.233872
d   -0.574698
e   -0.094711
dtype: float64

5、1、2 按值排序

s1 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s1.sort_values())

e   -1.799285
b   -1.652728
c   -0.540896
d    0.100869
a    1.750447
dtype: float64

5、1、3 搜索排序

s = pd.Series([1, 2, 3, 4, 5, 6, 7], index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
print(s.searchsorted([0, 5], side='left'))
print(s.searchsorted([0, 5], side='right'))

[0 4]
[0 5]

5、1、4 最大值最小值

s1 = pd.Series(np.random.randn(5), index=['b', 'c', 'a', 'd', 'e'])
print(s1.nlargest(3))
print(s1.nsmallest(3))

c    1.925964
a    0.351660
e    0.086050
dtype: float64

d   -2.126804
b   -0.702141
e    0.086050
dtype: float64

5、2 Dataframe 5、2、1 按索引排序

5、2、1、1

正序：

df = pd.Dataframe({'one': [2, 6, 4], 'two': [5, 7, 10], 'three': [4, 8, 8]}, index=['b', 'a', 'c'])
print(df.sort_index())

   one  two  three
a    6    7      8
b    2    5      4
c    4   10      8

倒序：

print(df.sort_index(ascending=False))

   one  two  three
c    4   10      8
b    2    5      4
a    6    7      8

5、2、1、2

默认对行标签排序，也可以对列标签进行排序。

df = pd.Dataframe({'three': [2, 6, 4], 'two': [5, 7, 10], 'one': [4, 8, 8]}, index=['b', 'a', 'c'])
print(df.sort_index(axis=1))

   one  three  two
b    4      2    5
a    8      6    7
c    8      4   10

5、2、2 按值排序

by用于指定按哪列排序，该参数的值可以是一列或多列数据。

df = pd.Dataframe({'three': [12, 6, 4], 'two': [15, 7, 10], 'one': [4, 8, 8]}, index=['b', 'a', 'c'])
print(df.sort_values(by=['two', 'three']))

   three  two  one
a      6    7    8
c      4   10    8
b     12   15    4

5、2、3 混合排序 5、2、4 最大值最小值

df = pd.Dataframe({'three': [12, 6, 4], 'two': [15, 7, 10], 'one': [4, 8, 8]}, index=['b', 'a', 'c'])
print(df.nlargest(3, 'one'))
print(df.nsmallest(3, ['one', 'two']))

   three  two  one
a      6    7    8
c      4   10    8
b     12   15    4

   three  two  one
b     12   15    4
a      6    7    8
c      4   10    8

Pandas进阶——操作数据

Python相关栏目本月热门文章