Pandas入门——基础用法

1、Head与Tail

默认显示5条数据，head()从头截取，tail()从尾截取。

long_series = pd.Series(np.random.randn(100))
print(long_series.head(3))

0   -1.032864
1    0.769729
2    0.327083
dtype: float64

print(long_series.tail(3))

97   -0.707512
98    0.649685
99   -0.497407
dtype: float64

2、布尔转换 2、1 any()/all() 2、1、1 Series

对于Series输入，输出是一个标量，表明是否存在任意一个元素为True。

print(pd.Series([True, False]).any())

True

2、1、2 Dataframe

2、1、2、1 列

每一列是否至少包含一个True元素。

df = pd.Dataframe({'one': [1, 0], 'two': [0, 0], 'three': [3, 5]}, index=['a', 'b'])
print(df.any(axis='index'))

one       True
two      False
three     True
dtype: bool

2、1、2、2 行

每一行是否至少包含一个True元素。

print(df.any(axis='columns'))

a    True
b    True
dtype: bool

2、1、2、3 整个Dataframe

print(df.any(axis=None))

True

2、2 empty属性

通过empty属性，可以验证 Pandas 对象是否为空。如果Dataframe仅包含NaN，则仍不认为它为空。

print(df.empty)

False

3、统计 3、1 Series 3、1、1 value_counts()

s = pd.Series([20, 9, 6], index=['a', 'b', 'c'])
print(s.value_counts())

a    20
b     9
c     6
dtype: int64

20    1
9     1
6     1
dtype: int64

3、1、2 mode()

s = pd.Series([20, 20, 6], index=['a', 'b', 'c'])
print(s.mode())

a    20
b    20
c     6
dtype: int64

0    20
dtype: int64

3、2 Dataframe 3、2、1 mean()等聚合函数，输出结果比原始数据集小

3、2、1、1 列

df = pd.Dataframe({'one': [1, np.nan], 'two': [6, 0], 'three': [3, 5]}, index=['a', 'b'])
print(df.mean(axis='index', skipna=True))

one      1.0
two      3.0
three    4.0
dtype: float64

3、2、1、2 行

print(df.mean(axis='columns', skipna=True))

a    3.333333
b    2.500000
dtype: float64

可结合函数进行广播操作

print(df-df.mean())

   one  two  three
a  0.0  3.0   -1.0
b  NaN -3.0    1.0

但要注意，操作时要始终保持轴标签与行/列对齐，不能先行后列或者先列后行。

下图中：先将轴标签与列对齐算出每一行的均值，而后相减时便不能再与列对齐。

print(df.sub((df.mean(1)))) #会报错
print(df.sub(df.mean(1), axis=0)) #正确

        one       two     three
a -2.333333  2.666667 -0.333333
b       NaN -2.500000  2.500000

3、2、2 cumsum()等，输出结果与原始数据集相同大小

轴标签与列对齐

df = pd.Dataframe({'one': [1, 2], 'two': [6, 4], 'three': [3, 5]}, index=['a', 'b'])
print(df.cumsum(axis=1))

   one  two  three
a    1    7     10
b    2    6     11

轴标签与行对齐

print(df.cumsum(axis=0))

   one  two  three
a    1    6      3
b    3   10      8

3、2、3 总结 describe()

3、2、3、1 数值型

df = pd.Dataframe({'one': [1, 2], 'two': [6, 4], 'three': [3, 5]}, index=['a', 'b'])
print(df.describe())

            one       two     three
count  2.000000  2.000000  2.000000
mean   1.500000  5.000000  4.000000
std    0.707107  1.414214  1.414214
min    1.000000  4.000000  3.000000
25%    1.250000  4.500000  3.500000
50%    1.500000  5.000000  4.000000
75%    1.750000  5.500000  4.500000
max    2.000000  6.000000  5.000000

还可以指定输出结果所包含的分位数

print(df.describe(percentiles=[.05, .25,]))

            one       two     three
count  2.000000  2.000000  2.000000
mean   1.500000  5.000000  4.000000
std    0.707107  1.414214  1.414214
min    1.000000  4.000000  3.000000
5%     1.050000  4.100000  3.100000
25%    1.250000  4.500000  3.500000
50%    1.500000  5.000000  4.000000
max    2.000000  6.000000  5.000000

3、2、3、2 非数值型

某一列出现字符，则原有的数值也被当做字符处理。返回值的总数、唯一值数量、出现次数最多的值及出现的次数。

df = pd.Dataframe({'one': ['a', 'c'], 'two': ['f', 'h'], 'three': ['yes', 'no']}, index=['a', 'b'])
print(df.describe())

       one two three
count    2   2     2
unique   2   2     2
top      a   f   yes
freq     1   1     1

3、2、3、3 混合型

某一列全为数值，默认为只显示数值列的统计结果，也可设置参数显示包含或排除的数据类型。

df = pd.Dataframe({'one': [2, 6], 'two': ['f', 'h'], 'three': ['yes', 'no']}, index=['a', 'b'])
print(df.describe())

            one
count  2.000000
mean   4.000000
std    2.828427
min    2.000000
25%    3.000000
50%    4.000000
75%    5.000000
max    6.000000

print(df.describe(include=['object']))

       two three
count    2     2
unique   2     2
top      f   yes
freq     1     1

print(df.describe(include='all'))

             one  two three
count   2.000000    2     2
unique       NaN    2     2
top          NaN    f   yes
freq         NaN    1     1
mean    4.000000  NaN   NaN
std     2.828427  NaN   NaN
min     2.000000  NaN   NaN
25%     3.000000  NaN   NaN
50%     4.000000  NaN   NaN
75%     5.000000  NaN   NaN
max     6.000000  NaN   NaN

mode()

df = pd.Dataframe({'one': [2, 6, 4], 'two': [20, 10, 20], 'three': [4, 8, 8]}, index=['a', 'b', 'c'])
print(df.mode())

   one   two  three
0    2  20.0    8.0
1    4   NaN    NaN
2    6   NaN    NaN

3、2、4 最值索引

最大值索引：idxmax()，最小值索引：idxmin()

df = pd.Dataframe({'one': [2, 6], 'two': [20, 10], 'three': [4, 8]}, index=['a', 'b'])
print(df.idxmin())
print(df.idxmax())

   one  two  three
a    2   20      4
b    6   10      8

one      a
two      b
three    a
dtype: object

one      b
two      a
three    b
dtype: object

3、

Pandas入门——基础用法

Python相关栏目本月热门文章