Dataframe 常用操作手册

官方文档：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Dataframe.dtypes.html

1 df的构造

pandas有两种重要的数据结构，分别是Series和Dataframe

Series: 类似一个一维数组，一个Series对应Dataframe的一列
Dataframe:类似一个二维数组，一个Dataframe由几个Series列构成。

1.1 读取excel.csv

# 读取excel，csv文件
import pandas as pd
df = pd.read_csv('df.csv')
df = pd.read_excel('df.xlsx')

关于excel/csv的读取，其参数为：

io ：excel 路径；
sheetname：默认是sheetname为0，返回多表使用sheetname=[0,1]，若sheetname=None是返回全表。注意：int/string返回的是dataframe，而none和list返回的是dict of dataframe。
header ：指定作为列名的行，默认0，即取第一行，数据为列名行以下的数据；若数据不含列名，则设定 header = None；
skiprows：省略指定行数的数据
skip_footer：省略从尾部数的行数据
index_col ：指定列为索引列，也可以使用 u’string’
names：指定列的名字，传入一个list数据

def read_excel(io, sheet_name=0, header=0, skiprows=None,skip_footer=0,
    index_col=None, names=None, usecols=None, parse_dates=False,
    date_parser=None, na_values=None, thousands=None,
    convert_float=True, converters=None, dtype=None,
    true_values=None, false_values=None, engine=None,
    squeeze=False, **kwds)

1.2 字典/series转df

# 字典->Dataframe
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.Dataframe(data=d)

	col1	col2
0	1	3
1	2	4

# series ->DF
df = pd.Dataframe(np.array([[1, 2, 3], [4, 5, 6], [7,8, 9]]),
                   columns=['a', 'b', 'c'])
df

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

# df->excel/csv

df.to_excel('df.xlsx',index = False,header=None)  # 不保留索引和列名，保留列名就是header = 1

1.3 多个df操作

# 两个df合并
df = pd.concat([df1,df2])

1.4 df 属性值 df.info()

打印关于Dataframe的信息，包括索引dtype和列、非空值和内存使用情况
df.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)

df.info()


Index: 4 entries, Ohio to New York
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   one     4 non-null      int32
 1   two     4 non-null      int32
 2   three   4 non-null      int32
 3   four    4 non-null      int32
dtypes: int32(4)
memory usage: 256.0+ bytes

df.head(5)

查看几行数据

df.head(5)

df.index

提取index名

df.index.tolist()

['Ohio', 'Colorado', 'Utah', 'New York']

df.columns

提取列名

df.columns.tolist()

['one', 'two', 'three', 'four']

df.shape

显示维度

df.shape

(4, 4)

df.describe()

显示基础统计信息

df.describe()

	one	two	three	four
count	4.000000	4.000000	4.000000	4.000000
mean	6.000000	7.000000	8.000000	9.000000
std	5.163978	5.163978	5.163978	5.163978
min	0.000000	1.000000	2.000000	3.000000
25%	3.000000	4.000000	5.000000	6.000000
50%	6.000000	7.000000	8.000000	9.000000
75%	9.000000	10.000000	11.000000	12.000000
max	12.000000	13.000000	14.000000	15.000000

df.items()

遍历Dataframe列，返回一个列名和内容为Series的元组

for label, content in df.items():
    print(label)
    print(content)

one
Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int32
two
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
three
Ohio         2
Colorado     6
Utah        10
New York    14
Name: three, dtype: int32
four
Ohio         3
Colorado     7
Utah        11
New York    15
Name: four, dtype: int32

2 df 数据类型

df = pd.Dataframe(
    {
        "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
        "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
        "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
        "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
        "e": pd.Series([10, 1, 20], dtype=np.dtype("int64")),
        "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
    }
)
df

	a	b	c	d	e	f
0	1	x	True	h	10	NaN
1	2	y	False	i	1	100.5
2	3	z	NaN	NaN	20	200.0

df.dtypes

查看df每一列数据类型

df.dtypes

a      int32
b     object
c     object
d     object
e      int64
f    float64
dtype: object

df.astype()

改变列的数据类型

### Cast all columns to int32:
df.astype('O').dtypes

### 指定列名转换
df.astype({'c':'bool'}).dtypes

a      int32
b     object
c       bool
d     object
e      int64
f    float64
dtype: object

df.select_dtypes

根据所需要的类型返回子集

df.select_dtypes(include='bool')


0
1
2

# 反向选择 exclude
df.select_dtypes(exclude=['int64'])

	a	b	c	d	f
0	1	x	True	h	NaN
1	2	y	False	i	100.5
2	3	z	NaN	NaN	200.0

pandas.Dataframe.convert_dtypes

自动将object类型转换为合适的类型。

dfn = df.convert_dtypes()
dfn.dtypes

a      Int32
b     string
c    boolean
d     string
e      Int64
f    float64
dtype: object

df->array(df.values)

df.values

array([[1, 'x', True, 'h', 10, nan],
       [2, 'y', False, 'i', 1, 100.5],
       [3, 'z', nan, nan, 20, 200.0]], dtype=object)

3 df 索引

df = pd.Dataframe([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
                  index=[4, 5, 6], columns=['A', 'B', 'C'])
df

	A	B	C
4	0	2	3
5	0	4	1
6	10	20	30

df.at

与loc类似,都是基于index的标签来进行,就是index的具体值

df.at[4,'B']

df.loc

# 通过标签访问 一组行/列数据
df.loc[4]
df.loc[4:5]  # 等价于 df.loc[[4,5] 
df.loc[[4,5],['A','B']]

	A	B
4	0	2
5	0	1

# 按照列数据筛选子集
df.loc[df['B']>3,['B','C']]

	A	B	C
4	0	2	3
5	0	1	1
6	10	1	1

df.iat

就是根据index来索引，而不是index的标签值

df.iat[1,2]

df.iloc

基于index进行，行和列都是只能使用index,可以提前根据column名提取，在进行iloc

df[['B','C']].iloc[0]

B    2
C    3
Name: 4, dtype: int64

根据索引结果进行修改

在查找的基础上，后接=value进行修改

例如：

df.loc[df['B']>3,['B','C']] = 100
df

	A	B	C
4	0	2	3
5	0	100	100
6	10	100	100

4 df切片

index = ['Ohio','Colorado','Utah','New York']
column = ['one','two','three','four']
df = pd.Dataframe(np.arange(16).reshape(4,4),index=index,columns=column)
df

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

选取某一列

3种方法

df['one']
df.one
df.iloc[:,0]

Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int32

选某几列

df[['one','two']]
df.iloc[:,:2]

	one	two
Ohio	0	1
Colorado	4	5
Utah	8	9
New York	12	13

选某一行

df.iloc[0:1]

	one	two	three	four
Ohio	0	1	2	3

df.iloc[0]

one      0
two      1
three    2
four     3
Name: Ohio, dtype: int32

df.loc['Ohio']

one      0
two      1
three    2
four     3
Name: Ohio, dtype: int32

选某几行

df.iloc[0:2]

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

df.iloc[[1,3]]

	one	two	three	four
Colorado	4	5	6	7
New York	12	13	14	15

df.loc[['Colorado','New York']]

	one	two	three	four
Colorado	4	5	6	7
New York	12	13	14	15

多行多列

df.iloc[0:3,[0,2]]

	one	three
Ohio	0	2
Colorado	4	6
Utah	8	10

df.loc[['Ohio','Utah'],['one','three']]

	one	three
Ohio	0	2
Utah	8	10

5 删除 5.1 删除列

df.drop(['one'],axis = 1)  # axis =1 就是列，axis = 0就是行
# df.drop(columns=['one'])

	two	three	four
Ohio	1	2	3
Colorado	5	6	7
Utah	9	10	11
New York	13	14	15

5.2 删除行

df.drop(['Ohio'],axis = 0)  # 此时 axis可以隐藏不写

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

6 填充

df = pd.Dataframe([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list("ABCD"))

df

	A	B	C	D
0	NaN	2.0	NaN	0
1	3.0	4.0	NaN	1
2	NaN	NaN	NaN	5
3	NaN	3.0	NaN	4

fillna

对空值nan进行填充

df.fillna(0)
# 也可以对不同列使用不同的填充
values = {"A": 0, "B": 1, "C": 2, "D": 3}
df.fillna(value=values)

# 限制第一个空值填充
df.fillna(value=values, limit=1)

	A	B	C	D
0	0.0	2.0	2.0	0
1	3.0	4.0	NaN	1
2	NaN	1.0	NaN	5
3	NaN	3.0	NaN	4

Dataframe 常用操作手册

Python相关栏目本月热门文章