官方文档:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Dataframe.dtypes.html
1 df的构造
pandas有两种重要的数据结构,分别是Series和Dataframe
- Series: 类似一个一维数组,一个Series对应Dataframe的一列
- Dataframe:类似一个二维数组,一个Dataframe由几个Series列构成。
1.1 读取excel.csv
# 读取excel,csv文件
import pandas as pd
df = pd.read_csv('df.csv')
df = pd.read_excel('df.xlsx')
关于excel/csv的读取,其参数为:
- io :excel 路径;
- sheetname:默认是sheetname为0,返回多表使用sheetname=[0,1],若sheetname=None是返回全表 。注意:int/string返回的是dataframe,而none和list返回的是dict of dataframe。
- header :指定作为列名的行,默认0,即取第一行,数据为列名行以下的数据;若数据不含列名,则设定 header = None;
- skiprows:省略指定行数的数据
- skip_footer:省略从尾部数的行数据
- index_col :指定列为索引列,也可以使用 u’string’
- names:指定列的名字,传入一个list数据
def read_excel(io, sheet_name=0, header=0, skiprows=None,skip_footer=0,
index_col=None, names=None, usecols=None, parse_dates=False,
date_parser=None, na_values=None, thousands=None,
convert_float=True, converters=None, dtype=None,
true_values=None, false_values=None, engine=None,
squeeze=False, **kwds)
1.2 字典/series转df
# 字典->Dataframe
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.Dataframe(data=d)
# series ->DF
df = pd.Dataframe(np.array([[1, 2, 3], [4, 5, 6], [7,8, 9]]),
columns=['a', 'b', 'c'])
df
# df->excel/csv
df.to_excel('df.xlsx',index = False,header=None) # 不保留索引和列名,保留列名就是header = 1
1.3 多个df操作
# 两个df合并
df = pd.concat([df1,df2])
1.4 df 属性值
df.info()
打印关于Dataframe的信息,包括索引dtype和列、非空值和内存使用情况
df.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)
df.info()
Index: 4 entries, Ohio to New York
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 one 4 non-null int32
1 two 4 non-null int32
2 three 4 non-null int32
3 four 4 non-null int32
dtypes: int32(4)
memory usage: 256.0+ bytes
df.head(5)
查看几行数据
df.head(5)
df.index
提取index名
df.index.tolist()
['Ohio', 'Colorado', 'Utah', 'New York']
df.columns
提取列名
df.columns.tolist()
['one', 'two', 'three', 'four']
df.shape
显示维度
df.shape
(4, 4)
df.describe()
显示基础统计信息
df.describe()
| one | two | three | four |
|---|
| count | 4.000000 | 4.000000 | 4.000000 | 4.000000 |
|---|
| mean | 6.000000 | 7.000000 | 8.000000 | 9.000000 |
|---|
| std | 5.163978 | 5.163978 | 5.163978 | 5.163978 |
|---|
| min | 0.000000 | 1.000000 | 2.000000 | 3.000000 |
|---|
| 25% | 3.000000 | 4.000000 | 5.000000 | 6.000000 |
|---|
| 50% | 6.000000 | 7.000000 | 8.000000 | 9.000000 |
|---|
| 75% | 9.000000 | 10.000000 | 11.000000 | 12.000000 |
|---|
| max | 12.000000 | 13.000000 | 14.000000 | 15.000000 |
|---|
df.items()
遍历Dataframe列,返回一个列名和内容为Series的元组
for label, content in df.items():
print(label)
print(content)
one
Ohio 0
Colorado 4
Utah 8
New York 12
Name: one, dtype: int32
two
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
three
Ohio 2
Colorado 6
Utah 10
New York 14
Name: three, dtype: int32
four
Ohio 3
Colorado 7
Utah 11
New York 15
Name: four, dtype: int32
2 df 数据类型
df = pd.Dataframe(
{
"a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
"b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
"c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
"d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
"e": pd.Series([10, 1, 20], dtype=np.dtype("int64")),
"f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
}
)
df
| a | b | c | d | e | f |
|---|
| 0 | 1 | x | True | h | 10 | NaN |
|---|
| 1 | 2 | y | False | i | 1 | 100.5 |
|---|
| 2 | 3 | z | NaN | NaN | 20 | 200.0 |
|---|
df.dtypes
查看df每一列数据类型
df.dtypes
a int32
b object
c object
d object
e int64
f float64
dtype: object
df.astype()
改变列的数据类型
### Cast all columns to int32:
df.astype('O').dtypes
### 指定列名转换
df.astype({'c':'bool'}).dtypes
a int32
b object
c bool
d object
e int64
f float64
dtype: object
df.select_dtypes
根据所需要的类型返回子集
df.select_dtypes(include='bool')
# 反向选择 exclude
df.select_dtypes(exclude=['int64'])
| a | b | c | d | f |
|---|
| 0 | 1 | x | True | h | NaN |
|---|
| 1 | 2 | y | False | i | 100.5 |
|---|
| 2 | 3 | z | NaN | NaN | 200.0 |
|---|
pandas.Dataframe.convert_dtypes
自动将object类型转换为合适的类型。
dfn = df.convert_dtypes()
dfn.dtypes
a Int32
b string
c boolean
d string
e Int64
f float64
dtype: object
df->array(df.values)
df.values
array([[1, 'x', True, 'h', 10, nan],
[2, 'y', False, 'i', 1, 100.5],
[3, 'z', nan, nan, 20, 200.0]], dtype=object)
3 df 索引
df = pd.Dataframe([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
df
df.at
与loc类似,都是基于index的标签来进行,就是index的具体值
df.at[4,'B']
2
df.loc
# 通过标签访问 一组行/列数据
df.loc[4]
df.loc[4:5] # 等价于 df.loc[[4,5]
df.loc[[4,5],['A','B']]
# 按照列数据筛选子集
df.loc[df['B']>3,['B','C']]
df.iat
就是根据index来索引,而不是index的标签值
df.iat[1,2]
1
df.iloc
基于index进行,行和列都是只能使用index,可以提前根据column名提取,在进行iloc
df[['B','C']].iloc[0]
B 2
C 3
Name: 4, dtype: int64
根据索引结果进行修改
在查找的基础上,后接=value进行修改
例如:
df.loc[df['B']>3,['B','C']] = 100
df
4 df切片
index = ['Ohio','Colorado','Utah','New York']
column = ['one','two','three','four']
df = pd.Dataframe(np.arange(16).reshape(4,4),index=index,columns=column)
df
| one | two | three | four |
|---|
| Ohio | 0 | 1 | 2 | 3 |
|---|
| Colorado | 4 | 5 | 6 | 7 |
|---|
| Utah | 8 | 9 | 10 | 11 |
|---|
| New York | 12 | 13 | 14 | 15 |
|---|
选取某一列
3种方法
df['one']
df.one
df.iloc[:,0]
Ohio 0
Colorado 4
Utah 8
New York 12
Name: one, dtype: int32
选某几列
df[['one','two']]
df.iloc[:,:2]
| one | two |
|---|
| Ohio | 0 | 1 |
|---|
| Colorado | 4 | 5 |
|---|
| Utah | 8 | 9 |
|---|
| New York | 12 | 13 |
|---|
选某一行
df.iloc[0:1]
df.iloc[0]
one 0
two 1
three 2
four 3
Name: Ohio, dtype: int32
df.loc['Ohio']
one 0
two 1
three 2
four 3
Name: Ohio, dtype: int32
选某几行
df.iloc[0:2]
| one | two | three | four |
|---|
| Ohio | 0 | 1 | 2 | 3 |
|---|
| Colorado | 4 | 5 | 6 | 7 |
|---|
df.iloc[[1,3]]
| one | two | three | four |
|---|
| Colorado | 4 | 5 | 6 | 7 |
|---|
| New York | 12 | 13 | 14 | 15 |
|---|
df.loc[['Colorado','New York']]
| one | two | three | four |
|---|
| Colorado | 4 | 5 | 6 | 7 |
|---|
| New York | 12 | 13 | 14 | 15 |
|---|
多行多列
df.iloc[0:3,[0,2]]
| one | three |
|---|
| Ohio | 0 | 2 |
|---|
| Colorado | 4 | 6 |
|---|
| Utah | 8 | 10 |
|---|
df.loc[['Ohio','Utah'],['one','three']]
5 删除
5.1 删除列
df.drop(['one'],axis = 1) # axis =1 就是列,axis = 0就是行
# df.drop(columns=['one'])
| two | three | four |
|---|
| Ohio | 1 | 2 | 3 |
|---|
| Colorado | 5 | 6 | 7 |
|---|
| Utah | 9 | 10 | 11 |
|---|
| New York | 13 | 14 | 15 |
|---|
5.2 删除行
df.drop(['Ohio'],axis = 0) # 此时 axis可以隐藏不写
| one | two | three | four |
|---|
| Colorado | 4 | 5 | 6 | 7 |
|---|
| Utah | 8 | 9 | 10 | 11 |
|---|
| New York | 12 | 13 | 14 | 15 |
|---|
6 填充
df = pd.Dataframe([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list("ABCD"))
df
| A | B | C | D |
|---|
| 0 | NaN | 2.0 | NaN | 0 |
|---|
| 1 | 3.0 | 4.0 | NaN | 1 |
|---|
| 2 | NaN | NaN | NaN | 5 |
|---|
| 3 | NaN | 3.0 | NaN | 4 |
|---|
fillna
对空值nan进行填充
df.fillna(0)
# 也可以对不同列使用不同的填充
values = {"A": 0, "B": 1, "C": 2, "D": 3}
df.fillna(value=values)
# 限制第一个空值填充
df.fillna(value=values, limit=1)
| A | B | C | D |
|---|
| 0 | 0.0 | 2.0 | 2.0 | 0 |
|---|
| 1 | 3.0 | 4.0 | NaN | 1 |
|---|
| 2 | NaN | 1.0 | NaN | 5 |
|---|
| 3 | NaN | 3.0 | NaN | 4 |
|---|