【Python数据分析】pandas知识总结(超全面)

文章目录

- - Series
  - - 初始化创建
    - 通过字典创建
    - 索引
    - 数据对齐
    - 缺失值处理
  - Dataframe
  - - 通过字典创建
    - 初始化创建
    - 索引
    - - 获取列值
      - 获取行值
      - 特别注意
    - 数据对齐
    - 缺失值处理
    - 排序
    - 常用函数
  - 时间序列
  - - 普通的时间处理方式
    - pandas时间处理
    - - 产生时间序列
      - 时间序列的应用
      - csv文件读取及时间序列转换
      - csv文件中的缺失值处理

Series

创建一维数据表

初始化创建

# 可以指定索引index
sr1 = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))
print(sr1)
# A    0
# B    1
# C    2
# D    3
# E    4
# F    5
# G    6
# H    7
# I    8
# J    9
# dtype: int32

# 可以不指定索引 index 默认以0开始作为索引
sr2 = pd.Series(np.arange(10))
print(sr2)
# 0    0
# 1    1
# 2    2
# 3    3
# 4    4
# 5    5
# 6    6
# 7    7
# 8    8
# 9    9
# dtype: int32

通过字典创建

d = {"name": "zdz", "age": 13, "tel": "10000"}
sr1 = pd.Series(d)
print(sr1)
# name      zdz
# age        13
# tel     10000
# dtype: object

索引

索引方式和之前的np中索引一样，可以数值索引，也可以直接按照字典的方式索引(sr[key])来获取某行的值

index属性用于获取所有的行索引值

使用iloc和loc两种方式进行索引，其中iloc是以行序号作为索引，而loc则是以行标签作为索引。

s = pd.Series(np.arange(10))
print(s)
print(s[0])  # 0  获取第0行
print(s[[1, 2, 3]])  # 获取第1 2 3行
# 1    1
# 2    2
# 3    3

sr1 = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))
print(sr1.index)  # 获取行索引
# Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype='object')

sr = pd.Series(np.arange(20))
sr2 = sr[10:]
print(sr2[10])  # sr2里面的索引代表的是标签值 key  不能表示sr2中元素的序号

# 要想用整数作为标签 则使用iloc 和 loc函数来作为区分
# # loc[key]里面的数字解释为key 也就是标签值
print(sr2.loc[10])  # 10
# print(sr2.loc[0])  # 报错
# 使用切片操作切出标签 前包后包
print(sr2.loc[10:13])  # 10 11 12 13

# iloc[idx]里面的数字解释为索引值  也就是元素序号
print(sr2.iloc[9])  # 19
print(sr2.iloc[0])  # 10
# 同样可以进行切片操作 切出前4行  前包后不包
print(sr2.iloc[0:4])

数据对齐

通过相加操作可以将两个索引不按顺序对应的数组对齐。它会自动选择相同索引处的值相加。

sr1 = pd.Series([12, 23, 34], index=['c', 'a', 'd'])
sr2 = pd.Series([1, 2, 3], index=['d', 'c', 'a'])
sr = sr1 + sr2
print(sr)
# a    26
# c    14
# d    35
# dtype: int64

sr3 = pd.Series([1, 2, 3, 4], index=['d', 'c', 'a', 'b'])
sr = sr1 + sr3
print(sr)
# a    26.0
# b     NaN
# c    14.0
# d    35.0
# dtype: float64
# 一个里面有  另一个里面没有  就设置为nan
# 使用函数 add sub div mul fill_value = 0 用0补齐
sr = sr1.add(sr3)  # 效果同上
print(sr)
# a    26.0
# b     NaN
# c    14.0
# d    35.0
# dtype: float64
sr = sr1.add(sr3, fill_value=0)
print(sr)
# a    26.0
# b     4.0
# c    14.0
# d    35.0

缺失值处理

sr = sr1.add(sr3)  # 效果同上
print(sr.isnull())
# a    False
# b     True
# c    False
# d    False
print(sr.notnull())
# a     True
# b    False
# c     True
# d     True
# dtype: bool
# 通过上述两个函数实现空值过滤

print(sr[sr.notnull()])
print(sr.dropna())  # 同上 丢弃nan
print(sr.fillna(0))  # 将nan的值fill为0
print(sr.fillna(sr.mean()))  # 除了nan之外的数组的平均值

Dataframe

创建二维数据表

Dataframe是Series的容器

通过字典创建

创建多维数组可以先指定列名，每个列下面可以用数组添加数据。

a = pd.Dataframe({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
print(a)  # 指定列名 字典为基础创建
#    a  b
# 0  1  5
# 1  2  6
# 2  3  7
# 3  4  8

初始化创建

可以同时指定行列名，若不指定，默认为从0开始的数字作为行列名。

t = pd.Dataframe(np.arange(12).reshape(3, 4), index=list("abc"), columns=list("ABCD"))
# index 指定行索引 columns指定列索引
print(t)
#    A  B   C   D
# a  0  1   2   3
# b  4  5   6   7
# c  8  9  10  11

t = pd.Dataframe(np.arange(12).reshape(3, 4))
print(t)
#    0  1   2   3
# 0  0  1   2   3
# 1  4  5   6   7
# 2  8  9  10  11

索引

直接以[]的方式进行索引获得的是列数据

iloc通过序号来索引行列数据

loc通过标签来索引行列数据

获取列值

print(t['A'])  # 获取A列数据
# a    0
# b    4
# c    8
# Name: A, dtype: int32
print(type(t['A'])) # 

print(t.loc[:, 'A']) # 获取A列数据
# a    0
# b    4
# c    8
# Name: A, dtype: int32

print(t.iloc[:, 1])  # 1列数据
# a    1
# b    5
# c    9
# Name: B, dtype: int32

获取行值

print(t.loc['a'])  # 获取'a'行数据
print(t.loc['a'][1])  # 1 获取a行1列的数据 只能这样获取 第二维可以使数字索引
print(t.loc['a']['B'])  # 1 可以这样获取B列的数据
# print(t.loc['a', 1]) # 报错
print(t.loc['a', 'B'])  # 1 获取a行1列的数据 第二维必须是列索引标签 不能是数字序号索引
print(t.loc['a'])  # 获取'a'行数据
print(type(t.loc['a'])) # 

print(t.loc['a', :])  # 同上
print(type(t.loc['a', :]))  # 同上

print(t.loc['a':'c'])  # 获取a-c行数据
#    A  B   C   D
# a  0  1   2   3
# b  4  5   6   7
# c  8  9  10  11
print(type(t.loc['a':'c']))  # 

print(t.loc['a':'c', 'A':'C'])  # 获取a~c行 A~C列
#    A  B   C
# a  0  1   2
# b  4  5   6
# c  8  9  10

print(t.loc[['a', 'c'], ['A', 'C']])  # 获取a c行 A C列
#    A   C
# a  0   2
# c  8  10

print(t.loc['a'][1])  # 1 获取a行1列的数据 只能这样获取 第二维可以使用数字索引
print(t.loc['a']['B'])  # 1 可以这样获取B列的数据
# print(t.loc['a', 1]) # 报错
print(t.loc['a', 'B'])  # 1 获取a行1列的数据 第二维必须是列索引标签 不能是数字

# 通过数字索引获取行数据 iloc === integer location
print(t.iloc[0])  # 等价于t.loc['a']
# A    0
# B    1
# C    2
# D    3
# Name: a, dtype: int32

print(t.iloc[0, 1])  # 1 0行1列数据

print(t.iloc[0, :])  # 0行数据
# A    0
# B    1
# C    2
# D    3

特别注意

通过列表获取的和直接通过行索引获取的得到的数据类型不一样。

print(t.loc['a'])
print(type(t.loc['a']))
# A    0
# B    1
# C    2
# D    3
# Name: a, dtype: int32
# 

print(t.loc[['a']])
print(type(t.loc[['a']]))
#    A  B  C  D
# a  0  1  2  3
#

数据对齐

df1 = pd.Dataframe({'x': [1, 2, 3, 4], 'y': [5, 6, 7, 8]}, index=['a', 'b', 'c', 'd'])
df2 = pd.Dataframe({'x': [0, 0, 1, 1], 'y': [2, 2, 3, 3]}, index=['a', 'b', 'c', 'd'])
df = df1 + df2
print(df1)
print(df2)
print(df)
#    x   y
# a  1   7
# b  2   8
# c  4  10
# d  5  11

df1.loc['d', 'y'] = np.nan
df1.loc['d', 'x'] = np.nan
#      x     y
# a  1.0   7.0
# b  2.0   8.0
# c  4.0  10.0
# d  NaN   NaN

缺失值处理

print(df3.dropna(how="any", axis=0))
# 默认how = any  axis=0 以行为依据
#      x    y
# a  1.0  7.0
# b  2.0  8.0

print(df3.dropna(how='all'))
# 列属性全为nan才会删除那一行
#      x     y
# a  1.0   7.0
# b  2.0   8.0
# c  NaN  10.0

print(df3.dropna(how="all", axis=1))
#      x     y
# a  1.0   7.0
# b  2.0   8.0
# c  NaN  10.0
# d  NaN   NaN

print(df3.dropna(axis=0, how="all").dropna(axis=1))
#       y
# a   7.0
# b   8.0
# c  10.0

排序

ac = df.sort_values(ascending=True, by='x')  # 按照列排序  行为轴
print(ac)
#    x   y
# a  1   7
# b  2   8
# c  4  10
# d  5  11
ar = df.sort_values(ascending=False, by='a', axis=1)  # 按照行排序 按照a这一行进行排序 列为轴
print(ar)
#     y  x
# a   7  1
# b   8  2
# c  10  4
# d  11  5
ai = df.sort_index(ascending=False)  # 按照行索引排序
print(ai)
#    x   y
# d  5  11
# c  4  10
# b  2   8
# a  1   7

ai = df.sort_index(ascending=False, axis=1) # 按照列索引排序
print(ai)
#     y  x
# a   7  1
# b   8  2
# c  10  4
# d  11  5

常用函数

注意轴axis的作用

print(df3.mean(axis=1))  # 行不变列变  说明求的是每一行的平均值
print(df3.mean(axis=0))  # 默认 列不变行变  说明求的是每一列的平均值  忽略nan值
print(df3.sum())
print(df3.sum(axis=1))  # nan视为0

时间序列普通的时间处理方式

import dateutil
import datetime
t = datetime.datetime.strptime('2021-9-24', '%Y-%m-%d')
print(t)  # 将字符串解析为日期
# dateutil 可以自动解析日期
d = dateutil.parser.parse('2020-9-24')
print(d)
d = dateutil.parser.parse('2020/09/08')
print(d)
d = dateutil.parser.parse('2020.09.08')
print(d)

pandas时间处理产生时间序列

自动解析任意形式的时间字符串

pdate = pd.to_datetime(['2020-09-24', '2021/09/24'])
# 可以传入一个列表 里面是时间字符串 自动解析为DatetimeIndex
print(pdate)
# DatetimeIndex(['2020-09-24', '2021-09-24'], dtype='datetime64[ns]', freq=None)

产生某个日期范围内的时间

x = pd.date_range('2020-01-01', '2020-3-2')  # 设定起止日期点
print(x)

从指定日期开始，连续产生指定天数，并设置产生的频率参数freq

# 从指定日期开始 连续产生指定天数
x = pd.date_range('2021-9-24', periods=30)
print(x) # 产生30天

x = pd.date_range('2021-09-24', freq='H', periods=10)  # 每小时输出一个 产生10个
print(x)
DatetimeIndex(['2021-09-24 00:00:00', '2021-09-24 01:00:00',
               '2021-09-24 02:00:00', '2021-09-24 03:00:00',
               '2021-09-24 04:00:00', '2021-09-24 05:00:00',
               '2021-09-24 06:00:00', '2021-09-24 07:00:00',
               '2021-09-24 08:00:00', '2021-09-24 09:00:00'],
              dtype='datetime64[ns]', freq='H')

# 按照周week输出  每周日输出
x = pd.date_range('2020-09-24', periods=10, freq='W-SUN')
print(x)
DatetimeIndex(['2020-09-27', '2020-10-04', '2020-10-11', '2020-10-18',
               '2020-10-25', '2020-11-01', '2020-11-08', '2020-11-15',
               '2020-11-22', '2020-11-29'],
              dtype='datetime64[ns]', freq='W-SUN')

# 每周一输出
x = pd.date_range('2020-09-24', periods=10, freq='W-MON')
print(x)
DatetimeIndex(['2020-09-28', '2020-10-05', '2020-10-12', '2020-10-19',
               '2020-10-26', '2020-11-02', '2020-11-09', '2020-11-16',
               '2020-11-23', '2020-11-30'],
              dtype='datetime64[ns]', freq='W-MON')

# 工作日输出  只输出工作日
x = pd.date_range('2021-09-24', periods=10, freq='B')  # Business
print(x)
DatetimeIndex(['2021-09-24', '2021-09-27', '2021-09-28', '2021-09-29',
               '2021-09-30', '2021-10-01', '2021-10-04', '2021-10-05',
               '2021-10-06', '2021-10-07'],
              dtype='datetime64[ns]', freq='B')

# 每隔1h20min输出一个
x = pd.date_range('2021-09-24', periods=10, freq='1h20min')
print(x)
DatetimeIndex(['2021-09-24 00:00:00', '2021-09-24 01:20:00',
               '2021-09-24 02:40:00', '2021-09-24 04:00:00',
               '2021-09-24 05:20:00', '2021-09-24 06:40:00',
               '2021-09-24 08:00:00', '2021-09-24 09:20:00',
               '2021-09-24 10:40:00', '2021-09-24 12:00:00'],
              dtype='datetime64[ns]', freq='80T')

时间序列的应用

将时间序列作为一维甚至多维数据的行索引

sr = pd.Series(np.arange(100), index=pd.date_range('2020-01-01', periods=100))

日期索引比如查看2020年2月的数据或者只查看2020年的数据

print(sr['2020-2'])
# 2020-02-01    31
# 2020-02-02    32
# 2020-02-03    33
# ...
# 2020-02-27    57
# 2020-02-28    58
# 2020-02-29    59
# Freq: D, dtype: int32
print(sr['2020'])

日期进行切片，查看某个时间范围内的数据

print(sr['2017':'2018-02'])
# 2017-01-01    366
# 2017-01-02    367
# 2017-01-03    368
# 2017-01-04    369
# 2017-01-05    370
#              ...
# 2018-02-24    785
# 2018-02-25    786
# 2018-02-26    787
# 2018-02-27    788
# 2018-02-28    789
# Freq: D, Length: 424, dtype: int32

按照某个标准进行重采样比如按周求和、按月求和、求均值等操作。

sw = sr.resample('W').sum()  # 按周求和
print(sw)
# 2016-01-03       3
# 2016-01-10      42
# 2016-01-17      91
# 2016-01-24     140
# 2016-01-31     189
#               ...
# 2018-09-02    6804
# 2018-09-09    6853
# 2018-09-16    6902
# 2018-09-23    6951

sm = sr.resample('M').sum()  # 按月求和
print(sm)
# 2016-01-31      465
# 2016-02-29     1305
# 2016-03-31     2325
# 2016-04-30     3165
# 2016-05-31     4216
# 2016-06-30     4995
# 2016-07-31     6107
# 2016-08-31     7068
# 2016-09-30     7755
# 2016-10-31     8959
# 2016-11-30     9585
# 2016-12-31    10850
# 2017-01-31    11811
# 2017-02-28    11494
# 2017-03-31    13640
# 2017-04-30    14115
# 2017-05-31    15531
# 2017-06-30    15945
# 2017-07-31    17422
# 2017-08-31    18383

csv文件读取及时间序列转换

test.csv

	date	open	close
0	2020/3/1	1000	2000
1	2020/3/2	1001	2001
2	2020/3/3	1002	2002
3	2020/3/4	1003	2003
4	2020/3/5	1004	2004
5	2020/3/6	1005	2005
6	2020/3/7	1006	2006
7	2020/3/8	1007	2007
8	2020/3/9	1008	2008
9	2020/3/10	1009	2009

读取csv文件

csv = pd.read_csv('test.csv')
print(csv)
#    Unnamed: 0       date    open   close
# 0           0   2020/3/1  1000.0  2000.0
# 1           1   2020/3/2  1001.0  2001.0
# 2           2   2020/3/3  1002.0  2002.0
# 3           3   2020/3/4  1003.0  2003.0
# 4           4   2020/3/5  1004.0  2004.0
# 5           5   2020/3/6  1005.0  2005.0
# 6           6   2020/3/7  1006.0  2006.0
# 7           7   2020/3/8  1007.0  2007.0
# 8           8   2020/3/9  1008.0  2008.0
# 9           9  2020/3/10  1009.0  2009.0
# 希望以第0列作为索引 但是默认第0列被视为unamed行

指定某列作为行索引

csv = pd.read_csv('test.csv', index_col='date')
print(csv)
#           Unnamed: 0    open   close
# date
# 2020/3/1            0  1000.0  2000.0
# 2020/3/2            1  1001.0  2001.0
# 2020/3/3            2  1002.0  2002.0
# 2020/3/4            3  1003.0  2003.0
# 2020/3/5            4  1004.0  2004.0
# 2020/3/6            5  1005.0  2005.0
# 2020/3/7            6  1006.0  2006.0
# 2020/3/8            7  1007.0  2007.0
# 2020/3/9            8  1008.0  2008.0
# 2020/3/10           9  1009.0  2009.0

print(csv.index)  # 输出索引  可以看出其类型为Index对象
# Index(['2020/3/1', '2020/3/2', '2020/3/3', '2020/3/4', '2020/3/5', '2020/3/6',
#        '2020/3/7', '2020/3/8', '2020/3/9', '2020/3/10'],
#       dtype='object', name='date')

将csv文件中的日期转换为DatetimeIndex对象，便于之后索引。
可以在read_csv函数中添加参数parse_date的值为True或者日期所在的列名(如date)。

c = pd.read_csv('test.csv', index_col='date', parse_dates=True)
c = pd.read_csv('test.csv', index_col='date', parse_dates=['date'])
print(c.index)
# DatetimeIndex(['2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04',
#                '2020-03-05', '2020-03-06', '2020-03-07', '2020-03-08',
#                '2020-03-09', '2020-03-10'],
#               dtype='datetime64[ns]', name='date', freq=None)

为csv文件添加列索引名也就是设置header参数为None，自定义列索引名names

c = pd.read_csv('test.csv', header=None, names=list('abc'))
print(c)
           a     b     c
0   2020/3/1  1000  2000
1   2020/3/2  1001  2001
2   2020/3/3  1002  2002
3   2020/3/4  1003  2003
4   2020/3/5  1004  2004
5   2020/3/6  1005  2005
6   2020/3/7  1006  2006
7   2020/3/8  1007  2007
8   2020/3/9  1008  2008
9  2020/3/10  1009  2009

csv文件中的缺失值处理

a.csv

0	2020/3/1	1000	2000
1	2020/3/2	1001	2001
2	2020/3/3	1002	2002
3	2020/3/4	NaN	2003
4	2020/3/5	1004	2004
5	2020/3/6	1005	2005
6	2020/3/7	1006	2006
7	2020/3/8	1007	2007
8	2020/3/9	1008	2008
9	2020/3/10	1009	2009

c = pd.read_csv('a.csv', header=None)  # 列索引为数字序号 默认
print(c)
   0          1     2     3
0  0   2020/3/1  1000  2000
1  1   2020/3/2  1001  2001
2  2   2020/3/3  1002  2002
3  3   2020/3/4  NaN   2003
4  4   2020/3/5  1004  2004
5  5   2020/3/6  1005  2005
6  6   2020/3/7  1006  2006
7  7   2020/3/8  1007  2007
8  8   2020/3/9  1008  2008
9  9  2020/3/10  1009  2009

print(c[2]) # 打印第二列
# 0    1000.0
# 1    1001.0
# 2    1002.0
# 3       NaN
# 4    1004.0
# 5    1005.0
# 6    1006.0
# 7    1007.0
# 8    1008.0
# 9    1009.0
# Name: 2, dtype: float64
# NaN被识别为float型数

# 将NaN改成None
print(c[2])
# 0    1000
# 1    1001
# 2    1002
# 3    None
# 4    1004
# 5    1005
# 6    1006
# 7    1007
# 8    1008
# 9    1009
# Name: 2, dtype: object
# None被识别为object
print(type(c[2][0]))
#  因为一列的所有数据为同一类型 因此被识别为str对象

需要将csv文件中的None转换为NaN，na_values指将None识别为NaN

c = pd.read_csv('a.csv', header=None, na_values=['None'])
print(c)
#    0          1       2     3
# 0  0   2020/3/1  1000.0  2000
# 1  1   2020/3/2  1001.0  2001
# 2  2   2020/3/3  1002.0  2002
# 3  3   2020/3/4     NaN  2003

print(c[2])  # 可以看出None变成了NaN 并且类型变为浮点数
# 0    1000.0
# 1    1001.0
# 2    1002.0
# 3       NaN
# Name: 2, dtype: float64

将None替换为NaN之后，存进csv文件(to_csv)，但是NaN在文件中默认不显示的，因此需要指定na_rep参数来替换NaN，此处替换为字符串null。

# 将数据写进csv
c.to_csv('out.csv', columns=[0, 1, 2, 3], header=False, index=False, na_rep='null')
# 默认NaN在csv文件中不显示，因此需要用na_seq参数来指定NaN处显示的值
# columns指定显示的列数 header指定是否显示列名 index指定是否要显示行索引

out.csv

0	2020/3/1	1000	2000
1	2020/3/2	1001	2001
2	2020/3/3	1002	2002
3	2020/3/4	null	2003
4	2020/3/5	1004	2004
5	2020/3/6	1005	2005
6	2020/3/7	1006	2006
7	2020/3/8	1007	2007
8	2020/3/9	1008	2008
9	2020/3/10	1009	2009

【Python数据分析】pandas知识总结(超全面)

Python相关栏目本月热门文章