Pandas数据结构 —— Series和DataFrame

本文章是3.2的内容，如果想要源代码和数据可以看以下链接：
https://download.csdn.net/download/Ahaha_biancheng/83338868

文章目录

3.2 Pandas数组结构

3.2.1 Series对象3.2.2 Series数据选取

（1）查询（2）修改(3) 增加(4) 删除(5) 更改索引 3.2.3 Dataframe对象

Dataframe 创建方法：Dataframe数据访问

(1) 访问（2）增加（3）修改（4）删除

3.2 Pandas数组结构

结构化数据分析是一种成熟的过程和技术。关系数据库用于结构化数据。

pandas是基于python的Numpy库的数据分析工具包，非常方便关系数据库的处理。

◆  Series数据结构用于处理一维数据

◆  Dataframe数据结构用于处理二维数据和高维数据

◆  汇集多种数据源数据、处理缺失数据

◆  对数据进行切片、聚合和汇总统计

◆  实现数据可视化

import numpy as np
import pandas as pd

from pandas import Dataframe, Series

3.2.1 Series对象

Series创建

Series([data, index, ....])
data：Python的列表或Numpy的一维ndarray对象 
index：列表，若省略则自动生成0 ~n-1的序号标签

例题3-1 创建5名篮球运动员身高的Series结构对象height，值是身高，
索引为球衣号码（数字字符串作为索引）。

# data=np.array([187,190,185,178,185])
height=Series([187,190,185,178,185],index=['13','14','7','2','9']) # index可省略，但是最好带着
height

13    187
14    190
7     185
2     178
9     185
dtype: int64

# 不赋值index默认数字索引
height2=Series([187,190,185,178,185])
height2

0    187
1    190
2    185
3    178
4    185
dtype: int64

Series对象与字典类型类似，可以将index和valus数组中序号相同的一组元素视为字典的键-值对。用字典创建Series对象，将字典的key作为索引：

height3 = Series({'13':187, '14':190}) # 别忘了花括号
height3

13    187
14    190
dtype: int64

3.2.2 Series数据选取（1）查询

# 索引名查询单个值

height['13']

# 索引名查询多个值

height[['13','2']]

13    187
2     178
dtype: int64

# 数字索引查询

height[4]

# 数字索引切片查询

height[0:3]

13    187
14    190
7     185
dtype: int64

# 条件筛选
height[height.values>=185]

13    187
14    190
7     185
9     185
dtype: int64

height=Series([187,190,185,178,185], index = ['13','14','7','2','9'])
height

13    187
14    190
7     185
2     178
9     185
dtype: int64

height.values>=185

array([ True,  True,  True, False,  True])

height[[ True,  True,  True, False,  True]]

13    187
14    190
7     185
9     185
dtype: int64

（2）修改

先查询后赋值

height['13'] = 180
height

13    180
14    190
7     185
2     178
9     185
dtype: int64

height[['13','14']] = 180
height

13    180
14    180
7     185
2     178
9     185
dtype: int64

height[:] = 180
height

13    180
14    180
7     180
2     180
9     180
dtype: int64

(3) 增加

Series不能直接添加新数据

append()函数将两个Series拼接产生一个新的Series

不改变原Series

height.append({'3':191}) # 出错

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

 in 
----> 1 height.append({'3':191}) # 出错


E:AnacondaInstalllibsite-packagespandascoreseries.py in append(self, to_append, ignore_index, verify_integrity)
   2579         else:
   2580             to_concat = [self, to_append]
-> 2581         return concat(
   2582             to_concat, ignore_index=ignore_index, verify_integrity=verify_integrity
   2583         )


E:AnacondaInstalllibsite-packagespandascorereshapeconcat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    269     ValueError: Indexes have overlapping values: ['a']
    270     """
--> 271     op = _Concatenator(
    272         objs,
    273         axis=axis,


E:AnacondaInstalllibsite-packagespandascorereshapeconcat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    355                     "only Series and Dataframe objs are valid".format(typ=type(obj))
    356                 )
--> 357                 raise TypeError(msg)
    358 
    359             # consolidate


TypeError: cannot concatenate object of type ''; only Series and Dataframe objs are valid

# 先创建一个新的Series数据 a
a = Series([191, 182], index=['3','0'])
a

3    191
0    182
dtype: int64

# append()函数将两个Series拼接产生一个新的Series
new = height.append(a)
new

13    187
14    190
7     185
2     178
9     185
3     191
0     182
dtype: int64

# 不改变原Series

height

13    180
14    180
7     180
2     180
9     180
dtype: int64

(4) 删除

height.drop(['13'])

14    190
7     185
2     178
9     185
dtype: int64

height.drop('14')

13    187
7     185
2     178
9     185
dtype: int64

# 不会改变原有数据
height

13    187
14    190
7     185
2     178
9     185
dtype: int64

# 所以要及时保存
new = height.drop(['13'])
new

14    190
7     185
2     178
9     185
dtype: int64

(5) 更改索引

用新的列表替换即可

height.index = [5, 6, 7, 8, 9]

height

5    180
6    180
7    180
8    180
9    180
dtype: int64

height[[5, 6]]

5    180
6    180
dtype: int64

# Series的索引为数字，基于位置序号访问需要使用iloc方式

height.iloc[0]

3.2.3 Dataframe对象

Dataframe 包括值（values）、行索引（index）和列索引（columns）3部分

Dataframe 创建方法：

Dataframe ( data，index = […]，columns=[…] )

* data：列表或NumPy的二维ndarray对象 
* index，colunms：列表，若省略则自动生成0 ~n-1的序号标签

data = np.array([[19,170,68],[20,165,65],[18, 175, 65]])
st = Dataframe(data, index=[11,12,13], columns=['age','height','weight'])
st

	age	height	weight
11	19	170	68
12	20	165	65
13	18	175	65

Dataframe数据访问 (1) 访问

st

	age	height	weight
11	19	170	68
12	20	165	65
13	18	175	65

# 选择列 df[col]

st[['age']]

	age
11	19
12	20
13	18

# 选择多列 df[col]

st[['age','height']]

	age	height
11	19	170
12	20	165
13	18	175

# 利用切片选择行 df[0:2]
st[0:2]
st.iloc[0:2, :] # 列的“:”可以省略

	age	height	weight
11	19	170	68
12	20	165	65

# 利用索引选择行 df.loc[label]

st.loc[11]

age        19
height    170
weight     68
Name: 11, dtype: int32

# 利用索引选择行，列 df.loc[index, column]

st.loc[[11,13],['age','height']]

	age	height
11	19	170
13	18	175

st

	age	height	weight
11	19	170	68
12	20	165	65
13	18	175	65

# 利用数字索引选择行 df.iloc[loc]

st.iloc[[0,1],[0,1]]

	age	height
11	19	170
12	20	165

# 利用切片选择行列 df[0:2, 0:2]

st.iloc[0:2, 0:2]

	age	height
11	19	170
12	20	165

# 利用表达式筛选行 df[bool_vec]，联想数据库查询

st.loc[st['age']>=19, ['height']]

	height
11	170
12	165

（2）增加

Dataframe对象可以添加新的列，但不能直接增加新的行，增加行需要通过两个Dataframe对象的合并实现（见章节3.5）

st

	age	height	weight
11	19	170	68
12	20	165	65
13	18	175	65

st['expense'] = [1100, 1000, 900] #列索引标签不存在，添加新列；存在则为值修改
st

	age	height	weight	expense
11	19	170	68	1100
12	20	165	65	1000
13	18	175	65	900

（3）修改

# 按索引先找到后赋值

st['age'] = st['age'] + 1 # 列索引标签不存在，则为值修改
st

	age	height	weight	expense
11	20	170	68	1100
12	21	165	65	1000
13	19	175	65	900

st['expense'] = 1200
st

	age	height	weight	expense
11	19	170	68	1200
12	20	165	65	1200
13	18	175	65	1200

# 按索引先找到后用列表赋值
st['expense'] = [1300, 1400, 1500]
st

	age	height	weight	expense
11	19	170	68	1300
12	20	165	65	1400
13	18	175	65	1500

# 修改1号同学数据，用列表赋值
st.loc[[11]] = [21,180,70,20]
st

	age	height	weight	expense
11	21	180	70	20
12	20	165	65	1400
13	18	175	65	1500

# 先筛选后赋值

st.loc[st['expense']<800, 'expense'] = 800
st

	age	height	weight	expense
11	21	180	70	800
12	20	165	65	1400
13	18	175	65	1500

st.loc[st['expense']==800, 'expense'] = 80

# 第一步 得到满足条件的布尔数组
mask = st['expense']<800
mask

11     True
12    False
13    False
Name: expense, dtype: bool

# 第二步 筛选出满足条件的人以后选中expense列进行赋值
st.loc[mask, 'expense'] = 900
st

	age	height	weight	expense
11	21	180	70	900
12	20	165	65	1400
13	18	175	65	1500

（4）删除

不修改原始数据对象，如果需要直接删除原始对象的行或列，设置参数 inplace=True
axis = 0表示行，axis = 1表示列 ˈaksəs

# 删除行
st.drop(11, axis=0)

	age	height	weight	expense
12	20	165	65	1400
13	18	175	65	1500

# 删除列
st.drop('age', axis=1)

	height	weight	expense
11	170	68	1100
12	165	65	1000
13	175	65	900

# 删除多列
st.drop(['height','age'], axis=1)

	weight	expense
11	68	1100
12	65	1000
13	65	900

# 未改变原始数据
st

	age	height	weight	expense
11	20	170	68	1100
12	21	165	65	1000

# 要改变原始数据
st.drop([13], axis=0, inplace=True)
st

	age	height	weight	expense
11	20	170	68	1100
12	21	165	65	1000

Pandas数据结构 —— Series和DataFrame

大数据系统相关栏目本月热门文章