import pandas as pd
pd.DataFrame( )
- 表示矩阵数据表,是已排序的列的集合
- 每一列的值的类型可以不同
使用字典创建DataFrame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'popu': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
会自动为Series分配索引
frame = pd.DataFrame(data)
frame
| state | year | popu |
|---|
| 0 | Ohio | 2000 | 1.5 |
|---|
| 1 | Ohio | 2001 | 1.7 |
|---|
| 2 | Ohio | 2002 | 3.6 |
|---|
| 3 | Nevada | 2001 | 2.4 |
|---|
| 4 | Nevada | 2002 | 2.9 |
|---|
| 5 | Nevada | 2003 | 3.2 |
|---|
head()会只展示前5行
frame.head()
| state | year | popu |
|---|
| 0 | Ohio | 2000 | 1.5 |
|---|
| 1 | Ohio | 2001 | 1.7 |
|---|
| 2 | Ohio | 2002 | 3.6 |
|---|
| 3 | Nevada | 2001 | 2.4 |
|---|
| 4 | Nevada | 2002 | 2.9 |
|---|
在创建的时候,可以指定列的顺序
pd.DataFrame(data, columns = ['year', 'state', 'popu']).head()
| year | state | popu |
|---|
| 0 | 2000 | Ohio | 1.5 |
|---|
| 1 | 2001 | Ohio | 1.7 |
|---|
| 2 | 2002 | Ohio | 3.6 |
|---|
| 3 | 2001 | Nevada | 2.4 |
|---|
| 4 | 2002 | Nevada | 2.9 |
|---|
- 到这里可以指定,index是指行的名字,及索引, 而columns指列名,后续会表示属性的名称
- 传入的columns不在字典里,会以缺失值出现
- 可见,当columns多出来一列,就会补充Na,但多出一行index就不可以,按照真实数据来说,多出一个样本很奇怪但多出一个属性则是一般操作
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'popu', 'debt'], index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2.columns
Index(['year', 'state', 'popu', 'debt'], dtype='object')
DataFrame的属性,可以按照字典那样索引,也可以按照属性.操作
frame2['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
frame2.year
one 2000
two 2001
three 2002
four 2001
five 2002
six 2003
Name: year, dtype: int64
如果想取一行,使用.loc
frame2
| year | state | popu | debt |
|---|
| one | 2000 | Ohio | 1.5 | NaN |
|---|
| two | 2001 | Ohio | 1.7 | NaN |
|---|
| three | 2002 | Ohio | 3.6 | NaN |
|---|
| four | 2001 | Nevada | 2.4 | NaN |
|---|
| five | 2002 | Nevada | 2.9 | NaN |
|---|
| six | 2003 | Nevada | 3.2 | NaN |
|---|
可以修改行,可以看到这个会自适应类型
frame2.loc['three'] = 1
frame2
| year | state | popu | debt |
|---|
| one | 2000 | Ohio | 1.5 | NaN |
|---|
| two | 2001 | Ohio | 1.7 | NaN |
|---|
| three | 1 | 1 | 1.0 | 1 |
|---|
| four | 2001 | Nevada | 2.4 | NaN |
|---|
| five | 2002 | Nevada | 2.9 | NaN |
|---|
| six | 2003 | Nevada | 3.2 | NaN |
|---|
frame2.loc['three'] = [2002, 'Ohio', 3.6, 'NaN']
frame2
| year | state | popu | debt |
|---|
| one | 2000 | Ohio | 1.5 | NaN |
|---|
| two | 2001 | Ohio | 1.7 | NaN |
|---|
| three | 2002 | Ohio | 3.6 | NaN |
|---|
| four | 2001 | Nevada | 2.4 | NaN |
|---|
| five | 2002 | Nevada | 2.9 | NaN |
|---|
| six | 2003 | Nevada | 3.2 | NaN |
|---|
也可以修改列
frame2.debt = 16.5
frame2
| year | state | popu | debt |
|---|
| one | 2000 | Ohio | 1.5 | 16.5 |
|---|
| two | 2001 | Ohio | 1.7 | 16.5 |
|---|
| three | 2002 | Ohio | 3.6 | 16.5 |
|---|
| four | 2001 | Nevada | 2.4 | 16.5 |
|---|
| five | 2002 | Nevada | 2.9 | 16.5 |
|---|
| six | 2003 | Nevada | 3.2 | 16.5 |
|---|
当把数组赋给一列的时候,需要长度一样
import numpy as np
frame2['debt'] = np.arange(6.)
frame2
| year | state | popu | debt |
|---|
| one | 2000 | Ohio | 1.5 | 0.0 |
|---|
| two | 2001 | Ohio | 1.7 | 1.0 |
|---|
| three | 2002 | Ohio | 3.6 | 2.0 |
|---|
| four | 2001 | Nevada | 2.4 | 3.0 |
|---|
| five | 2002 | Nevada | 2.9 | 4.0 |
|---|
| six | 2003 | Nevada | 3.2 | 5.0 |
|---|
也可以把Series对象赋给DataFrame的一列,Series的索引会按照DataFrame的索引重新排列
val = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2.debt = val
frame2
| year | state | popu | debt |
|---|
| one | 2000 | Ohio | 1.5 | NaN |
|---|
| two | 2001 | Ohio | 1.7 | -1.2 |
|---|
| three | 2002 | Ohio | 3.6 | NaN |
|---|
| four | 2001 | Nevada | 2.4 | -1.5 |
|---|
| five | 2002 | Nevada | 2.9 | -1.7 |
|---|
| six | 2003 | Nevada | 3.2 | NaN |
|---|
若被赋值的列不存在,就会生成一个新的列,可以像字典那样使用del删除列
对于创建,.操作即.eastern = 不能创建新的列
对于删除,也不可以使用.操作
frame2['eastern'] = frame2.state == 'Ohio'
frame2
| year | state | popu | debt | eastern |
|---|
| one | 2000 | Ohio | 1.5 | NaN | True |
|---|
| two | 2001 | Ohio | 1.7 | -1.2 | True |
|---|
| three | 2002 | Ohio | 3.6 | NaN | True |
|---|
| four | 2001 | Nevada | 2.4 | -1.5 | False |
|---|
| five | 2002 | Nevada | 2.9 | -1.7 | False |
|---|
| six | 2003 | Nevada | 3.2 | NaN | False |
|---|
del frame2['eastern']
print(frame2.columns)
print(frame2.index)
Index(['year', 'state', 'popu', 'debt'], dtype='object')
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')
注意:从DataFrame选取的Series列是视图,即改变了,就真的变了,想拷贝需要.copy()来显示的拷贝
使用字典套字典的方法创建DataFrame
pop = {'Nevada':{2001:2.4, 2002:2.9},
'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
pop
{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3
| Nevada | Ohio |
|---|
| 2001 | 2.4 | 1.7 |
|---|
| 2002 | 2.9 | 3.6 |
|---|
| 2000 | NaN | 1.5 |
|---|
使用numpy里的转置来操作DataFrame
frame3.T
| 2001 | 2002 | 2000 |
|---|
| Nevada | 2.4 | 2.9 | NaN |
|---|
| Ohio | 1.7 | 3.6 | 1.5 |
|---|
自适应dataframe对索引
pd.DataFrame(pop, index = [2001, 2002, 2003])
| Nevada | Ohio |
|---|
| 2001 | 2.4 | 1.7 |
|---|
| 2002 | 2.9 | 3.6 |
|---|
| 2003 | NaN | NaN |
|---|
包含Series的字典也可以用来制作Dataframe
pdata = {'Ohio':frame3['Ohio'][:-1], 'Nevada':frame3['Nevada'][:-1]}
pdata
{'Ohio': 2001 1.7
2002 3.6
Name: Ohio, dtype: float64,
'Nevada': 2001 2.4
2002 2.9
Name: Nevada, dtype: float64}
pd.DataFrame(pdata)
| Ohio | Nevada |
|---|
| 2001 | 1.7 | 2.4 |
|---|
| 2002 | 3.6 | 2.9 |
|---|
若索引和列有name属性,那么也会显示
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
| state | Nevada | Ohio |
|---|
| year | | |
|---|
| 2001 | 2.4 | 1.7 |
|---|
| 2002 | 2.9 | 3.6 |
|---|
| 2000 | NaN | 1.5 |
|---|
frame3.values
array([[2.4, 1.7],
[2.9, 3.6],
[nan, 1.5]])
索引对象——pd.Index( )
- 索引对象不可变,不能frame2.index[1] = ??
labels = pd.Index(np.arange(3))
labels
Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index = labels)
obj2
0 1.5
1 -2.5
2 0.0
dtype: float64
'php' in frame3.columns
False
2001 in frame3.index
True
Pandas的索引可以包括重复值
- 根据重复的索引进而选取的值,可以一次性选取全部该索引下的值
labels1 = pd.Index(['dawn', 'dawn', 'lby'])
labels1
Index(['dawn', 'dawn', 'lby'], dtype='object')
一些索引对象的常用方法
| 方法 | 描述 |
|---|
| append | 将额外索引对象粘贴到原索引,产生一个新索引 |
| difference | 计算两个索引差集 |
| intersection | 计算两个索引交集 |
| union | 计算两个索引并集 |
| isin | 某个索引是否在另一个索引中 |
| delete | 将位置i的元素删除,并产生新索引 |
| drop | 根据传参,删除指定索引值 |
| insert | 在位置i插入元素 |
| is_monotonic | 若索引序列递增,就返回True |
| is_unique | 若索引序列唯一,就返回True |
| unique | 计算索引的唯一值序列,去重 |
# append
labels = labels.append(labels1)
labels
Index([0, 1, 2, 'dawn', 'dawn', 'lby'], dtype='object')
# isin
labels1.isin(labels1)
array([ True, True, True])
# delete
labels = labels.delete(3)
labels
Index([0, 1, 2, 'dawn', 'lby'], dtype='object')
labels = labels.append(pd.Index(['22', '33']))
labels
Index([0, 1, 2, 'dawn', 'lby', '22', '33'], dtype='object')
# drop
labels = labels.drop('22')
labels
Index([0, 1, 2, 'dawn', 'lby', '33'], dtype='object')
labels.is_monotonic_increasing
False
labels.is_monotonic_decreasing
False
labels.is_unique
True
labels.unique()
Index([0, 1, 2, 'lby', '33'], dtype='object')