数据分析工具Pandas

创建序列（Series）
创建数据框（Dataframe）
查看数据

创建序列（Series）

import numpy as np
import pandas as pd
s = pd.Series(np.arange(10)) # 以 0~9 的数组生成Series
print(s)

输出结果为：

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32

可以看到Series中的每个元素都加了索引（1,2,3...），dtype:int32表明它的数据类型。

通过字典的方式创建序列

dic1 = {'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}  # 创建字典
s2 = pd.Series(dic1)
print(s2)

输出结果为：

a    10
b    20
c    30
d    40
e    50
dtype: int64

可以看到s2中的每个元素的索引是以a、b、c、d、e命名的，它来自于字典dic1的设置。

创建数据框（Dataframe）

Dataframe可以通过直接输入数据的方式进行创建

df = pd.Dataframe({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]})
print(df)

输出结果为：

查看数据

在机器学习中，很多时候需要训练的数据有成千上万个样本，不可能把所有数据全部展示出来，可以采用Dataframe的head()、tail()函数来查看部分数据。

df1 = pd.read_csv('data/2_apple.csv')
print(df1.head())

这里2_apple.csv的内容是：

year,apple,price,income
1990,12.8,50.1,1606
1991,12.3,71.3,1513
1992,13.1,81,1567
1993,12.9,76.2,1547
1994,13.8,80.3,1646
1995,13.2,91,1657
1996,13.3,90.2,1678
1997,13.6,84.2,1726
1998,13.5,83.7,1714
1999,12.9,77.1,1795
2000,12.9,74.5,1839
2001,12.8,82.8,1844
2002,13.3,92.2,1831
2003,13.7,88.8,1881
2004,13.2,87.2,1883
2005,13.7,88.3,1909
2006,13.6,90.1,1969
2007,13.7,88.7,2015
2008,13.5,87.3,2126
2009,13.9,93.9,2239
2010,13.9,102.6,2335
2011,13.6,100,2403
2012,14,102.3,2486
2013,14.2,111.4,2534
2014,14.8,117.6,2610

输出结果为：

   year  apple  price  income
0  1990   12.8   50.1    1606
1  1991   12.3   71.3    1513
2  1992   13.1   81.0    1567
3  1993   12.9   76.2    1547
4  1994   13.8   80.3    1646

显示该数据集的索引

print(df1.index)

输出结果为：

RangeIndex(start=0, stop=25, step=1)

表示索引是从 1~25的数值，步长为1.

显示该数据集的列

print(df1.columns)

输出结果为：

Index(['year', 'apple', 'price', 'income'], dtype='object')

对该数据进行基础的描述性统计分析：

print(df1.describe())

输出结果为：

              year      apple       price       income
count    25.000000  25.000000   25.000000    25.000000
mean   2002.000000  13.448000   87.712000  1934.120000
std       7.359801   0.533948   13.509574   327.831425
min    1990.000000  12.300000   50.100000  1513.000000
25%    1996.000000  13.100000   81.000000  1678.000000
50%    2002.000000  13.500000   88.300000  1844.000000
75%    2008.000000  13.700000   92.200000  2126.000000
max    2014.000000  14.800000  117.600000  2610.000000

选择数据：

df = pd.Dataframe({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]})
print(df['A'])

输出结果为：

0    1
1    2
2    3
3    4
Name: A, dtype: int64

切片选择：

print(df[0:3])

输出结果为：

按行列选择：

print(df.iloc[1:3, 0:2])

输出结果为：

   A  B
1  2  6
2  3  7

表示选择第2、3行，第1、2列的数据。

处理缺失值

df['C'] = pd.Series([1,2]) # 增加一列
print(df)

输出结果为：

   A  B    C
0  1  5  1.0
1  2  6  2.0
2  3  7  NaN
3  4  8  NaN

使用dropna()函数去掉其值为NaN的行或列

print(df.dropna(how='any')) # any：只要存在NaN即可去掉

输出结果为：

   A  B    C
0  1  5  1.0
1  2  6  2.0

数据分析工具Pandas

Python相关栏目本月热门文章