pandas10minnutes

10 minutes to pandas英文网址 pandas10minnutes_中英对照01 pandas10minnutes_中英对照02 pandas10minnutes_中英对照03 [pandas10minnutes_中英对照04 待更新]

本次主要讲以下章节内容：
4.Missing data 缺失数据
5.Operations 操作
6.Merge 合并

4.Missing data 缺失数据

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:

pandas主要使用np.nan表示缺失的数据。默认情况下，它不包括在计算中。请参阅缺失数据部分。    
重构索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本：

import numpy as np
import pandas as pd
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
df["F"] = s1

df

	A	B	C	D	F
2013-01-01	0.184624	-1.042814	0.444349	-0.259771	NaN
2013-01-02	-0.744011	-0.390294	-0.133267	0.952179	1.0
2013-01-03	1.003910	0.718454	-0.082483	2.182944	2.0
2013-01-04	-2.222158	-0.509435	-0.367156	0.852158	3.0
2013-01-05	-0.420209	2.178601	2.552643	0.733452	4.0
2013-01-06	0.450958	1.065650	0.171798	0.701391	5.0

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1

	A	B	C	D	F	E
2013-01-01	0.184624	-1.042814	0.444349	-0.259771	NaN	NaN
2013-01-02	-0.744011	-0.390294	-0.133267	0.952179	1.0	NaN
2013-01-03	1.003910	0.718454	-0.082483	2.182944	2.0	NaN
2013-01-04	-2.222158	-0.509435	-0.367156	0.852158	3.0	NaN

To drop any rows that have missing data:
要删除任何缺少数据的行，请执行以下操作：

df1.dropna(how="any")

	A	B	C	D	F	E

Filling missing data:
填充缺失数据：

df1.fillna(value=5)

	A	B	C	D	F	E
2013-01-01	0.184624	-1.042814	0.444349	-0.259771	5.0	5.0
2013-01-02	-0.744011	-0.390294	-0.133267	0.952179	1.0	5.0
2013-01-03	1.003910	0.718454	-0.082483	2.182944	2.0	5.0
2013-01-04	-2.222158	-0.509435	-0.367156	0.852158	3.0	5.0

To get the boolean mask where values are nan:
要获取值为nan（缺失）的布尔掩码：

pd.isna(df1)

	A	B	C	D	F	E
2013-01-01	False	False	False	False	True	True
2013-01-02	False	False	False	False	False	True
2013-01-03	False	False	False	False	False	True
2013-01-04	False	False	False	False	False	True

5.Operations 操作

See the Basic section on Binary Ops.

5.1Stats

Operations in general exclude missing data.
Performing a descriptive statistic:

参见二进制操作的基本部分

统计

操作通常排除丢失的数据。
进行描述性统计：

df.mean()

A   -0.291148
B    0.336694
C    0.430981
D    0.860392
F    3.000000
dtype: float64

Same operation on the other axis:
另一个轴上的相同操作：

df.mean(1)

2013-01-01    0.191630
2013-01-02   -0.114052
2013-01-03    0.071200
2013-01-04   -0.257770
2013-01-05    0.466199
2013-01-06    0.878283
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension:
操作具有不同维度且需要对齐的对象。此外，pandas还会自动沿指定维度广播：

s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

df.sub(s, axis="index")
#df

	A	B	C	D	F
2013-01-01	NaN	NaN	NaN	NaN	NaN
2013-01-02	NaN	NaN	NaN	NaN	NaN
2013-01-03	0.003910	-0.281546	-1.082483	1.182944	1.0
2013-01-04	-5.222158	-3.509435	-3.367156	-2.147842	0.0
2013-01-05	-5.420209	-2.821399	-2.447357	-4.266548	-1.0
2013-01-06	NaN	NaN	NaN	NaN	NaN

5.2Apply

Applying functions to the data:
应用
将函数应用于数据：

df.apply(np.cumsum)

	A	B	C	D	F
2013-01-01	0.184624	-1.042814	0.444349	-0.259771	NaN
2013-01-02	-0.559387	-1.433107	0.311082	0.692408	1.0
2013-01-03	0.444523	-0.714653	0.228599	2.875352	3.0
2013-01-04	-1.777635	-1.224088	-0.138557	3.727510	6.0
2013-01-05	-2.197844	0.954513	2.414086	4.460962	10.0
2013-01-06	-1.746887	2.020164	2.585884	5.162353	15.0

df.apply(lambda x: x.max() - x.min())

A    3.226068
B    3.221415
C    2.919799
D    2.442716
F    4.000000
dtype: float64

df.apply(lambda x: x.max() - x.min(),axis=1)

2013-01-01    1.487163
2013-01-02    1.744011
2013-01-03    2.265428
2013-01-04    5.222158
2013-01-05    4.420209
2013-01-06    4.828202
Freq: D, dtype: float64

5.3Histogramming

See more at Histogramming and Discretization.
组织编程
更多信息请参见组织编程和离散化。

s = pd.Series(np.random.randint(0, 7, size=10))

0    5
1    2
2    6
3    6
4    4
5    1
6    2
7    3
8    1
9    2
dtype: int64

s.value_counts()

2    3
6    2
1    2
5    1
4    1
3    1
dtype: int64

5.4String Methods

字符串方法

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.
Series（序列）在str（字符）属性中配备了一组字符串处理方法，可以方便地对数组的每个元素进行操作，如下面的代码片段所示。请注意，str中的模式匹配通常默认使用正则表达式（在某些情况下总是使用它们）。请参考向量化字符串方法。

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

type(s)

pandas.core.series.Series

6.Merge 6.1Concat

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
See the Merging section.
Concatenating pandas objects together with concat():
6.1 连接
pandas提供了各种工具用于在连接/合并类型操作的情况下，轻松地将带有索引和关系代数功能逻辑的序列和数据帧对象组合在一起。
请参阅合并部分。
将pandas对象通过concat（）连接在一起：

df = pd.DataFrame(np.random.randn(10, 4))
df

	0	1	2	3
0	0.488970	1.237504	-1.640805	-0.672117
1	0.390873	0.906830	0.260662	0.119989
2	-0.854710	-0.535410	1.641878	0.321487
3	-0.134780	0.555554	1.024371	-0.103164
4	-1.241929	-0.116488	-0.922242	-2.066726
5	-0.432397	2.018692	-0.536801	0.074576
6	1.452204	-0.587196	0.918798	1.192130
7	0.819954	0.224358	-0.022698	-0.745293
8	0.266344	-0.321944	1.251543	0.603333
9	-0.491671	0.278449	0.194751	1.056218

pieces = [df[:3], df[3:7], df[7:]]
pieces

[          0         1         2         3
 0  0.488970  1.237504 -1.640805 -0.672117
 1  0.390873  0.906830  0.260662  0.119989
 2 -0.854710 -0.535410  1.641878  0.321487,
           0         1         2         3
 3 -0.134780  0.555554  1.024371 -0.103164
 4 -1.241929 -0.116488 -0.922242 -2.066726
 5 -0.432397  2.018692 -0.536801  0.074576
 6  1.452204 -0.587196  0.918798  1.192130,
           0         1         2         3
 7  0.819954  0.224358 -0.022698 -0.745293
 8  0.266344 -0.321944  1.251543  0.603333
 9 -0.491671  0.278449  0.194751  1.056218]

pieces[0]

	0	1	2	3
0	0.488970	1.237504	-1.640805	-0.672117
1	0.390873	0.906830	0.260662	0.119989
2	-0.854710	-0.535410	1.641878	0.321487

pd.concat(pieces)

	0	1	2	3
0	0.488970	1.237504	-1.640805	-0.672117
1	0.390873	0.906830	0.260662	0.119989
2	-0.854710	-0.535410	1.641878	0.321487
3	-0.134780	0.555554	1.024371	-0.103164
4	-1.241929	-0.116488	-0.922242	-2.066726
5	-0.432397	2.018692	-0.536801	0.074576
6	1.452204	-0.587196	0.918798	1.192130
7	0.819954	0.224358	-0.022698	-0.745293
8	0.266344	-0.321944	1.251543	0.603333
9	-0.491671	0.278449	0.194751	1.056218

note:
Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it.
注意：向数据帧中添加列的速度相对较快。但是，添加行需要一个副本，而且可能会很昂贵。我们建议将预构建的记录列表传递给DataFrame容器中，而不是通过迭代地向其追加记录来构建DataFrame。

Join 连接

SQL style merges. See the Database style joining section.
SQL风格的合并。请参见“数据库样式连接”部分。

left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

left

	key	lval
0	foo	1
1	foo	2

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
right

	key	rval
0	foo	4
1	foo	5

pd.merge(left, right, on="key")

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

pd.merge(left, right)

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

Another example that can be given is:
可以给出的另一个例子是：

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
pd.merge(left, right, on="key")

	key	lval	rval
0	foo	1	4
1	bar	2	5

pandas10minnutes

Python相关栏目本月热门文章