从Pandas Dataframe单元格中将嵌套数组值拆分为多行

您可以

.apply(pd.Series)

为每个列运行，然后

stack

合并结果。

对于系列

s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])sOut[103]: 2011       [0, 1]2012    [2, 3, 4]dtype: object

它的工作原理如下

s.apply(pd.Series).stack()Out[104]: 2011  0    0.0      1    1.02012  0    2.0      1    3.0      2    4.0dtype: float64

该系列的内容长短不一（这很重要，因为2012年是a年）。中间系列（即before

stack

）的

NaN

值后来被删除。

现在，让我们来做一个框架：

a = list(range(14))b = list(range(20, 34))df = pd.Dataframe({'ID': [11111, 11111, 11112, 11112],        'Year': [2011, 2012, 2011, 2012],        'A': [a[:3], a[3:7], a[7:10], a[10:14]],        'B': [b[:3], b[3:7], b[7:10], b[10:14]]})dfOut[108]:        A      B     ID  Year0         [0, 1, 2]      [20, 21, 22]  11111  20111      [3, 4, 5, 6]  [23, 24, 25, 26]  11111  20122         [7, 8, 9]      [27, 28, 29]  11112  20113  [10, 11, 12, 13]  [30, 31, 32, 33]  11112  2012

然后我们可以运行：

# set an index (each column will inherit it)df2 = df.set_index(['ID', 'Year'])# the trickunnested_lst = []for col in df2.columns:    unnested_lst.append(df2[col].apply(pd.Series).stack())result = pd.concat(unnested_lst, axis=1, keys=df2.columns)

并获得：

resultOut[115]:       A     BID    Year   11111 2011 0   0.0  20.01   1.0  21.02   2.0  22.0      2012 0   3.0  23.01   4.0  24.02   5.0  25.03   6.0  26.011112 2011 0   7.0  27.01   8.0  28.02   9.0  29.0      2012 0  10.0  30.01  11.0  31.02  12.0  32.03  13.0  33.0

其余的（日期时间索引）则不太直接。例如：

# DatetimeIndexyears = pd.to_datetime(result.index.get_level_values(1).astype(str))# TimedeltaIndexdays = pd.to_timedelta(result.index.get_level_values(2), unit='D')# If the above line doesn't work (a bug in pandas), try this:# days = result.index.get_level_values(2).astype('timedelta64[D]')# the sum is again a DatetimeIndexdates = years + daysdates.name = 'Date'new_index = pd.MultiIndex.from_arrays([result.index.get_level_values(0), dates])result.index = new_indexresultOut[130]:A     BID    Date       11111 2011-01-01   0.0  20.0      2011-01-02   1.0  21.0      2011-01-03   2.0  22.0      2012-01-01   3.0  23.0      2012-01-02   4.0  24.0      2012-01-03   5.0  25.0      2012-01-04   6.0  26.011112 2011-01-01   7.0  27.0      2011-01-02   8.0  28.0      2011-01-03   9.0  29.0      2012-01-01  10.0  30.0      2012-01-02  11.0  31.0      2012-01-03  12.0  32.0      2012-01-04  13.0  33.0

从Pandas Dataframe单元格中将嵌套数组值拆分为多行

面试问答相关栏目本月热门文章