好吧,我可以想到几种方法:
实质上是通过 仅合并确切的字段( ) 炸毁数据 框,然后在合并后的30天窗口中进行过滤。
company
应该很快,但是会占用很多内存
将30天窗口中的合并和过滤移到中
groupby()
。导致每个组的合并,因此速度较慢,但应使用较少的内存
选项1
假设您的数据如下所示(我扩展了示例数据):
print df company date measure0 0 2010-01-01 101 0 2010-01-15 102 0 2010-02-01 103 0 2010-02-15 104 0 2010-03-01 105 0 2010-03-15 106 0 2010-04-01 107 1 2010-03-01 58 1 2010-03-15 59 1 2010-04-01 510 1 2010-04-15 511 1 2010-05-01 512 1 2010-05-15 5print windows company end_date0 0 2010-02-011 0 2010-03-152 1 2010-04-013 1 2010-05-15
为30天的窗口创建开始日期:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))print windows company end_date beg_date0 0 2010-02-01 2010-01-021 0 2010-03-15 2010-02-132 1 2010-04-01 2010-03-023 1 2010-05-15 2010-04-15现在做一个合并,然后选择基础上,如果
date落入
beg_date和
end_date:
df = df.merge(windows,on='company',how='left')df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df company date measure end_date beg_date2 0 2010-01-15 10 2010-02-01 2010-01-024 0 2010-02-01 10 2010-02-01 2010-01-027 0 2010-02-15 10 2010-03-15 2010-02-139 0 2010-03-01 10 2010-03-15 2010-02-1311 0 2010-03-15 10 2010-03-15 2010-02-1316 1 2010-03-15 5 2010-04-01 2010-03-0218 1 2010-04-01 5 2010-04-01 2010-03-0221 1 2010-04-15 5 2010-05-15 2010-04-1523 1 2010-05-01 5 2010-05-15 2010-04-1525 1 2010-05-15 5 2010-05-15 2010-04-15
您可以通过将
company和分组来计算30天的窗口总和
end_date:
print df.groupby(['company','end_date']).sum() measurecompany end_date0 2010-02-01 20 2010-03-15 301 2010-04-01 10 2010-05-15 15
选项#2 将所有合并到groupby中。这应该在内存上更好,但我认为要慢得多:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))def cond_merge(g,windows): g = g.merge(windows,on='company',how='left') g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)] return g.groupby('end_date')['measure'].sum()print df.groupby('company').apply(cond_merge,windows)company end_date 0 2010-02-01 20 2010-03-15 301 2010-04-01 10 2010-05-15 15另一个选择 现在,如果您的窗口永不重叠(例如在示例数据中),则可以执行以下类似操作,以免破坏数据框,但速度非常快:
windows['date'] = windows['end_date']df = df.merge(windows,on=['company','date'],how='outer')print df company date measure end_date0 0 2010-01-01 10 NaT1 0 2010-01-15 10 NaT2 0 2010-02-01 10 2010-02-013 0 2010-02-15 10 NaT4 0 2010-03-01 10 NaT5 0 2010-03-15 10 2010-03-156 0 2010-04-01 10 NaT7 1 2010-03-01 5 NaT8 1 2010-03-15 5 NaT9 1 2010-04-01 5 2010-04-0110 1 2010-04-15 5 NaT11 1 2010-05-01 5 NaT12 1 2010-05-15 5 2010-05-15
这种合并本质上是将窗口结束日期插入数据框中,然后回填结束日期(按组)将为您提供一个结构,可以轻松地创建求和窗口:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())print df company date measure end_date0 0 2010-01-01 10 2010-02-011 0 2010-01-15 10 2010-02-012 0 2010-02-01 10 2010-02-013 0 2010-02-15 10 2010-03-154 0 2010-03-01 10 2010-03-155 0 2010-03-15 10 2010-03-156 0 2010-04-01 10 NaT7 1 2010-03-01 5 2010-04-018 1 2010-03-15 5 2010-04-019 1 2010-04-01 5 2010-04-0110 1 2010-04-15 5 2010-05-1511 1 2010-05-01 5 2010-05-1512 1 2010-05-15 5 2010-05-15df = df[df.end_date.notnull()]df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))print df company date measure end_date beg_date0 0 2010-01-01 10 2010-02-01 2010-01-021 0 2010-01-15 10 2010-02-01 2010-01-022 0 2010-02-01 10 2010-02-01 2010-01-023 0 2010-02-15 10 2010-03-15 2010-02-134 0 2010-03-01 10 2010-03-15 2010-02-135 0 2010-03-15 10 2010-03-15 2010-02-137 1 2010-03-01 5 2010-04-01 2010-03-028 1 2010-03-15 5 2010-04-01 2010-03-029 1 2010-04-01 5 2010-04-01 2010-03-0210 1 2010-04-15 5 2010-05-15 2010-04-1511 1 2010-05-01 5 2010-05-15 2010-04-1512 1 2010-05-15 5 2010-05-15 2010-04-15df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df.groupby(['company','end_date']).sum() measurecompany end_date0 2010-02-01 20 2010-03-15 301 2010-04-01 10 2010-05-15 15另一种选择是将第一个数据帧重新采样为每日数据,然后在30天的时间范围内计算rolling_sums。然后在您感兴趣的结尾处选择日期。这也可能会占用大量内存。



