栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

如何在python Pandas中执行条件连接/解决方法?

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

如何在python Pandas中执行条件连接/解决方法?

好吧,我可以想到几种方法:

  1. 实质上是通过 仅合并确切的字段( 炸毁数据 框,然后在合并后的30天窗口中进行过滤。

    company

  2. 应该很快,但是会占用很多内存

  3. 将30天窗口中的合并和过滤移到中

    groupby()

  4. 导致每个组的合并,因此速度较慢,但​​应使用较少的内存

选项1

假设您的数据如下所示(我扩展了示例数据):

print df    company       date  measure0         0 2010-01-01       101         0 2010-01-15       102         0 2010-02-01       103         0 2010-02-15       104         0 2010-03-01       105         0 2010-03-15       106         0 2010-04-01       107         1 2010-03-01        58         1 2010-03-15        59         1 2010-04-01        510        1 2010-04-15        511        1 2010-05-01        512        1 2010-05-15        5print windows   company   end_date0        0 2010-02-011        0 2010-03-152        1 2010-04-013        1 2010-05-15

为30天的窗口创建开始日期:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))print windows   company   end_date   beg_date0        0 2010-02-01 2010-01-021        0 2010-03-15 2010-02-132        1 2010-04-01 2010-03-023        1 2010-05-15 2010-04-15

现在做一个合并,然后选择基础上,如果

date
落入
beg_date
end_date

df = df.merge(windows,on='company',how='left')df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df    company       date  measure   end_date   beg_date2         0 2010-01-15       10 2010-02-01 2010-01-024         0 2010-02-01       10 2010-02-01 2010-01-027         0 2010-02-15       10 2010-03-15 2010-02-139         0 2010-03-01       10 2010-03-15 2010-02-1311        0 2010-03-15       10 2010-03-15 2010-02-1316        1 2010-03-15        5 2010-04-01 2010-03-0218        1 2010-04-01        5 2010-04-01 2010-03-0221        1 2010-04-15        5 2010-05-15 2010-04-1523        1 2010-05-01        5 2010-05-15 2010-04-1525        1 2010-05-15        5 2010-05-15 2010-04-15

您可以通过将

company
和分组来计算30天的窗口总和
end_date

print df.groupby(['company','end_date']).sum()         measurecompany end_date0       2010-02-01       20        2010-03-15       301       2010-04-01       10        2010-05-15       15

选项#2 将所有合并到groupby中。这应该在内存上更好,但我认为要慢得多:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D'))def cond_merge(g,windows):    g = g.merge(windows,on='company',how='left')    g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]    return g.groupby('end_date')['measure'].sum()print df.groupby('company').apply(cond_merge,windows)company  end_date  0        2010-02-01    20         2010-03-15    301        2010-04-01    10         2010-05-15    15

另一个选择 现在,如果您的窗口永不重叠(例如在示例数据中),则可以执行以下类似操作,以免破坏数据框,但速度非常快:

windows['date'] = windows['end_date']df = df.merge(windows,on=['company','date'],how='outer')print df    company       date  measure   end_date0         0 2010-01-01       10        NaT1         0 2010-01-15       10        NaT2         0 2010-02-01       10 2010-02-013         0 2010-02-15       10        NaT4         0 2010-03-01       10        NaT5         0 2010-03-15       10 2010-03-156         0 2010-04-01       10        NaT7         1 2010-03-01        5        NaT8         1 2010-03-15        5        NaT9         1 2010-04-01        5 2010-04-0110        1 2010-04-15        5        NaT11        1 2010-05-01        5        NaT12        1 2010-05-15        5 2010-05-15

这种合并本质上是将窗口结束日期插入数据框中,然后回填结束日期(按组)将为您提供一个结构,可以轻松地创建求和窗口:

df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())print df    company       date  measure   end_date0         0 2010-01-01       10 2010-02-011         0 2010-01-15       10 2010-02-012         0 2010-02-01       10 2010-02-013         0 2010-02-15       10 2010-03-154         0 2010-03-01       10 2010-03-155         0 2010-03-15       10 2010-03-156         0 2010-04-01       10        NaT7         1 2010-03-01        5 2010-04-018         1 2010-03-15        5 2010-04-019         1 2010-04-01        5 2010-04-0110        1 2010-04-15        5 2010-05-1511        1 2010-05-01        5 2010-05-1512        1 2010-05-15        5 2010-05-15df = df[df.end_date.notnull()]df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -        np.timedelta64(30,'D'))print df   company       date  measure   end_date   beg_date0         0 2010-01-01       10 2010-02-01 2010-01-021         0 2010-01-15       10 2010-02-01 2010-01-022         0 2010-02-01       10 2010-02-01 2010-01-023         0 2010-02-15       10 2010-03-15 2010-02-134         0 2010-03-01       10 2010-03-15 2010-02-135         0 2010-03-15       10 2010-03-15 2010-02-137         1 2010-03-01        5 2010-04-01 2010-03-028         1 2010-03-15        5 2010-04-01 2010-03-029         1 2010-04-01        5 2010-04-01 2010-03-0210        1 2010-04-15        5 2010-05-15 2010-04-1511        1 2010-05-01        5 2010-05-15 2010-04-1512        1 2010-05-15        5 2010-05-15 2010-04-15df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]print df.groupby(['company','end_date']).sum()         measurecompany end_date0       2010-02-01       20        2010-03-15       301       2010-04-01       10        2010-05-15       15

另一种选择是将第一个数据帧重新采样为每日数据,然后在30天的时间范围内计算rolling_sums。然后在您感兴趣的结尾处选择日期。这也可能会占用大量内存。



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/611803.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号