栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

如何总结不同的groupby组合?

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

如何总结不同的groupby组合?

方法1:

合并

crop

>>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'],...     x['Crop2'],...     x['Crop3']]),axis=1)>>> df1.head()       County   Crop1    Crop2    Crop3  Total_pop   combined_temp0      Harney   grain   melons   apples       2000    [grain, melons, apples]1       Baker  melons    grain   apples       1500    [melons, grain, apples]2     Wheeler  melons    grain   apples       3000    [melons, grain, apples]3  Hood River  apples   melons    grain       1500    [apples, melons, grain]4       Wasco   pears  carrots  raddish       2000  [pears, carrots, raddish]

使它成为一个排序的元组

>>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1)>>> df1.head()       County   Crop1    Crop2 ...  Total_pop   combined_temp          sorted0      Harney   grain   melons ...       2000    [grain, melons, apples]    (apples, grain, melons)1       Baker  melons    grain ...       1500    [melons, grain, apples]    (apples, grain, melons)2     Wheeler  melons    grain ...       3000    [melons, grain, apples]    (apples, grain, melons)3  Hood River  apples   melons ...       1500    [apples, melons, grain]    (apples, grain, melons)4       Wasco   pears  carrots ...       2000  [pears, carrots, raddish]  (carrots, pears, raddish)

然后按操作进行常规分组

>>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index()>>> df1_groupedsorted  Total_pop0    (apples, grain, melons)       80001  (carrots, pears, raddish)       9200

方法2: 基于ws-apprentice的答案的简短

    df = df1.copy()    grouping_cols = ['Crop1', 'Crop2', 'Crop3']    df[grouping_cols] = pd.Dataframe(df.loc[:, grouping_cols]           .apply(set, axis=1)           .apply(sorted).values           .tolist(), columns=grouping_cols)    >>> df.head()County    Crop1  Crop2    Crop3  Total_pop    0      Harney   apples  grain   melons       2000    1       Baker   apples  grain   melons       1500    2     Wheeler   apples  grain   melons       3000    3  Hood River   apples  grain   melons       1500    4       Wasco  carrots  pears  raddish       2000

现在按组分组

    >>> df.groupby(grouping_cols).Total_pop.sum()    Crop1    Crop2  Crop3      apples   grain  melons     8000    carrots  pears  raddish    9200    Name: Total_pop, dtype: int64

但我个人更喜欢使用numpy这个答案

由于您的数据似乎可以保证每个国家/地区拥有3种独特的农作物(“我正在按县级列出前三名的农作物一览表。”),因此可以对值进行排序并重新分配。

import numpy as npcols = ['Crop1', 'Crop2', 'Crop3']df1[cols] = np.sort(df1[cols].values, axis=1)       County    Crop1  Crop2    Crop3  Total_pop0      Harney   apples  grain   melons       20001       Baker   apples  grain   melons       15002     Wheeler   apples  grain   melons       30003  Hood River   apples  grain   melons       15004       Wasco  carrots  pears  raddish       20005      Morrow  carrots  pears  raddish       25006       Union  carrots  pears  raddish       27007        Lake  carrots  pears  raddish       2000

然后总结一下:

df1.groupby(cols).sum()# Total_pop#Crop1   Crop2 Crop3  #apples  grain melons        8000#carrots pears raddish       9200

好处是您避免使用Series.apply或.apply(axis=1)。对于large Dataframes,性能差异非常明显:

df1 = pd.concat([df1]*10000, ignore_index=True)cols = ['Crop1', 'Crop2', 'Crop3']%timeit df1[cols] = np.sort(df1[cols].values, axis=1)#36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)to_sum = ['Crop1', 'Crop2', 'Crop3']%timeit df1[to_sum] = pd.Dataframe(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)#1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/596396.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号