在每个组熊猫数据框中对列进行排序并选择前n行

有两种解决方案：

sort_values

和合计

head

：

df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)print (df1)    mainid pidx pidy  score8        2    x    w     124        1    a    e      82        1    c    a      710       2    y    x      61        1    a    c      57        2    z    y      56        2    y    z      33        1    c    b      25        2    x    y      1

set_index

和合计

nlargest

：

df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() print (df)  pidx  mainid pidy  score0    a       1    e      81    a       1    c      52    c       1    a      73    c       1    b      24    x       2    w     125    x       2    y      16    y       2    x      67    y       2    z      38    z       2    y      5

时间：

np.random.seed(123)N = 1000000L1 = list('abcdefghijklmnopqrstu')L2 = list('efghijklmnopqrstuvwxyz')df = pd.Dataframe({'mainid':np.random.randint(1000, size=N),        'pidx': np.random.randint(10000, size=N),        'pidy': np.random.choice(L2, N),        'score':np.random.randint(1000, size=N)})#print (df)def epat(df):    grouped = df.groupby('pidx')    new_df = pd.Dataframe([], columns = df.columns)    for key, values in grouped:        new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)    return (new_df)print (epat(df))In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))1 loop, best of 3: 309 ms per loopIn [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())1 loop, best of 3: 7.11 s per loopIn [147]: %timeit (epat(df))1 loop, best of 3: 22 s per loop

在每个组熊猫数据框中对列进行排序并选择前n行

面试问答相关栏目本月热门文章