有两种解决方案:
1.
sort_values和合计
head:
df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)print (df1) mainid pidx pidy score8 2 x w 124 1 a e 82 1 c a 710 2 y x 61 1 a c 57 2 z y 56 2 y z 33 1 c b 25 2 x y 12.
set_index和合计
nlargest:
df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() print (df) pidx mainid pidy score0 a 1 e 81 a 1 c 52 c 1 a 73 c 1 b 24 x 2 w 125 x 2 y 16 y 2 x 67 y 2 z 38 z 2 y 5时间 :
np.random.seed(123)N = 1000000L1 = list('abcdefghijklmnopqrstu')L2 = list('efghijklmnopqrstuvwxyz')df = pd.Dataframe({'mainid':np.random.randint(1000, size=N), 'pidx': np.random.randint(10000, size=N), 'pidy': np.random.choice(L2, N), 'score':np.random.randint(1000, size=N)})#print (df)def epat(df): grouped = df.groupby('pidx') new_df = pd.Dataframe([], columns = df.columns) for key, values in grouped: new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0) return (new_df)print (epat(df))In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))1 loop, best of 3: 309 ms per loopIn [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())1 loop, best of 3: 7.11 s per loopIn [147]: %timeit (epat(df))1 loop, best of 3: 22 s per loop


