您可以
value_counts与
booleanindexing和一起使用
isin:
df = pd.Dataframe({ 'LeafID':[1,1,2,1,3,3,1,6,3,5,1], 'pidx':[10,10,300,10,30,40,20,10,30,45,20], 'pidy':[20,20,400,20,15,20,12,43,54,112,23], 'count':[10,20,30,40,80,10,20,50,30,10,70], 'score':[10,10,10,22,22,3,4,5,9,0,1]})print (df) LeafID count pidx pidy score0 1 10 10 20 101 1 20 10 20 102 2 30 300 400 103 1 40 10 20 224 3 80 30 15 225 3 10 40 20 36 1 20 20 12 47 6 50 10 43 58 3 30 30 54 99 5 10 45 112 010 1 70 20 23 1s = df.pidx.value_counts()idx = s[s>2].indexprint (df[df.pidx.isin(idx)]) LeafID count pidx pidy score0 1 10 10 20 101 1 20 10 20 103 1 40 10 20 227 6 50 10 43 5时间 :
np.random.seed(123)N = 1000000L1 = list('abcdefghijklmnopqrstu')L2 = list('efghijklmnopqrstuvwxyz')df = pd.Dataframe({'LeafId':np.random.randint(1000, size=N), 'pidx': np.random.randint(10000, size=N), 'pidy': np.random.choice(L2, N), 'count':np.random.randint(1000, size=N)})print (df)print (df.groupby('pidx').filter(lambda x: len(x) > 120))def jez(df): s = df.pidx.value_counts() return df[df.pidx.isin(s[s>120].index)]print (jez(df))In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))1 loop, best of 3: 1.17 s per loopIn [56]: %timeit (jez(df))10 loops, best of 3: 141 ms per loopIn [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])10 loops, best of 3: 102 ms per loopIn [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])1 loop, best of 3: 685 ms per loopIn [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])10 loops, best of 3: 104 ms per loop对于
final_score您可以使用:
df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))



