用途
factorize:
df['col'] = pd.factorize(df.col)[0]print (df) col0 01 12 03 04 1
文件
编辑:
如
Jeff评论中所述,那么最好是将column转换为,
categorical主要是因为更少的内存使用量:
df['col'] = df['col'].astype("category")时间 :
有趣的是,在大df
pandas中,速度更快
numpy。我不敢相信。
len(df)=500k:
In [29]: %timeit (a(df1))100 loops, best of 3: 9.27 ms per loopIn [30]: %timeit (a1(df2))100 loops, best of 3: 9.32 ms per loopIn [31]: %timeit (b(df3))10 loops, best of 3: 24.6 ms per loopIn [32]: %timeit (b1(df4))10 loops, best of 3: 24.6 ms per loop
len(df)=5k:
In [38]: %timeit (a(df1))1000 loops, best of 3: 274 µs per loopIn [39]: %timeit (a1(df2))The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 273 µs per loopIn [40]: %timeit (b(df3))The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 295 µs per loopIn [41]: %timeit (b1(df4))1000 loops, best of 3: 294 µs per loop
len(df)=5:
In [46]: %timeit (a(df1))1000 loops, best of 3: 206 µs per loopIn [47]: %timeit (a1(df2))1000 loops, best of 3: 204 µs per loopIn [48]: %timeit (b(df3))The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.10000 loops, best of 3: 164 µs per loopIn [49]: %timeit (b1(df4))The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.10000 loops, best of 3: 164 µs per loop
测试代码 :
d = {'col': ["baked","beans","baked","baked","beans"]}df = pd.Dataframe(data=d)print (df)df = pd.concat([df]*100000).reset_index(drop=True)#test for 5k#df = pd.concat([df]*1000).reset_index(drop=True)df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()def a(df): df['col'] = pd.factorize(df.col)[0] return dfdef a1(df): idx,_ = pd.factorize(df.col) df['col'] = idx return dfdef b(df): df['col'] = np.unique(df['col'],return_inverse=True)[1] return dfdef b1(df): _,idx = np.unique(df['col'],return_inverse=True) df['col'] = idx return dfprint (a(df1)) print (a1(df2)) print (b(df3)) print (b1(df4))


