Python-加快将分类变量转换为其数字索引的速度

用途

factorize

：

df['col'] = pd.factorize(df.col)[0]print (df)   col0    01    12    03    04    1

文件

编辑：

如

Jeff

评论中所述，那么最好是将column转换为，

categorical

主要是因为更少的内存使用量：

df['col'] = df['col'].astype("category")

时间：

有趣的是，在大df

pandas

中，速度更快

numpy

。我不敢相信。

len(df)=500k

：

In [29]: %timeit (a(df1))100 loops, best of 3: 9.27 ms per loopIn [30]: %timeit (a1(df2))100 loops, best of 3: 9.32 ms per loopIn [31]: %timeit (b(df3))10 loops, best of 3: 24.6 ms per loopIn [32]: %timeit (b1(df4))10 loops, best of 3: 24.6 ms per loop

len(df)=5k

：

In [38]: %timeit (a(df1))1000 loops, best of 3: 274 µs per loopIn [39]: %timeit (a1(df2))The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 273 µs per loopIn [40]: %timeit (b(df3))The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 295 µs per loopIn [41]: %timeit (b1(df4))1000 loops, best of 3: 294 µs per loop

len(df)=5

：

In [46]: %timeit (a(df1))1000 loops, best of 3: 206 µs per loopIn [47]: %timeit (a1(df2))1000 loops, best of 3: 204 µs per loopIn [48]: %timeit (b(df3))The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.10000 loops, best of 3: 164 µs per loopIn [49]: %timeit (b1(df4))The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.10000 loops, best of 3: 164 µs per loop

测试代码 ：

d = {'col': ["baked","beans","baked","baked","beans"]}df = pd.Dataframe(data=d)print (df)df = pd.concat([df]*100000).reset_index(drop=True)#test for 5k#df = pd.concat([df]*1000).reset_index(drop=True)df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()def a(df):    df['col'] = pd.factorize(df.col)[0]    return dfdef a1(df):    idx,_ = pd.factorize(df.col)    df['col'] = idx    return dfdef b(df):    df['col'] = np.unique(df['col'],return_inverse=True)[1]    return dfdef b1(df):    _,idx = np.unique(df['col'],return_inverse=True)    df['col'] = idx        return dfprint (a(df1))    print (a1(df2))   print (b(df3))   print (b1(df4))

Python-加快将分类变量转换为其数字索引的速度

面试问答相关栏目本月热门文章