栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

Python-加快将分类变量转换为其数字索引的速度

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

Python-加快将分类变量转换为其数字索引的速度

用途

factorize

df['col'] = pd.factorize(df.col)[0]print (df)   col0    01    12    03    04    1

文件

编辑:

Jeff
评论中所述,那么最好是将column转换为,
categorical
主要是因为更少的内存使用量:

df['col'] = df['col'].astype("category")

时间

有趣的是,在大df

pandas
中,速度更快
numpy
。我不敢相信。

len(df)=500k

In [29]: %timeit (a(df1))100 loops, best of 3: 9.27 ms per loopIn [30]: %timeit (a1(df2))100 loops, best of 3: 9.32 ms per loopIn [31]: %timeit (b(df3))10 loops, best of 3: 24.6 ms per loopIn [32]: %timeit (b1(df4))10 loops, best of 3: 24.6 ms per loop

len(df)=5k

In [38]: %timeit (a(df1))1000 loops, best of 3: 274 µs per loopIn [39]: %timeit (a1(df2))The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 273 µs per loopIn [40]: %timeit (b(df3))The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 295 µs per loopIn [41]: %timeit (b1(df4))1000 loops, best of 3: 294 µs per loop

len(df)=5

In [46]: %timeit (a(df1))1000 loops, best of 3: 206 µs per loopIn [47]: %timeit (a1(df2))1000 loops, best of 3: 204 µs per loopIn [48]: %timeit (b(df3))The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.10000 loops, best of 3: 164 µs per loopIn [49]: %timeit (b1(df4))The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.10000 loops, best of 3: 164 µs per loop

测试代码

d = {'col': ["baked","beans","baked","baked","beans"]}df = pd.Dataframe(data=d)print (df)df = pd.concat([df]*100000).reset_index(drop=True)#test for 5k#df = pd.concat([df]*1000).reset_index(drop=True)df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()def a(df):    df['col'] = pd.factorize(df.col)[0]    return dfdef a1(df):    idx,_ = pd.factorize(df.col)    df['col'] = idx    return dfdef b(df):    df['col'] = np.unique(df['col'],return_inverse=True)[1]    return dfdef b1(df):    _,idx = np.unique(df['col'],return_inverse=True)    df['col'] = idx        return dfprint (a(df1))    print (a1(df2))   print (b(df3))   print (b1(df4))


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/398387.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号