栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

性能:Python pandas DataFrame.to_csv附加逐渐变慢

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

性能:Python pandas DataFrame.to_csv附加逐渐变慢

在这种情况下,您应该 分析代码 (以查看哪些函数调用花费最多的时间),这样您就可以凭经验检查它确实

read_csv
比其他地方慢。

通过查看您的代码:首先,这里有很多复制和很多循环(没有足够的向量化)…每当您看到循环时,都在寻找一种删除它的方法。其次,当您使用诸如zfill之类的东西时,我想知道是否要

to_fwf
(固定宽度格式)而不是
to_csv

进行一些健全性测试:某些文件是否比其他文件大很多(这可能导致您碰到交换文件)?您确定最大的文件只有1200行吗?你检查了吗?例如使用

wc -l

IMO,我认为这不太可能是垃圾回收。(如其他答案所示)。


这是对代码的一些改进,可以改善运行时间。

列是固定的,我将提取列计算并向量化实数,子项和其他归一化。使用apply而不是迭代(对于zfill)。

columns_to_drop = set(head) & set(exclude)  # maybe also - ['ConcatIndex']remaining_cols = set(head) - set(exclude)real_cols = [r for r in remaining_cols if 'Real ' in r]real_cols_suffix = [r.strip('Real ') for r in real]remaining_cols = remaining_cols - real_colschild_cols = [r for r in remaining_cols if 'child' in r]child_cols_desc = [r.strip('child'+'desc') for r in real]remaining_cols = remaining_cols - child_colsfor count, picklefile in enumerate(pickleFiles):    if count % 100 == 0:        t2 = datetime.now()        print(str(t2))        print('count = ' + str(count))        print('time: ' + str(t2 - t1) + 'n')        t1 = t2    #Dataframe Manipulation:    df = pd.read_pickle(path + picklefile)    df['ConcatIndex'] = 100000*df.FileID + df.ID    # use apply here rather than iterating    df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))    df.index = df.ConcatIndex    #Dataframe Normalization:    dftemp = df.very_deep_copy()  # don't *think* you need this    # drop all excludes    dftemp.drop(columns_to_drop), axis=1, inplace=True)    # normalize real cols    m = dftemp[real_cols_suffix].max()    m.index = real_cols    dftemp[real_cols] = dftemp[real_cols] / m    # normalize child cols    m = dftemp[child_cols_desc].max()    m.index = child_cols    dftemp[child_cols] = dftemp[child_cols] / m    # normalize remaining    remaining = list(remaining - child)    dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()    # if this case is important then discard the rows of m with .max() is 0    #if max != 0:    #    dftemp[string] = dftemp[string]/max    # this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop    # dftemp.drop('ConcatIndex', axis=1, inplace=True)    #Saving Dataframe in CSV:    if picklefile == '0000.p':        dftemp.to_csv(finalnormCSVFile)    else:        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

从风格上讲,我可能会选择将这些部分包装成函数,这也意味着如果这确实是问题,那么可以进行更多的事情…


另一个更快的选择是使用pytables(HDF5Store),如果您不需要将结果输出为csv(但我希望您这样做的话)…

The best thing to do by far is to profile your pre. e.g. with

%prun
in
ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html.
Then you can see it definitely is
read_csv
and specifically where (which
line of your pre and which lines of pandas pre).


Ah ha, I’d missed that you are appending all these to a single csv file. And
in your prun it shows most of the time is spent in

close
, so let’s keep the
file open:

# outside of the for loop (so the file is opened and closed only once)f = open(finalnormCSVFile, 'w')...for picklefile in ...    if picklefile == '0000.p':        dftemp.to_csv(f)    else:        dftemp.to_csv(f, mode='a', header=False)...f.close()

Each time the file is opened before it can append to, it needs to seek to the
end before writing, it could be that this is the expensive (I don’t see why
this should be that bad, but keeping it open removes the need to do
this).



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/455813.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号