在这种情况下,您应该 分析代码 (以查看哪些函数调用花费最多的时间),这样您就可以凭经验检查它确实
read_csv比其他地方慢。
通过查看您的代码:首先,这里有很多复制和很多循环(没有足够的向量化)…每当您看到循环时,都在寻找一种删除它的方法。其次,当您使用诸如zfill之类的东西时,我想知道是否要
to_fwf(固定宽度格式)而不是
to_csv?
进行一些健全性测试:某些文件是否比其他文件大很多(这可能导致您碰到交换文件)?您确定最大的文件只有1200行吗?你检查了吗?例如使用
wc -l。
IMO,我认为这不太可能是垃圾回收。(如其他答案所示)。
这是对代码的一些改进,可以改善运行时间。
列是固定的,我将提取列计算并向量化实数,子项和其他归一化。使用apply而不是迭代(对于zfill)。
columns_to_drop = set(head) & set(exclude) # maybe also - ['ConcatIndex']remaining_cols = set(head) - set(exclude)real_cols = [r for r in remaining_cols if 'Real ' in r]real_cols_suffix = [r.strip('Real ') for r in real]remaining_cols = remaining_cols - real_colschild_cols = [r for r in remaining_cols if 'child' in r]child_cols_desc = [r.strip('child'+'desc') for r in real]remaining_cols = remaining_cols - child_colsfor count, picklefile in enumerate(pickleFiles): if count % 100 == 0: t2 = datetime.now() print(str(t2)) print('count = ' + str(count)) print('time: ' + str(t2 - t1) + 'n') t1 = t2 #Dataframe Manipulation: df = pd.read_pickle(path + picklefile) df['ConcatIndex'] = 100000*df.FileID + df.ID # use apply here rather than iterating df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10)) df.index = df.ConcatIndex #Dataframe Normalization: dftemp = df.very_deep_copy() # don't *think* you need this # drop all excludes dftemp.drop(columns_to_drop), axis=1, inplace=True) # normalize real cols m = dftemp[real_cols_suffix].max() m.index = real_cols dftemp[real_cols] = dftemp[real_cols] / m # normalize child cols m = dftemp[child_cols_desc].max() m.index = child_cols dftemp[child_cols] = dftemp[child_cols] / m # normalize remaining remaining = list(remaining - child) dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max() # if this case is important then discard the rows of m with .max() is 0 #if max != 0: # dftemp[string] = dftemp[string]/max # this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop # dftemp.drop('ConcatIndex', axis=1, inplace=True) #Saving Dataframe in CSV: if picklefile == '0000.p': dftemp.to_csv(finalnormCSVFile) else: dftemp.to_csv(finalnormCSVFile, mode='a', header=False)从风格上讲,我可能会选择将这些部分包装成函数,这也意味着如果这确实是问题,那么可以进行更多的事情…
另一个更快的选择是使用pytables(HDF5Store),如果您不需要将结果输出为csv(但我希望您这样做的话)…
The best thing to do by far is to profile your pre. e.g. with %prun
in
ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html.
Then you can see it definitely is read_csv
and specifically where (which
line of your pre and which lines of pandas pre).
Ah ha, I’d missed that you are appending all these to a single csv file. And
in your prun it shows most of the time is spent in
close, so let’s keep the
file open:
# outside of the for loop (so the file is opened and closed only once)f = open(finalnormCSVFile, 'w')...for picklefile in ... if picklefile == '0000.p': dftemp.to_csv(f) else: dftemp.to_csv(f, mode='a', header=False)...f.close()
Each time the file is opened before it can append to, it needs to seek to the
end before writing, it could be that this is the expensive (I don’t see why
this should be that bad, but keeping it open removes the need to do
this).



