保存到hdf5非常慢（Python冻结）

将数据写入HDF5

如果您在不指定块形状的情况下写入块数据集，则h5py将自动为您执行此操作。由于h5py无法知道您将如何从数据集中写入或读取数据，因此这通常会导致性能下降。

您还使用默认的1
MB块缓存大小。如果您仅写入块的一部分，而该块不适合缓存（这很可能是1MP块高速缓存大小），则整个块将在内存中读取，修改并写回到磁盘。如果多次发生这种情况，您将看到性能远远超过HDD
/ SSD的顺序IO速度。

在下面的示例中，我假设您仅沿第一个维度进行读取或写入。如果不是这样，则必须根据您的需要进行修改。

import numpy as npimport tables #register bloscimport h5py as h5import h5py_cache as h5cimport timebatch_size=120train_shape=(90827, 10, 10, 2048)hdf5_path='Test.h5'# As we are writing whole chunks here this isn't realy needed,# if you forget to set a large enough chunk-cache-size when not writing or reading # whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*200) #200 MB cache sizedset_train_bottle = f.create_dataset("train_bottle", shape=train_shape,dtype=np.float32,chunks=(10, 10, 10, 2048),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)t1=time.time()#Testing with 2GB of datafor i in range(20):    #prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)    dset_train_bottle[i*batch_size:(i+1)*batch_size,:,:,:]=predictionf.close()print(time.time()-t1)print("MB/s: " + str(2000/(time.time()-t1)))

编辑循环中的数据创建花费了很多时间，因此我在时间测量之前创建了数据。

这应至少提供900 MB / s的吞吐量（CPU限制）。使用实际数据和较低的压缩率，您应该轻松达到硬盘的顺序IO速度。

如果您多次错误地调用此块，则使用with语句打开HDF5-File也会导致性能下降。这将关闭并重新打开文件，删除块缓存。

保存到hdf5非常慢（Python冻结）

面试问答相关栏目本月热门文章