您可以传递
parallel=True给任何numba
jitted函数,但这并不意味着它总是利用所有内核。您必须了解numba使用一些启发式方法来使代码并行执行,有时这些启发式方法根本找不到任何要并行化的代码。当前有一个拉取请求,以便在不可能使其“平行”时发出警告。因此,它更像是“请使其尽可能并行执行”参数而不是“强制并行执行”。
但是,如果您确实知道可以并行化代码,则始终可以手动使用线程或进程。只是改编numba
docs中使用多线程的示例:
#!/usr/bin/env pythonfrom __future__ import print_function, division, absolute_importimport mathimport threadingfrom timeit import repeatimport numpy as npfrom numba import jitnthreads = 4size = 10**7 # CHANGED# CHANGEDdef func_np(a, b): """ Control function using Numpy. """ return a + b# CHANGED@jit('void(double[:], double[:], double[:])', nopython=True, nogil=True)def inner_func_nb(result, a, b): """ Function under test. """ for i in range(len(result)): result[i] = a[i] + b[i]def timefunc(correct, s, func, *args, **kwargs): """ Benchmark *func* and print out its runtime. """ print(s.ljust(20), end=" ") # Make sure the function is compiled before we start the benchmark res = func(*args, **kwargs) if correct is not None: assert np.allclose(res, correct), (res, correct) # time it print('{:>5.0f} ms'.format(min(repeat(lambda: func(*args, **kwargs), number=5, repeat=2)) * 1000)) return resdef make_singlethread(inner_func): """ Run the given function inside a single thread. """ def func(*args): length = len(args[0]) result = np.empty(length, dtype=np.float64) inner_func(result, *args) return result return funcdef make_multithread(inner_func, numthreads): """ Run the given function inside *numthreads* threads, splitting its arguments into equal-sized chunks. """ def func_mt(*args): length = len(args[0]) result = np.empty(length, dtype=np.float64) args = (result,) + args chunklen = (length + numthreads - 1) // numthreads # Create argument tuples for each input chunk chunks = [[arg[i * chunklen:(i + 1) * chunklen] for arg in args] for i in range(numthreads)] # Spawn one thread per chunk threads = [threading.Thread(target=inner_func, args=chunk) for chunk in chunks] for thread in threads: thread.start() for thread in threads: thread.join() return result return func_mtfunc_nb = make_singlethread(inner_func_nb)func_nb_mt = make_multithread(inner_func_nb, nthreads)a = np.random.rand(size)b = np.random.rand(size)correct = timefunc(None, "numpy (1 thread)", func_np, a, b)timefunc(correct, "numba (1 thread)", func_nb, a, b)timefunc(correct, "numba (%d threads)" % nthreads, func_nb_mt, a, b)我突出显示了我更改的部分,所有其他内容均从示例中逐字复制。这利用了我机器上的所有内核(4内核机器,因此有4个线程),但是并没有显示出明显的加速:
numpy (1 thread) 539 msnumba (1 thread) 536 msnumba (4 threads) 442 ms
在这种情况下,多线程缺乏(很多)加速的原因是加法是带宽受限的操作。这意味着从数组中加载元素并将结果放置在结果数组中要比实际加法花费更多的时间。
在这些情况下,由于并行执行,您甚至可能会看到速度变慢!
与加载和存储数组元素相比,只有函数更复杂并且实际操作要花费大量时间时,并行执行才会有很大的改进。numba文档中的示例就是这样的:
def func_np(a, b): """ Control function using Numpy. """ return np.exp(2.1 * a + 3.2 * b)@jit('void(double[:], double[:], double[:])', nopython=True, nogil=True)def inner_func_nb(result, a, b): """ Function under test. """ for i in range(len(result)): result[i] = math.exp(2.1 * a[i] + 3.2 * b[i])实际上,这实际上(几乎)随线程数扩展,因为两次乘法,一次加法和一次调用
math.exp比加载和存储结果要慢得多:
func_nb = make_singlethread(inner_func_nb)func_nb_mt2 = make_multithread(inner_func_nb, 2)func_nb_mt3 = make_multithread(inner_func_nb, 3)func_nb_mt4 = make_multithread(inner_func_nb, 4)a = np.random.rand(size)b = np.random.rand(size)correct = timefunc(None, "numpy (1 thread)", func_np, a, b)timefunc(correct, "numba (1 thread)", func_nb, a, b)timefunc(correct, "numba (2 threads)", func_nb_mt2, a, b)timefunc(correct, "numba (3 threads)", func_nb_mt3, a, b)timefunc(correct, "numba (4 threads)", func_nb_mt4, a, b)
结果:
numpy (1 thread) 3422 msnumba (1 thread) 2959 msnumba (2 threads) 1555 msnumba (3 threads) 1080 msnumba (4 threads) 797 ms



