这是一种获取每一行随机索引的矢量化方法,其概率
a为
2D数组-
(a.cumsum(1) > np.random.rand(a.shape[0])[:,None]).argmax(1)
概括覆盖
2D数组的行和列-
def random_choice_prob_index(a, axis=1): r = np.expand_dims(np.random.rand(a.shape[1-axis]), axis=axis) return (a.cumsum(axis=axis) > r).argmax(axis=axis)
让我们通过运行一百万次来验证给定的样本-
In [589]: a = np.array([ ...: [.1, .3, .6], ...: [.2, .4, .4], ...: ])In [590]: choices = [random_choice_prob_index(a)[0] for i in range(1000000)]# This should be close to first row of given sampleIn [591]: np.bincount(choices)/float(len(choices))Out[591]: array([ 0.099781, 0.299436, 0.600783])
运行时测试
原始的循环方式-
def loopy_app(categorical_distributions): m, n = categorical_distributions.shape out = np.empty(m, dtype=int) for i,row in enumerate(categorical_distributions): out[i] = np.random.choice(n, p=row) return out
更大数组上的时间-
In [593]: a = np.array([ ...: [.1, .3, .6], ...: [.2, .4, .4], ...: ])In [594]: a_big = np.repeat(a,100000,axis=0)In [595]: %timeit loopy_app(a_big)1 loop, best of 3: 2.54 s per loopIn [596]: %timeit random_choice_prob_index(a_big)100 loops, best of 3: 6.44 ms per loop



