AliasSample采样算法

def create_alias_table(df):
    s = [row['weight'] for index, row in df.iterrows()]
    prob_values = np.array(s)  # prob_values存放的是各个类型发生的累计概率*类型种数
    prob_values = prob_values / sum(prob_values)  # 归一化
    skus = [row['skuId'] for index, row in df.iterrows()]
    n = len(prob_values)
    prob_values = np.array(prob_values) * n  # 乘以种类数
    small, large = [], []
    accept, alias = [None] * n, ['-1'] * n
    # large存放长度大于1的线段，small存放长度小于1的线段,两个队列放的是种类id
    for idx, val in enumerate(prob_values):
        if val > 1:
            large.append(idx)
        else:
            small.append(idx)
    # 开始填充
    while small and large:  # 当两队列都有元素时
        small_idx, large_idx = small.pop(), large.pop()  # 长的和短的各取一个
        accept[small_idx] = prob_values[small_idx]  # 每个桶内的分界线。（因为这个算法就是先抽出属于哪个桶，再抽个数和分界线判断）
        alias[small_idx] = str(
            large_idx)  # alias存放第i个类型 对应的large_index。（因为都是由一个概率大于1和一个概率小于的类型组成）,当为默认值时，说明该桶没有large_index
        # 判定large_idx 去留
        # 先放短的线段，然后用长的线段去补1
        prob_values[large_idx] = prob_values[large_idx] - (1 - prob_values[small_idx])
        # 判断长的减剩下的是否还大于1， 大于1还放到large中，小于1了就放到small中
        if prob_values[large_idx] > 1:
            large.append(large_idx)
        else:
            small.append(large_idx)
    while small:  # 填充剩下的。
        small_idx = small.pop()
        accept[small_idx] = 1
    while large:
        large_idx = large.pop()
        accept[large_idx] = '1'
    return skus, accept, alias  #


def alias_sample(accept, alias):
    """
    :param accept:
    :param alias:
    :return: sample index
    """
    N = len(accept)
    i = int(np.random.random() * N)
    r = np.random.random()
    if r < accept[i]:
        return i
    else:
        return alias[i]


df = pd.Dataframe(data=[[0.8, 9274], [0.4, 8233], [0.2, 7163]], index=[0, 1, 2],columns=["weight", "skuId"])  # 计算三个商品的概率
a, b, c = create_alias_table(df)  # 创建alias_table
d = alias_sample(b, c)  # 根据aliasTable，进行采样

Alias Sample 算法就是先抽出属于哪个桶，再抽个数和分界线判断。
accept：用来存放每个桶内的分界线
alias：用来存放每个另一个桶的下标。（当前i个类型的内只有一种类型的时候，alias[i]为默认值）
参考资料：https://www.cnblogs.com/arachis/p/alias_sample.html

举例：三个商品，概率分别为0.8，0.4，0.2。得到如下数据

    # skus [9274, 8233, 7163]
    # accept [1, 0.8571428571428571, 0.42857142857142855]
    # alias ['-1', '0', '0']

解释：就是先从0，1，2选取一个桶。如果抽中第一个桶，由于alias[0]值为默认值-1，所以桶内只有一个类型（只有skus[0]）。如果抽中第二个桶，由于alias[0]值为0.857，然后再从0~1抽一个数，小于0.857则是sku[1]，大于0.857则是sku[alias[1]]，同理第三个桶

AliasSample采样算法

Python相关栏目本月热门文章