这是我的第一个版本,似乎工作正常,可以随时复制或就如何提高效率提出建议(一般来说,我在编程方面有很长的经验,而使用python或numpy则没有那么长的经验)
此函数创建单个随机平衡子样本。
编辑:子样本的大小现在对少数类进行了抽样,这可能应该更改。
def balanced_subsample(x,y,subsample_size=1.0): class_xs = [] min_elems = None for yi in np.unique(y): elems = x[(y == yi)] class_xs.append((yi, elems)) if min_elems == None or elems.shape[0] < min_elems: min_elems = elems.shape[0] use_elems = min_elems if subsample_size < 1: use_elems = int(min_elems*subsample_size) xs = [] ys = [] for ci,this_xs in class_xs: if len(this_xs) > use_elems: np.random.shuffle(this_xs) x_ = this_xs[:use_elems] y_ = np.empty(use_elems) y_.fill(ci) xs.append(x_) ys.append(y_) xs = np.concatenate(xs) ys = np.concatenate(ys) return xs,ys
对于试图通过Pandas Dataframe进行上述操作的任何人,您都需要进行一些更改:
- 更换
np.random.shuffle
用线
this_xs = this_xs.reindex(np.random.permutation(this_xs.index))
- 更换
np.concatenate
用线
xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')



