每日数据，每3天重新采样一次，然后连续5天进行有效计算

这里列出了使用基于bin的求和的两个基于NumPy的两个 ~~三三个~~ 解决方案，基本上涵盖了三种情况。

方案1：每个日期有多个条目，但没有缺失的日期

方法1：

# For now hard-pred to use Window size of 5 and stride length of 3def vectorized_app1(df):    # Extract the index names and values    vals = df.A.values    indx = df.index.values    # Extract IDs for bin based summing    mask = np.append(False,indx[1:] > indx[:-1])    date_id = mask.cumsum()    search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))    shifts = np.searchsorted(date_id,search_id)    reps = shifts[1:] - shifts[:-1]    id_arr = np.repeat(np.arange(len(reps)),reps)    # Perform bin based summing and subtract the repeated ones    IDsums = np.bincount(id_arr,vals)    allsums = IDsums[:-1] + IDsums[1:]    allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]    # Convert to pandas dataframe if needed    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group    return pd.Dataframe(allsums,index=out_index,columns=['A'])

方法2：

# For now hard-pred to use Window size of 5 and stride length of 3def vectorized_app2(df):    # Extract the index names and values    indx = df.index.values    # Extract IDs for bin based summing    mask = np.append(False,indx[1:] > indx[:-1])    date_id = mask.cumsum()    # Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern        # Pad with 0 and length of array at either ends as we use diff later on    shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()    search_id = np.hstack((0,shiftIDs,date_id[-1]+1))    # Find the start of those shifting indices        # Generate ID based on shifts and do bin based summing of dataframe    shifts = np.searchsorted(date_id,search_id)    reps = shifts[1:] - shifts[:-1]    id_arr = np.repeat(np.arange(len(reps)),reps)        IDsums = np.bincount(id_arr,df.A.values)    # Sum each group of 3 elems with a stride of 2, make dataframe if needed    allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]    # Convert to pandas dataframe if needed    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group    return pd.Dataframe(allsums,index=out_index,columns=['A'])

方法3：

def vectorized_app3(df, S=3, W=5):    dt = df.index.values    shifts = np.append(False,dt[1:] > dt[:-1])    c = np.bincount(shifts.cumsum(),df.A.values)    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]    out_index = dt[np.nonzero(shifts)[0][W-2::S]]    return pd.Dataframe(out,index=out_index,columns=['A'])

我们可以将卷积部分替换为直接切片求和，以获取其修改版本-

def vectorized_app3_v2(df, S=3, W=5):      dt = df.index.values    shifts = np.append(False,dt[1:] > dt[:-1])    c = np.bincount(shifts.cumsum(),df.A.values)    f = c.size+S-W    out = c[:f:S].copy()    for i in range(1,W):        out += c[i:f+i:S]    out_index = dt[np.nonzero(shifts)[0][W-2::S]]    return pd.Dataframe(out,index=out_index,columns=['A'])

方案2：每个日期和缺少的日期有多个条目

方法4：

def vectorized_app4(df, S=3, W=5):    dt = df.index.values    indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()    WL = ((indx[-1]+1)//S)    c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]    grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')    freq_str = str(S)+'D'    grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values    out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]    return pd.Dataframe(out,index=out_index,columns=['A'])

方案3：连续的日期，每个日期只有一个条目

方法5：

def vectorized_app5(df, S=3, W=5):    vals = df.A.values    N = (df.shape[0]-W+2*S-1)//S    n = vals.strides[0]    out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),       strides=(S*n,n)).sum(1)    index_idx = (W-1)+S*np.arange(N)    out_index = df.index[index_idx]    return pd.Dataframe(out,index=out_index,columns=['A'])

创建测试数据的建议

场景1：

# Setup input for multiple dates, but no missing datesS = 4 # Stride length (Could be edited)W = 7 # Window length (Could be edited)datasize = 3  # Decides datasizetidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')start_df = pd.Dataframe(dict(A=np.arange(len(tidx))), tidx)reps = np.random.randint(1,4,(len(start_df)))idx0 = np.repeat(start_df.index,reps)df_data = np.random.randint(0,9,(len(idx0)))df = pd.Dataframe(df_data,index=idx0,columns=['A'])

场景2：

要为多个日期和缺少日期的日期创建设置，我们可以只编辑

df_data

创建步骤，如下所示-

df_data = np.random.randint(0,9,(len(idx0)))

方案3：

# Setup input for exactly one entry per dateS = 4 # Could be editedW = 7datasize = 3  # Decides datasizetidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')df = pd.Dataframe(dict(A=np.arange(len(tidx))), tidx)

每日数据，每3天重新采样一次，然后连续5天进行有效计算

面试问答相关栏目本月热门文章