pd.Grouper
允许您指定“目标对象的groupby指令”。特别是,即使
df.index不是,您也可以使用它按日期分组
DatetimeIndex:
df.groupby(pd.Grouper(freq='2D', level=-1))
在
level=-1讲述
pd.Grouper寻找在多指标的最后一个级别的日期。此外,您可以将其与索引中的其他级别值结合使用:
level_values = df.index.get_level_valuesresult = (df.groupby([level_values(i) for i in [0,1]]+[pd.Grouper(freq='2D', level=-1)]).sum())
它看起来有些尴尬,但
using_Grouper比我最初的建议要快得多
using_reset_index:
import numpy as npimport pandas as pdimport datetime as DTdef using_Grouper(df): level_values = df.index.get_level_values return (df.groupby([level_values(i) for i in [0,1]] +[pd.Grouper(freq='2D', level=-1)]).sum())def using_reset_index(df): df = df.reset_index(level=[0, 1]) return df.groupby(['State','City']).resample('2D').sum()def using_stack(df): # http://stackoverflow.com/a/15813787/190597 return (df.unstack(level=[0,1]) .resample('2D').sum() .stack(level=[2,1]) .swaplevel(2,0))def make_orig(): values_a = range(16) values_b = range(10, 26) states = ['Georgia']*8 + ['Alabama']*8 cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4 dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4) df = pd.Dataframe( {'value_a': values_a, 'value_b': values_b}, index = [states, cities, dates]) df.index.names = ['State', 'City', 'Date'] return dfdef make_df(N): dates = pd.date_range('2000-1-1', periods=N) states = np.arange(50) cities = np.arange(10) index = pd.MultiIndex.from_product([states, cities, dates], names=['State', 'City', 'Date']) df = pd.Dataframe(np.random.randint(10, size=(len(index),2)), index=index,columns=['value_a', 'value_b']) return dfdf = make_orig()print(using_Grouper(df))产量
value_a value_bState City Date Alabama Mobile 2012-01-01 17 37 2012-01-03 21 41 Montgomery 2012-01-01 25 45 2012-01-03 29 49Georgia Atlanta 2012-01-01 1 21 2012-01-03 5 25 Savanna 2012-01-01 9 29 2012-01-03 13 33
这里是一个标杆比较
using_Grouper,
using_reset_index,
using_stack在一个有5000行数据帧:
In [30]: df = make_df(10)In [34]: len(df)Out[34]: 5000In [32]: %timeit using_Grouper(df)100 loops, best of 3: 6.03 ms per loopIn [33]: %timeit using_stack(df)10 loops, best of 3: 22.3 ms per loopIn [31]: %timeit using_reset_index(df)1 loop, best of 3: 659 ms per loop



