考虑以下迷你版本的问题:
from io import StringIOfrom pandas import read_csv, to_datetime# how close do sessions have to be to be considered equal? (in minutes)threshold = 5# datetime column (combination of date + start_time)dtc = [['date', 'start_time']]# index column (above combination)ixc = 'date_start_time'df1 = read_csv(StringIO(u'''date,start_time,employee_id,session_id01/01/2016,02:03:00,7261824,87163118201/01/2016,06:03:00,7261824,87163118301/01/2016,11:01:00,7261824,87163118401/01/2016,14:01:00,7261824,871631185'''), parse_dates=dtc)df2 = read_csv(StringIO(u'''date,start_time,employee_id,session_id01/01/2016,02:03:00,7261824,87163118201/01/2016,06:05:00,7261824,87163118301/01/2016,11:04:00,7261824,87163118401/01/2016,14:10:00,7261824,871631185'''), parse_dates=dtc)
这使
>>> df1 date_start_time employee_id session_id0 2016-01-01 02:03:00 7261824 8716311821 2016-01-01 06:03:00 7261824 8716311832 2016-01-01 11:01:00 7261824 8716311843 2016-01-01 14:01:00 7261824 871631185>>> df2 date_start_time employee_id session_id0 2016-01-01 02:03:00 7261824 8716311821 2016-01-01 06:05:00 7261824 8716311832 2016-01-01 11:04:00 7261824 8716311843 2016-01-01 14:10:00 7261824 871631185
您希望将其视为合并时的
df2[0:3]重复项
df1[0:3](因为它们分别相距少于5分钟),但是请视为
df1[3]并
df2[3]视为单独的会话。
解决方案1:间隔匹配
这实质上就是您在编辑中建议的内容。您希望将两个表中的时间戳映射到以时间戳为中心的10分钟间隔,并四舍五入到最接近的5分钟。
每个间隔都可以由其中点唯一表示,因此您可以合并时间戳上的数据帧,四舍五入到最接近的5分钟。例如:
import numpy as np# half-threshold in nanosecondsthreshold_ns = threshold * 60 * 1e9# compute "interval" to which each session belongsdf1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)# joincols = ['interval', 'employee_id', 'session_id']print df1.merge(df2, on=cols, how='outer')[cols]
哪个打印
interval employee_id session_id0 2016-01-01 02:05:00 7261824 8716311821 2016-01-01 06:05:00 7261824 8716311832 2016-01-01 11:00:00 7261824 8716311843 2016-01-01 14:00:00 7261824 8716311854 2016-01-01 11:05:00 7261824 8716311845 2016-01-01 14:10:00 7261824 871631185
请注意,这并不完全正确。会话
df1[2]和和
df2[2],尽管相距仅3分钟,却不被视为重复。这是因为它们位于间隔边界的不同侧。
解决方案2:一对一匹配
这是另一种方法,取决于in中的会话在中
df1具有零或一个重复项的条件
df2。
我们将时间戳替换
df1为最近的时间戳,
df2其中匹配的时间为
employee_id,
session_id并且 相距不到5分钟。
from datetime import timedelta# get closest match from "df2" to row from "df1" (as long as it's below the threshold)def closest(row): matches = df2.loc[(df2.employee_id == row.employee_id) &(df2.session_id == row.session_id)] deltas = matches.date_start_time - row.date_start_time deltas = deltas.loc[deltas <= timedelta(minutes=threshold)] try: return matches.loc[deltas.idxmin()] except ValueError: # no items return row# replace timestamps in "df1" with closest timestamps in "df2"df1 = df1.apply(closest, axis=1)# joincols = ['date_start_time', 'employee_id', 'session_id']print df1.merge(df2, on=cols, how='outer')[cols]
哪个打印
date_start_time employee_id session_id0 2016-01-01 02:03:00 7261824 8716311821 2016-01-01 06:05:00 7261824 8716311832 2016-01-01 11:04:00 7261824 8716311843 2016-01-01 14:01:00 7261824 8716311854 2016-01-01 14:10:00 7261824 871631185
这种方法明显较慢,因为您必须在中搜索
df2每行的全部
df1。我写的内容可能可以进一步优化,但是在大型数据集上仍然需要很长时间。



