我认为我最好通过一个例子来解释我想要实现的目标。假设我有这个数据框:
time 0 2013-01-01 12:56:00 1 2013-01-01 12:00:12 2 2013-01-01 10:34:28 3 2013-01-01 09:34:54 4 2013-01-01 08:34:55 5 2013-01-01 16:35:19 6 2013-01-01 16:35:30
我想在给定间隔 T 的情况下,计算每行中该间隔内“打开”的寄存器数量。例如,考虑到 T = 2 小时,输出如下:
time count 0 2013-01-01 12:56:00 1 # 12:56-2 = 10:56 -> 1 register between [10:56, 12:56) 1 2013-01-01 12:00:12 1 2 2013-01-01 10:34:28 2 # 10:34:28-2 = 8:34:28 -> 2 registers between [8:34:28, 10:34:28) 3 2013-01-01 09:34:54 1 4 2013-01-01 08:34:55 0 5 2013-01-01 16:35:19 0 6 2013-01-01 16:35:30 1
我想知道如何使用 pandas 获得这个结果。如果我只考虑 dt.hour 访问器,例如,对于 T 等于 1,我可以创建每小时的列数,然后将其移动 1,将结果相加count[i] + count[i-1]。但我不知道是否可以将其推广到所需的输出。
count[i] + count[i-1]
这里的思路是将所有收银台的开门时间标记为 +1,将所有收银台的关门时间标记为 -1。然后按时间排序并对 +/- 1 的值进行累计求和,以获得给定时间的开门次数。
# initialize interval start times as 1, end times as -1 start_times= df.assign(time=df['time'] - pd.Timedelta(hours=2), count=1) all_times = start_times.append(df.assign(count=-1), ignore_index=True) # sort by time and perform a cumulative sum get the count of overlaps at a given time # (subtract 1 since you don't want to include the current value in the overlap) all_times = all_times.sort_values(by='time') all_times['count'] = all_times['count'].cumsum() - 1 # reassign to the original dataframe, keeping only the original times df['count'] = all_times['count']
输出结果:
time count 0 2013-01-01 12:56:00 1 1 2013-01-01 12:00:12 1 2 2013-01-01 10:34:28 2 3 2013-01-01 09:34:54 1 4 2013-01-01 08:34:55 0 5 2013-01-01 16:35:19 0 6 2013-01-01 16:35:30 1