我有两组 Dataframe,都具有唯一标识符和日期时间数据,格式如下
“2020-01-01 00:00:01”-日期时间和“12345”-唯一标识符和类型
第一个问题,DF1:
DatetimeX ID Type 2020-01-01 02:00:01 12345 C 2020-01-01 02:00:03 12345 C 2020-01-01 05:00:03 12345 C 2020-01-01 05:03:05 12345 C 2020-01-01 03:00:09 13333 D 2020-01-01 02:00:09 12345 C 2020-01-01 02:01:35 12345 C 2020-01-01 02:10:35 12345 C 2020-01-01 02:00:01 13333 D 2020-01-01 02:05:35 13333 D 2020-01-01 02:00:50 13333 E 2020-01-01 02:00:01 12211 C 2020-01-01 02:09:50 13333 E 2020-01-01 02:11:50 13333 E
我想根据具有相同“类型”的 ID 的第一个时间戳,并在 10 分钟后删除行,如下所示:
DatetimeX ID Type 2020-01-01 02:00:01 12345 C 2020-01-01 05:00:03 12345 C 2020-01-01 02:10:35 12345 C 2020-01-01 03:00:09 13333 D 2020-01-01 02:00:01 13333 D 2020-01-01 02:00:50 13333 E 2020-01-01 02:00:01 12211 C 2020-01-01 02:11:50 13333 E
我尝试探索时间范围/日期范围,但找不到任何类似的编码概念。希望有人能指出我可以通过哪些方式进行探索,而不是试图获得完整的解决方案。几年没接触过 Python,以前也不熟悉它。谢谢
更新了附加数据行以获得更准确的示例
添加示例输入数据并简化流程:
Timestamp = pd.to_datetime data = [{'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12345, 'Type': 'C'}, {'DatetimeX': Timestamp('2020-01-01 02:00:03'), 'ID': 12345, 'Type': 'C'}, {'DatetimeX': Timestamp('2020-01-01 05:00:03'), 'ID': 12345, 'Type': 'C'}, {'DatetimeX': Timestamp('2020-01-01 05:03:05'), 'ID': 12345, 'Type': 'C'}, {'DatetimeX': Timestamp('2020-01-01 03:00:09'), 'ID': 13333, 'Type': 'D'}, {'DatetimeX': Timestamp('2020-01-01 02:00:09'), 'ID': 12345, 'Type': 'C'}, {'DatetimeX': Timestamp('2020-01-01 02:01:35'), 'ID': 12345, 'Type': 'C'}, {'DatetimeX': Timestamp('2020-01-01 02:10:35'), 'ID': 12345, 'Type': 'C'}, {'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 13333, 'Type': 'D'}, {'DatetimeX': Timestamp('2020-01-01 02:05:35'), 'ID': 13333, 'Type': 'D'}, {'DatetimeX': Timestamp('2020-01-01 02:00:50'), 'ID': 13333, 'Type': 'E'}, {'DatetimeX': Timestamp('2020-01-01 02:00:01'), 'ID': 12211, 'Type': 'C'}, {'DatetimeX': Timestamp('2020-01-01 02:09:50'), 'ID': 13333, 'Type': 'E'}, {'DatetimeX': Timestamp('2020-01-01 02:11:50'), 'ID': 13333, 'Type': 'E'}] df1 = pd.DataFrame(data) col_raw = df1.columns while True: df1.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True) df1['diff1_lt10min'] = df1.groupby(['ID', 'Type'])['DatetimeX'].diff().dt.seconds < 10 * 60 df1['tag_group'] = (~df1['diff1_lt10min']).cumsum() if df1.duplicated('tag_group').sum()==0: break df1 = df1.merge((df1.groupby('tag_group')['DatetimeX'].first() .reset_index() .rename(columns={'DatetimeX':'DatetimeX_1st'})), on='tag_group') df1['diff2_lt10min'] = (df1.DatetimeX - df1.DatetimeX_1st).dt.seconds < 10 * 60 cond = df1['diff1_lt10min'] & df1['diff2_lt10min'] df1 = df1.loc[~cond, col_raw] df1 = df1[col_raw]
细节…
# repeat col_raw = df1.columns df4 = df1.copy() n_round = 1 while True: print('#'*20, f'round {n_round}', '#'*20) # step 1 sort the values & group by ['Type', 'ID'] calculate the DatetimeX's time diff # notice: the time-diff is not the actual wanted df = df4[col_raw].copy() df.sort_values(['ID', 'Type', 'DatetimeX'], inplace=True) df['diff'] = df.groupby(['Type', 'ID'])['DatetimeX'].diff() print('#'*10, 'step1', '#'*10) print(df) # step 2, create a tag column to store the first 10min gap from 'diff' column cond = False cond |= df['diff'].dt.seconds > 10 * 60 cond |= df['diff'].isnull() df['tag'] = np.where(cond, 1, 0) df['tag'] = df['tag'].cumsum().fillna(method = 'ffill') print('#'*10, 'step2', '#'*10) print(df) # step 3, use 'tag' to judge to stop the while loop or not # tag should be unique break_sign = df.tag.duplicated().sum() if break_sign == 0: break print('#'*10, 'step3', '#'*10) print(break_sign) # step 4: # create a 'DatetimeX_1st' with the 'tag' group's first DatetimeX # create a 'diff2' = 'DatetimeX' - 'DatetimeX_1st' df2 = df.reset_index().set_index('tag') df2['DatetimeX_1st'] = df.groupby('tag').first()['DatetimeX'] df2['diff2'] = df2['DatetimeX'] - df2['DatetimeX_1st'] print('#'*10, 'step4', '#'*10) print(df2) # step 5: # drop the True < 10min gaps records # 'diff' and 'diff2' should all < 10min cond = (df2['diff2'].dt.seconds < 10 * 60) & (df2['diff'].dt.seconds < 10 * 60) df3 = df2[~cond].copy() print('#'*10, 'step5', '#'*10) print(df3) # step 6: # reset index cols = 'tag DatetimeX ID Type'.split() df4 = df3.reset_index().set_index('index').sort_index()[cols] print('#'*10, 'step6', '#'*10) print(df4) n_round += 1 print() # get result result = df[['DatetimeX', 'ID', 'Type']].copy() result.index.name = None print() print('#'*10, 'result', '#'*10) print(result)
输出:
#################### round 1 #################### ########## step1 ########## DatetimeX ID Type diff 11 2020-01-01 02:00:01 12211 C NaT 0 2020-01-01 02:00:01 12345 C NaT 1 2020-01-01 02:00:03 12345 C 0 days 00:00:02 5 2020-01-01 02:00:09 12345 C 0 days 00:00:06 6 2020-01-01 02:01:35 12345 C 0 days 00:01:26 7 2020-01-01 02:10:35 12345 C 0 days 00:09:00 2 2020-01-01 05:00:03 12345 C 0 days 02:49:28 3 2020-01-01 05:03:05 12345 C 0 days 00:03:02 8 2020-01-01 02:00:01 13333 D NaT 9 2020-01-01 02:05:35 13333 D 0 days 00:05:34 4 2020-01-01 03:00:09 13333 D 0 days 00:54:34 10 2020-01-01 02:00:50 13333 E NaT 12 2020-01-01 02:09:50 13333 E 0 days 00:09:00 13 2020-01-01 02:11:50 13333 E 0 days 00:02:00 ########## step2 ########## DatetimeX ID Type diff tag 11 2020-01-01 02:00:01 12211 C NaT 1 0 2020-01-01 02:00:01 12345 C NaT 2 1 2020-01-01 02:00:03 12345 C 0 days 00:00:02 2 5 2020-01-01 02:00:09 12345 C 0 days 00:00:06 2 6 2020-01-01 02:01:35 12345 C 0 days 00:01:26 2 7 2020-01-01 02:10:35 12345 C 0 days 00:09:00 2 2 2020-01-01 05:00:03 12345 C 0 days 02:49:28 3 3 2020-01-01 05:03:05 12345 C 0 days 00:03:02 3 8 2020-01-01 02:00:01 13333 D NaT 4 9 2020-01-01 02:05:35 13333 D 0 days 00:05:34 4 4 2020-01-01 03:00:09 13333 D 0 days 00:54:34 5 10 2020-01-01 02:00:50 13333 E NaT 6 12 2020-01-01 02:09:50 13333 E 0 days 00:09:00 6 13 2020-01-01 02:11:50 13333 E 0 days 00:02:00 6 ########## step3 ########## 8 ########## step4 ########## index DatetimeX ID Type diff \ tag 1 11 2020-01-01 02:00:01 12211 C NaT 2 0 2020-01-01 02:00:01 12345 C NaT 2 1 2020-01-01 02:00:03 12345 C 0 days 00:00:02 2 5 2020-01-01 02:00:09 12345 C 0 days 00:00:06 2 6 2020-01-01 02:01:35 12345 C 0 days 00:01:26 2 7 2020-01-01 02:10:35 12345 C 0 days 00:09:00 3 2 2020-01-01 05:00:03 12345 C 0 days 02:49:28 3 3 2020-01-01 05:03:05 12345 C 0 days 00:03:02 4 8 2020-01-01 02:00:01 13333 D NaT 4 9 2020-01-01 02:05:35 13333 D 0 days 00:05:34 5 4 2020-01-01 03:00:09 13333 D 0 days 00:54:34 6 10 2020-01-01 02:00:50 13333 E NaT 6 12 2020-01-01 02:09:50 13333 E 0 days 00:09:00 6 13 2020-01-01 02:11:50 13333 E 0 days 00:02:00 DatetimeX_1st diff2 tag 1 2020-01-01 02:00:01 0 days 00:00:00 2 2020-01-01 02:00:01 0 days 00:00:00 2 2020-01-01 02:00:01 0 days 00:00:02 2 2020-01-01 02:00:01 0 days 00:00:08 2 2020-01-01 02:00:01 0 days 00:01:34 2 2020-01-01 02:00:01 0 days 00:10:34 3 2020-01-01 05:00:03 0 days 00:00:00 3 2020-01-01 05:00:03 0 days 00:03:02 4 2020-01-01 02:00:01 0 days 00:00:00 4 2020-01-01 02:00:01 0 days 00:05:34 5 2020-01-01 03:00:09 0 days 00:00:00 6 2020-01-01 02:00:50 0 days 00:00:00 6 2020-01-01 02:00:50 0 days 00:09:00 6 2020-01-01 02:00:50 0 days 00:11:00 ########## step5 ########## index DatetimeX ID Type diff \ tag 1 11 2020-01-01 02:00:01 12211 C NaT 2 0 2020-01-01 02:00:01 12345 C NaT 2 7 2020-01-01 02:10:35 12345 C 0 days 00:09:00 3 2 2020-01-01 05:00:03 12345 C 0 days 02:49:28 4 8 2020-01-01 02:00:01 13333 D NaT 5 4 2020-01-01 03:00:09 13333 D 0 days 00:54:34 6 10 2020-01-01 02:00:50 13333 E NaT 6 13 2020-01-01 02:11:50 13333 E 0 days 00:02:00 DatetimeX_1st diff2 tag 1 2020-01-01 02:00:01 0 days 00:00:00 2 2020-01-01 02:00:01 0 days 00:00:00 2 2020-01-01 02:00:01 0 days 00:10:34 3 2020-01-01 05:00:03 0 days 00:00:00 4 2020-01-01 02:00:01 0 days 00:00:00 5 2020-01-01 03:00:09 0 days 00:00:00 6 2020-01-01 02:00:50 0 days 00:00:00 6 2020-01-01 02:00:50 0 days 00:11:00 ########## step6 ########## tag DatetimeX ID Type index 0 2 2020-01-01 02:00:01 12345 C 2 3 2020-01-01 05:00:03 12345 C 4 5 2020-01-01 03:00:09 13333 D 7 2 2020-01-01 02:10:35 12345 C 8 4 2020-01-01 02:00:01 13333 D 10 6 2020-01-01 02:00:50 13333 E 11 1 2020-01-01 02:00:01 12211 C 13 6 2020-01-01 02:11:50 13333 E #################### round 2 #################### ########## step1 ########## DatetimeX ID Type diff index 11 2020-01-01 02:00:01 12211 C NaT 0 2020-01-01 02:00:01 12345 C NaT 7 2020-01-01 02:10:35 12345 C 0 days 00:10:34 2 2020-01-01 05:00:03 12345 C 0 days 02:49:28 8 2020-01-01 02:00:01 13333 D NaT 4 2020-01-01 03:00:09 13333 D 0 days 01:00:08 10 2020-01-01 02:00:50 13333 E NaT 13 2020-01-01 02:11:50 13333 E 0 days 00:11:00 ########## step2 ########## DatetimeX ID Type diff tag index 11 2020-01-01 02:00:01 12211 C NaT 1 0 2020-01-01 02:00:01 12345 C NaT 2 7 2020-01-01 02:10:35 12345 C 0 days 00:10:34 3 2 2020-01-01 05:00:03 12345 C 0 days 02:49:28 4 8 2020-01-01 02:00:01 13333 D NaT 5 4 2020-01-01 03:00:09 13333 D 0 days 01:00:08 6 10 2020-01-01 02:00:50 13333 E NaT 7 13 2020-01-01 02:11:50 13333 E 0 days 00:11:00 8 ########## result ########## DatetimeX ID Type 11 2020-01-01 02:00:01 12211 C 0 2020-01-01 02:00:01 12345 C 7 2020-01-01 02:10:35 12345 C 2 2020-01-01 05:00:03 12345 C 8 2020-01-01 02:00:01 13333 D 4 2020-01-01 03:00:09 13333 D 10 2020-01-01 02:00:50 13333 E 13 2020-01-01 02:11:50 13333 E