我想将具有相同 ID、连续日期和相同特征值的行合并起来。
我有以下数据框:
Id Start End Feature1 Feature2 0 A 2020-01-01 2020-01-15 1 1 1 A 2020-01-16 2020-01-30 1 1 2 A 2020-01-31 2020-02-15 0 1 3 A 2020-07-01 2020-07-15 0 1 4 B 2020-01-31 2020-02-15 0 0 5 B 2020-02-16 NaT 0 0
预期结果是:
Id Start End Feature1 Feature2 0 A 2020-01-01 2020-01-30 1 1 1 A 2020-01-31 2020-02-15 0 1 2 A 2020-07-01 2020-07-15 0 1 3 B 2020-01-31 NaT 0 0
我一直在尝试其他帖子的答案,但它们并不符合我的用例。
您可以通过以下方式联系:
Start
End
GroupBy.shift()
group_no
Id
.gropuby()
.agg()
NaT`由于分组内有数据,我们需要在分组时指定。此外,为了获取组内`dropna=False`的最后一项,我们使用而不是。`End``x.iloc[-1]``last # convert to datetime format if not already in datetime df['Start'] = pd.to_datetime(df['Start']) df['End'] = pd.to_datetime(df['End']) # sort by columns `Id` and `Start` if not already in this sequence df = df.sort_values(['Id', 'Start']) day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days group_no = (day_diff.isna() | day_diff.gt(1)).cumsum() df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False) .agg({'Id': 'first', 'Start': 'first', 'End': lambda x: x.iloc[-1], 'Feature1': 'first', 'Feature2': 'first', }))
结果:
print(df_out) Id Start End Feature1 Feature2 0 A 2020-01-01 2020-01-30 1 1 1 A 2020-01-31 2020-02-15 0 1 2 A 2020-07-01 2020-07-15 0 1 3 B 2020-01-31 NaT 0 0