我有以下数据:
import pandas as pd, numpy as np dates = pd.date_range('01/01/2022', '01/11/2022', freq = 'D') values = [0,0,1,1,0,0,1,1,1,0,1] df = pd.DataFrame({'date': dates, 'value': values}) df date value 0 2022-01-01 0 1 2022-01-02 0 2 2022-01-03 1 3 2022-01-04 1 4 2022-01-05 0 5 2022-01-06 0 6 2022-01-07 1 7 2022-01-08 1 8 2022-01-09 1 9 2022-01-10 0 10 2022-01-11 1
我想对此进行转换,以便我最终得到一个“开始”和“结束”列,这样开始是 1 的第一次出现,结束是 1 的最后一次连续出现。基本上我应该这样结束:
start end 2022-01-03 2022-01-04 2022-01-07 2022-01-09 2022-01-11
所以到目前为止我所做的如下:
conditions = [ (df.value == 1) & (df.value.shift(1) == 0), (df.value == 1) & (df.value.shift(-1) == 0)] choices = ['start', 'end'] df['value'] = np.select(conditions, choices, default=pd.NA) df = df.dropna() df.pivot(columns='value') date value end start 2 NaT 2022-01-03 3 2022-01-04 NaT 6 NaT 2022-01-07 8 2022-01-09 NaT 10 NaT 2022-01-11
正如你所看到的,它几乎就在那里......我可以做一些额外的摆弄来得到我想要的 - 但我觉得我可能以错误的方式接近这个。 有没有更好、更有效的方法来解决这个问题?
我会在groupby.agg这里使用:
groupby.agg
# which rows have value 1? m = df['value'].eq(1) (df[m] # keep only value==1 .groupby(m.ne(m.shift()).cumsum()) # group by consecutive values ['date'].agg(['first', 'last']) # get first and last date .reset_index(drop=True) )
输出:
first last 0 2022-01-03 2022-01-04 1 2022-01-07 2022-01-09 2 2022-01-11 2022-01-11