一尘不染

如何在pandas滚动窗口中基于多个列查找重复项?

python

样本数据

{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}}
.
.

我有一些这样的代码

    df = pd.DataFrame()
for line in sys.stdin:
    data = json.loads(line)
    # df1 = pd.DataFrame(data["transaction"], index=[len(df.index)])
    df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']])
    df1['time'] = pd.to_datetime(df1['time'])
    df = df.append(df1)
    # df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count()

print(df)
print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))

电流输出

2019-02-13T10:00:00.000Z  merchantA      20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z  merchantB      90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z  merchantC      90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z  merchantD      90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z  merchantE      90 2019-02-13 11:01:30
2019-02-13T11:02:30.000Z  merchantE      90 2019-02-13 11:02:30

2

预期产量

2019-02-13T10:00:00.000Z  merchantA      20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z  merchantB      90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z  merchantC      90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z  merchantD      90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z  merchantE      90 2019-02-13 11:01:30

由于数据正在流式传输。我想检查重复的记录(其商人和金额值相同)是否在两分钟内到达,所以我将其丢弃并对其不进行任何处理。将其打印为副本。

我是否需要对索引压缩或groupby进行处理?但是然后如何等于多列。或在两列上有一些滚动条件,但找不到任何方法。

我在这里想念什么?

谢谢

编辑

#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)]
     res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
        # res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T')
        res['timediff'] = (data['transaction']['time'] - res['time'])
        if len(res.index) >1:
           print(res)

所以我尝试这样的事情,如果结果少于120秒,我可以处理它。但是最终产生的df的形式为

                      merchant  amount                time       concat          timediff
2019-02-13 11:03:00  merchantF      10 2019-02-13 11:03:00  merchantF10 -1 days +23:59:20
2019-02-13 11:02:20  merchantF      10 2019-02-13 11:02:20  merchantF10          00:00:00

2019-02-13 11:01:30  merchantE      10 2019-02-13 11:01:30  merchantE10 00:01:00
2019-02-13 11:02:00  merchantE      10 2019-02-13 11:02:00  merchantE10 00:00:30
2019-02-13 11:02:30  merchantE      10 2019-02-13 11:02:30  merchantE10 00:00:00

-1天+23:59:20 我认为采用绝对值可以消除这种格式吗?

如何将时间转换为可以与120秒进行比较的格式? pd.to_deltatime() 对我不起作用,或者我使用的是错误的。


阅读 258

收藏
2021-01-20

共1个答案

一尘不染

所以我使它工作,但不支持滚动窗口,因为它不支持字符串类型。该功能也在Pandas Repo上进行了报告和请求。

我对这个问题的解决方案摘要:

    if len(df.index) > 0:
        res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
        res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
        if res.timediff.any():
            continue
    df = df.append(df1)
print(df)

样本数据:

{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}

输出:

                      merchant  amount                time
2019-02-13 10:00:00  merchantA      20 2019-02-13 10:00:00
2019-02-13 11:00:01  merchantB      90 2019-02-13 11:00:01
2019-02-13 11:00:10  merchantC      10 2019-02-13 11:00:10
2019-02-13 11:00:20  merchantD      10 2019-02-13 11:00:20
2019-02-13 11:01:30  merchantE      10 2019-02-13 11:01:30
2019-02-13 11:03:00  merchantF      10 2019-02-13 11:03:00
2019-02-13 11:05:20  merchantF      10 2019-02-13 11:05:20
2021-01-20