样本数据
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}} {"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}} {"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}} {"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}} {"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}} {"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}} . .
我有一些这样的代码
df = pd.DataFrame() for line in sys.stdin: data = json.loads(line) # df1 = pd.DataFrame(data["transaction"], index=[len(df.index)]) df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']]) df1['time'] = pd.to_datetime(df1['time']) df = df.append(df1) # df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count() print(df) print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))
电流输出
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00 2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01 2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10 2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20 2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30 2019-02-13T11:02:30.000Z merchantE 90 2019-02-13 11:02:30 2
预期产量
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00 2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01 2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10 2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20 2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
由于数据正在流式传输。我想检查重复的记录(其商人和金额值相同)是否在两分钟内到达,所以我将其丢弃并对其不进行任何处理。将其打印为副本。
我是否需要对索引压缩或groupby进行处理?但是然后如何等于多列。或在两列上有一些滚动条件,但找不到任何方法。
我在这里想念什么?
谢谢
编辑
#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)] res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])] # res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T') res['timediff'] = (data['transaction']['time'] - res['time']) if len(res.index) >1: print(res)
所以我尝试这样的事情,如果结果少于120秒,我可以处理它。但是最终产生的df的形式为
merchant amount time concat timediff 2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00 merchantF10 -1 days +23:59:20 2019-02-13 11:02:20 merchantF 10 2019-02-13 11:02:20 merchantF10 00:00:00 2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30 merchantE10 00:01:00 2019-02-13 11:02:00 merchantE 10 2019-02-13 11:02:00 merchantE10 00:00:30 2019-02-13 11:02:30 merchantE 10 2019-02-13 11:02:30 merchantE10 00:00:00
-1天+23:59:20 我认为采用绝对值可以消除这种格式吗?
如何将时间转换为可以与120秒进行比较的格式? pd.to_deltatime() 对我不起作用,或者我使用的是错误的。
所以我使它工作,但不支持滚动窗口,因为它不支持字符串类型。该功能也在Pandas Repo上进行了报告和请求。
我对这个问题的解决方案摘要:
if len(df.index) > 0: res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])] res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120 if res.timediff.any(): continue df = df.append(df1) print(df)
样本数据:
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}} {"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}} {"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}} {"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}} {"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}} {"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}} {"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}} {"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}} {"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}} {"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}} {"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}
输出:
merchant amount time 2019-02-13 10:00:00 merchantA 20 2019-02-13 10:00:00 2019-02-13 11:00:01 merchantB 90 2019-02-13 11:00:01 2019-02-13 11:00:10 merchantC 10 2019-02-13 11:00:10 2019-02-13 11:00:20 merchantD 10 2019-02-13 11:00:20 2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30 2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00 2019-02-13 11:05:20 merchantF 10 2019-02-13 11:05:20