我有一个df和字典列表,如下所示。
df:
Date t_factor 2020-02-01 5 2020-02-02 23 2020-02-03 14 2020-02-04 23 2020-02-05 23 2020-02-06 23 2020-02-07 30 2020-02-08 29 2020-02-09 100 2020-02-10 38 2020-02-11 38 2020-02-12 38 2020-02-13 70 2020-02-14 70
param_list:
param_obj_list = [{'type': 'df_first', 'from': '2020-02-01T20:00:00.000Z', 'to': '2020-02-03T20:00:00.000Z', 'days': 0, 'coef': [0.1, 0.1, 0.1, 0.1, 0.1, 0.1]}, {'type': 'quadratic', 'from': '2020-02-03T20:00:00.000Z', 'to': '2020-02-06T20:00:00.000Z', 'days': 3, 'coef': [0.1, 0.1, 0.1, 0.1, 0.1, 0.1]}, {'type': 'linear', 'from': '2020-02-06T20:00:00.000Z', 'to': '2020-02-10T20:00:00.000Z', 'days': 3, 'coef': [0.1, 0.1, 0.1, 0.1, 0.1, 0.1]}, {'type': 'polynomial', 'from': '2020-02-10T20:00:00.000Z', 'to': '2020-02-14T20:00:00.000Z', 'days': 3, 'coef': [0.1, 0.1, 0.1, 0.1, 0.1, 0.1]}]
从以上我想基于字典指定的“类型”和日期列在df中创建一个新列。
我从下面的代码开始
import pandas as pd import numpy as np import datetime as DT def time_objective(df, param_obj_list) for params_obj in param_obj_list: # Do the data processing start_date, end_date, label, coef, n_days = params_obj['from'], params_obj['to'], params_obj['type'], params_obj['coef'], params_obj['days'] print(start_date, end_date, label, coef, n_days) start_date = DT.datetime.strptime(start_date, "%Y-%m-%dT%H:%M:%S.%fZ") print(start_date) if (start_date == 0) | (end_date == 0): return df elif: if len(coef) == 6: # Coefficients Index Initializations a0 = coef[0] a1 = coef[1] a2 = coef[2] a3 = coef[3] a4 = coef[4] a5 = coef[5] if label == 'df_first': df['Date'] = pd.to_datetime(df['Date']) m = df['Date'].between(start_date, end_date, inclusive=True) df.loc[m, 't_factor'] =
说明:
if "type" == df_first: df['new_col'] = df['t_factor'] (duration only from the "from" and "to" date specified in that dictionary) elif "type" == "quadratic": df['new_col'] = a0 + a1*(T) + a2*(T)**2 + previous value of df['new_col'] where T = 1 for one day after the "from" date of that dictionary and T counted in days based Date value elif "type" == "linear": df['new_col'] = a0 + a1*(T) + previous value of df['new_col'] where T = 1 for one day after the "from" date of that dictionary. elif "type" == "polynomial": df['new_col'] = a0 + a1*(T) + a2*(T)**2 + a3*(T)**3 + a4*(T)**4 + a5*(T)**5 + previous value of df['new_col'] where T = 1 for start_date of that dictionary.
预期产量:
Date t_factor new_col 2020-02-01 5 5 2020-02-02 23 23 2020-02-03 14 14 2020-02-04 23 14 + 0.1 + 0.1*(1) + 0.1*(1)**2 2020-02-05 23 14 + 0.1 + 0.1*(2) + 0.1*(2)**2 2020-02-06 23 14 + 0.1 + 0.1*(3) + 0.1*(3)**2 = 15.3 2020-02-07 30 15.3 + 0.1 + 0.1*(1) 2020-02-08 29 15.3 + 0.1 + 0.1*(2) 2020-02-09 100 15.3 + 0.1 + 0.1*(3) 2020-02-10 38 15.3 + 0.1 + 0.1*(4) = 15.8 2020-02-11 38 15.8 +0.1+0.1*(1)+0.1*(1)**2+0.1*(1)**3+0.1*(1)**4+0.1*(1)**5 2020-02-12 38 15.8 +0.1+0.1*(2)+0.1*(2)**2+0.1*(2)**3+0.1*(2)**4+0.1*(2)**5 2020-02-13 70 15.8 +0.1+0.1*(3)+0.1*(3)**2+0.1*(3)**3+0.1*(3)**4+0.1*(3)**5 2020-02-14 70 15.8 +0.1+0.1*(4)+0.1*(4)**2+0.1*(4)**3+0.1*(4)**4+0.1*(4)**5
定义一个函数time_objective,该函数接受dataframe和作为参数,param_obj_list并返回添加了新列的数据框。在这里,我们已经使用Series.between来创建一个boolean mask和并使用boolean indexing此掩码,根据要求填充值:
time_objective
dataframe
param_obj_list
Series.between
boolean mask
boolean indexing
def time_objective(df, param_obj_list): df['new_col'] = np.nan for d in param_obj_list: if 'from' not in d or 'to' not in d \ or d['from'] == 0 or d['to'] == 0: continue if len(d['coef']) != 6: print('Exception: Coefficients index do not match') return df a0, a1, a2, a3, a4, a5 = d['coef'] start = pd.Timestamp(d['from']).strftime('%Y-%m-%d') end = pd.Timestamp(d['to']).strftime('%Y-%m-%d') T = df['Date'].sub(pd.Timestamp(start)).dt.days mask = df['Date'].between(start, end, inclusive=True) if d['type'] == 'df_first': df.loc[mask, 'new_col'] = df['t_factor'] elif d['type'] == 'quadratic': df.loc[mask, 'new_col'] = a0 + a1 * T + a2 * (T)**2 + df['new_col'].ffill() elif d['type'] == 'linear': df.loc[mask, 'new_col'] = a0 + a1 * T + df['new_col'].ffill() elif d['type'] == 'polynomial': df.loc[mask, 'new_col'] = a0 + a1*(T) + a2*(T)**2 + a3 * \ (T)**3 + a4*(T)**4 + a5*(T)**5 + df['new_col'].ffill() return df
结果:
Date t_factor new_col 0 2020-02-01 5 5.0 1 2020-02-02 23 23.0 2 2020-02-03 14 14.1 3 2020-02-04 23 14.3 4 2020-02-05 23 14.7 5 2020-02-06 23 15.4 6 2020-02-07 30 15.5 7 2020-02-08 29 15.6 8 2020-02-09 100 15.7 9 2020-02-10 38 15.9 10 2020-02-11 38 16.4 11 2020-02-12 38 22.1 12 2020-02-13 70 52.2 13 2020-02-14 70 152.3