我有两个数据框,df1, df2它们的索引类型相同,但几乎不存在相同的匹配项。索引也可能有重复项。A 列和 B 列将由内部唯一值组成。所有索引和列都已排序,但方向不同。df1.index是降序的,df1['A']是升序的。df2.index是升序的,df2['B']是降序的。
df1, df2
df1.index
df1['A']
df2.index
df2['B']
df1:(左边的数字是数据框的未命名索引)
A 80 -13.545215 76 -12.270691 73 -11.274724 65 -8.280187 38 -7.965972 13 -7.788130 10 -6.690969 6 -5.273063
df2:
B 8 -13.827641 10 -12.283885 14 -11.459951 62 -11.067622 64 -10.745988 87 -10.661594 95 -9.816053 97 -7.740810
我想合并数据框,以便将 中的值df2['B']放置到df2中的最接近的相应索引df1,以便所需的输出采用以下形式:
df2
df1
B A 8 -13.827641 -6.690969 10 -12.283885 -6.690969 14 -11.459951 -7.965972 62 -11.067622 -8.280187 64 -10.745988 -8.280187 87 -10.661594 NaN 95 -9.816053 NaN 97 -7.740810 NaN
如果绝对值最接近的索引 A 低于索引 B,则索引 A 的较高值即为正确匹配。如果索引 B 在索引 A 中没有对应的更高匹配,则 NaN 即为正确匹配。
到目前为止,我已经使用pd.merge()和fillna()进行必要的分析。但有些人可能会发现对插值/合成数据进行分析“不自然”。无论如何,这就是我的做法:
pd.merge()
fillna()
pd.merge()和的部分代码示例dropna():
dropna():
# outer merge df3 = pd.merge(df1,df2, how = 'outer', left_index = True, right_index = True) #df4 = df3.interpolate(method = 'linear')[1:] df4 = df3.interpolate(method = 'linear').dropna() df4
输出:
A B 8 -5.982016 -13.827641 10 -6.690969 -12.283885 13 -7.788130 -11.871918 14 -7.877051 -11.459951 38 -7.965972 -11.263787 62 -8.070710 -11.067622 64 -8.175448 -10.745988 65 -8.280187 -10.729109 73 -11.274724 -10.712230 76 -12.270691 -10.695352 80 -13.545215 -10.678473 87 -13.545215 -10.661594 95 -13.545215 -9.816053 97 -13.545215 -7.740810
阴谋:
完整的数据和代码示例
#imports import numpy as np import pandas as pd # Some sample data np.random.seed(1) df1_index = sorted(np.random.randint(1,101,8), reverse = True) df1info = {'A':sorted((np.random.normal(10, 2, 8))*-1)} df2_index = sorted(np.random.randint(1,101,8)) df2info = {'B':sorted(np.random.normal(10, 2, 8)*-1)} # Two dataframes df1 = pd.DataFrame(df1info, index = df1_index) df2 = pd.DataFrame(df2info, index = df2_index) # outer merge df3 = pd.merge(df1,df2, how = 'outer', left_index = True, right_index = True) # interpolate missing values df4 = df3.interpolate(method = 'linear').dropna() # plot df4.plot()
感谢您的任何建议!
编辑 1:重复场景 1:
如果df2.index在 中有完全匹配df1.index,并且df1.index有重复,则正确匹配是最低的df1.index。我希望这有意义。如果由于某种原因它变得毫无意义,我愿意接受其他建议!
O(n)
df2_index.sort() df1_index.sort() a = 0 b = 0 mapping = [[],[]] while b < len(df2_index) and a < len(df1_index): if df1_index[a] == df2_index[b]: mapping[0].append(df2_index[b]) mapping[1].append(df1.loc[df1_index[a], "A"]) b += 1 a += 1 elif df1_index[a] > df2_index[b]: mapping[0].append(df2_index[b]) mapping[1].append(df1.loc[df1_index[a], "A"]) b += 1 else: a += 1 df = pd.DataFrame({'A': mapping[1]}, index = mapping[0]) df2.merge(df, left_index=True, right_index=True, how='outer')
输出
b
a