小能豆

在 python 中使用部分匹配比较来自不同数据框的两列来映射值

py

我有 2 个df1包含id_numberdf2的数据框identity_No

希望使用条件将 Dataframe2 中最高匹配行的值映射到 Dataframe1。想要将 df1 中的每一行 ( df1[‘id_number’] ) 与 df2 的整个列 ( df2[‘identity_No’] ) 进行比较。我也尝试过使用部分匹配,但没有得到输出。

df1

score   id_number       company_name      company_code     match_acc     action_reqd
20      IN2231D           AXN pvt Ltd        IN225                          Yes
45      UK654IN        Aviva Intl Ltd        IN115                          No
65      SL1432H   Ship Incorporations        CZ555                          Yes
35      LK0678G  Oppo Mobiles pvt ltd        PQ795                          Yes
59      NG5678J             Nokia Inc        RS885                          No
20      IN2231D           AXN pvt Ltd        IN215                          Yes

df2

OR_score   identity_No       comp_name        comp_code   
51          UK654IN        Aviva Int.L Ltd       IN515  
25          SL6752J       Ship Inc Traders       CZ555  
79          NG5678K             Nokia Inc        RS005 
20          IN22312           AXN pvt Ltd        IN255
38          LK0665G       Oppo Mobiles ltd       PQ895 

例如:需要df1.id_number与进行比较df2.identity_No,基于的第 1 行查找匹配df1['id_number']将与的所有行匹配df2['identity_No'],并且相对于的第 4 行具有最高匹配百分比df2['identity_No'],并且其超过 80%,它将从 df2 的第 4 行复制相应的值到 df1 的第 1 行。对 df1 的每一行都应用相同的方法。

预期输出:

score   id_number       company_name      company_code     match_acc     action_reqd
20      IN22312           AXN pvt Ltd        IN225              90          Yes
51      UK654IN       Aviva Int.L Ltd        IN115              100         No
25      SL1432H   Ship Incorporations        CZ555              30          Yes
38      LK0665G      Oppo Mobiles ltd        PQ795              80          Yes
79      NG5678K             Nokia Inc        RS885              85          No
20      IN22312           AXN pvt Ltd        IN225              90          Yes

我现在已经尝试过了:

for index, row in df1.iterrows():
    for index2, config2 in df2.iterrows():
        if process.extractOne(row["id_number"], df["identity_No"])[1] >=80:
            df1['id_number'][index] = config2['identity_No']
            df1['company_name'][index] = config2['comp_name']
            df1['company_code'][index] = config2['comp_code']
            df1['score'][index] = config2['OR_Score']

尝试2

for index, row in df1.iterrows():
        for index2, config2 in df2.iterrows():
            if fuzz.partial_ratio(row["id_number"], config2["identity_No"]) >=80:

请提出建议


阅读 17

收藏
2025-01-03

共1个答案

小能豆

嵌入的评论:

import pandas as pd
from fuzzywuzzy import process

# Column mapping between the 2 dataframes
cols1 = ["score", "id_number", "company_name", "company_code"]
cols2 = ["OR_score", "identity_No", "comp_name", "comp_code"]

# Find the single best match above a score in a list of choices.
dfm = pd.DataFrame(df1["id_number"].apply(lambda x: process.extractOne(x, df2["identity_No"]))
                                   .tolist(), columns=["match_comp", "match_acc", "match_idx"])

# Get the indexes of (df1, df2) which satisfy the condition (match_acc> 80)
idx1, idx2 = dfm.loc[dfm["match_acc"] > 80, "match_idx"].reset_index().values.T.tolist()

# Update values from df2 to df1
df1.loc[idx1, cols1] = df2.loc[idx2, cols2].values
df1["match_acc"] = dfm["match_acc"]  # don't forget match_acc column
>>> df1
   score id_number          company_name company_code  match_acc action_reqd
0     20   IN22312           AXN pvt Ltd        IN255         86         Yes
1     51   UK654IN       Aviva Int.L Ltd        IN515        100          No
2     65   SL1432H   Ship Incorporations        CZ555         43         Yes
3     35   LK0678G  Oppo Mobiles pvt ltd        PQ795         71         Yes
4     79   NG5678K             Nokia Inc        RS005         86          No
5     20   IN22312           AXN pvt Ltd        IN255         86         Yes

根据您的输入数据进行测试:

df1 = pd.read_csv(io.StringIO("""score,id_number,company_name,company_code,match_acc,action_reqd
20,IN2231D,AXN pvt Ltd,IN225,,Yes
45,UK654IN,Aviva Intl Ltd,IN115,,No
65,SL1432H,Ship Incorporations,CZ555,,Yes
35,LK0678G,Oppo Mobiles pvt ltd,PQ795,,Yes
59,NG5678J,Nokia Inc,RS885,,No
20,IN2231D,AXN pvt Ltd,IN215,,Yes"""))

df2 = pd.read_csv(io.StringIO("""OR_score,identity_No,comp_name,comp_code
51,UK654IN,Aviva Int.L Ltd,IN515
25,SL6752J,Ship Inc Traders,CZ555
79,NG5678K,Nokia Inc,RS005
20,IN22312,AXN pvt Ltd,IN255
38,LK0665G,Oppo Mobiles ltd,PQ895"""))
2025-01-03