我有 2 个df1包含id_number和df2的数据框identity_No。
df1
id_number
df2
identity_No
希望使用条件将 Dataframe2 中最高匹配行的值映射到 Dataframe1。想要将 df1 中的每一行 ( df1[‘id_number’] ) 与 df2 的整个列 ( df2[‘identity_No’] ) 进行比较。我也尝试过使用部分匹配,但没有得到输出。
score id_number company_name company_code match_acc action_reqd 20 IN2231D AXN pvt Ltd IN225 Yes 45 UK654IN Aviva Intl Ltd IN115 No 65 SL1432H Ship Incorporations CZ555 Yes 35 LK0678G Oppo Mobiles pvt ltd PQ795 Yes 59 NG5678J Nokia Inc RS885 No 20 IN2231D AXN pvt Ltd IN215 Yes
OR_score identity_No comp_name comp_code 51 UK654IN Aviva Int.L Ltd IN515 25 SL6752J Ship Inc Traders CZ555 79 NG5678K Nokia Inc RS005 20 IN22312 AXN pvt Ltd IN255 38 LK0665G Oppo Mobiles ltd PQ895
例如:需要df1.id_number与进行比较df2.identity_No,基于的第 1 行查找匹配df1['id_number']将与的所有行匹配df2['identity_No'],并且相对于的第 4 行具有最高匹配百分比df2['identity_No'],并且其超过 80%,它将从 df2 的第 4 行复制相应的值到 df1 的第 1 行。对 df1 的每一行都应用相同的方法。
df1.id_number
df2.identity_No
df1['id_number']
df2['identity_No']
预期输出:
score id_number company_name company_code match_acc action_reqd 20 IN22312 AXN pvt Ltd IN225 90 Yes 51 UK654IN Aviva Int.L Ltd IN115 100 No 25 SL1432H Ship Incorporations CZ555 30 Yes 38 LK0665G Oppo Mobiles ltd PQ795 80 Yes 79 NG5678K Nokia Inc RS885 85 No 20 IN22312 AXN pvt Ltd IN225 90 Yes
我现在已经尝试过了:
for index, row in df1.iterrows(): for index2, config2 in df2.iterrows(): if process.extractOne(row["id_number"], df["identity_No"])[1] >=80: df1['id_number'][index] = config2['identity_No'] df1['company_name'][index] = config2['comp_name'] df1['company_code'][index] = config2['comp_code'] df1['score'][index] = config2['OR_Score']
尝试2
for index, row in df1.iterrows(): for index2, config2 in df2.iterrows(): if fuzz.partial_ratio(row["id_number"], config2["identity_No"]) >=80:
请提出建议
嵌入的评论:
import pandas as pd from fuzzywuzzy import process # Column mapping between the 2 dataframes cols1 = ["score", "id_number", "company_name", "company_code"] cols2 = ["OR_score", "identity_No", "comp_name", "comp_code"] # Find the single best match above a score in a list of choices. dfm = pd.DataFrame(df1["id_number"].apply(lambda x: process.extractOne(x, df2["identity_No"])) .tolist(), columns=["match_comp", "match_acc", "match_idx"]) # Get the indexes of (df1, df2) which satisfy the condition (match_acc> 80) idx1, idx2 = dfm.loc[dfm["match_acc"] > 80, "match_idx"].reset_index().values.T.tolist() # Update values from df2 to df1 df1.loc[idx1, cols1] = df2.loc[idx2, cols2].values df1["match_acc"] = dfm["match_acc"] # don't forget match_acc column >>> df1 score id_number company_name company_code match_acc action_reqd 0 20 IN22312 AXN pvt Ltd IN255 86 Yes 1 51 UK654IN Aviva Int.L Ltd IN515 100 No 2 65 SL1432H Ship Incorporations CZ555 43 Yes 3 35 LK0678G Oppo Mobiles pvt ltd PQ795 71 Yes 4 79 NG5678K Nokia Inc RS005 86 No 5 20 IN22312 AXN pvt Ltd IN255 86 Yes
根据您的输入数据进行测试:
df1 = pd.read_csv(io.StringIO("""score,id_number,company_name,company_code,match_acc,action_reqd 20,IN2231D,AXN pvt Ltd,IN225,,Yes 45,UK654IN,Aviva Intl Ltd,IN115,,No 65,SL1432H,Ship Incorporations,CZ555,,Yes 35,LK0678G,Oppo Mobiles pvt ltd,PQ795,,Yes 59,NG5678J,Nokia Inc,RS885,,No 20,IN2231D,AXN pvt Ltd,IN215,,Yes""")) df2 = pd.read_csv(io.StringIO("""OR_score,identity_No,comp_name,comp_code 51,UK654IN,Aviva Int.L Ltd,IN515 25,SL6752J,Ship Inc Traders,CZ555 79,NG5678K,Nokia Inc,RS005 20,IN22312,AXN pvt Ltd,IN255 38,LK0665G,Oppo Mobiles ltd,PQ895"""))