数据匹配算法

一尘不染

数据匹配算法

algorithm

我目前正在从事一个需要实施数据匹配算法的项目。外部系统传入它了解的有关客户的所有数据，而我设计的系统必须返回匹配的客户。这样，外部系统便会知道客户的正确ID，并获得其他数据，或者可以更新特定客户自己的数据。

传入以下字段：

名称
名称2
街
市
邮政编码
银行帐号
银行名称
银行代码
电子邮件
电话
传真
网页

数据可以是高质量的，并且可以使用很多信息，但是通常数据很糟糕，只有名称和地址可用并且可能有拼写。

我正在.Net中实施该项目。我目前正在做的事情如下：

public bool IsMatch(Customer customer)
{
    // CanIdentify just checks if the info is provided and has a specific length (e.g. > 1)
    if (CanIdentifyByStreet() && CanIdentifyByBankAccountNumber())
    {
        // some parsing of strings done before (substring, etc.)
        if(Street == customer.Street && AccountNumber == customer.BankAccountNumber) return true;
    }
    if (CanIdentifyByStreet() && CanIdentifyByZipCode() &&CanIdentifyByName())
    {
        ...
    }
}

我对上述方法不太满意。这是因为我必须为所有合理的情况（组合）编写if语句，这样我才不会错过匹配实体的任何机会。

所以我想也许我可以创造某种匹配分数。因此，对于每个匹配的标准，将添加一个分数。喜欢：

public bool IsMatch(Customer customer)
{
    int matchingScore = 0;
    if (CanIdentifyByStreet())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyByName())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyBankAccountNumber())
    {
        if(....)
            matchingScore += 10;
    }

    if(matchingScore > iDontKnow)
        return true;
}

这将使我能够考虑所有匹配数据，并且根据某些权重，我将增加匹配分数。如果分数足够高，那就是一场比赛。

知道我的问题是：是否存在针对此类事情的最佳实践，例如匹配算法模式等？非常感谢！

阅读 646

2020-07-28

共1个答案

一尘不染

为了获得启发，请看Levenshtein距离算法。这将为您提供合理的机制来加权比较。

我还要补充一点，以我的经验，您绝对不能绝对将两个任意数据匹配到同一实体中。您需要向用户提供合理的匹配条件，然后用户才能确定1920 E.
Pine上的John Smith是否与East Pine Road 192上的Jon Smith是同一个人。

2020-07-28