一尘不染

Java中的相似字符串比较

java

我想将多个字符串相互比较,并找到最相似的字符串。我想知道是否有任何库,方法或最佳实践会返回我哪些字符串与其他字符串更相似的字符串。例如:

  • “The quick fox jumped” -> “The fox jumped”
  • “The quick fox jumped” -> “The fox”

该比较将返回第一个比第二个更相似。

我想我需要一些方法,例如:

double similarityIndex(String s1, String s2)

某处有这样的东西吗?

编辑:为什么我要这样做?我正在编写一个脚本,用于将MS Project文件的输出与处理任务的某些旧系统的输出进行比较。由于传统系统的字段宽度非常有限,因此在添加值时将省略描述。我想要一些半自动的方法来查找MS Project中的哪些条目与系统上的条目相似,以便获得生成的密钥。它有缺点,因为它仍然必须手动检查,但是这样可以节省很多工作


阅读 561

收藏
2020-03-06

共2个答案

一尘不染

是的,有许多文献证明的算法,例如:

  • Cosine similarity
  • Jaccard similarity
  • Dice’s coefficient
  • Matching similarity
  • Overlap similarity
  • etc etc
2020-03-06
一尘不染

在许多库中,以0%-100%的方式计算两个字符串之间相似度的常用方法是测量必须更改较长的字符串以使其变为较短的字符串的百分比(%):

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below

计算editDistance():

editDistance()预期上面的函数将计算两个字符串之间的编辑距离。此步骤有几种实现,每种实现可能更适合特定的情况。最常见的是Levenshtein距离算法,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能会表现更好)。

这是两个用于计算编辑距离的选项:

  • 你可以使用Levenshtein距离的Apache Commons Text的实现: apply(CharSequence left, CharSequence rightt)
  • 自己实施。在下面,你将找到一个示例实现。

工作示例:

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

输出:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"
2020-03-06