我在项目中使用模糊匹配,主要是查找拼写错误和具有相同名称的不同拼写。我需要完全了解elasticsearch的模糊匹配如何工作以及它如何使用标题中提到的2个参数。
据我了解, min_similarity 是查询的字符串与数据库中的字符串匹配的百分比。我找不到有关此值如何计算的确切描述。
据我了解, max_expansions 是应该执行搜索的Levenshtein距离。如果这实际上是Levenshtein距离,对我来说将是理想的解决方案。无论如何,这是行不通的,例如我有“ Samvel”一词
queryStr max_expansions matches? samvel 0 Should not be 0. error (but levenshtein distance can be 0!) samvel 1 Yes samvvel 1 Yes samvvell 1 Yes (but it shouldn't have) samvelll 1 Yes (but it shouldn't have) saamvelll 1 No (but for some weird reason it matches with Samvelian) saamvelll anything bigger than 1 No
该文档说了我实际上不理解的内容:
Add max_expansions to the fuzzy query allowing to control the maximum number of terms to match. Default to unbounded (or bounded by the max clause count in boolean query).
因此,请任何人向我解释这些参数究竟如何影响搜索结果。
的min_similarity是零和一之间的值。从Lucene文档中:
min_similarity
For example, for a minimumSimilarity of 0.5 a term of the same length as the query term is considered similar to the query term if the edit distance between both terms is less than length(term)*0.5
所谓的“编辑距离”是Levenshtein距离。
该查询在内部工作的方式是:
您可以想象此查询可能有多繁重!
为了解决这个问题,您可以设置max_expansions参数以指定应考虑的最大匹配词数。
max_expansions