嘿,我正在使用Levenshteins算法来获取源字符串和目标字符串之间的距离。
我也有方法返回值从0到1:
/// <summary> /// Gets the similarity between two strings. /// All relation scores are in the [0, 1] range, /// which means that if the score gets a maximum value (equal to 1) /// then the two string are absolutely similar /// </summary> /// <param name="string1">The string1.</param> /// <param name="string2">The string2.</param> /// <returns></returns> public static float CalculateSimilarity(String s1, String s2) { if ((s1 == null) || (s2 == null)) return 0.0f; float dis = LevenshteinDistance.Compute(s1, s2); float maxLen = s1.Length; if (maxLen < s2.Length) maxLen = s2.Length; if (maxLen == 0.0F) return 1.0F; else return 1.0F - dis / maxLen; }
但这对我来说还不够。因为我需要更复杂的方式来匹配两个句子。
例如,我想自动标记一些音乐,我拥有原始歌曲名称,并且我有带有垃圾邮件的歌曲,例如 super,quality, 诸如 2007、2008等 年份。等等。还有一些文件只有http:// trash。 .thash..song_name_mp3.mp3,其他都是正常的。我想创建一种算法,该算法现在比我的算法更完美。.也许有人可以帮助我吗?
这是我目前的算法:
/// <summary> /// if we need to ignore this target. /// </summary> /// <param name="targetString">The target string.</param> /// <returns></returns> private bool doIgnore(String targetString) { if ((targetString != null) && (targetString != String.Empty)) { for (int i = 0; i < ignoreWordsList.Length; ++i) { //* if we found ignore word or target string matching some some special cases like years (Regex). if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true; } } return false; } /// <summary> /// Removes the duplicates. /// </summary> /// <param name="list">The list.</param> private void removeDuplicates(List<String> list) { if ((list != null) && (list.Count > 0)) { for (int i = 0; i < list.Count - 1; ++i) { if (list[i] == list[i + 1]) { list.RemoveAt(i); --i; } } } } /// <summary> /// Does the fuzzy match. /// </summary> /// <param name="targetTitle">The target title.</param> /// <returns></returns> private TitleMatchResult doFuzzyMatch(String targetTitle) { TitleMatchResult matchResult = null; if (targetTitle != null && targetTitle != String.Empty) { try { //* change target title (string) to lower case. targetTitle = targetTitle.ToLower(); //* scores, we will select higher score at the end. Dictionary<Title, float> scores = new Dictionary<Title, float>(); //* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\"', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_' List<String> targetKeywords = new List<string>(targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries)); //* remove all trash from keywords, like super, quality, etc.. targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); }); //* sort keywords. targetKeywords.Sort(); //* remove some duplicates. removeDuplicates(targetKeywords); //* go through all original titles. foreach (Title sourceTitle in titles) { float tempScore = 0f; //* split orig. title to keywords list. List<String> sourceKeywords = new List<string>(sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries)); sourceKeywords.Sort(); removeDuplicates(sourceKeywords); //* go through all source ttl keywords. foreach (String keyw1 in sourceKeywords) { float max = float.MinValue; foreach (String keyw2 in targetKeywords) { float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2); if (currentScore > max) { max = currentScore; } } tempScore += max; } //* calculate average score. float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count)); //* if average score is bigger than minimal score and target title is not in this source title ignore list. if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle)) { //* add score. scores.Add(sourceTitle, averageScore); } } //* choose biggest score. float maxi = float.MinValue; foreach (KeyValuePair<Title, float> kvp in scores) { if (kvp.Value > maxi) { maxi = kvp.Value; matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic); } } } catch { } } //* return result. return matchResult; }
这正常工作,但只是在某些情况下,许多标题应该匹配,不匹配…我想我需要某种公式来处理权重等,但我想不到。
有想法吗?有什么建议吗?阿尔戈斯?
顺便说一句,我已经知道了这个主题(我的同事已经发布了这个主题,但是我们不能为这个问题提供适当的解决方案。)
您的问题可能是区分干扰词和有用数据:
您可能需要制作一个干扰词字典来忽略。这似乎很笨拙,但是我不确定是否有一种算法可以区分乐队/专辑名称和噪音。