了解使用LCP阵列进行模式匹配的算法

一尘不染

了解使用LCP阵列进行模式匹配的算法

algorithm

前言：我的问题主要是算法问题，因此即使您不熟悉后缀和LCP数组，您也可能会帮助我。

在此论文中描述了如何高效地使用后缀和LCP阵列为字符串模式匹配。

我了解了SA和LCP的工作原理，以及如何将算法的运行时间从O(P*log(N))（P模式N的长度和字符串的长度）提高到O(P+log(N))（感谢ChrisEelmaa 在这里和jogojapans
在这里的]回答）的理解。我试图遍历图4中的算法，该算法解释了LLcp和的用法RLcp。但是我在理解它是如何工作时遇到了问题。

该算法（取自源）：

模式匹配算法

使用的变量名称的说明：

lcp(v,w) : Length of the longest common prefix of v and w
W = w0..wP-1 : pattern of length P
A = a0..aN-1 : the text (length N)
Pos[0..N-1] : suffix array
L_W : index (in A) of first occurrence of the matched pattern
M : middle index of current substring
L : lower bound
R : upper bound
Lcp : array of size N-2 such that Lcp[M] = lcp(A_Pos[L_M], A_pos[M]) where L_M is the lower bound of the unique interval with M in the middle
Rcp : array of size N-2 such that Rcp[M] = lcp(A_Pos[R_M], A_pos[M]) where R_M is the upper bound of the unique interval with M in the middle

现在，我想使用以下示例（部分取自此处）尝试算法：

 SA | LCP | Suffix entry
 -----------------------
 5  | N/A | a  
 3  | 1   | ana  
 1  | 3   | anana  
 0  | 0   | banana  
 4  | 0   | na  
 2  | 2   | nana

A = "banana" ; N = 6
W = "ban" ; P = 3

我想尝试匹配一个字符串，说ban，并希望算法返回0作为L_W。

这是我逐步执行算法的方法：

l = lcp("a", "ban") = 0
r = lcp("nana", "ban") = 0
if 0 = 3 or 'b' =< 'a' then // which is NOT the case for both conditions
    L_W = 0
else if 0 < 3 or 'b' =< 'n' then // which is the case for both conditions 
    L_W = 6 // which means 'not found'
...
...

我觉得我想念什么，但我找不到。我也想知道如何使用预先计算的LCP数组代替调用lcp(v,w)。

阅读 409

2020-07-28

共1个答案

一尘不染

我相信有一个错误。

第一个条件很容易理解。当LCP长度==模式长度时，就完成了。当您的模式小于或等于最小模式时，唯一的选择就是最小模式。

第二个条件是错误的。我们可以通过矛盾来证明这一点。r <P || Wr <= a …表示r> = P && Wr> a …如果r> =
P，那么由于我们已经有了r个长度的公共前缀，我们怎么能使Lw = N（未找到）？

2020-07-28