我正在尝试解析维基百科页面,并且需要使用正则表达式提取页面的特定部分。在下面的数据中,我只需要提取 {{Infobox…}} 部分内的数据。
{{Infobox XC Championships |Name = Senior men's race at the 2008 IAAF World Cross Country Championships |Host city = [[Edinburgh]], [[Scotland]], [[United Kingdom]] {{flagicon|United Kingdom}} |Location = [[Holyrood Park]] |Nations participating = 45 }} 2008.<ref name=iaaf_00> {{ Citation | last = | publisher = [[IAAF]] }}
所以在上面的例子中,我只需要提取
Infobox XC Championships |Name = Senior men's race at the 2008 IAAF World Cross Country Championships |Host city = [[Edinburgh]], [[Scotland]], [[United Kingdom]] {{flagicon|United Kingdom}} |Location = [[Holyrood Park]] |Nations participating = 45
请注意,{{Infobox…}} 部分中可能有嵌套的 {{ }} 字符。我不想忽略这一点。
下面是我的正则表达式:
\\{\\{Infobox[^{}]*\\}\\}
但它似乎不起作用。请帮忙。谢谢!
由于 infobox-section 的格式,实际上可以为此使用正则表达式。 诀窍是,您甚至不关心嵌套{{...}}元素,因为它们中的每一个都将以|.
{{...}}
|
{{(Infobox.*\r\n(?:\|.*\r\n)+)}}
调试演示
{{ start of the string (Infobox start of the capturing group .*\r\n any characters until a line break appears (?: \| line has to start with a | .*\r\n any characters until a line break appears ) + the non-capturing group can occur multiple times ) end of capturing group }}
因此,在 - 部分中,您只需匹配以直到弹出的Infobox行开头的行。|``}}
Infobox
|``}}
您可能需要\r\n根据您的平台/语言进行试验。Debuggex很好\r\n,但regex101.com只会匹配\n
\r\n
\n