使用BeautifulSoup查找包含某些文本的HTML标签

一尘不染

使用BeautifulSoup查找包含某些文本的HTML标签

python

我正在尝试获取HTML文档中包含以下文本模式的元素：＃\ S {11}

<h2> this is cool #12345678901 </h2>

因此，前者将通过使用以下内容进行匹配：

soup('h2',text=re.compile(r' #\S{11}'))

结果将是这样的：

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

我可以获取所有匹配的文本（请参见上面的行）。但是我希望文本的父元素匹配，因此我可以将其用作遍历文档树的起点。在这种情况下，我希望所有h2元素都返回，而不是文本匹配。

有想法吗？

阅读 242

2020-12-20

共1个答案

一尘不染

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

印刷品：

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

2020-12-20