BeautifulSoup：只需进入标签内部，无论有多少个封闭标签

一尘不染

BeautifulSoup：只需进入标签内部，无论有多少个封闭标签

python

我正在尝试<p>使用BeautifulSoup从网页中的元素中抓取所有内部html 。有内部标签，但我不在乎，我只想获取内部文本。

例如，用于：

<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>

我如何提取：

Red
Blue
Yellow
Light green

既不.string也不.contents[0]做什么，我需要的。也没有.extract()，因为我不想事先指定内部标签-
我想处理可能发生的任何事情。

在BeautifulSoup中是否有一种“仅获取可见HTML”类型的方法？

-—更新------

根据建议，尝试：

soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags): 
    print str(i) + p_tag

但这无济于事-它会打印出：

0Red
1

2Blue
3

4Yellow
5

6Light 
7green
8

阅读 153

2020-12-20

共1个答案

一尘不染

简短答案： soup.findAll(text=True)

这已经在S)和BeautifulSoup文档中得到了解答。

更新：

澄清一下，一段有效的代码：

>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
...     print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green

2020-12-20