一尘不染

如何使用 BeautifulSoup 围绕多个标签包装新标签?

py

在下面的示例中,我试图在一个部分中的<content>所有标签周围包装一个标签。<p>每个部分都在一个内<item>,但<title>需要留在外<content>。我怎样才能做到这一点?

源文件:

<item>
<title>Heading for Sec 1</title>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
</item>

<item>
<title>Heading for Sec 2</title>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
</item>

<item>
<title>Heading for Sec 3</title>
    <p>some text sec 3</p>
    <p>some text sec 3</p>
</item>

我想要这个输出:

<item>
<title>Heading for Sec 1</title>
    <content>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
    </content>
</item>

<item>
<title>Heading for Sec 2</title>
    <content>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
    </content>
</item>

<item>
<title>Heading for Sec 3</title>
    <content>
    <p>some text sec 3</p>
    <p>some text sec 3</p>
    </content>
</item>

下面的代码是我正在尝试的。但是,它会<content>在每个<p>标签周围包装一个标签,而不是<p>在一个部分中的所有标签周围。我怎样才能解决这个问题?

from bs4 import BeautifulSoup
with open('testdoc.txt', 'r') as f:
    soup = BeautifulSoup(f, "html.parser")

content = None
for tag in soup.select("p"):  
    if tag.name == "p":
        content = tag.wrap(soup.new_tag("content"))
        content.append(tag)
        continue

print(soup)

阅读 129

收藏
2022-10-01

共1个答案

一尘不染

尝试:

from bs4 import BeautifulSoup

html_doc = """\
<item>
<title>Heading for Sec 1</title>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
</item>

<item>
<title>Heading for Sec 2</title>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
</item>

<item>
<title>Heading for Sec 3</title>
    <p>some text sec 3</p>
    <p>some text sec 3</p>
</item>"""


soup = BeautifulSoup(html_doc, "html.parser")

for item in soup.select("item"):
    t = soup.new_tag("content")
    t.append("\n")
    item.title.insert_after(t)
    item.title.insert_after("\n")

    for p in item.select("p"):
        t.append(p)
        t.append("\n")

    item.smooth()
    for t in item.find_all(text=True, recursive=False):
        t.replace_with("\n")

print(soup)

打印:

<item>
<title>Heading for Sec 1</title>
<content>
<p>some text sec 1</p>
<p>some text sec 1</p>
<p>some text sec 1</p>
</content>
</item>
<item>
<title>Heading for Sec 2</title>
<content>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</content>
</item>
<item>
<title>Heading for Sec 3</title>
<content>
<p>some text sec 3</p>
<p>some text sec 3</p>
</content>
</item>
2022-10-01