在下面的示例中,我试图在一个部分中的<content>
所有标签周围包装一个标签。<p>
每个部分都在一个内<item>
,但<title>
需要留在外<content>
。我怎样才能做到这一点?
源文件:
<item>
<title>Heading for Sec 1</title>
<p>some text sec 1</p>
<p>some text sec 1</p>
<p>some text sec 1</p>
</item>
<item>
<title>Heading for Sec 2</title>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</item>
<item>
<title>Heading for Sec 3</title>
<p>some text sec 3</p>
<p>some text sec 3</p>
</item>
我想要这个输出:
<item>
<title>Heading for Sec 1</title>
<content>
<p>some text sec 1</p>
<p>some text sec 1</p>
</content>
</item>
<item>
<title>Heading for Sec 2</title>
<content>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</content>
</item>
<item>
<title>Heading for Sec 3</title>
<content>
<p>some text sec 3</p>
<p>some text sec 3</p>
</content>
</item>
下面的代码是我正在尝试的。但是,它会<content>
在每个<p>
标签周围包装一个标签,而不是<p>
在一个部分中的所有标签周围。我怎样才能解决这个问题?
from bs4 import BeautifulSoup
with open('testdoc.txt', 'r') as f:
soup = BeautifulSoup(f, "html.parser")
content = None
for tag in soup.select("p"):
if tag.name == "p":
content = tag.wrap(soup.new_tag("content"))
content.append(tag)
continue
print(soup)
尝试:
from bs4 import BeautifulSoup
html_doc = """\
<item>
<title>Heading for Sec 1</title>
<p>some text sec 1</p>
<p>some text sec 1</p>
<p>some text sec 1</p>
</item>
<item>
<title>Heading for Sec 2</title>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</item>
<item>
<title>Heading for Sec 3</title>
<p>some text sec 3</p>
<p>some text sec 3</p>
</item>"""
soup = BeautifulSoup(html_doc, "html.parser")
for item in soup.select("item"):
t = soup.new_tag("content")
t.append("\n")
item.title.insert_after(t)
item.title.insert_after("\n")
for p in item.select("p"):
t.append(p)
t.append("\n")
item.smooth()
for t in item.find_all(text=True, recursive=False):
t.replace_with("\n")
print(soup)
打印:
<item>
<title>Heading for Sec 1</title>
<content>
<p>some text sec 1</p>
<p>some text sec 1</p>
<p>some text sec 1</p>
</content>
</item>
<item>
<title>Heading for Sec 2</title>
<content>
<p>some text sec 2</p>
<p>some text sec 2</p>
<p>some text sec 2</p>
</content>
</item>
<item>
<title>Heading for Sec 3</title>
<content>
<p>some text sec 3</p>
<p>some text sec 3</p>
</content>
</item>