在下面的示例中,我试图在一个部分中的<content>所有标签周围包装一个标签。<p>每个部分都在一个内<item>,但<title>需要留在外<content>。我怎样才能做到这一点?
<content>
<p>
<item>
<title>
源文件:
<item> <title>Heading for Sec 1</title> <p>some text sec 1</p> <p>some text sec 1</p> <p>some text sec 1</p> </item> <item> <title>Heading for Sec 2</title> <p>some text sec 2</p> <p>some text sec 2</p> <p>some text sec 2</p> </item> <item> <title>Heading for Sec 3</title> <p>some text sec 3</p> <p>some text sec 3</p> </item>
我想要这个输出:
<item> <title>Heading for Sec 1</title> <content> <p>some text sec 1</p> <p>some text sec 1</p> </content> </item> <item> <title>Heading for Sec 2</title> <content> <p>some text sec 2</p> <p>some text sec 2</p> <p>some text sec 2</p> </content> </item> <item> <title>Heading for Sec 3</title> <content> <p>some text sec 3</p> <p>some text sec 3</p> </content> </item>
下面的代码是我正在尝试的。但是,它会<content>在每个<p>标签周围包装一个标签,而不是<p>在一个部分中的所有标签周围。我怎样才能解决这个问题?
from bs4 import BeautifulSoup with open('testdoc.txt', 'r') as f: soup = BeautifulSoup(f, "html.parser") content = None for tag in soup.select("p"): if tag.name == "p": content = tag.wrap(soup.new_tag("content")) content.append(tag) continue print(soup)
尝试:
from bs4 import BeautifulSoup html_doc = """\ <item> <title>Heading for Sec 1</title> <p>some text sec 1</p> <p>some text sec 1</p> <p>some text sec 1</p> </item> <item> <title>Heading for Sec 2</title> <p>some text sec 2</p> <p>some text sec 2</p> <p>some text sec 2</p> </item> <item> <title>Heading for Sec 3</title> <p>some text sec 3</p> <p>some text sec 3</p> </item>""" soup = BeautifulSoup(html_doc, "html.parser") for item in soup.select("item"): t = soup.new_tag("content") t.append("\n") item.title.insert_after(t) item.title.insert_after("\n") for p in item.select("p"): t.append(p) t.append("\n") item.smooth() for t in item.find_all(text=True, recursive=False): t.replace_with("\n") print(soup)
打印:
<item> <title>Heading for Sec 1</title> <content> <p>some text sec 1</p> <p>some text sec 1</p> <p>some text sec 1</p> </content> </item> <item> <title>Heading for Sec 2</title> <content> <p>some text sec 2</p> <p>some text sec 2</p> <p>some text sec 2</p> </content> </item> <item> <title>Heading for Sec 3</title> <content> <p>some text sec 3</p> <p>some text sec 3</p> </content> </item>