我们可以将xpath与BeautifulSoup一起使用吗？

一尘不染

我们可以将xpath与BeautifulSoup一起使用吗？

python

我正在使用BeautifulSoup抓取网址，并且我有以下代码

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

现在在上面的代码中，我们可以findAll用来获取标签和与其相关的信息，但是我想使用xpath。是否可以将xpath与BeautifulSoup一起使用？如果可能的话，任何人都可以给我提供示例代码，以使其更有帮助吗？

阅读 171

2020-12-20

共1个答案

一尘不染

不，BeautifulSoup本身不支持XPath表达式。

另一种库，LXML，不支持的XPath
1.0。它具有BeautifulSoup兼容模式，它将以Soup的方式尝试解析损坏的HTML。但是，默认的lxml
HTML解析器可以很好地完成解析损坏的HTML的工作，而且我相信它的速度更快。

将文档解析为lxml树后，就可以使用该.xpath()方法搜索元素。

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

还有一个带有附加功能的专用lxml.html()模块。

请注意，在上面的示例中，我将response对象直接传递给lxml，因为直接从流中读取解析器比将响应首先读取到大字符串中更有效。要对requests库执行相同的操作，您需要在启用透明传输解压缩后设置stream=True并传递response.raw对象：)

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

您可能会感兴趣的是CSS选择器支持；在CSSSelector类转换CSS语句转换为XPath表达式，使您的搜索td.empformbody更加容易：

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

即将来临：BeautifulSoup本身确实
具有非常完整的CSS选择器支持：

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

2020-12-20