我selenium用来单击所需的网页,然后使用解析网页Beautiful Soup。
selenium
Beautiful Soup
有人展示了如何在中获取元素的内部HTMLSeleniumWebDriver。有没有办法获取整个页面的HTML?谢谢
SeleniumWebDriver
中的示例代码Python (基于上面的帖子,语言似乎没有太大关系):
Python
from selenium import webdriver from selenium.webdriver.support.ui import Select from bs4 import BeautifulSoup url = 'http://www.google.com' driver = webdriver.Firefox() driver.get(url) the_html = driver---somehow----.get_attribute('innerHTML') bs = BeautifulSoup(the_html, 'html.parser')
要获取整个页面的HTML:
from selenium import webdriver driver = webdriver.Firefox() driver.get("http://stackoverflow.com") html = driver.page_source
要获取外部HTML(包括标记):
# HTML from `<html>` html = driver.execute_script("return document.documentElement.outerHTML;") # HTML from `<body>` html = driver.execute_script("return document.body.outerHTML;") # HTML from element with some JavaScript element = driver.find_element_by_css_selector("#hireme") html = driver.execute_script("return arguments[0].outerHTML;", element) # HTML from element with `get_attribute` element = driver.find_element_by_css_selector("#hireme") html = element.get_attribute('outerHTML')
要获取内部HTML(不包括标签):
# HTML from `<html>` html = driver.execute_script("return document.documentElement.innerHTML;") # HTML from `<body>` html = driver.execute_script("return document.body.innerHTML;") # HTML from element with some JavaScript element = driver.find_element_by_css_selector("#hireme") html = driver.execute_script("return arguments[0].innerHTML;", element) # HTML from element with `get_attribute` element = driver.find_element_by_css_selector("#hireme") html = element.get_attribute('innerHTML')