一尘不染

需要使用python(selenium)删除通过ajax加载的表

selenium

我有一个包含表格的页面(表格ID =“
ctl00_ContentPlaceHolder_ctl00_ctl00_GV” class =“
GridListings”),我需要剪贴。我通常为此使用BeautifulSoup和urllib,但是在这种情况下,问题在于该表需要花费一些时间来加载,因此当我尝试使用BS来获取该表时不会捕获该表。由于某些安装问题,我无法使用PyQt4,drysracpe或windmill,因此唯一可能的方法是使用Selenium
/ PhantomJS我尝试了以下操作,但仍未成功:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located(By.CSS_SELECTOR, 'table#ctl00_ContentPlaceHolder_ctl00_ctl00_GV'))

上面的代码没有给我所需的表内容。我该如何实现这一目标???


阅读 245

收藏
2020-06-26

共1个答案

一尘不染

您可以使用 requestbs4, 来获取数据,几乎(如果不是全部)asp站点总是需要提供一些后置参数,例如 EVENTTARGET
EVENTVALIDATION 等。:

from bs4 import BeautifulSoup
import requests

data = {"__EVENTTARGET": "ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV",
    "__EVENTARGUMENT": "LISTINGS;0",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$ctl00$hdnProductID": "139",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$hdnProductID": "139",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortField": "Listing Number",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortDirection": "A-Z, Low-High",
    "__ASYNCPOST": "true"}

对于实际的帖子,我们需要添加一些其他值以发布帖子数据:

post = "https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
with requests.Session() as s:
    s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
    soup = BeautifulSoup(s.get(post).content)

    data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
    data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
    data["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]

    r = s.post(post, data=data)
    soup2 = BeautifulSoup(r.content)
    table = soup2.select_one("div.GridListings")
    print(table)

运行代码时,您将看到打印的表格。

2020-06-26