我正在尝试使用scrapy从网页中抓取产品信息。我的待抓取网页如下所示:
我的蜘蛛非常标准,如下所示:
class ProductSpider(CrawlSpider): name = "product_spider" allowed_domains = ['example.com'] start_urls = ['http://example.com/shanghai'] rules = [ Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'), ] def parse_product(self, response): self.log("parsing product %s" %response.url, level=INFO) hxs = HtmlXPathSelector(response) # actual data follows
任何想法表示赞赏。谢谢!
这实际上取决于你需要如何刮取网站以及你希望如何以及要获取什么数据。
这是一个示例,你可以使用Scrapy+ 跟踪eBay上的分页Selenium:
Scrapy
Selenium
import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = "product_spider" allowed_domains = ['ebay.com'] start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40'] def __init__(self): self.driver = webdriver.Firefox() def parse(self, response): self.driver.get(response.url) while True: next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a') try: next.click() # get the data and write it to scrapy items except: break self.driver.close()
除了必须与结合使用之外Selenium,还有另一种选择Scrapy。