我需要设置引荐来源网址,然后再抓取网站,该网站使用基于引用网址的身份验证,因此如果引荐来源无效,则不允许我登录。
有人可以告诉我如何在Scrapy中执行此操作吗?
如果你想在Spider的请求中更改引荐来源网址,则可以DEFAULT_REQUEST_HEADERS在settings.py文件中进行更改:
DEFAULT_REQUEST_HEADERS = { 'Referer': 'http://www.google.com' }
from scrapy.spiders import CrawlSpider from scrapy import Request class MySpider(CrawlSpider): name = "myspider" allowed_domains = ["example.com"] start_urls = [ 'http://example.com/foo' 'http://example.com/bar' 'http://example.com/baz' ] rules = [(...)] def start_requests(self): requests = [] for item in self.start_urls: requests.append(Request(url=item, headers={'Referer':'http://www.example.com/'})) return requests def parse_me(self, response): (...)
这将在你的终端中生成以下日志:
(...) [myspider] DEBUG: Crawled (200) <GET http://example.com/foo> (referer: http://www.example.com/) (...) [myspider] DEBUG: Crawled (200) <GET http://example.com/bar> (referer: http://www.example.com/) (...) [myspider] DEBUG: Crawled (200) <GET http://example.com/baz> (referer: http://www.example.com/) (...)
与BaseSpider相同。最后,start_requests方法是BaseSpider方法,CrawlSpider继承自该方法。