如何通过python Web抓取框架Scrapy利用代理支持?
Scrapy是否可以与HTTP代理一起使用? 是。(从Scrapy 0.8开始)通过HTTP代理下载器中间件提供对HTTP代理的支持。请参阅HttpProxyMiddleware。
HttpProxyMiddleware
使用代理的最简单方法是设置环境变量http_proxy。如何完成取决于你的外壳。
http_proxy
C:\>set http_proxy=http://proxy:port csh% setenv http_proxy http://proxy:port sh$ export http_proxy=http://proxy:port
如果你想使用https代理并访问https web,要设置环境变量,http_proxy请遵循以下步骤:
C:\>set https_proxy=https://proxy:port csh% setenv https_proxy https://proxy:port sh$ export https_proxy=https://proxy:port
单一代理
settings.py
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1 }
request.meta
request = Request(url="http://example.com") request.meta['proxy'] = "host:port" yield request
如果你有地址池,也可以随机选择一个代理地址。像这样:
多个代理
class MySpider(BaseSpider): name = "my_spider" def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN'] def parse(self, response): ...parse code... if something: yield self.get_request(url) def get_request(self, url): req = Request(url=url) if self.proxy_pool: req.meta['proxy'] = random.choice(self.proxy_pool) return req