我需要在Ubuntu中设置Tor并在scrapy框架中使用它的帮助。
我做了一些研究,找到了本指南:
class RetryChangeProxyMiddleware(RetryMiddleware): def _retry(self, request, reason, spider): log.msg('Changing proxy') tn = telnetlib.Telnet('127.0.0.1', 9051) tn.read_until("Escape character is '^]'.", 2) tn.write('AUTHENTICATE "267765"\r\n') tn.read_until("250 OK", 2) tn.write("signal NEWNYM\r\n") tn.read_until("250 OK", 2) tn.write("quit\r\n") tn.close() time.sleep(3) log.msg('Proxy changed') return RetryMiddleware._retry(self, request, reason, spider)
然后在settings.py中使用它:
DOWNLOADER_MIDDLEWARE = { 'spider.middlewares.RetryChangeProxyMiddleware': 600, }
然后你只想通过本地Tor代理(polipo)发送请求,可以使用以下方法完成:
tsocks scrapy crawl spirder
有谁可以确认这种方法有效并且你获得了不同的IP?
You can use this middleware to have a random user agent every request the spider makes. # You can define a user USER_AGEN_LIST in your settings and the spider will chose a random user agent from that list every time. # # You will have to disable the default user agent middleware and add this to your settings file. # # DOWNLOADER_MIDDLEWARES = { # 'scraper.random_user_agent.RandomUserAgentMiddleware': 400, # 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None, # } from scraper.settings import USER_AGENT_LIST import random from scrapy import log class RandomUserAgentMiddleware(object): def process_request(self, request, spider): ua = random.choice(USER_AGENT_LIST) if ua: request.headers.setdefault('User-Agent', ua) #log.msg('>>>> UA %s'%request.headers) # Snippet imported from snippets.scrapy.org (which no longer works) # author: dushyant # date : Sep 16, 2011