试图绕过Scrapy,但遇到了一些死胡同。
我在页面上有2个表,并希望从每个表中提取数据,然后移至下一页。
表格看起来像这样(第一个称为Y1,第二个称为Y2),并且结构相同。
<div id="Y1" style="margin-bottom: 0px; margin-top: 15px;"> <h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;"> <table class="table table-striped table-hover table-curved"> <thead> <tr> <th class="tCol1" style="padding: 10px;">First Col Head</th> <th class="tCol2" style="padding: 10px;">Second Col Head</th> <th class="tCol3" style="padding: 10px;">Third Col Head</th> </tr> </thead> <tbody> <tr> <td>Info 1</td> <td>Monday 5 September, 2016</td> <td>Friday 21 October, 2016</td> </tr> <tr class="vevent"> <td class="summary"><b>Info 2</b></td> <td class="dtstart" timestamp="1477094400"><b></b></td> <td class="dtend" timestamp="1477785600"> <b>Sunday 30 October, 2016</b></td> </tr> <tr> <td>Info 3</td> <td>Monday 31 October, 2016</td> <td>Tuesday 20 December, 2016</td> </tr> <tr class="vevent"> <td class="summary"><b>Info 4</b></td> <td class="dtstart" timestamp="1482278400"><b>Wednesday 21 December, 2016</b></td> <td class="dtend" timestamp="1483315200"> <b>Monday 2 January, 2017</b></td> </tr> </tbody> </table>
如你所见,结构有点不一致,但是只要我可以获取每个td并将其输出到csv,那么我就会很高兴。
我尝试使用xPath,但这只会让我更加困惑。
我最后的尝试:
import scrapy class myScraperSpider(scrapy.Spider): name = "myScraper" allowed_domains = ["mysite.co.uk"] start_urls = ( 'https://mysite.co.uk/page1/', ) def parse_products(self, response): products = response.xpath('//*[@id="Y1"]/table') # ignore the table header row for product in products[1:] item = Schooldates1Item() item['hol'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[1]').extract()[0] item['first'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[2]').extract()[0] item['last'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[3]').extract()[0] yield item
此处没有错误,但仅会激发出许多有关爬网的信息,但没有实际结果。
更新:
import scrapy class SchoolSpider(scrapy.Spider): name = "school" allowed_domains = ["termdates.co.uk"] start_urls = ( 'https://termdates.co.uk/school-holidays-16-19-abingdon/', ) def parse_products(self, response): products = sel.xpath('//*[@id="Year1"]/table//tr') for p in products[1:]: item = dict() item['hol'] = p.xpath('td[1]/text()').extract_first() item['first'] = p.xpath('td[1]/text()').extract_first() item['last'] = p.xpath('td[1]/text()').extract_first() yield item
这给了我:IndentationError:意外的缩进
如果我运行下面的修订脚本(感谢@Granitosaurus)以输出到CSV(-o schoolDates.csv),则会得到一个空文件:
import scrapy class SchoolSpider(scrapy.Spider): name = "school" allowed_domains = ["termdates.co.uk"] start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',) def parse_products(self, response): products = sel.xpath('//*[@id="Year1"]/table//tr') for p in products[1:]: item = dict() item['hol'] = p.xpath('td[1]/text()').extract_first() item['first'] = p.xpath('td[1]/text()').extract_first() item['last'] = p.xpath('td[1]/text()').extract_first() yield item
这是日志:
2017-03-23 12:04:08 [scrapy.core.engine] INFO: Spider opened 2017-03-23 12:04:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-03-23 12:04:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on ... 2017-03-23 12:04:08 [scrapy.core.engine] DEBUG: Crawled (200) https://termdates.co.uk/robots.txt> (referer: None) 2017-03-23 12:04:08 [scrapy.core.engine] DEBUG: Crawled (200) https://termdates.co.uk/school-holidays-16-19-abingdon/> (referer: None) 2017-03-23 12:04:08 [scrapy.core.scraper] ERROR: Spider error processing https://termdates.co.uk/school-holidays-16-19-abingdon/> (referer: None) Traceback (most recent call last): File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _ runCallbacks current.result = callback(current.result, *args, **kw) File "c:\python27\lib\site-packages\scrapy-1.3.3-py2.7.egg\scrapy\spiders__init__.py", line 76, in parse raise NotImplementedError NotImplementedError 2017-03-23 12:04:08 [scrapy.core.engine] INFO: Closing spider (finished) 2017-03-23 12:04:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 467, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 11311, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 845000), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/NotImplementedError': 1, 'start_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 356000)} 2017-03-23 12:04:08 [scrapy.core.engine] INFO: Spider closed (finished)
更新2:(跳过行)这会将结果推送到csv文件,但跳过每隔一行。
命令行管理程序显示 {‘hol’:None,’last’:u’\ r \ n \ t \ t \ t \ t \ t \ t \ t \ t’,’first’:None}
import scrapy class SchoolSpider(scrapy.Spider): name = "school" allowed_domains = ["termdates.co.uk"] start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',) def parse(self, response): products = response.xpath('//*[@id="Year1"]/table//tr') for p in products[1:]: item = dict() item['hol'] = p.xpath('td[1]/text()').extract_first() item['first'] = p.xpath('td[2]/text()').extract_first() item['last'] = p.xpath('td[3]/text()').extract_first() yield item
解决方案:这将爬网start_urls中的所有页面并处理不一致的表布局
# -*- coding: utf-8 -*- import scrapy from SchoolDates_1.items import Schooldates1Item class SchoolSpider(scrapy.Spider): name = "school" allowed_domains = ["termdates.co.uk"] start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/', 'https://termdates.co.uk/school-holidays-3-dimensions',) def parse(self, response): products = response.xpath('//*[@id="Year1"]/table//tr') # ignore the table header row for product in products[1:]: item = Schooldates1Item() item['hol'] = product.xpath('td[1]//text()').extract_first() item['first'] = product.xpath('td[2]//text()').extract_first() item['last'] = ''.join(product.xpath('td[3]//text()').extract()).strip() item['url'] = response.url yield item
需要稍微更正你的代码。由于你已经选择了表中的所有元素,因此无需再次指向表。因此,你可以将xpath缩短为此类td[1]//text()。
td[1]//text()
def parse_products(self, response): products = response.xpath('//*[@id="Year1"]/table//tr') # ignore the table header row for product in products[1:] item = Schooldates1Item() item['hol'] = product.xpath('td[1]//text()').extract_first() item['first'] = product.xpath('td[2]//text()').extract_first() item['last'] = product.xpath('td[3]//text()').extract_first() yield item