我一直在关注本教程(http://blog.florian-hopf.de/2014/07/scrapy-and- elasticsearch.html)并使用这个易碎的elasticsearch管道(https://github.com/knockrentals/scrapy -elasticsearch),并能够将数据从scrapy提取到JSON文件,并在本地主机上启动并运行Elasticsearch服务器。
但是,当我尝试使用管道将抓取的数据发送到elasticsearch时,出现以下错误:
2015-08-05 21:21:53 [scrapy] ERROR: Error processing {'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/221907250/'], 'title': [u'Alles rund um Elasticsearch']} Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 70, in process_item self.index_item(item) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 52, in index_item local_id = hashlib.sha1(item[uniq_key]).hexdigest() TypeError: must be string or buffer, not list
我的items.py scrapy文件看起来像这样:
from scrapy.item import Item, Field class MeetupItem(Item): title = Field() link = Field() description = Field()
和(我认为只有其中的相关部分)我的settings.py文件如下所示:
from scrapy import log ITEM_PIPELINES = [ 'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline', ] ELASTICSEARCH_SERVER = 'localhost' # If not 'localhost' prepend 'http://' ELASTICSEARCH_PORT = 9200 # If port 80 leave blank ELASTICSEARCH_USERNAME = '' ELASTICSEARCH_PASSWORD = '' ELASTICSEARCH_INDEX = 'meetups' ELASTICSEARCH_TYPE = 'meetup' ELASTICSEARCH_UNIQ_KEY = 'link' ELASTICSEARCH_LOG_LEVEL= log.DEBUG
任何帮助将不胜感激!
如您在错误消息中所看到的:Error processing {'link': [u'http://www.meetup.com/Search- Meetup-Karlsruhe/events/221907250/'], 'title': [u'Alles rund um Elasticsearch']}您的项目link和title字段是列表(值周围的方括号表明了这一点)。
Error processing {'link': [u'http://www.meetup.com/Search- Meetup-Karlsruhe/events/221907250/'], 'title': [u'Alles rund um Elasticsearch']}
link
title
这是因为您在Scrapy中进行了提取。您没有在问题中发布它,但应该使用它response.xpath().extract()[0]来获得列表的第一个结果。当然,在这种情况下,您应该准备遇到空结果集,以避免索引错误。
response.xpath().extract()[0]
更新资料
对于不提取任何内容的情况,可以使用以下方法进行准备:
linkSelection = response.xpath().extract() item['link'] = linkSelection[0] if linkSelection else ""
或类似的东西取决于您的数据和字段。None如果列表为空,可能也有效。
None
基本思想是拆分XPath提取和列表项选择。如果项目包含必需的元素,则应从列表中选择一个项目。