我已经编写了这段简短的spider代码,以从新闻首页提取标题。
import scrapy class HackerItem(scrapy.Item): #declaring the item hackertitle = scrapy.Field() class HackerSpider(scrapy.Spider): name = 'hackernewscrawler' allowed_domains = ['news.ycombinator.com'] # website we chose start_urls = ['http://news.ycombinator.com/'] def parse(self,response): sel = scrapy.Selector(response) #selector to help us extract the titles item=HackerItem() #the item declared up # xpath of the titles item['hackertitle'] = sel.xpath("//tr[@class='athing']/td[3]/a[@href]/text()").extract() # printing titles using print statement. print (item['hackertitle']
但是当我运行代码 scrapy scrawl hackernewscrawler -o hntitles.json -t json
scrapy scrawl hackernewscrawler -o hntitles.json -t json
我得到一个空的.json文件,其中没有任何内容。
你应该将print语句更改为yield:
print
yield
import scrapy class HackerItem(scrapy.Item): #declaring the item hackertitle = scrapy.Field() class HackerSpider(scrapy.Spider): name = 'hackernewscrawler' allowed_domains = ['news.ycombinator.com'] # website we chose start_urls = ['http://news.ycombinator.com/'] def parse(self,response): sel = scrapy.Selector(response) #selector to help us extract the titles item=HackerItem() #the item declared up # xpath of the titles item['hackertitle'] = sel.xpath("//tr[@class='athing']/td[3]/a[@href]/text()").extract() # return items yield item
然后运行:
scrapy crawl hackernewscrawler -o hntitles.json -t json