一尘不染

在Javascript中抓取数据

scrapy

scrapy用来筛选网站上的抓取数据。但是,我想要的数据不在html本身内,而是来自javascript。所以,我的问题是:

如何获得这种情况的值(文本值)?

这是我要筛选的网站,网址为:https : //www.mcdonalds.com.sg/locate-us/

我尝试获取的属性:地址,联系方式,营业时间。

如果你在Chrome浏览器中执行“右键单击”,“查看源代码”,你将看到HTML中本身不提供此类值。

编辑

保罗,我照你说的做了,找到了admin-ajax.php,看见了尸体,但是,我现在真的很困。

如何从json对象中检索值并将其存储到我自己的变量字段中?如果你可以分享如何为公众和刚开始涉嫌欺诈的人们分享一个属性,那将是很好。

到目前为止,这是我的代码

Items.py

class McDonaldsItem(Item):
name = Field()
address = Field()
postal = Field()
hours = Field()

McDonalds.py

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re

from fastfood.items import McDonaldsItem

class McDonaldSpider(BaseSpider):
name = "mcdonalds"
allowed_domains = ["mcdonalds.com.sg"]
start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]

def parse_json(self, response):

    js = json.loads(response.body)
    pprint.pprint(js)

长时间编辑很抱歉,总之,我如何将json值存储到属性中?例如

item[‘address’] = * how to retrieve

PS,不确定这是否有帮助,但是,我使用以下命令在cmd行上运行这些脚本

scrapy爬网mcdonalds -o McDonalds.json -t json(用于将我的所有数据保存到json文件中)

我没有足够的压力去表达我的感激之情。我知道问你这个问题是不合理的,即使你没有时间这样做也完全可以。


阅读 540

收藏
2020-04-09

共1个答案

一尘不染

已经准备好以json格式存储所需的所有数据。

Scrapy shell在编写蜘蛛程序之前提供了一个非常方便思想者访问网站的命令:

$ scrapy shell https://www.mcdonalds.com.sg/locate-us/
2013-09-27 00:44:14-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: scrapybot)
...

In [1]: from scrapy.http import FormRequest

In [2]: url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'

In [3]: payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}

In [4]: req = FormRequest(url, formdata=payload)

In [5]: fetch(req)
2013-09-27 00:45:13-0400 [default] DEBUG: Crawled (200) <POST https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php> (referer: None)
...

In [6]: import json

In [7]: data = json.loads(response.body)

In [8]: len(data['stores']['listing'])
Out[8]: 127

In [9]: data['stores']['listing'][0]
Out[9]: 
{u'address': u'678A Woodlands Avenue 6<br/>#01-05<br/>Singapore 731678',
 u'city': u'Singapore',
 u'id': 78,
 u'lat': u'1.440409',
 u'lon': u'103.801489',
 u'name': u"McDonald's Admiralty",
 u'op_hours': u'24 hours<br>\r\nDessert Kiosk: 0900-0100',
 u'phone': u'68940513',
 u'region': u'north',
 u'type': [u'24hrs', u'dessert_kiosk'],
 u'zip': u'731678'}

简而言之:在你的Spider中,你必须返回FormRequest(...)上面的内容,然后在回调中从中加载json对象response.body,最后为列表中每个商店的数据data['stores']['listing']创建一个具有所需值的项目。

像这样:

class McDonaldSpider(BaseSpider):
    name = "mcdonalds"
    allowed_domains = ["mcdonalds.com.sg"]
    start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]

    def parse(self, response):
        # This receives the response from the start url. But we don't do anything with it.
        url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'
        payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}
        return FormRequest(url, formdata=payload, callback=self.parse_stores)

    def parse_stores(self, response):
        data = json.loads(response.body)
        for store in data['stores']['listing']:
            yield McDonaldsItem(name=store['name'], address=store['address'])
2020-04-09