Python 遵循重定向然后下载页面？

小能豆

Python 遵循重定向然后下载页面？

python

我有以下 python 脚本，它运行得很好。

import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data

然而，我给它的一些 URL 可能会重定向它 2 次或更多次。如何让 python 在加载数据之前等待重定向完成。例如，当使用上面的代码时

http://www.google.com/search?hl=en&q=KEYWORD&btnI=1

这相当于在谷歌搜索上点击“我很幸运”按钮，我得到：

>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>

我已经尝试过（url，数据，超时）但是，我不确定该放什么。

编辑：我实际上发现如果我不重定向并且只使用第一个链接的标题，我可以获取下一个重定向的位置并将其用作我的最终链接

阅读 83

2024-05-20

共1个答案

小能豆

要解决重定向问题并处理 HTTP 403 错误，你可以尝试以下方法。由于你使用的是 Python 2 和 urllib2，我们可以通过设置适当的请求头（例如 User-Agent）来模拟浏览器请求，从而避免 403 错误，并手动处理重定向。

使用 `urllib2` 并手动处理重定向

首先，通过修改请求头来避免 403 错误，然后手动处理重定向。下面是一个完整的示例：

import urllib2
import urlparse

def get_final_url(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    req = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(req)

    while response.getcode() in (301, 302):
        redirect_url = response.info().get('Location')
        if not redirect_url.startswith('http'):
            redirect_url = urlparse.urljoin(url, redirect_url)
        url = redirect_url
        req = urllib2.Request(url, headers=headers)
        response = urllib2.urlopen(req)

    return response.read()

url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
data = get_final_url(url)
print(data)

代码解释

导入模块：
python import urllib2 import urlparse
urllib2 用于处理 HTTP 请求，urlparse 用于解析和拼接 URL。
定义 get_final_url 函数：
python def get_final_url(url): headers = {'User-Agent': 'Mozilla/5.0'} req = urllib2.Request(url, headers=headers) response = urllib2.urlopen(req)
创建一个包含 User-Agent 头的请求，模拟浏览器请求。
处理重定向：
python while response.getcode() in (301, 302): redirect_url = response.info().get('Location') if not redirect_url.startswith('http'): redirect_url = urlparse.urljoin(url, redirect_url) url = redirect_url req = urllib2.Request(url, headers=headers) response = urllib2.urlopen(req)
检查响应状态码是否为 301 或 302（表示重定向）。如果是，则获取重定向 URL，并处理相对 URL 的情况，然后继续请求新的 URL。
返回最终数据：
python return response.read()
读取并返回最终重定向后的页面内容。
使用示例：
python url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1' data = get_final_url(url) print(data)
调用 get_final_url 函数并打印结果。

使用 `requests` 库（推荐）

如果可能的话，建议使用 requests 库，它更加现代化并且处理重定向更加简便：

import requests

url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
print(response.text)

安装 requests：

pip install requests

requests.get 会自动处理重定向，并且可以通过 headers 参数设置 User-Agent 头，避免 403 错误。

通过上述方法，你可以有效地处理 URL 重定向和避免 HTTP 403 错误。如果你有更多问题或需要进一步帮助，请随时提问！

2024-05-20