我正在尝试使用文本文件中已有的一些链接提取不同属性的地址。我使用 asyncio 库创建了这个脚本。该脚本运行良好,直到遇到该网站抛出的这种类型的页面。我还检查了实现代理,但没有运气。虽然它肯定不是验证码页面,但在使用 asyncio 时,我最终在几次请求后得到了该页面。仅供参考,当我进入请求模块时,我没有遇到该页面。
我怎样才能摆脱该错误页面?
以下是我在文本文件中使用的几个 URL 。
我尝试过:
import asyncio import aiohttp import random import requests from bs4 import BeautifulSoup async def get_text(session,url): async with session.get(url,ssl=False) as resp: assert resp.status == 200 print("----------",str(resp.url)) if "Error" in str(resp.url):raise return await resp.read() async def get_info(sem,session,link): async with sem: r = await get_text(session,link) soup = BeautifulSoup(r,"html.parser") try: address = soup.select_one("h1#mainaddresstitle").get_text(strip=True) except AttributeError: address = "" print(address) async def main(): sem = asyncio.Semaphore(5) with open("link_list.txt","r") as f: link_list = [url.strip() for url in f.readlines()] async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10)) as session: await asyncio.gather( *(get_info(sem,session,item) for item in link_list) ) if __name__ == '__main__': asyncio.run(main())
PS 当脚本超过速率限制时,它应该会遇到一些页面,/Property/UsageValidation但不会 /Property/Error/?id=14e53e71-11b1-4f5e-a88c-f8a4721de99e
/Property/UsageValidation
/Property/Error/?id=14e53e71-11b1-4f5e-a88c-f8a4721de99e
你遇到的问题很可能是由于目标网站的速率限制或检测到了脚本化流量。以下是一些方法,可以帮助你规避错误页面:
在某些情况下,请求可能会暂时失败。实现一个重试机制以再次尝试请求:
async def get_text(session, url, retries=3): for attempt in range(retries): try: async with session.get(url, ssl=False) as resp: if resp.status == 200 and "Error" not in str(resp.url): return await resp.read() print(f"Retry {attempt + 1} for URL: {url}") except Exception as e: print(f"Exception for URL {url}: {e}") await asyncio.sleep(random.uniform(1, 3)) # 等待随机时间再尝试 return None
目标网站可能对高并发请求非常敏感。降低并发请求数量,减少对服务器的压力:
sem = asyncio.Semaphore(2) # 同时运行的任务减少到2
在每次请求之间添加一个随机的延迟以模拟人类行为:
async def get_info(sem, session, link): async with sem: await asyncio.sleep(random.uniform(2, 5)) # 随机等待 r = await get_text(session, link) if not r: print(f"Skipping URL due to repeated errors: {link}") return soup = BeautifulSoup(r, "html.parser") try: address = soup.select_one("h1#mainaddresstitle").get_text(strip=True) except AttributeError: address = "" print(address)
使用 aiohttp.ClientSession 时,设置更接近真实用户的请求头(如 User-Agent 和 Referer)以降低被识别为脚本化流量的可能性:
aiohttp.ClientSession
User-Agent
Referer
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "Referer": "https://www.bcassessment.ca/", } async def main(): sem = asyncio.Semaphore(5) with open("link_list.txt", "r") as f: link_list = [url.strip() for url in f.readlines()] async with aiohttp.ClientSession( headers=headers, timeout=aiohttp.ClientTimeout(total=20) ) as session: await asyncio.gather( *(get_info(sem, session, item) for item in link_list) )
代理可以帮助分散流量,避免频繁从同一 IP 发送请求:
proxy_list = [ "http://proxy1:port", "http://proxy2:port", # 添加更多代理 ] async def get_text(session, url, retries=3, proxies=None): for attempt in range(retries): proxy = random.choice(proxies) if proxies else None try: async with session.get(url, proxy=proxy, ssl=False) as resp: if resp.status == 200 and "Error" not in str(resp.url): return await resp.read() print(f"Retry {attempt + 1} for URL: {url} with proxy {proxy}") except Exception as e: print(f"Exception for URL {url} with proxy {proxy}: {e}") await asyncio.sleep(random.uniform(1, 3)) # 等待随机时间再尝试 return None
如果 requests 库表现更好,可能是 aiohttp 的某些行为被目标网站标记。尝试将 asyncio 替换为 requests,并使用线程池实现并发:
requests
aiohttp
asyncio
import requests from bs4 import BeautifulSoup from concurrent.futures import ThreadPoolExecutor def fetch_info(link): try: response = requests.get(link, headers=headers, timeout=10) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") address = soup.select_one("h1#mainaddresstitle").get_text(strip=True) print(address) except Exception as e: print(f"Error fetching {link}: {e}") def main(): with open("link_list.txt", "r") as f: link_list = [url.strip() for url in f.readlines()] with ThreadPoolExecutor(max_workers=5) as executor: executor.map(fetch_info, link_list) if __name__ == "__main__": main()
在脚本中添加对 UsageValidation 页面重定向的检测。对于这类页面,可以实现简单的回退策略(如增加延迟或重试)。
UsageValidation
通过以上优化,脚本应该更能应对目标网站的限制。特别建议你结合以下方法: 1. 降低并发数和增加随机延迟。 2. 使用代理分散请求。 3. 模拟真实用户行为(例如,添加请求头和合理的超时设置)。
如果问题持续,可以考虑抓包分析目标网站的流量规则并进一步调整。