我正在尝试使用文本文件中已有的一些链接提取不同属性的地址。我使用 asyncio 库创建了这个脚本。该脚本运行良好,直到遇到该网站抛出的这种类型的页面。我还检查了实现代理,但没有运气。虽然它肯定不是验证码页面,但在使用 asyncio 时,我最终在几次请求后得到了该页面。仅供参考,当我进入请求模块时,我没有遇到该页面。
以下是我在文本文件中使用的几个 URL 。
import asyncio import aiohttp import random import requests from bs4 import BeautifulSoup async def get_text(session,url): async with session.get(url,ssl=False) as resp: assert resp.status == 200 print("----------",str(resp.url)) if "Error" in str(resp.url):raise return await resp.read() async def get_info(sem,session,link): async with sem: r = await get_text(session,link) soup = BeautifulSoup(r,"html.parser") try: address = soup.select_one("h1#mainaddresstitle").get_text(strip=True) except AttributeError: address = "" print(address) async def main(): sem = asyncio.Semaphore(5) with open("link_list.txt","r") as f: link_list = [url.strip() for url in f.readlines()] async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=10)) as session: await asyncio.gather( *(get_info(sem,session,item) for item in link_list) ) if __name__ == '__main__': asyncio.run(main())
PS 当脚本超过速率限制时,它应该会遇到一些页面,/Property/UsageValidation但不会 /Property/Error/?id=14e53e71-11b1-4f5e-a88c-f8a4721de99e
async def get_text(session, url, retries=3): for attempt in range(retries): try: async with session.get(url, ssl=False) as resp: if resp.status == 200 and "Error" not in str(resp.url): return await resp.read() print(f"Retry {attempt + 1} for URL: {url}") except Exception as e: print(f"Exception for URL {url}: {e}") await asyncio.sleep(random.uniform(1, 3)) # 等待随机时间再尝试 return None
sem = asyncio.Semaphore(2) # 同时运行的任务减少到2
async def get_info(sem, session, link): async with sem: await asyncio.sleep(random.uniform(2, 5)) # 随机等待 r = await get_text(session, link) if not r: print(f"Skipping URL due to repeated errors: {link}") return soup = BeautifulSoup(r, "html.parser") try: address = soup.select_one("h1#mainaddresstitle").get_text(strip=True) except AttributeError: address = "" print(address)
使用 aiohttp.ClientSession 时,设置更接近真实用户的请求头(如 User-Agent 和 Referer)以降低被识别为脚本化流量的可能性:
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "Referer": "https://www.bcassessment.ca/", } async def main(): sem = asyncio.Semaphore(5) with open("link_list.txt", "r") as f: link_list = [url.strip() for url in f.readlines()] async with aiohttp.ClientSession( headers=headers, timeout=aiohttp.ClientTimeout(total=20) ) as session: await asyncio.gather( *(get_info(sem, session, item) for item in link_list) )
代理可以帮助分散流量,避免频繁从同一 IP 发送请求:
proxy_list = [ "http://proxy1:port", "http://proxy2:port", # 添加更多代理 ] async def get_text(session, url, retries=3, proxies=None): for attempt in range(retries): proxy = random.choice(proxies) if proxies else None try: async with session.get(url, proxy=proxy, ssl=False) as resp: if resp.status == 200 and "Error" not in str(resp.url): return await resp.read() print(f"Retry {attempt + 1} for URL: {url} with proxy {proxy}") except Exception as e: print(f"Exception for URL {url} with proxy {proxy}: {e}") await asyncio.sleep(random.uniform(1, 3)) # 等待随机时间再尝试 return None
如果 requests 库表现更好,可能是 aiohttp 的某些行为被目标网站标记。尝试将 asyncio 替换为 requests,并使用线程池实现并发:
import requests from bs4 import BeautifulSoup from concurrent.futures import ThreadPoolExecutor def fetch_info(link): try: response = requests.get(link, headers=headers, timeout=10) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") address = soup.select_one("h1#mainaddresstitle").get_text(strip=True) print(address) except Exception as e: print(f"Error fetching {link}: {e}") def main(): with open("link_list.txt", "r") as f: link_list = [url.strip() for url in f.readlines()] with ThreadPoolExecutor(max_workers=5) as executor: executor.map(fetch_info, link_list) if __name__ == "__main__": main()
在脚本中添加对 UsageValidation 页面重定向的检测。对于这类页面,可以实现简单的回退策略(如增加延迟或重试)。
通过以上优化,脚本应该更能应对目标网站的限制。特别建议你结合以下方法: 1. 降低并发数和增加随机延迟。 2. 使用代理分散请求。 3. 模拟真实用户行为(例如,添加请求头和合理的超时设置)。