我正在尝试从网站上抓取电子邮件,但当我运行代码时出现了错误,我不太明白这个错误
Traceback (most recent call last): File "Email_Scrapper.py", line 37, in <module> parts = urllib.parse.urlsplit(url) File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 423, in urlsplit url, scheme, _coerce_result = _coerce_args(url, scheme) File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 124, in _coerce_args return _decode_args(args) + (_encode_result,) File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 108, in _decode_args return tuple(x.decode(encoding, errors) if x else '' for x in args) File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 108, in <genexpr> return tuple(x.decode(encoding, errors) if x else '' for x in args) AttributeError: 'builtin_function_or_method' object has no attribute 'decode'
这是我的代码:
scraped_url = set() emails = set() #ici on fait un compte jusqu'a ce que # les emails recuperer atteignent 20 et on sort de la boucle count = 0 try: while len(urls): count += 1 if count == 20: break url = urls.popleft scraped_url.add(url) parts = urllib.parse.urlsplit(url) base_url = '{0.scheme}://{0.netloc}'.format(parts) path = url[:url.rfind('/')+1] if '/' in parts.path else url print('[%d] Processing %s' % count(count, url)) try: response = requests.get() except(requests.exceptions.MissingSchema, requests.exceptions.ConnectionError): continue new_emails = set(re.findall(r'[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+',response.text, re.I)) soup = BeautifulSoup(response.text, features="lxml") for anchor in soup.find_all("a"): link = anchor.attrs['href'] if 'href' in anchor.attrs else '' if link.startswith('/'): link = base_url + link elif not link.startswith('http'): link = path + link if not link in urls and not link in scraped_url: urls.append(link) except KeyboardInterrupt: print('[-]closing') for mails in emails: print(mails)
你遇到的错误是:
AttributeError: 'builtin_function_or_method' object has no attribute 'decode'
这是因为在这一行代码:
parts = urllib.parse.urlsplit(url)
urlsplit 试图对某些内容进行 decode() 操作,而它预期的对象是字节类型(bytes),但是你提供的 url 看起来像是一个方法而不是一个字符串。具体来说,错误的原因是这一行:
urlsplit
decode()
bytes
url
url = urls.popleft
你忘记在这一行调用 popleft() 方法。由于你没有括号,urls.popleft 是方法本身,而不是它的返回值。因此,urlsplit 收到的是一个方法,而不是一个字符串 URL。
popleft()
urls.popleft
你需要在 urls.popleft 后加上括号,确保它返回并传递实际的 URL 字符串:
url = urls.popleft() # 加上括号以正确调用 popleft 方法
修改后的代码如下:
scraped_url = set() emails = set() count = 0 try: while len(urls): count += 1 if count == 20: break url = urls.popleft() # 调用 popleft() 获取实际 URL scraped_url.add(url) parts = urllib.parse.urlsplit(url) base_url = '{0.scheme}://{0.netloc}'.format(parts) path = url[:url.rfind('/')+1] if '/' in parts.path else url print('[%d] Processing %s' % (count, url)) # 修正格式化问题 try: response = requests.get(url) # 你需要传递 url 到 requests.get() except(requests.exceptions.MissingSchema, requests.exceptions.ConnectionError): continue new_emails = set(re.findall(r'[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+', response.text, re.I)) emails.update(new_emails) soup = BeautifulSoup(response.text, features="lxml") for anchor in soup.find_all("a"): link = anchor.attrs['href'] if 'href' in anchor.attrs else '' if link.startswith('/'): link = base_url + link elif not link.startswith('http'): link = path + link if link not in urls and link not in scraped_url: urls.append(link) except KeyboardInterrupt: print('[-] Closing') # 打印所有抓取的邮件地址 for mail in emails: print(mail)
修复 print 语句中的格式问题: 原来你使用了 count(count, url),这会导致错误。正确的格式应该是 [%d] Processing %s,然后传递 count 和 url。
print
count(count, url)
[%d] Processing %s
count
传递 URL 给 requests.get(): 你在 requests.get() 中没有传递 URL 参数。应该传递 url 变量。
requests.get()
更新邮件集合: 你在代码中提到有一个 emails 集合,但它没有在 new_emails 中更新。通过 emails.update(new_emails) 将新邮件添加到 emails 集合中。
emails
new_emails
emails.update(new_emails)