小能豆

AttributeError:'builtin_function_or_method'对象没有属性'decode'

py

我正在尝试从网站上抓取电子邮件,但当我运行代码时出现了错误,我不太明白这个错误

Traceback (most recent call last):
  File "Email_Scrapper.py", line 37, in <module>
    parts = urllib.parse.urlsplit(url)
  File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 423, in urlsplit
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 124, in _coerce_args
    return _decode_args(args) + (_encode_result,)
  File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 108, in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
  File "C:\Users\HP\AppData\Local\Programs\Python\Python38-32\lib\urllib\parse.py", line 108, in <genexpr>
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'builtin_function_or_method' object has no attribute 'decode'

这是我的代码:

scraped_url = set()
emails = set() 

#ici on fait un compte jusqu'a ce que 
# les emails recuperer atteignent 20 et on sort de la boucle
count = 0
try:
    while len(urls):
        count += 1
        if count == 20:
            break
        url = urls.popleft 
        scraped_url.add(url)

        parts = urllib.parse.urlsplit(url)
        base_url = '{0.scheme}://{0.netloc}'.format(parts)

        path = url[:url.rfind('/')+1] if '/' in parts.path else url

        print('[%d] Processing %s' % count(count, url))
        try:
           response = requests.get()
        except(requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
            continue

        new_emails = set(re.findall(r'[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+',response.text, re.I))
        soup = BeautifulSoup(response.text, features="lxml")

        for anchor in soup.find_all("a"):
            link = anchor.attrs['href'] if 'href' in anchor.attrs else ''
            if link.startswith('/'):
                link = base_url + link
            elif not link.startswith('http'):
                link = path + link 
            if not link in urls and not link in scraped_url:
                urls.append(link)
except KeyboardInterrupt:
    print('[-]closing')

for mails in emails:
    print(mails)

阅读 18

收藏
2024-12-04

共1个答案

小能豆

你遇到的错误是:

AttributeError: 'builtin_function_or_method' object has no attribute 'decode'

这是因为在这一行代码:

parts = urllib.parse.urlsplit(url)

urlsplit 试图对某些内容进行 decode() 操作,而它预期的对象是字节类型(bytes),但是你提供的 url 看起来像是一个方法而不是一个字符串。具体来说,错误的原因是这一行:

url = urls.popleft

你忘记在这一行调用 popleft() 方法。由于你没有括号,urls.popleft 是方法本身,而不是它的返回值。因此,urlsplit 收到的是一个方法,而不是一个字符串 URL。

解决方法:

你需要在 urls.popleft 后加上括号,确保它返回并传递实际的 URL 字符串:

url = urls.popleft()  # 加上括号以正确调用 popleft 方法

修改后的代码如下:

scraped_url = set()
emails = set() 

count = 0
try:
    while len(urls):
        count += 1
        if count == 20:
            break
        url = urls.popleft()  # 调用 popleft() 获取实际 URL

        scraped_url.add(url)

        parts = urllib.parse.urlsplit(url)
        base_url = '{0.scheme}://{0.netloc}'.format(parts)

        path = url[:url.rfind('/')+1] if '/' in parts.path else url

        print('[%d] Processing %s' % (count, url))  # 修正格式化问题
        try:
            response = requests.get(url)  # 你需要传递 url 到 requests.get()
        except(requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
            continue

        new_emails = set(re.findall(r'[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+', response.text, re.I))
        emails.update(new_emails)

        soup = BeautifulSoup(response.text, features="lxml")

        for anchor in soup.find_all("a"):
            link = anchor.attrs['href'] if 'href' in anchor.attrs else ''
            if link.startswith('/'):
                link = base_url + link
            elif not link.startswith('http'):
                link = path + link 
            if link not in urls and link not in scraped_url:
                urls.append(link)
except KeyboardInterrupt:
    print('[-] Closing')

# 打印所有抓取的邮件地址
for mail in emails:
    print(mail)

其他改进:

  1. 修复 print 语句中的格式问题
    原来你使用了 count(count, url),这会导致错误。正确的格式应该是 [%d] Processing %s,然后传递 counturl

  2. 传递 URL 给 requests.get()
    你在 requests.get() 中没有传递 URL 参数。应该传递 url 变量。

  3. 更新邮件集合
    你在代码中提到有一个 emails 集合,但它没有在 new_emails 中更新。通过 emails.update(new_emails) 将新邮件添加到 emails 集合中。

总结:

  1. 错误的根本原因是你没有正确调用 popleft() 方法。
  2. 修改后,你的程序会从队列中取出 URL,进行抓取并解析,并将邮件地址打印出来。
2024-12-04