Beautiful Soup 不会提取网页的所有 html

小能豆

Beautiful Soup 不会提取网页的所有 html

当我’ inspect‘代码时，包含的部分img src如下：

    <div class="dataBild">
    <img src="https://tmssl.akamaized.net//images/portrait/header/195652-1456301478.jpg?lm=1456301501" title="Jordon Ibe" alt="Jordon Ibe" class="">
<div class="bildquelle"><span title="imago">imago</span></div>            
</div>

因此我认为我可以使用它BeautifulSoup来找到div它class = "DataBild"，因为它是独一无二的。

# Import the Libraries that I need
import urllib3
import certifi
from bs4 import BeautifulSoup

# Specify the URL
url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url)


#Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(response.data, "html.parser")

player_img = soup.find_all('div', {'class':'dataBild'})
print (player_img)

它运行但没有输出任何内容。所以我通过运行来检查print(soup)

# Import the Libraries that I need
import urllib3
import certifi
from bs4 import BeautifulSoup

# Specify the URL
url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url)


#Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(response.data, "html.parser")

print(soup)

这输出

<html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr/><center>nginx</center>
</body>
</html>

所以它显然没有从网页中提取所有 HTML？为什么会发生这种情况？我寻找的逻辑是什么div class = DataBild sound？

阅读 17

2024-12-23

共1个答案

小能豆

你遇到的情况是典型的由于HTTP请求返回了404错误，表示目标URL没有找到。在这种情况下，BeautifulSoup没有从目标网页提取有效的HTML内容，而是返回了一个404页面的HTML内容。

原因分析：

404 Not Found 错误： 从返回的HTML内容看，你的请求返回了一个404页面，说明你请求的URL在服务器上不存在或有其他问题。404页面本身的内容是错误页面的HTML，而不是你原本期望的页面内容。
URL问题： 根据你提供的URL https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652，它看起来是一个有效的URL，但可能因为某些原因，该页面的结构发生了变化或者该URL暂时不可访问。比如，页面可能被重定向，或者该页面被删除了。

解决方法：

检查URL： 你可以尝试手动在浏览器中打开该URL，看看是否能正常访问。如果浏览器中也返回404，说明该页面确实不可用。如果是其他类型的错误（例如重定向），你需要查看网页的重定向逻辑。
处理重定向： 有时目标网站可能会进行重定向。如果你需要处理这种情况，可以让 urllib3 自动跟踪重定向，确保你最终获得正确的网页内容。例如：

```python
import urllib3
import certifi
from bs4 import BeautifulSoup

Specify the URL

url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652’
http = urllib3.PoolManager(cert_reqs=’CERT_REQUIRED’, ca_certs=certifi.where())

Allow redirects

response = http.request(‘GET’, url, redirect=True)

Parse the html using BeautifulSoup

soup = BeautifulSoup(response.data, “html.parser”)

print(soup)
```
检查HTML结构： 如果URL有效并且返回了正确的页面内容，你可以使用 soup.find_all() 查找 div 元素。你提到的 div 元素的 class 属性是 dataBild，但请注意它可能是动态生成的，或者有时页面内容会在JavaScript中加载。如果是这样，BeautifulSoup 可能无法解析通过JavaScript加载的内容。在这种情况下，使用 requests + BeautifulSoup 可能无法获取完整内容，建议使用像 Selenium 这样的工具来模拟浏览器行为。
- 查找 div 元素：
  确保你在 find_all 中传递正确的类名，可能有大小写敏感问题：
python player_img = soup.find_all('div', {'class': 'dataBild'}) print(player_img)
调试和验证HTML： 为了确认 div 元素是否存在，你可以打印 soup 的部分内容，或者查找特定的标签：

python player_img = soup.find('div', class_='dataBild') if player_img: print(player_img.prettify()) else: print("No div with class 'dataBild' found.")

总结：

检查返回的HTTP响应，确认是否存在404错误。
如果页面存在重定向，确保你允许自动重定向。
如果页面是动态加载的（例如通过JavaScript），则需要使用 Selenium 来模拟浏览器加载页面。
确保查找的元素类名正确。

你可以先验证页面是否可以通过浏览器访问并确认页面是否存在，进一步调试可能的重定向问题。

2024-12-23

Beautiful Soup 不会提取网页的所有 html

共1个答案

原因分析：

解决方法：

Specify the URL

Allow redirects

Parse the html using BeautifulSoup

总结：