如何使用Python从HTML获得href链接？

一尘不染

如何使用Python从HTML获得href链接？

html

import urllib2

website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

到目前为止，一切都很好。

但是我只希望纯文本HTML中的href链接。我怎么解决这个问题？

阅读 620

2020-05-10

共1个答案

一尘不染

尝试使用Beautifulsoup：

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

如果您只想要以开头的链接http://，则应使用：

soup.findAll('a', attrs={'href': re.compile("^http://")})

在带有BS4的Python 3中，它应该是：

from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
    print(link.get('href'))

2020-05-10