import requests from bs4 import BeautifulSoup import csv from urlparse import urljoin import urllib2 base_url = 'http://www.baseball-reference.com' data = requests.get("http://www.baseball-reference.com/teams/BAL/2014-schedule-scores.shtml") soup = BeautifulSoup(data.content) outfile = open("./Balpbp.csv", "wb") writer = csv.writer(outfile) url = [] for link in soup.find_all('a'): if not link.has_attr('href'): continue if link.get_text() != 'boxscore': continue url.append(base_url + link['href']) for list in url: response = requests.get(list) html = response.content soup = BeautifulSoup(html) table = soup.find('table', attrs={'id': 'play_by_play'}) list_of_rows = [] for row in table.findAll('tr'): list_of_cells = [] for cell in row.findAll('td'): text = cell.text.replace(' ', '') list_of_cells.append(text) list_of_rows.append(list_of_cells) writer.writerows(list_of_rows)
u’G.\xa0Holland’、u’N.\xa0Cruz’…
错误信息如下:
Traceback (most recent call last): File "try.py", line 40, in <module> writer.writerows(list_of_rows) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 57: ordinal not in range(128)
当我将数据写入 csv 时,我最终得到的数据包含 \x… 数据片段中的内容,这阻止了数据写入 csv。我该如何更改数据以删除这部分数据或采取某种措施来规避此问题?
你遇到的问题是由于 \xa0(非换行空格字符)以及其他可能的非 ASCII 字符导致的编码错误。Python 试图将这些字符写入文件时,默认使用 ASCII 编码,但无法处理这些字符。以下是解决此问题的几种方法:
\xa0
encode()
将所有单元格文本显式编码为 UTF-8:
for cell in row.findAll('td'): text = cell.text.replace(u'\xa0', '').encode('utf-8') # 替换并编码为 UTF-8 list_of_cells.append(text)
这将去掉非换行空格并将其编码为 UTF-8,从而避免编码错误。
encoding='utf-8'
使用 Python 3 时,可以在文件打开时指定编码:
import csv outfile = open("./Balpbp.csv", "w", encoding='utf-8', newline='') # 设置编码为 UTF-8 writer = csv.writer(outfile)
这样,CSV 写入器能够直接处理 Unicode 字符。
如果你只想删除或替换非 ASCII 字符,可以使用正则表达式清理文本:
import re for cell in row.findAll('td'): text = cell.text text = re.sub(r'[^\x00-\x7F]+', '', text) # 删除所有非 ASCII 字符 list_of_cells.append(text)
这种方法适合在数据中只需要保留 ASCII 字符的场景。
在初始化 BeautifulSoup 时,显式指定编码为 UTF-8,并选择 html.parser 或 lxml 解析器:
html.parser
lxml
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8') # 确保使用 UTF-8 编码
这样可以减少因解析器默认行为导致的编码问题。
将上述方法结合,可以得到一个更健壮的代码版本:
import requests from bs4 import BeautifulSoup import csv base_url = 'http://www.baseball-reference.com' data = requests.get("http://www.baseball-reference.com/teams/BAL/2014-schedule-scores.shtml") soup = BeautifulSoup(data.content, 'html.parser') outfile = open("./Balpbp.csv", "w", encoding='utf-8', newline='') # 设置 UTF-8 编码 writer = csv.writer(outfile) url = [] for link in soup.find_all('a'): if not link.has_attr('href') or link.get_text() != 'boxscore': continue url.append(base_url + link['href']) for list_url in url: response = requests.get(list_url) html = response.content soup = BeautifulSoup(html, 'html.parser') table = soup.find('table', attrs={'id': 'play_by_play'}) if not table: continue # 如果表格不存在,跳过 list_of_rows = [] for row in table.findAll('tr'): list_of_cells = [] for cell in row.findAll('td'): text = cell.text.replace(u'\xa0', ' ').strip() # 替换并清理 list_of_cells.append(text) list_of_rows.append(list_of_cells) writer.writerows(list_of_rows) outfile.close()
用 text.encode('utf-8') 或 open 时设置 encoding='utf-8'。
text.encode('utf-8')
open
清理特殊字符:
使用 .replace() 去除 \xa0 或正则表达式清理非 ASCII 字符。
.replace()
增强鲁棒性:
if not table: continue
这样可以确保无论字符集中是否包含特殊符号,程序都能成功运行并写入文件。