由于 Python 中的 ascii 错误，将数据写入 CSV 时出错

小能豆

由于 Python 中的 ascii 错误，将数据写入 CSV 时出错

import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2


base_url = 'http://www.baseball-reference.com'
data = requests.get("http://www.baseball-reference.com/teams/BAL/2014-schedule-scores.shtml")
soup = BeautifulSoup(data.content)
outfile = open("./Balpbp.csv", "wb")
writer = csv.writer(outfile)

url = []
for link in soup.find_all('a'):

    if not link.has_attr('href'):
        continue

    if link.get_text() != 'boxscore':
        continue

    url.append(base_url + link['href'])

for list in url:
    response = requests.get(list)
    html = response.content
    soup = BeautifulSoup(html)


    table = soup.find('table', attrs={'id': 'play_by_play'})

    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&nbsp;', '')
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)
    writer.writerows(list_of_rows)

u’G.\xa0Holland’、u’N.\xa0Cruz’…

错误信息如下：

Traceback (most recent call last):
  File "try.py", line 40, in <module>
    writer.writerows(list_of_rows)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 57: ordinal not in range(128)

当我将数据写入 csv 时，我最终得到的数据包含 \x… 数据片段中的内容，这阻止了数据写入 csv。我该如何更改数据以删除这部分数据或采取某种措施来规避此问题？

阅读 47

2024-11-27

共1个答案

小能豆

你遇到的问题是由于 \xa0（非换行空格字符）以及其他可能的非 ASCII 字符导致的编码错误。Python 试图将这些字符写入文件时，默认使用 ASCII 编码，但无法处理这些字符。以下是解决此问题的几种方法：

方法 1：使用 `encode()` 将文本转换为 UTF-8

将所有单元格文本显式编码为 UTF-8：

for cell in row.findAll('td'):
    text = cell.text.replace(u'\xa0', '').encode('utf-8')  # 替换并编码为 UTF-8
    list_of_cells.append(text)

这将去掉非换行空格并将其编码为 UTF-8，从而避免编码错误。

方法 2：在 CSV 写入时设置 `encoding='utf-8'`

使用 Python 3 时，可以在文件打开时指定编码：

import csv

outfile = open("./Balpbp.csv", "w", encoding='utf-8', newline='')  # 设置编码为 UTF-8
writer = csv.writer(outfile)

这样，CSV 写入器能够直接处理 Unicode 字符。

方法 3：清理 Unicode 字符

如果你只想删除或替换非 ASCII 字符，可以使用正则表达式清理文本：

import re

for cell in row.findAll('td'):
    text = cell.text
    text = re.sub(r'[^\x00-\x7F]+', '', text)  # 删除所有非 ASCII 字符
    list_of_cells.append(text)

这种方法适合在数据中只需要保留 ASCII 字符的场景。

方法 4：确保 BeautifulSoup 使用正确的解析器和编码

在初始化 BeautifulSoup 时，显式指定编码为 UTF-8，并选择 html.parser 或 lxml 解析器：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')  # 确保使用 UTF-8 编码

这样可以减少因解析器默认行为导致的编码问题。

综合解决方案

将上述方法结合，可以得到一个更健壮的代码版本：

import requests
from bs4 import BeautifulSoup
import csv

base_url = 'http://www.baseball-reference.com'
data = requests.get("http://www.baseball-reference.com/teams/BAL/2014-schedule-scores.shtml")
soup = BeautifulSoup(data.content, 'html.parser')
outfile = open("./Balpbp.csv", "w", encoding='utf-8', newline='')  # 设置 UTF-8 编码
writer = csv.writer(outfile)

url = []
for link in soup.find_all('a'):
    if not link.has_attr('href') or link.get_text() != 'boxscore':
        continue
    url.append(base_url + link['href'])

for list_url in url:
    response = requests.get(list_url)
    html = response.content
    soup = BeautifulSoup(html, 'html.parser')

    table = soup.find('table', attrs={'id': 'play_by_play'})
    if not table:
        continue  # 如果表格不存在，跳过

    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace(u'\xa0', ' ').strip()  # 替换并清理
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)
    writer.writerows(list_of_rows)

outfile.close()

关键修改点：

处理编码问题：
用 text.encode('utf-8') 或 open 时设置 encoding='utf-8'。
清理特殊字符：
使用 .replace() 去除 \xa0 或正则表达式清理非 ASCII 字符。
增强鲁棒性：
如果目标表格不存在，使用 if not table: continue 跳过。

这样可以确保无论字符集中是否包含特殊符号，程序都能成功运行并写入文件。

2024-11-27