我有一个 pandas DataFrame,其中包含一个包含 Wikipedia 网址的列,我想将其加载。但是,有些字符串无法加载,因为它们包含 unicode。例如,“Kruskal %E2%80%93 Wallis_one-way_analysis_of_variance”引发以下问题
PageError: Page id "Cauchy%E2%80%93Schwarz_inequality" does not match any pages. Try another id!
有没有办法将所有 unicode 转换为 ascii?因此,在这种情况下,我需要一个可以创建新列的函数:
old column new column Cauchy%E2%80%93Schwarz_inequality Cauchy–Schwarz_inequality Markov%27s_inequality Markov's_inequality
urllib.parse.unquote应该可以解决问题。希望这能有所帮助。
urllib.parse.unquote
In [1]: import urllib ...: ...: import pandas as pd ...: ...: ...: df = pd.DataFrame({'url': ['Markov%27s_inequality', 'Cauchy%E2%80%93Schwarz_inequality']}) ...: df['clean_url'] = df['url'].apply(urllib.parse.unquote) ...: In [2]: df Out[2]: url clean_url 0 Markov%27s_inequality Markov's_inequality 1 Cauchy%E2%80%93Schwarz_inequality Cauchy–Schwarz_inequality