我们有这个代码(感谢 Cody 和 Alex Tereshenkov):
import pandas as pd import requests from bs4 import BeautifulSoup pd.set_option('display.width', 1000) pd.set_option('display.max_columns', 50) url = "https://www.aliexpress.com/store/feedback-score/1665279.html" s = requests.Session() r = s.get(url) soup = BeautifulSoup(r.content, "html.parser") iframe_src = soup.select_one("#detail-displayer").attrs["src"] r = s.get(f"https:{iframe_src}") soup = BeautifulSoup(r.content, "html.parser") rows = [] for row in soup.select(".history-tb tr"): #print("\t".join([e.text for e in row.select("th, td")])) rows.append([e.text for e in row.select("th, td")]) #print df = pd.DataFrame.from_records( rows, columns=['Feedback', '1 Month', '3 Months', '6 Months'], ) # remove first row with column names df = df.iloc[1:] df['Shop'] = url.split('/')[-1].split('.')[0] pivot = df.pivot(index='Shop', columns='Feedback') pivot.columns = [' '.join(col).strip() for col in pivot.columns.values] column_mapping = dict( zip(pivot.columns.tolist(), [col[:12] for col in pivot.columns.tolist()])) # column_mapping # {'1 Month Negative (1-2 Stars)': '1 Month Nega', # '1 Month Neutral (3 Stars)': '1 Month Neut', # '1 Month Positive (4-5 Stars)': '1 Month Posi', # '1 Month Positive feedback rate': '1 Month Posi', # '3 Months Negative (1-2 Stars)': '3 Months Neg', # '3 Months Neutral (3 Stars)': '3 Months Neu', # '3 Months Positive (4-5 Stars)': '3 Months Pos', # '3 Months Positive feedback rate': '3 Months Pos', # '6 Months Negative (1-2 Stars)': '6 Months Neg', # '6 Months Neutral (3 Stars)': '6 Months Neu', # '6 Months Positive (4-5 Stars)': '6 Months Pos', # '6 Months Positive feedback rate': '6 Months Pos'} pivot.columns = [column_mapping[col] for col in pivot.columns] pivot.to_excel('Report.xlsx')
该代码提取给定 URL(位于 iframe 内)的“反馈历史记录”表,并将所有表数据转换为 1 行,如下所示:
另一方面,我们在同一个项目文件夹(“urls.txt”)中有一个包含 50 个 URL 的文件,如下所示:
https://www.aliexpress.com/store/feedback-score/4385007.html https://www.aliexpress.com/store/feedback-score/1473089.html https://www.aliexpress.com/store/feedback-score/3085095.html https://www.aliexpress.com/store/feedback-score/2793002.html https://www.aliexpress.com/store/feedback-score/4656043.html https://www.aliexpress.com/store/feedback-score/4564021.html
我们只需要为文件中的所有 URL 提取相同的数据。
我们该怎么做呢?
由于 URL 数量约为 50,因此您可以将 URL 读入列表,然后向每个 URL 发送请求。我刚刚测试了这 6 个 URL,该解决方案对它们有效。但您可能需要添加一些 try-except 以应对可能发生的任何异常。
import pandas as pd import requests from bs4 import BeautifulSoup with open('urls.txt','r') as f: urls=f.readlines() master_list=[] for idx,url in enumerate(urls): s = requests.Session() r = s.get(url) soup = BeautifulSoup(r.content, "html.parser") iframe_src = soup.select_one("#detail-displayer").attrs["src"] r = s.get(f"https:{iframe_src}") soup = BeautifulSoup(r.content, "html.parser") rows = [] for row in soup.select(".history-tb tr"): rows.append([e.text for e in row.select("th, td")]) df = pd.DataFrame.from_records( rows, columns=['Feedback', '1 Month', '3 Months', '6 Months'], ) df = df.iloc[1:] shop=url.split('/')[-1].split('.')[0] df['Shop'] = shop pivot = df.pivot(index='Shop', columns='Feedback') master_list.append([shop]+pivot.values.tolist()[0]) if idx == len(urls) - 1: #last item in the list final_output=pd.DataFrame(master_list) pivot.columns = [' '.join(col).strip() for col in pivot.columns.values] column_mapping = dict(zip(pivot.columns.tolist(), [col[:12] for col in pivot.columns.tolist()])) final_output.columns = ['Shop']+[column_mapping[col] for col in pivot.columns] final_output.set_index('Shop', inplace=True) final_output.to_excel('Report.xlsx')
输出:
也许您可以考虑的更好的解决方案是完全避免使用 pandas。获取数据后,您可以对其进行操作以获取列表并使用XlsxWriter。