问题概述
我需要将几个 .xlsx 文件合并到工作表中,其中每个工作表名称必须是文件名。
当前问题
下面的代码在处理几个文件之后会变慢并消耗大量内存。
尝试的解决方案
关闭 excel 文件并删除数据框并gc手动运行不起作用。
gc
代码
import pandas as pd import openpyxl import os import gc as gc print("Copying sheets from multiple files to one file") dir_input = 'D:/MeusProjetosJava/Importacao/' dir_output = "Integrados/combined.xlsx" cwd = os.path.abspath(dir_input) files = os.listdir(cwd) df_total = pd.DataFrame() df_total.to_excel(dir_output) #create a new file workbook=openpyxl.load_workbook(dir_output) ss_sheet = workbook['Sheet1'] ss_sheet.title = 'TempExcelSheetForDeleting' workbook.save(dir_output) for file in files: # loop through Excel files if file.endswith('.xls') or file.endswith('.xlsx'): excel_file = pd.ExcelFile(cwd+"/"+file) sheets = excel_file.sheet_names for sheet in sheets: sheet_name = str(file.title()) sheet_name = sheet_name.replace(".xlsx","").lower() sheet_name = sheet_name.removesuffix(".xlsx") print(file, sheet_name) df = excel_file.parse(sheet_name = sheet) with pd.ExcelWriter(dir_output,mode='a') as writer: df.to_excel(writer, sheet_name=f"{sheet_name}", index=False) del df excel_file.close() del excel_file sheets = None gc.collect() workbook=openpyxl.load_workbook(dir_output) std=workbook["TempExcelSheetForDeleting"] workbook.remove(std) workbook.save(dir_output) print("all done")
你的问题是,在处理多个 .xlsx 文件并将它们合并到一个工作簿时,内存使用量不断增加并且速度变慢。这通常是由于在循环中频繁打开和关闭文件以及多次调用 pandas 的 ExcelWriter 和 openpyxl 导致的性能瓶颈。
.xlsx
pandas
ExcelWriter
openpyxl
你提到尝试手动清理内存和关闭文件没有奏效。以下是一些改进的建议和重构代码来减少内存使用和提高性能:
gc.collect()
import pandas as pd import openpyxl import os print("Copying sheets from multiple files to one file") dir_input = 'D:/MeusProjetosJava/Importacao/' dir_output = "Integrados/combined.xlsx" cwd = os.path.abspath(dir_input) files = os.listdir(cwd) # Prepare ExcelWriter once, write all sheets at once with pd.ExcelWriter(dir_output, engine='openpyxl') as writer: # Loop through Excel files for file in files: if file.endswith('.xls') or file.endswith('.xlsx'): excel_file = pd.ExcelFile(cwd + "/" + file) sheets = excel_file.sheet_names for sheet in sheets: # Create a unique sheet name based on the file name sheet_name = file.replace(".xlsx", "").replace(".xls", "").lower() print(f"Processing {file}, Sheet: {sheet_name}") # Parse the sheet into a DataFrame df = excel_file.parse(sheet_name=sheet) # Write the DataFrame to the Excel file with the sheet name df.to_excel(writer, sheet_name=sheet_name, index=False) # No need to manually close, `excel_file` is handled by context manager # After all sheets are written, now remove the default "Sheet1" if it exists workbook = openpyxl.load_workbook(dir_output) if 'Sheet1' in workbook.sheetnames: del workbook['Sheet1'] workbook.save(dir_output) print("All done")
with
Sheet1
pandas.ExcelWriter
xlrd
pandas.DataFrame
希望这个版本的代码能够有效地提升性能并减少内存使用!