I have more than 1000 CSV files (let’s call them SECONDARY files), with the same structure (compressed size varies from 400 KB to 3531 KB)
The SECONDARY files’ names correspond to the key that joins them to the MAIN CSV document that has another type of info.
I need to perform some functions on several columns of the SECONDARY files (some values are repeated throughout the files and need to be grouped by).
Later, I will need to get some information from the SECONDARY files taking into account some info from the MAIN file.
I tried:
import pandas as pd import sqlite3 import os import re # Set the directory where your CSV files are located csv_directory = r"path" # Connect to SQLite database (or create a new one if it doesn't exist) conn = sqlite3.connect("db.db") cursor = conn.cursor() # Get a list of all CSV files in the directory csv_files = [file for file in os.listdir(csv_directory) if file.endswith(".csv")] # Define the master table name master_bus = "master_table" # Replace or remove invalid characters from the master table name master_bus = re.sub(r'\W+', '', master_bus) # loop through each CSV file and create a table in the database for csv_file in csv_files: table_name = os.path.splitext(csv_file)[0] csv_path = os.path.join(csv_directory, csv_file) # Read CSV file into a pandas DataFrame df = pd.read_csv(csv_path) # Add a new column with the filename df['filename'] = os.path.basename(csv_file) # Write the DataFrame to SQLite df.to_sql(table_name, conn, index=False) # Concatenate all tables into the master table tables = cursor.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall() tables = [table[0] for table in tables] # Create the master table cursor.execute(f"CREATE TABLE '{master_bus}' AS SELECT * FROM {tables[0]} WHERE 0;") # Concatenate data from all tables into the master table for table in tables: cursor.execute(f"INSERT INTO {master_bus} SELECT * FROM {table};") # Commit changes and close connection conn.commit() conn.close()
It kind of worked. I could create a DB with a table for each csv file with another code. But when I tried to add the name of the file to a new column, and create a master table this happens:
OperationalError Traceback (most recent call last) c:\Users\acuna\Desktop\Computer Science\Project_Bus.ipynb Cell 17 line 4 37 tables = [table[0] for table in tables] 39 # Create the master table ---> 40 cursor.execute(f"CREATE TABLE '{master_bus}' AS SELECT * FROM {tables[0]} WHERE 0;") 42 # Concatenate data from all tables into the master table 43 for table in tables: OperationalError: near "-": syntax error
The error you’re encountering is due to the presence of a hyphen in the table name (master_bus). SQLite table names should not contain special characters like hyphens. To resolve this issue, you can replace or remove invalid characters from the table name.
master_bus
Here’s an updated version of your code with a function (clean_table_name) to replace invalid characters and ensure a valid table name:
clean_table_name
import pandas as pd import sqlite3 import os import re def clean_table_name(table_name): # Replace invalid characters with underscores return re.sub(r'\W+', '_', table_name) # Set the directory where your CSV files are located csv_directory = r"path" # Connect to SQLite database (or create a new one if it doesn't exist) conn = sqlite3.connect("db.db") cursor = conn.cursor() # Get a list of all CSV files in the directory csv_files = [file for file in os.listdir(csv_directory) if file.endswith(".csv")] # Define the master table name master_bus = "master_table" # Clean the master table name master_bus = clean_table_name(master_bus) # loop through each CSV file and create a table in the database for csv_file in csv_files: table_name = os.path.splitext(csv_file)[0] table_name = clean_table_name(table_name) # Clean the table name csv_path = os.path.join(csv_directory, csv_file) # Read CSV file into a pandas DataFrame df = pd.read_csv(csv_path) # Add a new column with the filename df['filename'] = os.path.basename(csv_file) # Write the DataFrame to SQLite df.to_sql(table_name, conn, index=False) # Concatenate all tables into the master table tables = cursor.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall() tables = [table[0] for table in tables] # Create the master table cursor.execute(f"CREATE TABLE '{master_bus}' AS SELECT * FROM {tables[0]} WHERE 0;") # Concatenate data from all tables into the master table for table in tables: cursor.execute(f"INSERT INTO {master_bus} SELECT * FROM {table};") # Commit changes and close connection conn.commit() conn.close()
This code includes the clean_table_name function to replace invalid characters with underscores. Adjustments have been made to use this function for both the master table name and individual table names.