小能豆

huge amount of csv, using sqlite in python

py

I have more than 1000 CSV files (let’s call them SECONDARY files), with the same structure (compressed size varies from 400 KB to 3531 KB)

The SECONDARY files’ names correspond to the key that joins them to the MAIN CSV document that has another type of info.

I need to perform some functions on several columns of the SECONDARY files (some values are repeated throughout the files and need to be grouped by).

Later, I will need to get some information from the SECONDARY files taking into account some info from the MAIN file.

I tried:

import pandas as pd
import sqlite3
import os
import re

# Set the directory where your CSV files are located
csv_directory = r"path"

# Connect to SQLite database (or create a new one if it doesn't exist)
conn = sqlite3.connect("db.db")
cursor = conn.cursor()

# Get a list of all CSV files in the directory
csv_files = [file for file in os.listdir(csv_directory) if file.endswith(".csv")]

# Define the master table name
master_bus = "master_table"
# Replace or remove invalid characters from the master table name
master_bus = re.sub(r'\W+', '', master_bus)

# loop through each CSV file and create a table in the database
for csv_file in csv_files:
    table_name = os.path.splitext(csv_file)[0] 
    csv_path = os.path.join(csv_directory, csv_file)

# Read CSV file into a pandas DataFrame
    df = pd.read_csv(csv_path)

# Add a new column with the filename
    df['filename'] = os.path.basename(csv_file)

# Write the DataFrame to SQLite
    df.to_sql(table_name, conn, index=False)

# Concatenate all tables into the master table
tables = cursor.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
tables = [table[0] for table in tables]

# Create the master table
cursor.execute(f"CREATE TABLE '{master_bus}' AS SELECT * FROM {tables[0]} WHERE 0;")

# Concatenate data from all tables into the master table
for table in tables:
    cursor.execute(f"INSERT INTO {master_bus} SELECT * FROM {table};")

# Commit changes and close connection
conn.commit()
conn.close()

It kind of worked. I could create a DB with a table for each csv file with another code. But when I tried to add the name of the file to a new column, and create a master table this happens:

OperationalError Traceback (most recent call last)

c:\Users\acuna\Desktop\Computer Science\Project_Bus.ipynb Cell 17 line 4

37 tables = [table[0] for table in tables]

39 # Create the master table

---> 40 cursor.execute(f"CREATE TABLE '{master_bus}' AS SELECT * FROM {tables[0]} WHERE 0;")

42 # Concatenate data from all tables into the master table

43 for table in tables:

OperationalError: near "-": syntax error

阅读 86

收藏
2023-11-24

共1个答案

小能豆

The error you’re encountering is due to the presence of a hyphen in the table name (master_bus). SQLite table names should not contain special characters like hyphens. To resolve this issue, you can replace or remove invalid characters from the table name.

Here’s an updated version of your code with a function (clean_table_name) to replace invalid characters and ensure a valid table name:

import pandas as pd
import sqlite3
import os
import re

def clean_table_name(table_name):
    # Replace invalid characters with underscores
    return re.sub(r'\W+', '_', table_name)

# Set the directory where your CSV files are located
csv_directory = r"path"

# Connect to SQLite database (or create a new one if it doesn't exist)
conn = sqlite3.connect("db.db")
cursor = conn.cursor()

# Get a list of all CSV files in the directory
csv_files = [file for file in os.listdir(csv_directory) if file.endswith(".csv")]

# Define the master table name
master_bus = "master_table"
# Clean the master table name
master_bus = clean_table_name(master_bus)

# loop through each CSV file and create a table in the database
for csv_file in csv_files:
    table_name = os.path.splitext(csv_file)[0] 
    table_name = clean_table_name(table_name)  # Clean the table name
    csv_path = os.path.join(csv_directory, csv_file)

    # Read CSV file into a pandas DataFrame
    df = pd.read_csv(csv_path)

    # Add a new column with the filename
    df['filename'] = os.path.basename(csv_file)

    # Write the DataFrame to SQLite
    df.to_sql(table_name, conn, index=False)

# Concatenate all tables into the master table
tables = cursor.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
tables = [table[0] for table in tables]

# Create the master table
cursor.execute(f"CREATE TABLE '{master_bus}' AS SELECT * FROM {tables[0]} WHERE 0;")

# Concatenate data from all tables into the master table
for table in tables:
    cursor.execute(f"INSERT INTO {master_bus} SELECT * FROM {table};")

# Commit changes and close connection
conn.commit()
conn.close()

This code includes the clean_table_name function to replace invalid characters with underscores. Adjustments have been made to use this function for both the master table name and individual table names.

2023-11-24