一尘不染

如何在CSV文件中查找为主键候选集设置的列?

algorithm

我有一个CSV文件(未规范化,例如,实际文件最多100列):

   ID, CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
    1,     CUST1,     CLIENT1,          10, 2018-04-01, 2018-04-02
    2,     CUST1,     CLIENT1,          10, 2018-04-01, 2018-05-30
    3,     CUST1,     CLIENT1,         101, 2018-04-02, 2018-04-03
    4,     CUST2,     CLIENT1,         102, 2018-04-02, 2018-04-03

我如何找到所有可能用作主键的列集。

所需的输出:

  1) ID
  2) PAYMENT_NUM,START_DATE,END_DATE
  3) CUST_NAME, CLIENT_NAME, PAYMENT_NUM,START_DATE,END_DATE

我可以用Java做到这一点,但可能是Python / Pandas已经提供了快速解决方案


阅读 288

收藏
2020-07-28

共1个答案

一尘不染

熊猫和itertools将为您提供所需的东西。

import pandas
from itertools import chain, combinations

def key_options(items):
    return chain.from_iterable(combinations(items, r) for r in range(1, len(items)+1) )

df = pandas.read_csv('test.csv');

# iterate over all combos of headings, excluding ID for brevity
for candidate in key_options(list(df)[1:]):
    deduped = df.drop_duplicates(candidate)

    if len(deduped.index) == len(df.index):
        print ','.join(candidate)

这将为您提供输出:

PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, END_DATE
CUST_NAME, PAYMENT_NUM, END_DATE
CLIENT_NAME, PAYMENT_NUM, END_DATE
PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, START_DATE, END_DATE
CUST_NAME, PAYMENT_NUM, START_DATE, END_DATE
CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
2020-07-28