我有一组数据是通过将相似的子项目匹配在一起,然后按“类别”将这些相似的项目分组而创建的。
现在,必须以使每个“ group_id”内的相关类别分组在一起的方式匹配结果类别。在下面的示例中,一个匹配项是A-> B-> C-> D-> E-> F-> G,这是通过逐行重复获得的。
我已经发布了当前答案,该答案适用于此简单数据集,但是由于实际数据集最多包含1M行,并且每个“ group_id”最多可能有60个类别,因此该查询会导致“后台处理空间不足”在实际数据上的错误。
请注意:
正确答案将
您需要递归方法,但是WITH RECURSIVE会产生巨大的中间结果,因此 不再需要假脱机 。
WITH RECURSIVE
对于类似的过程,我使用了以下方法(最初在存储过程中使用WHILE循环):
CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log AS ( SELECT group_id, category_1, category_2, -- assign a unique number to Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk -- remove when you source data is unique GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK FROM match_detail ) WITH DATA PRIMARY INDEX (category_2) ON COMMIT PRESERVE ROWS;
现在重复以下更新,直到0 rows processed:
0 rows processed
-- find matching categories and assign them a common number UPDATE vt_tmp FROM ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk FROM vt_tmp e1 JOIN vt_tmp e2 ON e1.category_2 = e2.category_2 AND e1.rnk < e2.rnk GROUP BY e2.group_id, e2.category_1 ) x SET rnk = minrnk WHERE vt_tmp.group_id = x.group_id AND vt_tmp.category_1 = x.category_1 ;
要获得相关类别,您最终需要:
SELECT group_id, category_1 AS category, rnk AS related_categories FROM vt_tmp UNION SELECT group_id, category_2, rnk FROM vt_tmp
为了与您的预期结果完全匹配,您需要添加一个DENSE_RANK:
DENSE_RANK
SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories) FROM ( SELECT group_id, category_1 AS category, rnk AS related_categories FROM vt_tmp UNION SELECT group_id, category_2, rnk FROM vt_tmp ) AS dt