一尘不染

如何在SQL中选择相似的集合

algorithm

我有以下表格:

Order
----
ID (pk)

OrderItem
----
OrderID (fk -> Order.ID)
ItemID (fk -> Item.ID)
Quantity

Item
----
ID (pk)

我该如何编写一个查询,以选择Orders与特定对象至少85%相似的所有对象Order

我考虑过使用Jaccard
Index
统计量来计算两个相似度Orders。(通过将每个集合的交集OrderItems除以每个集合的并集OrderItems

但是,我想不出一种方法来存储两个的每个可能的Jaccard Index Orders还有另一种方法吗?

另外,是否有一种方法可以将Quantity每个匹配项的差异包括OrderItem在内?

附加信息:


总计Orders:〜79K
OrderItems:〜1.76米
魅力。OrderItems每位Order:21.5
总计Items:〜13k

注意


85%的相似度数字只是对客户实际需求的最佳猜测,将来可能会改变。适用于任何相似性的解决方案将是更可取的。


阅读 343

收藏
2020-07-28

共1个答案

一尘不染

您指定

如何编写查询以选择与特定订单至少相似85%的所有订单?

与“至少相似度为85%的所有成对订单”相比,这是一个重要的简化。

我们将使用一些TDQD(测试驱动的查询设计)和一些分析来帮助我们。

初赛

要远程相似,两个订单必须至少有一个共同点。此查询可用于确定哪些订单与指定订单至少具有一个共同点:

SELECT DISTINCT I1.OrderID AS ID
  FROM OrderItem AS I1
  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
 WHERE I1.OrderID != <specified order ID>

这会大大减少要检查的其他订单的列表,尽管如果指定的订单包含您最受欢迎的商品之一,则很可能很多其他订单也这样做。

除了DISTINCT,您可以使用:

SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
  FROM OrderItem AS I1
  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
 WHERE I1.OrderID != <specified order ID>
 GROUP BY I1.OrderID

这将为您提供与指定订单共有的订单中的商品数量。我们还需要每个订单中的商品数量:

SELECT OrderID AS ID, COUNT(*) AS Num_Total
  FROM OrderItem
 GROUP BY OrderID;

相同顺序

对于100%的相似性,这两个订单的共同点与每个物品的点数一样多。但是,这可能找不到很多成对的订单。我们可以很容易地找到与指定订单完全相同的商品的订单:

SELECT L1.ID
  FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
          FROM OrderItem
         GROUP BY OrderID
       ) AS L1
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common;

编辑: 结果证明不够严格;为了使订单相同,指定订单中的商品数也必须与共同的商品数相同:

SELECT L1.ID, L1.Num_Total, L2.ID, L2.Num_Common, L3.ID, L3.Num_Total
  FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
          FROM OrderItem
         GROUP BY OrderID
       ) AS L1
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common
  JOIN (SELECT OrderID AS ID, COUNT(*) AS Num_Total
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS L3 ON L2.Num_Common = L3.Num_Total;

相似的订单—分析公式

将 Wikipedia上定义的Jaccard相似度应用于具有|
A |的两个顺序A和B。是顺序A中项数的计数,Jaccard相似度 J(A,B)= |A∩B| ÷|A∪B| ,其中|A∩B|
是两个订单共有的项目数,| A commonB | 是订购的不同商品的总数。

为了满足“
85%提卡相似度”标准,如果任一订单中的商品数量少于某个阈值,则订单必须相同。例如,如果订单A和B都有5个商品,但是两者之间有一个商品不同,那么它将为您提供4个共同商品(|A∩B|)和总共6个商品(|A∪B|)
,因此Jaccard相似度J(A,B)仅为66⅔%。

对于85%的相似度,当两个订单中的每一个中有N个项目,而一个项目不同时, (N-1)÷(N + 1)≥0.85 ,这意味着N>
12(准确地说是12⅓)。对于分数F = J(A,B),可以用一项不同的均值 (N-1)÷(N + 1)≥F 来求解N, 得出N≥(1 +
F)÷(1- F)
。随着相似性要求的提高,N的值越大,阶数必须相同。

进一步概括一下,让我们假设我们有N个和M个项目的不同大小顺序(不失一般性,N <M)。|A∩B|的最大值 现在是N,且|A∪B|的最小值
是M(表示所有较小顺序的项目都按较大顺序出现)。让我们定义M = N +
∆,并且存在∂个项,这些项以较小的顺序显示,而没有以较大的顺序显示。因此,存在以较大顺序显示的∆ +∂项目,而不是以较小顺序显示的项目。

根据定义,|A∩B| =N-∂,并且|A∪B| =(N-∂)+∂+(N +
∆-(N-∂)),其中三个相加项表示(1)两个订单之间共有的项目数,(2)仅在(3)仅以较大顺序排列的商品数量。简化为:|A∪B| = N + ∆ +∂。


关键方程式

对于相似分数F,我们对J(A,B)≥F的订单对感兴趣,因此:

(N-∂)÷(N + ∆ +∂)≥F

F≤(N-∂)÷(N + ∆ +∂)


我们可以使用电子表格来绘制它们之间的关系。对于给定数量的较小顺序(x轴)的项目,以及给定的相似度,我们可以绘制∂的最大值,以得出与我们相似的F。公式为:

∂=(N(1-F)-F∆)÷(1 + F)

∂的曲线=(N(1-F)-F∆)÷(1 + F)...

对于常数F,这是一个在N和∆中的线性方程。对于不同的F值,它是非线性的。显然,∂必须是一个非负整数。

给定F = 0.85,对于相同大小的订单(∆ = 0),对于1≤N <13,∂= 0;对于13≤N <25,∂≤1; 对于25≤N
<37,∂≤2,对于37≤N <50,∂≤3。

对于相差1(∆ = 1)的阶,对于1≤N <18,∂= 0;对于18≤N <31,∂≤1; 对于31≤N <43,∂≤2; 等等。如果∆ = 6,则您需要N
= 47才能使订单仍与∂= 1相似的85%。这意味着小订单有47个项目,其中46个与大订单有53个项目相同。

相似的订单-应用分析

到目前为止,一切都很好。我们如何应用该理论来选择类似于指定订单的订单?

首先,我们观察到指定的订单可能与相似的订单大小相同,或者更大或更小。这使事情变得有些复杂。

上式的参数为:

  • N –数量较少的商品
  • ∆ —较大项数与N之间的差
  • F-固定
  • ∂—较小顺序中的项目数不匹配较大顺序中的项

在顶部开发的查询中使用细微变化可以使用的值:

  • NC-共有项目数
  • NA —指定顺序的项目数
  • NB —比较顺序中的项目数

对应查询:

SELECT OrderID AS ID, COUNT(*) AS NA
  FROM OrderItem
 WHERE OrderID = <specified order ID>
 GROUP BY OrderID;

SELECT OrderID AS ID, COUNT(*) AS NB
  FROM OrderItem
 WHERE OrderID != <specified order ID>
 GROUP BY OrderID;

SELECT I1.OrderID AS ID, COUNT(*) AS NC
  FROM OrderItem AS I1
  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
 WHERE I1.OrderID != <specified order ID>
 GROUP BY I1.OrderID

为了方便起见,我们希望获得N和N + ∆(因此还有∆)值,因此我们可以使用UNION适当地安排事物,其中:

  • NS = N-较小顺序的项目数
  • NL = N + ∆-较大顺序的项目数

在UNION查询的第二个版本中,具有:

  • NC = N-∂-共有项目数

这两个查询都保留两个订单ID号,以便您以后可以追溯到其余的订单信息。

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA <= v2.NB
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA > v2.NB

这为我们提供了一个表格表达式,其中包含列OrderID_1,NS,OrderID_2,NL,其中NS是“较小顺序”中的项目数,而NL是较大顺序中的项目数。由于v1和v2表表达式生成的订单号没有重叠,因此无需担心OrderID值相同的“反身”条目。也可以在UNION查询中最容易地将NC添加到此:

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA <= v2.NB
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS v3
    ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA > v2.NB
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS v3
    ON v3.ID = v1.ID

这为我们提供了一个表格表达式,其中包含列OrderID_1,NS,OrderID_2,NL,NC,其中NS是“较小顺序”中的项目数,NL是较大顺序中的项目数,而NC是“较小顺序”中的项目数共同。

给定NS,NL,NC,我们正在寻找满足以下条件的订单:

(N-∂)÷(N + ∆ +∂)≥F.

  • N –数量较少的商品
  • ∆ —较大项数与N之间的差
  • F-固定
  • ∂—较小顺序中的项目数不匹配较大顺序中的项

  • NS = N-较小顺序的项目数

  • NL = N + ∆-较大顺序的项目数

  • NC = N-∂-共有项目数

因此,条件必须是:

NC / (NL + (NS - NC)) ≥ F

LHS上的术语必须以浮点数而不是整数表达式的形式求值。将其应用于上面的UNION查询,将产生:

SELECT OrderID_1, NS, OrderID_2, NL, NC,
        CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
  FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
          FROM (SELECT OrderID AS ID, COUNT(*) AS NA
                  FROM OrderItem
                 WHERE OrderID = <specified order ID>
                 GROUP BY OrderID
               ) AS v1
          JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
                  FROM OrderItem
                 WHERE OrderID != <specified order ID>
                 GROUP BY OrderID
               ) AS v2
            ON v1.NA <= v2.NB
          JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
                  FROM OrderItem AS I1
                  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
                 WHERE I1.OrderID != <specified order ID>
                 GROUP BY I1.OrderID
               ) AS v3
            ON v3.ID = v2.ID
        UNION
        SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
          FROM (SELECT OrderID AS ID, COUNT(*) AS NA
                  FROM OrderItem
                 WHERE OrderID = <specified order ID>
                 GROUP BY OrderID
               ) AS v1
          JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
                  FROM OrderItem
                 WHERE OrderID != <specified order ID>
                 GROUP BY OrderID
               ) AS v2
            ON v1.NA > v2.NB
          JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
                  FROM OrderItem AS I1
                  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
                 WHERE I1.OrderID != <specified order ID>
                 GROUP BY I1.OrderID
               ) AS v3
            ON v3.ID = v1.ID
       ) AS u
 WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

您可能会注意到,该查询仅使用OrderItem表。不需要订单和项目表。


警告:部分测试过的SQL(提示符)。现在,上面的SQL似乎对微小的数据集产生了合理的答案。我调整了相似度要求(先是0.25,然后是0.55),然后得出了合理的值和适当的选择性。但是,我的测试数据按最大顺序只有8项,并且肯定没有涵盖所描述数据的全部范围。由于我最常使用的DBMS不支持CTE,因此下面的SQL未经测试。但是,我有一定的信心,除非犯了大错,否则版本1中的CTE代码(具有大量重复的指定订单ID)应该是干净的。我认为版本2也可以,但是…未经测试。

可能有更紧凑的方式来表达查询,可能使用OLAP函数。

如果要对此进行测试,我将创建一个表,其中包含一些具有代表性的订单项集,检查返回的相似性度量是否合理。如图所示,我或多或少地会处理查询,逐渐建立复杂的查询。如果其中一个表达式显示有缺陷,那么我将对该查询进行适当的调整,直到修复缺陷为止。

显然,性能将是一个问题。最内层的查询并不是很复杂,但并不是微不足道的。但是,测量将显示这是一个引人注目的问题还是令人讨厌的问题。研究查询计划可能会有所帮助。似乎很可能在OrderItem.OrderID上有一个索引;如果没有这样的索引,查询就不太可能执行良好。因为它是外键列,所以这不太可能成为问题。

使用“ WITH子句”(通用表表达式)可能会带来一些好处。他们将明确表示在UNION子查询的两半中隐含的重复。


使用公用表表达式

当表达式相同时,使用公用表表达式可以使优化程序清晰明了,并且可以帮助其更好地执行。他们还帮助人们阅读您的查询。上面的查询确实请求使用CTE。

版本1:重复指定的订单号

WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)
              FROM OrderItem
             WHERE OrderID = <specified order ID>
             GROUP BY OrderID
           ),
     OO AS (SELECT OrderID AS ID, COUNT(*) AS NB       -- Other orders (OO)
              FROM OrderItem
             WHERE OrderID != <specified order ID>
             GROUP BY OrderID
           ),
     CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC    -- Common Items (CI)
              FROM OrderItem AS I1
              JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
             WHERE I1.OrderID != <specified order ID>
             GROUP BY I1.OrderID
           )
SELECT OrderID_1, NS, OrderID_2, NL, NC,
        CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
  FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
          FROM SO AS v1
          JOIN OO AS v2 ON v1.NA <= v2.NB
          JOIN CI AS v3 ON v3.ID  = v2.ID
        UNION
        SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
          FROM SO AS v1
          JOIN OO AS v2 ON v1.NA  > v2.NB
          JOIN CI AS v3 ON v3.ID  = v1.ID
       ) AS u
 WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

版本2:避免重复指定的订单号

WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)
              FROM OrderItem
             WHERE OrderID = <specified order ID>
             GROUP BY OrderID
           ),
     OO AS (SELECT OI.OrderID AS ID, COUNT(*) AS NB    -- Other orders (OO)
              FROM OrderItem AS OI
              JOIN SO ON OI.OrderID != SO.ID
             GROUP BY OI.OrderID
           ),
     CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC    -- Common Items (CI)
              FROM OrderItem AS I1
              JOIN SO AS S1 ON I1.OrderID != S1.ID
              JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID
              JOIN SO AS S2 ON I2.OrderID  = S2.ID
             GROUP BY I1.OrderID
           )
SELECT OrderID_1, NS, OrderID_2, NL, NC,
        CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
  FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
          FROM SO AS v1
          JOIN OO AS v2 ON v1.NA <= v2.NB
          JOIN CI AS v3 ON v3.ID  = v2.ID
        UNION
        SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
          FROM SO AS v1
          JOIN OO AS v2 ON v1.NA  > v2.NB
          JOIN CI AS v3 ON v3.ID  = v1.ID
       ) AS u
 WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

这些都不容易读;两者都比大写CTE的大型SELECT容易。


最少的测试数据

这不足以进行良好的测试。它给出了一点点信心(并且确实显示了“相同顺序”查询的问题。

CREATE TABLE Order (ID SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE Item  (ID SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE OrderItem
(
    OrderID INTEGER NOT NULL REFERENCES Order,
    ItemID INTEGER NOT NULL REFERENCES Item,
    Quantity DECIMAL(8,2) NOT NULL
);

INSERT INTO Order VALUES(1);
INSERT INTO Order VALUES(2);
INSERT INTO Order VALUES(3);
INSERT INTO Order VALUES(4);
INSERT INTO Order VALUES(5);
INSERT INTO Order VALUES(6);
INSERT INTO Order VALUES(7);

INSERT INTO Item VALUES(111);
INSERT INTO Item VALUES(222);
INSERT INTO Item VALUES(333);
INSERT INTO Item VALUES(444);
INSERT INTO Item VALUES(555);
INSERT INTO Item VALUES(666);
INSERT INTO Item VALUES(777);
INSERT INTO Item VALUES(888);
INSERT INTO Item VALUES(999);

INSERT INTO OrderItem VALUES(1, 111, 1);
INSERT INTO OrderItem VALUES(1, 222, 1);
INSERT INTO OrderItem VALUES(1, 333, 1);
INSERT INTO OrderItem VALUES(1, 555, 1);

INSERT INTO OrderItem VALUES(2, 111, 1);
INSERT INTO OrderItem VALUES(2, 222, 1);
INSERT INTO OrderItem VALUES(2, 333, 1);
INSERT INTO OrderItem VALUES(2, 555, 1);

INSERT INTO OrderItem VALUES(3, 111, 1);
INSERT INTO OrderItem VALUES(3, 222, 1);
INSERT INTO OrderItem VALUES(3, 333, 1);
INSERT INTO OrderItem VALUES(3, 444, 1);
INSERT INTO OrderItem VALUES(3, 555, 1);
INSERT INTO OrderItem VALUES(3, 666, 1);

INSERT INTO OrderItem VALUES(4, 111, 1);
INSERT INTO OrderItem VALUES(4, 222, 1);
INSERT INTO OrderItem VALUES(4, 333, 1);
INSERT INTO OrderItem VALUES(4, 444, 1);
INSERT INTO OrderItem VALUES(4, 555, 1);
INSERT INTO OrderItem VALUES(4, 777, 1);

INSERT INTO OrderItem VALUES(5, 111, 1);
INSERT INTO OrderItem VALUES(5, 222, 1);
INSERT INTO OrderItem VALUES(5, 333, 1);
INSERT INTO OrderItem VALUES(5, 444, 1);
INSERT INTO OrderItem VALUES(5, 555, 1);
INSERT INTO OrderItem VALUES(5, 777, 1);
INSERT INTO OrderItem VALUES(5, 999, 1);

INSERT INTO OrderItem VALUES(6, 111, 1);
INSERT INTO OrderItem VALUES(6, 222, 1);
INSERT INTO OrderItem VALUES(6, 333, 1);
INSERT INTO OrderItem VALUES(6, 444, 1);
INSERT INTO OrderItem VALUES(6, 555, 1);
INSERT INTO OrderItem VALUES(6, 777, 1);
INSERT INTO OrderItem VALUES(6, 888, 1);
INSERT INTO OrderItem VALUES(6, 999, 1);

INSERT INTO OrderItem VALUES(7, 111, 1);
INSERT INTO OrderItem VALUES(7, 222, 1);
INSERT INTO OrderItem VALUES(7, 333, 1);
INSERT INTO OrderItem VALUES(7, 444, 1);
INSERT INTO OrderItem VALUES(7, 555, 1);
INSERT INTO OrderItem VALUES(7, 777, 1);
INSERT INTO OrderItem VALUES(7, 888, 1);
INSERT INTO OrderItem VALUES(7, 999, 1);
INSERT INTO OrderItem VALUES(7, 666, 1);
2020-07-28