我需要计算一个日期范围内的滚动总和。为了说明,使用AdventureWorks 示例数据库,以下假设语法将完全满足我的需要:
SELECT TH.ProductID, TH.TransactionDate, TH.ActualCost, RollingSum45 = SUM(TH.ActualCost) OVER ( PARTITION BY TH.ProductID ORDER BY TH.TransactionDate RANGE BETWEEN INTERVAL 45 DAY PRECEDING AND CURRENT ROW) FROM Production.TransactionHistory AS TH ORDER BY TH.ProductID, TH.TransactionDate, TH.ReferenceOrderID;
遗憾的是,RANGE窗框范围当前不允许 SQL Server 中的间隔。
RANGE
我知道我可以使用子查询和常规(非窗口)聚合来编写解决方案:
SELECT TH.ProductID, TH.TransactionDate, TH.ActualCost, RollingSum45 = ( SELECT SUM(TH2.ActualCost) FROM Production.TransactionHistory AS TH2 WHERE TH2.ProductID = TH.ProductID AND TH2.TransactionDate <= TH.TransactionDate AND TH2.TransactionDate >= DATEADD(DAY, -45, TH.TransactionDate) ) FROM Production.TransactionHistory AS TH ORDER BY TH.ProductID, TH.TransactionDate, TH.ReferenceOrderID;
给定以下索引:
CREATE UNIQUE INDEX i ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID) INCLUDE (ActualCost);
执行计划是:
虽然不是非常低效,但似乎应该可以仅使用 SQL Server 2012、2014 或 2016(到目前为止)支持的窗口聚合和分析函数来表达此查询。
为清楚起见,我正在寻找一种对数据执行单次传递的解决方案。
在 T-SQL 中,这很可能意味着子句将OVER完成工作,而执行计划将包含 Window Spools 和 Window Aggregates。所有使用该OVER子句的语言元素都是公平的游戏。SQLCLR 解决方案是可以接受的,只要它保证产生正确的结果。
OVER
对于 T-SQL 解决方案,执行计划中的哈希、排序和窗口假脱机/聚合越少越好。随意添加索引,但不允许使用单独的结构(例如,没有预先计算的表与触发器保持同步)。允许参考表(数字表、日期表等)
理想情况下,解决方案将以与上述子查询版本相同的顺序产生完全相同的结果,但任何可以说是正确的也是可以接受的。性能始终是一个考虑因素,因此解决方案至少应该是相当有效的。
T-SQL 方法可以概括为以下步骤:
使用SET STATISTICS IO ON,这种方法会报告Table 'TransactionHistory'. Scan count 1, logical reads 484,它确认了桌子上的“单次通过”。供参考,原始循环搜索查询报告Table 'TransactionHistory'. Scan count 113444, logical reads 438366。
SET STATISTICS IO ON
Table 'TransactionHistory'. Scan count 1, logical reads 484
Table 'TransactionHistory'. Scan count 113444, logical reads 438366
据报道SET STATISTICS TIME ON,CPU时间是514ms。2231ms这与原始查询相比是有利的。
SET STATISTICS TIME ON
514ms
2231ms
CLR 快速总结
CLR 总结可以概括为以下步骤:
使用SET STATISTICS IO ON,此方法报告未发生逻辑 I/O!哇,完美的解决方案!(实际上,似乎SET STATISTICS IO并没有报告 CLR 内部发生的 I/O。但是从代码中,很容易看出恰好对表进行了一次扫描,并按照 Paul 建议的索引顺序检索数据。
SET STATISTICS IO
据报道SET STATISTICS TIME ON,CPU时间是现在187ms。所以这是对 T-SQL 方法的相当大的改进。不幸的是,这两种方法的总经过时间非常相似,大约为半秒。但是,基于 CLR 的方法确实必须向控制台输出 113K 行(而按产品/日期分组的 T-SQL 方法只有 52K),所以这就是我转而关注 CPU 时间的原因。
187ms
这种方法的另一大优点是它产生与原始循环/搜索方法完全相同的结果,即使在同一天多次销售产品的情况下,每笔交易也包括一行。(在 AdventureWorks 上,我专门比较了逐行的结果,并确认它们与 Paul 的原始查询相符。)
这种方法的一个缺点,至少在目前的形式中,是它读取内存中的所有数据。然而,已经设计的算法在任何给定时间只严格需要内存中的当前窗口帧,并且可以更新以适用于超出内存的数据集。Paul 在他的回答中通过生成该算法的实现来说明这一点,该算法仅将滑动窗口存储在内存中。这是以授予 CLR 程序集更高权限为代价的,但绝对值得将此解决方案扩展到任意大的数据集。
最初设定
USE AdventureWorks2012 GO -- Create Paul's index CREATE UNIQUE INDEX i ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID) INCLUDE (ActualCost); GO -- Build calendar table for 2000 ~ 2020 CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY) GO DECLARE @d DATETIME = '1/1/2000' WHILE (@d < '1/1/2021') BEGIN INSERT INTO dbo.calendar (d) VALUES (@d) SELECT @d = DATEADD(DAY, 1, @d) END GO
查询
DECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs @maxAnalysisDate DATE = '2008-09-03' -- Customizable end date depending on business needs SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders FROM ( SELECT ProductID, TransactionDate, NumOrders, ActualCost, SUM(ActualCost) OVER ( PARTITION BY ProductId ORDER BY TransactionDate ROWS BETWEEN 45 PRECEDING AND CURRENT ROW ) AS RollingSum45 FROM ( -- The full cross-product of products and dates, combined with actual cost information for that product/date SELECT p.ProductID, c.d AS TransactionDate, COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost FROM Production.Product p JOIN dbo.calendar c ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate LEFT OUTER JOIN Production.TransactionHistory TH ON TH.ProductId = p.productId AND TH.TransactionDate = c.d GROUP BY P.ProductID, c.d ) aggsByDay ) rollingSums WHERE NumOrders > 0 ORDER BY ProductID, TransactionDate -- MAXDOP 1 to avoid parallel scan inflating the scan count OPTION (MAXDOP 1)
执行计划
从执行计划中,我们看到 Paul 提出的原始索引足以让我们执行单次有序扫描Production.TransactionHistory,使用合并连接将交易历史与每个可能的产品/日期组合结合起来。
Production.TransactionHistory
假设
这种方法中有一些重要的假设。我想这将由保罗来决定它们是否可以接受:)
Production.Product
AdventureWorks2012
NumOrders
319
2007-09-05 00:00:00.000
主要功能体
这里没什么可看的;函数主体声明输入(必须匹配相应的 SQL 函数),建立 SQL 连接,并打开 SQLReader。
// SQL CLR function for rolling SUMs on AdventureWorks2012.Production.TransactionHistory [SqlFunction(DataAccess = DataAccessKind.Read, FillRowMethodName = "RollingSum_Fill", TableDefinition = "ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT," + "ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT")] public static IEnumerable RollingSumTvf(SqlInt32 rollingPeriodDays) { using (var connection = new SqlConnection("context connection=true;")) { connection.Open(); List<TrxnRollingSum> trxns; using (var cmd = connection.CreateCommand()) { //Read the transaction history (note: the order is important!) cmd.CommandText = @"SELECT ProductId, TransactionDate, ReferenceOrderID, CAST(ActualCost AS FLOAT) AS ActualCost FROM Production.TransactionHistory ORDER BY ProductId, TransactionDate"; using (var reader = cmd.ExecuteReader()) { trxns = ComputeRollingSums(reader, rollingPeriodDays.Value); } } return trxns; } }
核心逻辑
我已经分离出主要逻辑,以便更容易关注:
// Given a SqlReader with transaction history data, computes / returns the rolling sums private static List<TrxnRollingSum> ComputeRollingSums(SqlDataReader reader, int rollingPeriodDays) { var startIndexOfRollingPeriod = 0; var rollingSumIndex = 0; var trxns = new List<TrxnRollingSum>(); // Prior to the loop, initialize "next" to be the first transaction var nextTrxn = GetNextTrxn(reader, null); while (nextTrxn != null) { var currTrxn = nextTrxn; nextTrxn = GetNextTrxn(reader, currTrxn); trxns.Add(currTrxn); // If the next transaction is not the same product/date as the current // transaction, we can finalize the rolling sum for the current transaction // and all previous transactions for the same product/date var finalizeRollingSum = nextTrxn == null || (nextTrxn != null && (currTrxn.ProductId != nextTrxn.ProductId || currTrxn.TransactionDate != nextTrxn.TransactionDate)); if (finalizeRollingSum) { // Advance the pointer to the first transaction (for the same product) // that occurs within the rolling period while (startIndexOfRollingPeriod < trxns.Count && trxns[startIndexOfRollingPeriod].TransactionDate < currTrxn.TransactionDate.AddDays(-1 * rollingPeriodDays)) { startIndexOfRollingPeriod++; } // Compute the rolling sum as the cumulative sum (for this product), // minus the cumulative sum for prior to the beginning of the rolling window var sumPriorToWindow = trxns[startIndexOfRollingPeriod].PrevSum; var rollingSum = currTrxn.ActualCost + currTrxn.PrevSum - sumPriorToWindow; // Fill in the rolling sum for all transactions sharing this product/date while (rollingSumIndex < trxns.Count) { trxns[rollingSumIndex++].RollingSum = rollingSum; } } // If this is the last transaction for this product, reset the rolling period if (nextTrxn != null && currTrxn.ProductId != nextTrxn.ProductId) { startIndexOfRollingPeriod = trxns.Count; } } return trxns; }
帮手
以下逻辑可以内联编写,但将它们拆分为自己的方法时更容易阅读。
private static TrxnRollingSum GetNextTrxn(SqlDataReader r, TrxnRollingSum currTrxn) { TrxnRollingSum nextTrxn = null; if (r.Read()) { nextTrxn = new TrxnRollingSum { ProductId = r.GetInt32(0), TransactionDate = r.GetDateTime(1), ReferenceOrderId = r.GetInt32(2), ActualCost = r.GetDouble(3), PrevSum = 0 }; if (currTrxn != null) { nextTrxn.PrevSum = (nextTrxn.ProductId == currTrxn.ProductId) ? currTrxn.PrevSum + currTrxn.ActualCost : 0; } } return nextTrxn; } // Represents the output to be returned // Note that the ReferenceOrderId/PrevSum fields are for debugging only private class TrxnRollingSum { public int ProductId { get; set; } public DateTime TransactionDate { get; set; } public int ReferenceOrderId { get; set; } public double ActualCost { get; set; } public double PrevSum { get; set; } public double RollingSum { get; set; } } // The function that generates the result data for each row // (Such a function is mandatory for SQL CLR table-valued functions) public static void RollingSum_Fill(object trxnWithRollingSumObj, out int productId, out DateTime transactionDate, out int referenceOrderId, out double actualCost, out double prevCumulativeSum, out double rollingSum) { var trxn = (TrxnRollingSum)trxnWithRollingSumObj; productId = trxn.ProductId; transactionDate = trxn.TransactionDate; referenceOrderId = trxn.ReferenceOrderId; actualCost = trxn.ActualCost; prevCumulativeSum = trxn.PrevSum; rollingSum = trxn.RollingSum; }
在 SQL 中将所有内容捆绑在一起
到目前为止,一切都在 C# 中,所以让我们看看实际涉及的 SQL。(或者,您可以使用此部署脚本直接从我的程序集的位创建程序集,而不是自己编译。)
USE AdventureWorks2012; /* GPATTERSON2\SQL2014DEVELOPER */ GO -- Enable CLR EXEC sp_configure 'clr enabled', 1; GO RECONFIGURE; GO -- Create the assembly based on the dll generated by compiling the CLR project -- I've also included the "assembly bits" version that can be run without compiling CREATE ASSEMBLY ClrPlayground -- See http://pastebin.com/dfbv1w3z for a "from assembly bits" version FROM 'C:\FullPathGoesHere\ClrPlayground\bin\Debug\ClrPlayground.dll' WITH PERMISSION_SET = safe; GO --Create a function from the assembly CREATE FUNCTION dbo.RollingSumTvf (@rollingPeriodDays INT) RETURNS TABLE ( ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT, ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT) -- The function yields rows in order, so let SQL Server know to avoid an extra sort ORDER (ProductID, TransactionDate, ReferenceOrderID) AS EXTERNAL NAME ClrPlayground.UserDefinedFunctions.RollingSumTvf; GO -- Now we can actually use the TVF! SELECT * FROM dbo.RollingSumTvf(45) ORDER BY ProductId, TransactionDate, ReferenceOrderId GO
注意事项
CLR 方法为优化算法提供了更大的灵活性,并且可能由 C# 专家进一步调整。然而,CLR 策略也有缺点。要记住几件事:
TRUSTWORTHY
EXTERNAL_ACCESS
在尝试创造性地思考这个问题一段时间后,我想我也会发布一个相当简单实用的方法,如果它出现在我的日常工作中,我可能会选择解决这个问题。它确实利用了 SQL 2012+ 窗口功能,但不是以问题所希望的开创性方式:
-- Compute all running costs into a #temp table; Note that this query could simply read -- from Production.TransactionHistory, but a CROSS APPLY by product allows the window -- function to be computed independently per product, supporting a parallel query plan SELECT t.* INTO #runningCosts FROM Production.Product p CROSS APPLY ( SELECT t.ProductId, t.TransactionDate, t.ReferenceOrderId, t.ActualCost, -- Running sum of the cost for this product, including all ties on TransactionDate SUM(t.ActualCost) OVER ( ORDER BY t.TransactionDate RANGE UNBOUNDED PRECEDING) AS RunningCost FROM Production.TransactionHistory t WHERE t.ProductId = p.ProductId ) t GO -- Key the table in our output order ALTER TABLE #runningCosts ADD PRIMARY KEY (ProductId, TransactionDate, ReferenceOrderId) GO SELECT r.ProductId, r.TransactionDate, r.ReferenceOrderId, r.ActualCost, -- Cumulative running cost - running cost prior to the sliding window r.RunningCost - ISNULL(w.RunningCost,0) AS RollingSum45 FROM #runningCosts r OUTER APPLY ( -- For each transaction, find the running cost just before the sliding window begins SELECT TOP 1 b.RunningCost FROM #runningCosts b WHERE b.ProductId = r.ProductId AND b.TransactionDate < DATEADD(DAY, -45, r.TransactionDate) ORDER BY b.TransactionDate DESC ) w ORDER BY r.ProductId, r.TransactionDate, r.ReferenceOrderId GO
这实际上产生了一个相当简单的整体查询计划,即使同时查看两个相关的查询计划:
我喜欢这种方法的几个原因:
900ms
2700ms
几个潜在的警告:
这是一个很长的答案,所以我决定在这里添加一个摘要。
ProductIDs
我将使用AdventureWorks2014数据库和 SQL Server Express 2014。
对原始数据库的更改:
[Production].[TransactionHistory].[TransactionDate]
datetime
date
[dbo].[Calendar]
[Production].[TransactionHistory]
.
CREATE TABLE [dbo].[Calendar] ( [dt] [date] NOT NULL, CONSTRAINT [PK_Calendar] PRIMARY KEY CLUSTERED ( [dt] ASC )) CREATE UNIQUE NONCLUSTERED INDEX [i] ON [Production].[TransactionHistory] ( [ProductID] ASC, [TransactionDate] ASC, [ReferenceOrderID] ASC ) INCLUDE ([ActualCost]) -- Init calendar table INSERT INTO dbo.Calendar (dt) SELECT TOP (50000) DATEADD(day, ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1, '2000-01-01') AS dt FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2 OPTION (MAXDOP 1);
MSDN 关于OVER条款的文章有一个链接,指向Itzik Ben-Gan的一篇关于窗口函数的优秀博客文章。在那篇文章中,他解释了如何工作,和选项OVER之间的区别,并提到了计算一个日期范围内的滚动总和的问题。他提到当前版本的 SQL Server 没有完全实现,也没有实现时间间隔数据类型。他解释了两者之间的区别并给了我一个想法。ROWS``RANGE``RANGE``ROWS``RANGE
ROWS``RANGE``RANGE``ROWS``RANGE
没有间隔和重复的日期
如果TransactionHistory表包含的日期没有间隔且没有重复,则以下查询将产生正确的结果:
TransactionHistory
SELECT TH.ProductID, TH.TransactionDate, TH.ActualCost, RollingSum45 = SUM(TH.ActualCost) OVER ( PARTITION BY TH.ProductID ORDER BY TH.TransactionDate ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) FROM Production.TransactionHistory AS TH ORDER BY TH.ProductID, TH.TransactionDate, TH.ReferenceOrderID;
事实上,一个 45 行的窗口正好可以覆盖 45 天。
有间隔没有重复的日期
不幸的是,我们的数据在日期上有差距。为了解决这个问题,我们可以使用一个Calendar表来生成一组没有间隙的日期,然后将LEFT JOIN原始数据添加到这个集合中并使用相同的查询ROWS BETWEEN 45 PRECEDING AND CURRENT ROW。仅当日期不重复时(在同一日期内ProductID),这才会产生正确的结果。
Calendar
LEFT JOIN
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
ProductID
有重复的空白日期
不幸的是,我们的数据在日期和日期之间都存在差距,并且日期可以在同一个ProductID. 为了解决这个问题,我们可以通过GROUP原始数据ProductID, TransactionDate生成一组没有重复的日期。然后使用Calendartable 生成一组没有间隙的日期。然后我们可以使用查询 withROWS BETWEEN 45 PRECEDING AND CURRENT ROW来计算 rolling SUM。这将产生正确的结果。请参阅下面查询中的注释。
GROUP
ProductID, TransactionDate
SUM
WITH -- calculate Start/End dates for each product CTE_Products AS ( SELECT TH.ProductID ,MIN(TH.TransactionDate) AS MinDate ,MAX(TH.TransactionDate) AS MaxDate FROM [Production].[TransactionHistory] AS TH GROUP BY TH.ProductID ) -- generate set of dates without gaps for each product ,CTE_ProductsWithDates AS ( SELECT CTE_Products.ProductID, C.dt FROM CTE_Products INNER JOIN dbo.Calendar AS C ON C.dt >= CTE_Products.MinDate AND C.dt <= CTE_Products.MaxDate ) -- generate set of dates without duplicates for each product -- calculate daily cost as well ,CTE_DailyCosts AS ( SELECT TH.ProductID, TH.TransactionDate, SUM(ActualCost) AS DailyActualCost FROM [Production].[TransactionHistory] AS TH GROUP BY TH.ProductID, TH.TransactionDate ) -- calculate rolling sum over 45 days ,CTE_Sum AS ( SELECT CTE_ProductsWithDates.ProductID ,CTE_ProductsWithDates.dt ,CTE_DailyCosts.DailyActualCost ,SUM(CTE_DailyCosts.DailyActualCost) OVER ( PARTITION BY CTE_ProductsWithDates.ProductID ORDER BY CTE_ProductsWithDates.dt ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) AS RollingSum45 FROM CTE_ProductsWithDates LEFT JOIN CTE_DailyCosts ON CTE_DailyCosts.ProductID = CTE_ProductsWithDates.ProductID AND CTE_DailyCosts.TransactionDate = CTE_ProductsWithDates.dt ) -- remove rows that were added by Calendar, which fill the gaps in dates -- add back duplicate dates that were removed by GROUP BY SELECT TH.ProductID ,TH.TransactionDate ,TH.ActualCost ,CTE_Sum.RollingSum45 FROM [Production].[TransactionHistory] AS TH INNER JOIN CTE_Sum ON CTE_Sum.ProductID = TH.ProductID AND CTE_Sum.dt = TH.TransactionDate ORDER BY TH.ProductID ,TH.TransactionDate ,TH.ReferenceOrderID ;
我确认此查询产生的结果与使用子查询的问题的方法相同。
第一个查询使用子查询,第二个 - 这种方法。您可以看到这种方法的持续时间和读取次数要少得多。这种方法中的大部分估计成本是最终的ORDER BY,见下文。
ORDER BY
子查询方法具有嵌套循环和O(n*n)复杂性的简单计划。
O(n*n)
计划这种方法扫描TransactionHistory几次,但没有循环。如您所见,超过 70% 的估计成本Sort用于最终的ORDER BY.
Sort
顶部结果 - subquery,底部 - OVER。
subquery
上面计划中最后的Index Scan, Merge Join and Sort,是由于finallyINNER JOIN与原表的关系,使得最终的结果与子查询的slow approach完全相同。返回的行数与TransactionHistory表中相同。TransactionHistory同一产品在同一天发生多笔交易时存在行。如果可以在结果中只显示每日摘要,则JOIN可以删除这个final,查询变得更简单、更快。上一个计划中的最后一个索引扫描、合并连接和排序被替换为过滤器,它删除了添加的行Calendar。
INNER JOIN
JOIN
WITH -- two scans -- calculate Start/End dates for each product CTE_Products AS ( SELECT TH.ProductID ,MIN(TH.TransactionDate) AS MinDate ,MAX(TH.TransactionDate) AS MaxDate FROM [Production].[TransactionHistory] AS TH GROUP BY TH.ProductID ) -- generate set of dates without gaps for each product ,CTE_ProductsWithDates AS ( SELECT CTE_Products.ProductID, C.dt FROM CTE_Products INNER JOIN dbo.Calendar AS C ON C.dt >= CTE_Products.MinDate AND C.dt <= CTE_Products.MaxDate ) -- generate set of dates without duplicates for each product -- calculate daily cost as well ,CTE_DailyCosts AS ( SELECT TH.ProductID, TH.TransactionDate, SUM(ActualCost) AS DailyActualCost FROM [Production].[TransactionHistory] AS TH GROUP BY TH.ProductID, TH.TransactionDate ) -- calculate rolling sum over 45 days ,CTE_Sum AS ( SELECT CTE_ProductsWithDates.ProductID ,CTE_ProductsWithDates.dt ,CTE_DailyCosts.DailyActualCost ,SUM(CTE_DailyCosts.DailyActualCost) OVER ( PARTITION BY CTE_ProductsWithDates.ProductID ORDER BY CTE_ProductsWithDates.dt ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) AS RollingSum45 FROM CTE_ProductsWithDates LEFT JOIN CTE_DailyCosts ON CTE_DailyCosts.ProductID = CTE_ProductsWithDates.ProductID AND CTE_DailyCosts.TransactionDate = CTE_ProductsWithDates.dt ) -- remove rows that were added by Calendar, which fill the gaps in dates SELECT CTE_Sum.ProductID ,CTE_Sum.dt AS TransactionDate ,CTE_Sum.DailyActualCost ,CTE_Sum.RollingSum45 FROM CTE_Sum WHERE CTE_Sum.DailyActualCost IS NOT NULL ORDER BY CTE_Sum.ProductID ,CTE_Sum.dt ;
仍然,TransactionHistory被扫描两次。需要进行一次额外扫描才能获取每种产品的日期范围。我很想看看它如何与另一种方法进行比较,在这种方法中,我们使用关于 中的全局日期范围的外部知识,以及可以避免额外扫描的TransactionHistory额外表。我从此查询中删除了每天交易数量的计算,以使比较有效。它可以在两个查询中添加,但我想保持简单以便比较。我还不得不使用其他日期,因为我使用的是 2014 版本的数据库。Product``ProductIDs
Product``ProductIDs
DECLARE @minAnalysisDate DATE = '2013-07-31', -- Customizable start date depending on business needs @maxAnalysisDate DATE = '2014-08-03' -- Customizable end date depending on business needs SELECT -- one scan ProductID, TransactionDate, ActualCost, RollingSum45 --, NumOrders FROM ( SELECT ProductID, TransactionDate, --NumOrders, ActualCost, SUM(ActualCost) OVER ( PARTITION BY ProductId ORDER BY TransactionDate ROWS BETWEEN 45 PRECEDING AND CURRENT ROW ) AS RollingSum45 FROM ( -- The full cross-product of products and dates, -- combined with actual cost information for that product/date SELECT p.ProductID, c.dt AS TransactionDate, --COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost FROM Production.Product p JOIN dbo.calendar c ON c.dt BETWEEN @minAnalysisDate AND @maxAnalysisDate LEFT OUTER JOIN Production.TransactionHistory TH ON TH.ProductId = p.productId AND TH.TransactionDate = c.dt GROUP BY P.ProductID, c.dt ) aggsByDay ) rollingSums --WHERE NumOrders > 0 WHERE ActualCost IS NOT NULL ORDER BY ProductID, TransactionDate -- MAXDOP 1 to avoid parallel scan inflating the scan count OPTION (MAXDOP 1);
两个查询以相同的顺序返回相同的结果。
比较
这是时间和 IO 统计信息。
两次扫描变体更快一些并且读取次数更少,因为一次扫描变体必须大量使用 Worktable。此外,如您在计划中看到的那样,单次扫描变体生成的行数超出了所需的数量。即使 a没有任何事务,它也会ProductID为表中的每个生成日期。表中有 504 行,但只有 441 个产品在. 此外,它为每个产品生成相同的日期范围,这超出了需要。如果整体历史较长,而每个产品的历史相对较短,那么多余的不需要的行数会更高。Product``ProductID``Product``TransactionHistory``TransactionHistory
Product``ProductID``Product``TransactionHistory``TransactionHistory
另一方面,可以通过在 just 上创建另一个更窄的索引来进一步优化两次扫描变体(ProductID, TransactionDate)。CTE_Products该索引将用于计算每个产品的开始/结束日期(
(ProductID, TransactionDate)
CTE_Products
所以,我们可以选择,要么有一个额外的显式简单扫描,要么有一个隐式的 Worktable。
顺便说一句,如果只有每日摘要的结果是可以的,那么最好创建一个不包含ReferenceOrderID. 它将使用更少的页面 => 更少的 IO。
ReferenceOrderID
CREATE NONCLUSTERED INDEX [i2] ON [Production].[TransactionHistory] ( [ProductID] ASC, [TransactionDate] ASC ) INCLUDE ([ActualCost])
它变成了一个很长的答案,但这里还有一个变体,它再次只返回每日摘要,但它只对数据进行一次扫描,并且不需要有关日期范围或 ProductID 列表的外部知识。它也不做中间排序。整体性能与以前的变体相似,但似乎更差一些。
主要思想是使用一个数字表来生成可以填补日期空白的行。对于每个现有日期LEAD,用于计算间隔的天数,然后用于CROSS APPLY将所需的行数添加到结果集中。起初,我尝试使用永久的数字表。该计划在该表中显示了大量读取,尽管实际持续时间几乎相同,就像我使用CTE.
LEAD
CROSS APPLY
CTE
WITH e1(n) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ) -- 10 ,e2(n) AS (SELECT 1 FROM e1 CROSS JOIN e1 AS b) -- 10*10 ,e3(n) AS (SELECT 1 FROM e1 CROSS JOIN e2) -- 10*100 ,CTE_Numbers AS ( SELECT ROW_NUMBER() OVER (ORDER BY n) AS Number FROM e3 ) ,CTE_DailyCosts AS ( SELECT TH.ProductID ,TH.TransactionDate ,SUM(ActualCost) AS DailyActualCost ,ISNULL(DATEDIFF(day, TH.TransactionDate, LEAD(TH.TransactionDate) OVER(PARTITION BY TH.ProductID ORDER BY TH.TransactionDate)), 1) AS DiffDays FROM [Production].[TransactionHistory] AS TH GROUP BY TH.ProductID, TH.TransactionDate ) ,CTE_NoGaps AS ( SELECT CTE_DailyCosts.ProductID ,CTE_DailyCosts.TransactionDate ,CASE WHEN CA.Number = 1 THEN CTE_DailyCosts.DailyActualCost ELSE NULL END AS DailyCost FROM CTE_DailyCosts CROSS APPLY ( SELECT TOP(CTE_DailyCosts.DiffDays) CTE_Numbers.Number FROM CTE_Numbers ORDER BY CTE_Numbers.Number ) AS CA ) ,CTE_Sum AS ( SELECT ProductID ,TransactionDate ,DailyCost ,SUM(DailyCost) OVER ( PARTITION BY ProductID ORDER BY TransactionDate ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) AS RollingSum45 FROM CTE_NoGaps ) SELECT ProductID ,TransactionDate ,DailyCost ,RollingSum45 FROM CTE_Sum WHERE DailyCost IS NOT NULL ORDER BY ProductID ,TransactionDate ;
这个计划“更长”,因为查询使用两个窗口函数(LEAD和SUM)。