一尘不染

检测和删除时间序列中的间隔

sql

我在数据集中有针对不同公司的每日时间序列,并使用PostgreSQL。我的目标是排除时间序列太不完整的公司。因此,我想排除所有连续缺失值等于或大于3的公司。此外,我想排除在数据集中的第一个日期和最后一个日期之间缺失值超过50%的所有公司。

我们可以使用以下示例数据:

date             company    value
2012-01-01       A          5
2012-01-01       B          2
2012-01-02       A          NULL
2012-01-02       B          2
2012-01-02       C          4
2012-01-03       A          NULL
2012-01-03       B          NULL
2012-01-03       C          NULL
2012-01-04       A          NULL
2012-01-04       B          NULL
2012-01-04       C          NULL
2012-01-05       A          8
2012-01-05       B          9
2012-01-05       C          3
2012-01-06       A          8
2012-01-06       B          9
2012-01-06       C          NULL

因此,必须排除A,因为它的三个连续缺失值之间有一个缺口,而C因为在它的第一个日期与最后一个日期之间有50%以上的缺失值,所以必须将其排除。

结合这个论坛中的其他答案,我组成了以下代码:

  1. 添加自动增量主键以标识每一行

    CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date;
    

    CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE;
    ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL(‘id_seq’);
    UPDATE test SET id = NEXTVAL(‘id_seq’);

    ALTER TABLE test ADD PRIMARY KEY (id);

  2. 检测时间序列中的间隔

    CREATE TABLE to_del AS WITH count3 AS
    

    ( SELECT *,
    COUNT(CASE WHEN value IS NULL THEN 1 END)
    OVER (PARTITION BY company ORDER BY id
    ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
    AS cnt FROM test)
    SELECT company, id FROM count3 WHERE cnt >= 3;

  3. 删除mytable中的差距

    DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);
    

从时间序列中检测并删除3个或更多连续缺失值的间隙似乎可以实现。但是这种方法非常麻烦。而且我不知道如何将所有遗漏价值超过50%的公司都排除在外。

您能想到一个比我更有效的解决方案(我只是学习使用PostgreSQL),该解决方案还设法排除了价值损失超过50%的公司吗?


阅读 120

收藏
2021-05-16

共1个答案

一尘不染

我只会创建一个查询:

DELETE FROM mytable 
WHERE company in (
  SELECT Company 
  FROM (
    SELECT Company, 
      COUNT(CASE WHEN value IS NULL THEN 1 END) 
         OVER (PARTITION BY company ORDER BY id 
               ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
      COUNT(CASE WHEN value IS NULL THEN 1 END) 
         OVER (PARTITION BY company)
      / 
      COUNT(*) 
         OVER (PARTITION BY company) As p50
  ) alias
  WHERE cnt >= 3 OR p50 > 0.5
)

(公司+值)列上的复合索引可以帮助获得此查询的最大速度。


编辑


上面的查询不起作用, 我已经对其进行了一些更正,这是一个演示:http

//sqlfiddle.com/#!15/c9bfe/7更改了两件事:-按公司 ORDER BY日期 而不是 ORDER 日期
划分了PARTITION BY id- 显式转换为数字(因为整数已被截断为0): OVER(按公司划分) :: numeric

SELECT company, cnt, p50
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
– WHERE cnt >= 3 OR p50 > 0.5

现在删除查询应该可以工作了:

DELETE FROM mytable 
WHERE company in (
      SELECT company
      FROM (
        SELECT company, 
          COUNT(CASE WHEN value IS NULL THEN 1 END) 
             OVER (PARTITION BY company ORDER BY date 
                   ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
          SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) 
             OVER (PARTITION BY company)::numeric
          / 
          COUNT(*) 
             OVER (PARTITION BY company) As p50
        FROM mytable
      ) alias
    WHERE cnt >= 3 OR p50 > 0.5
)
2021-05-16