我在数据集中有针对不同公司的每日时间序列,并使用PostgreSQL。我的目标是排除时间序列太不完整的公司。因此,我想排除所有连续缺失值等于或大于3的公司。此外,我想排除在数据集中的第一个日期和最后一个日期之间缺失值超过50%的所有公司。
我们可以使用以下示例数据:
date company value 2012-01-01 A 5 2012-01-01 B 2 2012-01-02 A NULL 2012-01-02 B 2 2012-01-02 C 4 2012-01-03 A NULL 2012-01-03 B NULL 2012-01-03 C NULL 2012-01-04 A NULL 2012-01-04 B NULL 2012-01-04 C NULL 2012-01-05 A 8 2012-01-05 B 9 2012-01-05 C 3 2012-01-06 A 8 2012-01-06 B 9 2012-01-06 C NULL
因此,必须排除A,因为它的三个连续缺失值之间有一个缺口,而C因为在它的第一个日期与最后一个日期之间有50%以上的缺失值,所以必须将其排除。
结合这个论坛中的其他答案,我组成了以下代码:
添加自动增量主键以标识每一行
CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date;
CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE; ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL(‘id_seq’); UPDATE test SET id = NEXTVAL(‘id_seq’);
ALTER TABLE test ADD PRIMARY KEY (id);
检测时间序列中的间隔
CREATE TABLE to_del AS WITH count3 AS
( SELECT *, COUNT(CASE WHEN value IS NULL THEN 1 END) OVER (PARTITION BY company ORDER BY id ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) AS cnt FROM test) SELECT company, id FROM count3 WHERE cnt >= 3;
删除mytable中的差距
DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);
从时间序列中检测并删除3个或更多连续缺失值的间隙似乎可以实现。但是这种方法非常麻烦。而且我不知道如何将所有遗漏价值超过50%的公司都排除在外。
您能想到一个比我更有效的解决方案(我只是学习使用PostgreSQL),该解决方案还设法排除了价值损失超过50%的公司吗?
我只会创建一个查询:
DELETE FROM mytable WHERE company in ( SELECT Company FROM ( SELECT Company, COUNT(CASE WHEN value IS NULL THEN 1 END) OVER (PARTITION BY company ORDER BY id ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt, COUNT(CASE WHEN value IS NULL THEN 1 END) OVER (PARTITION BY company) / COUNT(*) OVER (PARTITION BY company) As p50 ) alias WHERE cnt >= 3 OR p50 > 0.5 )
(公司+值)列上的复合索引可以帮助获得此查询的最大速度。
编辑
//sqlfiddle.com/#!15/c9bfe/7更改了两件事:-按公司 ORDER BY日期 而不是 ORDER 日期 划分了PARTITION BY id- 显式转换为数字(因为整数已被截断为0): OVER(按公司划分) :: numeric
SELECT company, cnt, p50 FROM ( SELECT company, COUNT(CASE WHEN value IS NULL THEN 1 END) OVER (PARTITION BY company ORDER BY date ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt, SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY company)::numeric / COUNT(*) OVER (PARTITION BY company) As p50 FROM mytable ) alias – WHERE cnt >= 3 OR p50 > 0.5
现在删除查询应该可以工作了:
DELETE FROM mytable WHERE company in ( SELECT company FROM ( SELECT company, COUNT(CASE WHEN value IS NULL THEN 1 END) OVER (PARTITION BY company ORDER BY date ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt, SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY company)::numeric / COUNT(*) OVER (PARTITION BY company) As p50 FROM mytable ) alias WHERE cnt >= 3 OR p50 > 0.5 )