在全球历史气候网络已其收藏的天气测量的标记无效或错误的数据。删除这些元素后,会有大量数据不再具有连续的日期部分。数据类似于:
"2007-12-01";14 -- Start of December "2007-12-29";8 "2007-12-30";11 "2007-12-31";7 "2008-01-01";8 -- Start of January "2008-01-02";12 "2008-01-29";0 "2008-01-31";7 "2008-02-01";4 -- Start of February ... entire month is complete ... "2008-02-29";12 "2008-03-01";14 -- Start of March "2008-03-02";17 "2008-03-05";17
尽管可以推断缺失的数据(例如,通过其他年份的平均值)以提供连续的范围,但是为了简化系统,我还是要根据是否有连续的日期范围来填充月份来标记不连续的细分:
D;"2007-12-01";14 -- Start of December D;"2007-12-29";8 D;"2007-12-30";11 D;"2007-12-31";7 D;"2008-01-01";8 -- Start of January D;"2008-01-02";12 D;"2008-01-29";0 D;"2008-01-31";7 "2008-02-01";4 -- Start of February ... entire month is complete ... "2008-02-29";12 D;"2008-03-01";14 -- Start of March D;"2008-03-02";17 D;"2008-03-05";17
1843年进行了一些测量。
对于所有气象站,您将如何标记月份中缺少一天或几天的所有日期?
选择数据的代码类似于:
select m.id, m.taken, m.station_id, m.amount from climate.measurement
生成一个包含连续日期的表,并将它们与测量的数据日期进行比较。
可以使用本节中的SQL重新创建该问题。
该表如下创建:
CREATE TABLE climate.calendar ( id serial NOT NULL, n character varying(2) NOT NULL, d date NOT NULL, "valid" boolean NOT NULL DEFAULT true, CONSTRAINT calendar_pk PRIMARY KEY (id) ) WITH ( OIDS=FALSE );
以下SQL将数据插入表中(id[int],name [varchar],date [date],valid[boolean]):
id
n
d
valid
insert into climate.calendar (n, d) select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n insert into climate.calendar (n, d) select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n insert into climate.calendar (n, d) select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n insert into climate.calendar (n, d) select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n insert into climate.calendar (n, d) select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n insert into climate.calendar (n, d) select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
这些值'A'通过'F'代表在特定日期进行测量的气象站的名称。
'A'
'F'
删除一些行,如下所示:
delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);
以下内容不会将月份中缺少某天或更多天的所有天的valid标志false都切换为:
false
UPDATE climate.calendar SET valid = false WHERE date_trunc('month', d) IN ( SELECT DISTINCT date_trunc('month', d) FROM climate.calendar A WHERE NOT EXISTS ( SELECT 1 FROM climate.calendar B WHERE A.d - 1 = B.d ) );
以下SQL产生一个空结果集:
with gen_calendar as ( select (date('1982-01-1') + (n || ' days')::interval)::date cal_date from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n ) select gc.cal_date from gen_calendar gc left join climate.calendar c on c.d = gc.cal_date where c.d is null;
以下SQL生成站名称和日期的所有可能组合:
select distinct( cc.n ), t.d from climate.calendar cc, ( select (date('1982-01-1') + (n || ' days')::interval)::date d from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n ) t order by cc.n
但是,在实际数据中有数百个站点,日期可追溯到1800年代中期,因此所有站点的所有日期的笛卡尔坐标太大。如果有足够的时间,这种方法可能行得通…必须有一种更快的方法。
PostgreSQL具有窗口功能。
如何使用Postgres中的窗口功能选择特定的更改
谢谢!
PostgreSQL的generate_series()函数可以创建一个包含连续日期列表的视图:
generate_series()
with calendar as ( select ((select min(date) from test)::date + (n || ' days')::interval)::date cal_date from generate_series(0, (select max(date) - min(date) from test)) n ) select cal_date from calendar c left join test t on t.date = c.cal_date where t.date is null;
该表达select max(date) - min(date) from test可能偏离一。
select max(date) - min(date) from test
识别无效月份的一种方法是创建两个视图。第一个计算每个站点每个月应产生的每日读数数量。(注意climate.calendar转换为climate_calendar。)秒数返回每个电台每月产生的实际每日读数。
climate.calendar
climate_calendar
该视图将返回每个站点一个月中的实际天数。(例如,二月将始终为28天或29天。)
create view count_max_station_calendar_days as with calendar as ( select ((select min(d) from climate_calendar)::date + (n || ' days')::interval)::date cal_date from generate_series(0, (select max(d) - min(d) from climate_calendar)) n ) select n, extract(year from cal_date) yr, extract(month from cal_date) mo, count(*) num_days from stations cross join calendar group by n, yr, mo order by n, yr, mo
返回的总天数将少于计数。(例如,一月将始终少于31天。)
create view count_actual_station_calendar_days as select n, extract(year from d) yr, extract(month from d) mo, count(*) num_days from climate_calendar group by n, yr, mo order by n, yr, mo;
ORDER BY在生产中删除这些子句(它们对开发很有帮助)。
ORDER BY
将这两个视图结合在一起,以标识需要标记的站点和月份,进入一个新视图:
create view invalid_station_months as select m.n, m.yr, m.mo, m.num_days - a.num_days num_days_missing from count_max_station_calendar_days m inner join count_actual_station_calendar_days a on (m.n = a.n and m.yr = a.yr and m.mo = a.mo and m.num_days <> a.num_days) n yr mo num_days_missing -- A 1982 1 1 E 2007 3 1
该列num_days_missing不是必需的,但很有用。
num_days_missing
这些是需要更新的行:
select cc.* from climate_calendar cc inner join invalid_station_months im on (cc.n = im.n and extract(year from cc.d) = im.yr and extract(month from cc.d) = im.mo) where valid = true
要更新它们,id密钥很方便。
update climate_calendar set valid = false where id in ( select id from climate_calendar cc inner join invalid_station_months im on (cc.n = im.n and extract(year from cc.d) = im.yr and extract(month from cc.d) = im.mo) where valid = true );