by Stanislas Nanchen

Detecting periods with window functions in PostgreSQL

One of our customers was recently confronted with a SQL migration problem of their PostgreSQL database¹. Their application used a table containing the history of positions of objects with one record per day. The table below is a simplified sample of this history. Its meaning is that, for every row, the given object was at the given position on the given day. The primary key of the table is the tuple (Object Id, Day). For example, in the following sample, the object 1 was at position a on the 1st of October 2019.

Object Id* Day* Position
1 2019-10-01 a
1 2019-10-02 a
1 2019-10-03 a
1 2019-10-04 b
1 2019-10-05 a
2 2019-10-01 c
2 2019-10-02 c
2 2019-10-03 c
2 2019-10-04 d
2 2019-10-05 d

The following PostgreSQL code reconsitutes the sample table.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
create table object_position
(
    object_id bigint not null,
    day       date   not null,
    position  text   not null,
    primary key (object_id, day)
);

insert into object_position
values (1, '2019-10-01', 'a'),
       (1, '2019-10-02', 'a'),
       (1, '2019-10-03', 'a'),
       (1, '2019-10-04', 'b'),
       (1, '2019-10-05', 'a'),
       (2, '2019-10-01', 'c'),
       (2, '2019-10-02', 'c'),
       (2, '2019-10-03', 'c'),
       (2, '2019-10-04', 'd'),
       (2, '2019-10-05', 'd');

commit;

In the application of our customer, objects rarely changed place and, with accumulated history, the table had become too large. In approximately one year, the table had accumulated 8.5 millions of rows for 20'000 objects. Therefore, the decision was made to change the representation of this data by storing periods (as closed intervals) for which the positions object did not change. Under this other representation, the data in the previous table would be the following.

Object Id* Valid From* Valid To Position
1 2019-10-01 2019-10-03 a
1 2019-10-04 2019-10-04 b
1 2019-10-05 2019-10-05 a
2 2019-10-01 2019-10-03 c
2 2019-10-04 2019-10-05 d

As you can see, the first object changed positions 3 times and the second object 2 times. We have in total 5 periods for these 2 objects. Note that there is no data loss in the second table; the 2 representations contain the same information. The problem is the following: how to migrate the data from the first representation to the second. In this article, we explore a solution consisting of a single SQL query.

Let's assume first that we would have the period information on source table. By reusing the same color for the periods, it would look like the following data.

Object Id Day Position Period
1 2019-10-01 a p1
1 2019-10-02 a p1
1 2019-10-03 a p1
1 2019-10-04 b p2
1 2019-10-05 a p3
2 2019-10-01 c p4
2 2019-10-02 c p4
2 2019-10-03 c p4
2 2019-10-04 d p5
2 2019-10-05 d p5

If we had a period field on the original table, we could write a simple aggregation query to migrate the data:

1
2
3
4
5
6
select object_id     as object_id,
       min(day)      as valid_from,
       max(day)      as valid_to,
       min(position) as position
from object_position
group by object_id, period

Note: By construction, the field period has the following property: for every pair of rows, if the periods are the same for the rows, then the positions are the same as well. Because of that property, it would be possible to add the position field in the group by clause instead of using the min aggregate function. However, in the rest of this article, we will compute the period field and we can use the condition min(position) = max(position) to verify the correctness of the computation.

A way to uniquely identify every period is to use, for each row, the pair consisting of the id of the object and of the date of the next position change (relatively from that row) or 9999-12-31 if it is the last period for the current object. For instance, the period p1 of the previous table can be represented by the pair (1, 2019-10-04) and the period p3 by the pair (1, 9999-12-31). The following table shows the periods for all the rows of the source table.

Object Id Day Position Period
1 2019-10-01 a (1, 2019-10-04)
1 2019-10-02 a (1, 2019-10-04)
1 2019-10-03 a (1, 2019-10-04)
1 2019-10-04 b (1, 2019-10-05)
1 2019-10-05 a (1, 9999-12-31)
2 2019-10-01 c (2, 2019-10-04)
2 2019-10-02 c (2, 2019-10-04)
2 2019-10-03 c (2, 2019-10-04)
2 2019-10-04 d (2, 9999-12-31)
2 2019-10-05 d (2, 9999-12-31)

To solve our migration problem, we must now compute this next date with position change. It is done in 2 steps. First, we use the lag window function to compute for each row whether an object has changed position or not: a position for an object has changed if the previous position is different from the current position. We create the view object_position_with_change_detection to augment the source table with the change information.

1
2
3
4
5
6
7
8
create view object_position_with_change_detection as
select object_id as object_id,
       day       as day,
       position  as position,
       coalesce(lag(position) over (partition by object_id order by day), '')
           != position
                 as has_changed
from object_position;

The view object_position_with_change_detection contains the following data (the previous position is included to illustrate the change detection).

Object Id Day Position Previous Has changed?
1 2019-10-01 a true
1 2019-10-02 a a false
1 2019-10-03 a a false
1 2019-10-04 b a true
1 2019-10-05 a b true
2 2019-10-01 c true
2 2019-10-02 c c false
2 2019-10-03 c c false
2 2019-10-04 d c true
2 2019-10-05 d d false

Based on the view object_position_with_change_detection, we can now compute the next date with position change via the following SQL expression.

1
2
3
4
5
min(day)
	filter (where has_changed)
	over
		(partition by object_id order by day
		rows between current row and unbounded following exclude current row)

It is a complicated window function expression and we analyse it bit by bit. First, on the line 4, we state that we compute for each object separately and that we consider the rows ordered by day. Secondly, on line 5, we state that, for each row, the frame (i.e. rows that are considered for the computation) are only the rows that comes after it. Thirdly, via the filter clause of line 2, we filter the frame to include only rows that have a position change. Of the rows that are left, we need the smallest date; we select it by using the min function. For instance, for the 3 first rows of the previous table, the set of rows in the frame consists of the 4th and 5th rows and the smallest date is the 2019-10-04. Similarly, for the 4th row, the set of rows in the frame consists only of the 5th row and the smallest date is the 2019-10-05. Note that the SQL expression yields NULL if there are no rows in the frame. We need to use the function coalesce to complete the expression.

The view object_position_with_change_date use the previous expression to compute the next date with position change.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
create view object_position_with_change_date as
select object_id as object_id,
       day       as day,
       position  as position,
       coalesce(
           min(day)
               filter (where has_changed)
               over
               (partition by object_id order by day
               rows between current row and unbounded following exclude current row),
           '9999-12-31'
       )     	 as change_date
from object_position_with_change_detection;

Finally, we can write the query for the migration of the data. It is simply the aggregation over the object ids and change dates. We can check that all periods have been correctly computed with the equality min(position) = max(position).

1
2
3
4
5
6
7
8
select object_id                     as object_id,
       min(day)                      as valid_from,
       max(day)                      as valid_to,
       min(position)                 as position,
       min(position) = max(position) as is_valid
from object_position_with_change_date
group by object_id, change_date
order by object_id, valid_from;

Using this query, our customer was able to migrate the 8.5 millions of rows of the table with the original representation in approximately 15 minutes. The table with the new representation contains around 400'000 rows which is 20 times smaller.


¹The examples in this article work with PostgreSQL 11 and later.