by Stanislas Nanchen
Detecting periods with window functions in PostgreSQL
One of our customers was recently confronted with a SQL migration problem of their PostgreSQL database¹. Their application used a table containing the history of positions of objects with one record per day. The table below is a simplified sample of this history. Its meaning is that, for every row, the given object was at the given position on the given day. The primary key of the table is the tuple (Object Id, Day). For example, in the following sample, the object 1 was at position a on the 1st of October 2019.
The following PostgreSQL code reconsitutes the sample table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
create table object_position ( object_id bigint not null, day date not null, position text not null, primary key (object_id, day) ); insert into object_position values (1, '2019-10-01', 'a'), (1, '2019-10-02', 'a'), (1, '2019-10-03', 'a'), (1, '2019-10-04', 'b'), (1, '2019-10-05', 'a'), (2, '2019-10-01', 'c'), (2, '2019-10-02', 'c'), (2, '2019-10-03', 'c'), (2, '2019-10-04', 'd'), (2, '2019-10-05', 'd'); commit;
In the application of our customer, objects rarely changed place and, with accumulated history, the table had become too large. In approximately one year, the table had accumulated 8.5 millions of rows for 20'000 objects. Therefore, the decision was made to change the representation of this data by storing periods (as closed intervals) for which the positions object did not change. Under this other representation, the data in the previous table would be the following.
|Object Id*||Valid From*||Valid To||Position|
As you can see, the first object changed positions 3 times and the second object 2 times. We have in total 5 periods for these 2 objects. Note that there is no data loss in the second table; the 2 representations contain the same information. The problem is the following: how to migrate the data from the first representation to the second. In this article, we explore a solution consisting of a single SQL query.
Let's assume first that we would have the period information on source table. By reusing the same color for the periods, it would look like the following data.
If we had a
period field on the original table, we could
write a simple aggregation query to migrate the data:
1 2 3 4 5 6
select object_id as object_id, min(day) as valid_from, max(day) as valid_to, min(position) as position from object_position group by object_id, period
Note: By construction, the field
periodhas the following property: for every pair of rows, if the periods are the same for the rows, then the positions are the same as well. Because of that property, it would be possible to add the
positionfield in the
group byclause instead of using the
minaggregate function. However, in the rest of this article, we will compute the period field and we can use the condition
min(position) = max(position)to verify the correctness of the computation.
A way to uniquely identify every period is to use, for each row, the pair consisting of the id of the object and of the date of the next position change (relatively from that row) or 9999-12-31 if it is the last period for the current object. For instance, the period p1 of the previous table can be represented by the pair (1, 2019-10-04) and the period p3 by the pair (1, 9999-12-31). The following table shows the periods for all the rows of the source table.
To solve our migration problem, we must now compute this next date with position change. It is done in
2 steps. First, we use the
lag window function to compute for each row whether an object has changed
position or not: a position for an object has changed if the previous position is different from the current
position. We create the view
object_position_with_change_detection to augment the source table with the
1 2 3 4 5 6 7 8
create view object_position_with_change_detection as select object_id as object_id, day as day, position as position, coalesce(lag(position) over (partition by object_id order by day), '') != position as has_changed from object_position;
object_position_with_change_detection contains the following data (the previous position is included
to illustrate the change detection).
|Object Id*||Day*||Position||Previous||Has changed?|
Based on the view
object_position_with_change_detection, we can now compute the next date with position change
via the following SQL expression.
1 2 3 4 5
min(day) filter (where has_changed) over (partition by object_id order by day rows between current row and unbounded following exclude current row)
It is a complicated window function expression and we analyse it bit by bit. First, on the line 4, we state
that we compute for each object separately and that we consider the rows ordered by day. Secondly, on line 5,
we state that, for each row, the frame (i.e. rows that are considered for the computation) are only the rows that comes after it.
Thirdly, via the filter clause
of line 2, we filter the frame to include only rows that have a position change.
Of the rows that are left, we need the smallest date;
we select it by using the
min function. For instance, for the 3 first rows of the previous table, the
set of rows in the frame consists of the 4th and 5th rows and the smallest date is
2019-10-04. Similarly, for the 4th row, the set of rows in the frame consists only of
the 5th row and the smallest date is the
2019-10-05. Note that the SQL expression yields
NULL if there
are no rows in the frame. We need to use the function
coalesce to complete the expression.
object_position_with_change_date use the previous expression to compute the
next date with position change.
1 2 3 4 5 6 7 8 9 10 11 12 13
create view object_position_with_change_date as select object_id as object_id, day as day, position as position, coalesce( min(day) filter (where has_changed) over (partition by object_id order by day rows between current row and unbounded following exclude current row), '9999-12-31' ) as change_date from object_position_with_change_detection;
Finally, we can write the query for the migration of the data. It is simply the aggregation
over the object ids and change dates. We can check that all periods have been correctly computed
with the equality
min(position) = max(position).
1 2 3 4 5 6 7 8
select object_id as object_id, min(day) as valid_from, max(day) as valid_to, min(position) as position, min(position) = max(position) as is_valid from object_position_with_change_date group by object_id, change_date order by object_id, valid_from;
Using this query, our customer was able to migrate the 8.5 millions of rows of the table with the original representation in approximately 15 minutes. The table with the new representation contains around 400'000 rows which is 20 times smaller.