by Stanislas Nanchen
Splitting Strings in Oracle with the Model Clause
In database applications, it can happen that some information get stored as an encoded comma separated string. After some time, it is recognized as an error in the modeling and this information must be split into individual values and inserted into a child table. The problem is to find an efficient way to insert these values into the child table.
In Oracle, it is possible to solve this problem with a hierarchical query or with PL/SQL. In this post, we look at an elegant solution with a query using the model clause¹, a feature introduced in Oracle 10g. With the model clause, it is possible to add rows and columns to the result set of a query and to fill theses generated rows and columns via declarative rules. These rules describe the content of cells from the content of other cells. It is really like having a spreadsheet in SQL.
Lets begin and create a simple table entity
with 2 columns: a primary key, a field for
comma separated values and some dummy data.
1 2 3 4 5 6 7 8 9 10
create table entity ( id number(19) primary key, vals varchar2(2000) ); insert into entity values (1, 'abc,def,ghi'); insert into entity values (2, 'hello'); insert into entity values (3, 'foo,bar'); commit;
Now we want to normalize that information into a entity_val
table.
1 2 3 4 5 6
create table entity_val ( entity_id number(19) not null references entity (id), val varchar2(100) not null, primary key (entity_id, val) );
The solution using the model clause is the following insert as select query.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
insert into entity_val select id, val from entity model partition by (id) dimension by (1 i) measures ( vals vals, regexp_count(vals, ',') + 1 cnt, cast(null as varchar2(100)) val ) rules ( val[for i from 1 to cnt[1] increment 1] = regexp_substr(vals[1], '[^,]+', 1, CV(i)) );
As you can see, the model clause is added at the end of a select sql statement. Here we select everything
from the table entity
and will build model(s) from the content of it.
The model clause allows actually the creation of many independent models by partitioning
the result set returned by the primary sql statement. It is similar to the frames of a window function.
By partitioning by id
, we indicate
an independent model for each row of the table entity
. The measures
are additional computed columns
and they are given dimensions to express their values relatively to other measures thanks to rules that
are declared in the rule
clause. We only need a simple numeric dimension i
that has the start
value of 1
representing the first extracted value. Later, we will extend the dimension i
.
The first measure vals
is the column vals
itself to allow its use in the
rules
².
The second measure cnt
is the number of values that has to be extracted from the column vals
;
as vals
is encoded as comma separated string, it is the number of commas plus one; we extract this information with the function regexp_count
.
The third and last measure val
represents one single
extracted value. Because we will compute the real values for val
with a rule, we just use null
casted
as varchar2
(as we extract individual strings).
The last part of the model clause are the rules that specify how the measures are computed from other measures.
We have a single rule that will fill the measure val
. At the beginning of the computation of the model,
because we have partitioned by id
and have a dimension i
with start value 1
, we have a simple model
for each line of the table entity
with 3 cells: vals[1]
which contains the content of the
column vals
; cnt[1]
containing the number of individual values, and val[1]
containing null
for the
moment.
The model then applies the rule. On the left side of the equality, we have an expression of the type
val[x]
signaling that we want to compute the value of val
at point x
and on the right we have an
expression whose value will be assigned to val[x]
. In our case, x
is the multi-cells expression
for i from 1 to cnt[1] increment 1
, which can be translated as
for all integers between 1
and cnt[1]
. The rule represents the computation of many cells.
As for the moment only val[1]
exists in the model, Oracle will create cnt[1]-1
new cells to
accommodate for val[2]
to val[cnt[1]]
.
The right side of our expression is a call to the regexp_substr
function to extract
the i-th non-comma word from vals[1]
. The model clause function CV(i)
return the current value of i
.
Now we can return to the select
clause of our sql statement: we select the columns id
and val
.
Oracle will return one row for each combination of dimensions in the models. This means that we will have
one row for each extracted value. You can run the following query to see the values
for the dimension i
and for all measures. The measures vals
and cnt
are only computed for i=1
.
1 2 3 4 5 6 7 8 9 10 11 12 13
select id, val from entity model partition by (id) dimension by (1 i) measures ( vals vals, regexp_count(vals, ',') + 1 cnt, cast(null as varchar2(100)) val ) rules ( val[for i from 1 to cnt[1] increment 1] = regexp_substr(vals[1], '[^,]+', 1, CV(i)) );