Jaromir D.B. Nemec
Recently a question about
(re) sampling of time series data was posted on the oracle-l [1] mailing list.
The question was how to get from a time series data sampled with two seconds
frequency a sample based on half minute. In other words - how to get each 15th
row from the sample? The obvious solution is to assign to each row a sequence
number and to filter based on the MOD function.
SELECT /*+ gather_plan_statistics */ x.col,
x.t_stamp
FROM ( SELECT
col, t_stamp,
row_number() over(order by t_stamp) rn
FROM t
WHERE
t_stamp >= TRUNC( SYSDATE - 30 )) x
WHERE MOD(
x.rn, 15 ) = 1;
A drawback of this solution
is that the whole data must be sorted (here in the row_number function).
Therefore the question was formulated if there is a cheapest solution. One
approach to avoid the costly sorting at all is presented below.
The idea behind is to get the
sequence number (used to filter in MOD function) without actually sorting the data.
We divide the time space in a grid of two seconds intervals and to assign each
timestamp the sequential number of the interval. This approach is illustrated
of the figure below.
00:01
00:03 00:05 00:07
00:09 00:11 00:13
1 2
3 4 5
6 7
The interval number is
calculated from the difference between the timestamp and the starting point
multiplied with the number of bi-seconds per day - 24*60*30.
Here a Select to get the
assignment:
select
col, t_stamp,
((t_stamp -
trunc(sysdate-30)) * (24*60*30) ) biseconds_from_start,
mod(1 +
round( (t_stamp - trunc(sysdate-30)) * (24*60*30)
),15) mod2
from t order by col
;
There are two caveats
associated with this approach. First possibility is that a timestamp is
assigned to a wrong bucket due to calculation inaccuracy; remember that one
second in Oracle is represented with the value
0,00001157407407407407407407407407407407407407.
This is especially dangerous
if the timestamp lies near the end of the intervals – see figure below.
Random flip of the 4th timestamp caused by rounding error
A possibility to workaround
it is to use an offset to shift the timestamp in the middle of the interval:
mod(1 + trunc((t_stamp - (trunc(sysdate-30) -
1/(24*60*60))) * (24*60*30) ),15) mod2
Note, that we use trunc
instead of round, which is more accurate.
The second potential problem
is a slow shift of the timestamp caused by the clock inaccuracy – similar to a
DBMS_JOB started each full hour with interval 60 minutes that after some time
it starts minutes after the full hour. This could be corrected – if
systematically occurring – with adjusting the coefficient of the number of
intervals per day; e.g. (24*60*30 + 1) for
watch running too fast. See example below.
Timestamp gets out of sync due to clock inaccuracy
Anyway the first thing to do
is to check the concept with actual data – something like the following select:
select * from (
select
col, t_stamp,
mod(row_number() over(order by t_stamp),15) mod1,
/* offset 1 second - adjust if required in <0,2) */
mod(1 +
trunc((t_stamp - (trunc(sysdate-30) - 1/(24*60*60))) * (24*60*30) ),15) mod2
from t order by col
)
where mod1 != mod2;
This select compares the two
approaches sort vs. assign to time grid.
If the select return no or
acceptable small number of rows you may use this solution.
SELECT col, t_stamp
FROM t
WHERE t_stamp >= TRUNC( SYSDATE - 30 ) and
/* filter every 15th row using 2 seconds buckets with
1 sec. offset */
mod(1 + trunc((t_stamp - (trunc(sysdate-30) -
1/(24*60*60))) * (24*60*30) ),15) = 1
;
The difference in performance
depends of the volume of the data to be sorted.
Last revision 23.1.2011
Jaromir D.B. Nemec is a freelancer specializing in data warehouse and integration solutions. He can be reached at http://www.db-nemec.com