Sampling Time Series Data

Jaromir D.B. Nemec

Recently a question about (re) sampling of time series data was posted on the oracle-l [1] mailing list. The question was how to get from a time series data sampled with two seconds frequency a sample based on half minute. In other words - how to get each 15^th row from the sample? The obvious solution is to assign to each row a sequence number and to filter based on the MOD function.

SELECT /*+ gather_plan_statistics */ x.col,

x.t_stamp

FROM ( SELECT col, t_stamp,

row_number() over(order by t_stamp) rn

FROM t

WHERE t_stamp >= TRUNC( SYSDATE - 30 )) x

WHERE MOD( x.rn, 15 ) = 1;

A drawback of this solution is that the whole data must be sorted (here in the row_number function). Therefore the question was formulated if there is a cheapest solution. One approach to avoid the costly sorting at all is presented below.

The idea behind is to get the sequence number (used to filter in MOD function) without actually sorting the data. We divide the time space in a grid of two seconds intervals and to assign each timestamp the sequential number of the interval. This approach is illustrated of the figure below.

00:01 00:03 00:05 00:07 00:09 00:11 00:13

1 2 3 4 5 6 7

The interval number is calculated from the difference between the timestamp and the starting point multiplied with the number of bi-seconds per day - 24*60*30.

Here a Select to get the assignment:

select

col, t_stamp,

((t_stamp - trunc(sysdate-30)) * (24*60*30) ) biseconds_from_start,

mod(1 + round( (t_stamp - trunc(sysdate-30)) * (24*60*30) ),15) mod2

from t order by col

;

There are two caveats associated with this approach. First possibility is that a timestamp is assigned to a wrong bucket due to calculation inaccuracy; remember that one second in Oracle is represented with the value 0,00001157407407407407407407407407407407407407.

This is especially dangerous if the timestamp lies near the end of the intervals – see figure below.

Random flip of the 4^th timestamp caused by rounding error

A possibility to workaround it is to use an offset to shift the timestamp in the middle of the interval:

mod(1 + trunc((t_stamp - (trunc(sysdate-30) - 1/(24*60*60))) * (24*60*30) ),15) mod2

Note, that we use trunc instead of round, which is more accurate.

The second potential problem is a slow shift of the timestamp caused by the clock inaccuracy – similar to a DBMS_JOB started each full hour with interval 60 minutes that after some time it starts minutes after the full hour. This could be corrected – if systematically occurring – with adjusting the coefficient of the number of intervals per day; e.g. (24*60*30 + 1) for watch running too fast. See example below.

Timestamp gets out of sync due to clock inaccuracy

Anyway the first thing to do is to check the concept with actual data – something like the following select:

select * from (

select

col, t_stamp,

mod(row_number() over(order by t_stamp),15) mod1,

/* offset 1 second - adjust if required in <0,2) */

mod(1 + trunc((t_stamp - (trunc(sysdate-30) - 1/(24*60*60))) * (24*60*30) ),15) mod2

from t order by col

)

where mod1 != mod2;

This select compares the two approaches sort vs. assign to time grid.

If the select return no or acceptable small number of rows you may use this solution.

SELECT col, t_stamp

FROM t

WHERE t_stamp >= TRUNC( SYSDATE - 30 ) and

/* filter every 15th row using 2 seconds buckets with 1 sec. offset */

mod(1 + trunc((t_stamp - (trunc(sysdate-30) - 1/(24*60*60))) * (24*60*30) ),15) = 1

;

The difference in performance depends of the volume of the data to be sorted.

[1] http://www.freelists.org/post/oracle-l/cheapest-way-to-access-every-15th-row-in-table-ordered-by-timestamp-column

Last revision 23.1.2011

Jaromir D.B. Nemec is a freelancer specializing in data warehouse and integration solutions. He can be reached at http://www.db-nemec.com