Comparing Data of Two Oracle Tables Using MD5 Hash

Jaromir D.B. Nemec

 

Compare Use Case

 

In relational database business there is offen a need to compare data of two tables with the same structure so see if they are identical or not. This could be a validation, whether two tables from different instancies are in sync or a decision of a JUnit test whether the expected results were produced.

In most cases the check is only a binary one, so the answer is yes if the two tables are identical or no if there is a difference.

 

This use case is well known and often practiced, either with the set operators MINUS and UNION using the "magic" formula (A-B) + (B-A) or

grouping the UNION of both tables on the primary key and testing if there is a group HAVING different counts from two sources [1].

There is a hidden caveat while using the set operation if the table doesn't have a primary key as illustrated in the Appendix,

but this is not the thema of this post. Our scope is to use an alternative approach for the test of the identity using the hashing concept. The main goal it to omit the expesive sorting operation which is the working horse or the set operations.

 

Hashing a File

 

To summarize a state of a file often a hash function is used. Let's illustrate this using the MD5 hash algorithmus, that is our choice for this post.

 

$ md5sum *.txt

3dbec9c1b92200eb56349835275e00b9 *foo.txt

a94ef40313461e1de3938994cce447e6 *HelloWorld.txt

a94ef40313461e1de3938994cce447e6 *OtherFile.txt

 

Event without the inspection of the content of the three file we see that the foo.txt andHelloWorld.txt are different. On the other side

the files HelloWorld.txt and OtherFile.txt produces identical hash, which indicates they are identical as the hash value is the same. Yes we all know that each hash algorithm

can produce collisions (i.e. producing identical hash code for different sources, but we neglect this for our purpose - see [2] how to address this problem if relevant).

See [3] for more information about hash functions in generall.

 

$ cat foo.txt

foo

bar

 

$ cat HelloWorld.txt

Hello

world

 

$ cat OtherFile.txt

Hello

world

 

Hashing a Database Table

 

Quick research provides that Oracle can calculate MD5 hash out of the box. Dependent on the version, in 11g with the package dbms_obfuscation_toolkit.md5,

in 12c using the standard_hash as illustrated below.

 

select lower(standard_hash('foo'||chr(13)||chr(10)||'bar'||chr(13)||chr(10), 'MD5')) from dual;

 

3dbec9c1b92200eb56349835275e00b9

So let's try to calculate the hash in the database and see if get the same result as from the file.

Note that to deal with the same data we had to append the line delimiter characters at the end of each string.

We'll set up a table containing the file lines in a column TXT.

Note that the line delimiters is operating system dependent, we use the Windows CR,LF characters.

 

create table foo as

select 1 id, 'foo'||chr(13)||chr(10) txt from dual union all

select 2 id, 'bar'||chr(13)||chr(10) txtfrom dual;

††††††††

We'll use the LISTAGG function to concatenate the string and calculate the hash.

 

with md5 as (

select LISTAGG(txt) WITHIN GROUP (ORDER BY id)†† txt

from foo)

select lower(standard_hash(txt, 'MD5')) md5from md5;

 

MD5†††††††††††††††††††††††††††

--------------------------------

3dbec9c1b92200eb56349835275e00b9

 

So we got the expected result and therefore we found a way to compare the contents of two table columns and decide whether they are identical or not.

Are we ready?

 

Not really there are two problems still to be approached. One obvious and one subtle.

 

The obvious problem with this approach is the length of the concatenated string. You will quickly get ORA-01489 if the result of the concatenation exceeds 4000 bytes.

The subtle problem lies in the order. Note that we had to sort the rows based on ID to receive the same order as the lines in the file. Changing the order

produces an different hash code as show below.

 

-- sorting on ID DESC

with md5 as (

select LISTAGG(txt) WITHIN GROUP (ORDER BY id DESC)txt

from foo)

select lower(standard_hash(txt, 'MD5')) md5from md5

 

MD5†††††††††††††††††††††††††††

--------------------------------

d2bfea7dbd26ef2c4b3e378e3a10daa0

 

This is in context of files perfectly meaningful - a foo, bar file is very different from a file containing bar and foo lines. But in the context of relational

databases the data are considered as sets, so the order should play no role.

We will approach both problems, letís start with the length limitations first.

 

Incremental Hash Calculation

 

Letís return back to the calculation of the hash code for a file. The same problem with data length must exist there as well. How it is possible to calculate hash code for a file that is larger than the main memory?

 

The answer is the incremental calculation. The hash is not calculated for the whole file, but for each line of the file and the

result is combined with the previous result. After processing the last line the hash code is returned.

 

Let's illustrate the calculation of Java MD5 Hash using a Groovy script:

 

MessageDigest digest = MessageDigest.getInstance("MD5")

byte[] md5hash

 

groovyCon.eachRow ('select txt from foo order by id')

†††† {

††††††† digest.update(it.txt.getBytes(StandardCharsets.UTF_8))

†††† }††

// Terminate

md5hash = digest.digest();

println md5hash.encodeHex().toString()

In the first line the instance of the MessageDigest for MD5 hash is created.

Groovy enables with eachRow elegant looping through the result of a SQL query.

In the loop for each column value (it.txt) the binary value is taken and combined with the previous result.

At the end of the loop with the call of the method digest() the final value of the hash is calculated and we can print it out.

 

So it would be possible to make a similar processing to work around the ORA-01489 limitation, but before we do so letís approach the second problem the order of the data.

 

Combine Unordered

 

Ordering is not a cheap operation it cost us at least O(n * log(n)) 1). Furthermore we actually do not want to order the data, the order plays definitively no role for the compare.

 

How to approach this problem? Let us once again inspire with the Java hashing algorithm. Java calculates hash function for all objects inclusive sets. So how is it performed?Quick research provides the answer, the hash code is calculated for each

element and combined with the previous result in an order independent way usingeither XOR or SUM. (Note that A XOR B is equal B XOR A, so the order of the lines is irrelevant).

 

Letís demonstrate this approach in the query below. We use XOR to combine the line hashes (BIT_XOR)

 

select lower(UTL_RAW.BIT_XOR(

(select standard_hash(txt, 'MD5') md5 from foo where id = 1),

(selectstandard_hash(txt, 'MD5') md5 from foo where id = 2 ))) MD5_XOR from dual;

 

MD5_XOR††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††† ††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††††

-------------------------------

173f1f8f85f1a154044b7629a23e949c

 

If you do not trust in the order independence re-run the query while swapping the ID and confirm the identical result.

 

 

Aggregate Function

 

So now we are very close to the final solution, the XOR combination is a win win solution getting rid of the length limitation and providing order independence (and therefore a performant O(N) algorithm).

 

The missing link is to build an aggregate function (letís call it MD5_XOR) that will calculate the combined MD5 has for a table column.

 

select MD5_XOR(txt) from foo

 

This is an easy exercise - see sources in [5]. We even do not need Java for the implementation. PL/SQL aggregate function can be used.

The only thing to work around is the absence of standard_hash in PL/SQL, the select from DUAL trick is the solution.

 

Hashing the Whole Table

How to handle a table with more than one column with different data types? Simple convert all columns to character format and concatenate them.

 

select MD5_XOR(to_char(id)||COL_TXT|| to_char(COL_DATE,'dd.mm.yyyy hh24:mi:ss')) md5from†† hash_test_a;

 

For extreme fat tables with lot of long columns you will need to apply the XOR trick once again. Calculate the hash for each column and merge the results using XOR. Here is the idea:

 

select UTL_RAW.BIT_XOR(UTL_RAW.BIT_XOR(MD5_XOR(to_char(id)), MD5_XOR(COL_TXT)), MD5_XOR(to_char(COL_DATE,'dd.mm.yyyy hh24:mi:ss'))) md5from†† hash_test_a;

 

Similar trick will be required to handle CLOB columns.

 

Some Test

 

Our MD5_XOR aggregate function works fine and theoretically it should by much faster than the approach using GROUP BY or MINUS.

Unfortunately the test for tables in non-trivial size show that this theoretical advantage is more than eliminated with the context switch caused by the user implementation of the aggregate function (see results below). So this concept will work efficiently only with a native Oracle implementation.

 

SQL> create table hash_test_a as

2select

3rownum id,

4'txt value to be compared '||rownum as col_txt,

5DATE '2000-01-01' + rownum/1000 as col_date

6from dual

7connect by level <= 100000;

 

SQL> create table hash_test_b as

2select * from hash_test_a;

 

SQL> select id,

2†††††††††† count(src1) CNT1,

3††††††††††† count(src2) CNT2

4†††††† from

5†††† ( select a.*,

6††††††††††† 1 src1,

7††††††††††† to_number(null) src2

8†††††††† from hash_test_a a

9††††††† union all

10††††† select b.*,

11†††††††††† to_number(null) src1,

12†††††††††† 2 src2

13††††††† from hash_test_b b

14†††† )

15††† group by id

16††† having count(src1) <> count(src2);

 

no rows selected

 

Elapsed: 00:00:00.12

 

SQL>

SQL> select 'HASH_TEST_A' as src, MD5_XOR(col_txt) md5 from hash_test_a

2union all

3select 'HASH_TEST_B' as src, MD5_XOR(col_txt) md5 from hash_test_b

4;

 

SRC

-----------

MD5

--------------------------------------------------------------------------------

 

HASH_TEST_A

C24D465E9FE0A6F6673260A5ED8F9D03

 

HASH_TEST_B

C24D465E9FE0A6F6673260A5ED8F9D03

 

 

Elapsed: 00:00:04.69

 

 

 

 

 

 

 

[1] http://www.oracle.com/technetwork/issue-archive/2005/05-jan/o15asktom-084959.html ††

 

[2] https://www.quora.com/How-can-we-solve-the-problem-of-collision-in-the-hash ††

 

[3] https://en.wikipedia.org/wiki/Hash_function

 

[4] https://stackoverflow.com/questions/47253864/order-independent-hash-in-java

 

[5] MD5_XOR_agg_fun.sql

 

1) This may be lifted in the future, in Oracle Release NNq (q stands for quantum computing;).