Hash Code Based Identity in the Database

 

Jaromír D.B. Němec,
DB-Nemec.com

 

Motivation

 

I once worked on a project with a rather simple task. 1) Extract the data from a table and 2) write it in a file. Finally, 3) transfer the file to a remote destination. This is not a very challenging problem, but it was extremely important that the data be processed in a reliable way. So I wanted to avoid eventual data corruptions caused by spooling and/or transferring the file (both are not transactional backed tasks).

How to verify the identity? An often used procedure to compare two files is to calculate their hash codes and compare them.  I can calculate the hash code of the final file, but how to calculate the hash code of an Oracle table? This paper describes the solution that enables such hash code based comparisons between a file and a database table and adapts the idea further to the creation of an aggregated function enabling the comparison of the content of two tables.

 

Summary

 

In recent versions, Oracle provides an out of the box function for hash code calculation. The problem is that this function is calculated on a single value and cannot be used for a whole table (e.g. for all values of a column in a table). This paper describes a user defined aggregate function that extends the hash code calculation to that level.

Possible use cases for a hash code based comparison of two tables are discussed and the performance considerations are outlined.

 

What is Hash Code?

 

One very well-known use case of hash code is the validation of the integrity of the file after a download. Here is an example for a MySQL installation file. With the calculation of the hash code (using md5sum) we can verify that the file is the exact same copy of the original file, i.e. the file was not corrupted during the download.

 

 

shell> md5sum mysql-standard-5.0.96-linux-i686.tar.gz
aaab65abbec64d5e907dcd41b8699945  mysql-standard-5.0.96-linux-i686.tar.gz
 

Fig 1 Verifying Download – see [MySQL Verifying MD5 Checksum]

 

Here is how it works. Let´s create a simple text file containing two lines foo and bar. With md5sum we calculate the hash code

 

# cat foo.txt

foo

bar

 

# md5sum foo.txt

f47c75614087a8dd938ba4acff252494  foo.txt

 

The central point of the hash code is that it represents the content of the file. To see if the file was modified, we do not need to check the content of the file, but we simply observe the change (or identity) of the hash code.

Let’s modify our file by adding a line and recalculate the hash code.

 

 

# cat foo.txt

foo

bar

foo

# md5sum foo.txt

657263afa14a81f87438ae18bf8be5c8  foo.txt

 

The resulting hash code is different, which indicates a change of the file. So with the hash code we can very efficiently recognise a change in a file (without investigating the file content), but of course we cannot say anything about what was changed.

 

Is Hash Code Unique?

 

An important question is if the hash code identifies a file uniquely. Of course not. The hash code is simply a sequence of bits of a specific length (128 bits for MD5). So the number of all different hash code MD5 is large, but limited and there exists for sure more different files. So there must exist two different files that produce identical hash code. From a security point of view, this is a problem, but for our consideration, we can neglect it, as the probability that this happens by accident is extremely low. To further reduce the collision probability, we may switch to the SHA family of hash code with larger length (see Oracle Hash Support for examples).

 

Database Identity

 

On some level of abstraction we may describe the common way of comparing two database tables to verify identity with this formula:

 

(A minus B) union (B minus A) is null

 

The set operation MINUS and UNION are used to check that there is no record in A that does not occur in B and vice versa.

 

MINUS and UNION are set operations, so they have the same complexity as sorting the data. Therefore, a count check is performed as a first step in practice. If we see that the count of records of the two tables is different, we are ready - the tables cannot be identical. Note - COUNT only scans the data and is much cheaper than a sort. Technically the complexity of the COUNT is O(n), while the complexity of the sort is O(n * log(n)).

 

Here is a small example:

 

select count(*) from A;

3

select count(*) from B;

3

 

In case of identical count, we check the difference in both directions:

 

select col from A

MINUS

select col from B

 

no records selected

 

select col from B

MINUS

select col from A

 

no records selected

 

So, we are ready - the tables are identical. Note - if you think the tables are in this case identical under all circumstances, see the discussion in Appendix A.

 

Oracle Hash Support

 

Let’s recapitulate the Oracle implementation of the hash code calculating functions. We will not consider the DBMS_OBFUSCATION_TOOLKIT package from the former release and focus only on the standard_hash function. The standard_hash function receives the hash algorithm code (e.g. MD5) as a parameter and calculates the hash code of some string.

 

Here are some examples:

 

select lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'MD5')) from dual;

 

f47c75614087a8dd938ba4acff252494

 

Oracle also supports the SHA hash code family:

 

select lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'SHA1')) from dual;

4e48e2c9a3d2ca8a708cb0cc545700544efb5021

 

select lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'SHA256')) from dual;

d78931fcf2660108eec0d6674ecb4e02401b5256a6b5ee82527766ef6d198c67

 

select lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'SHA384')) from dual;

9ba70653308f258bffed0ca6db288e22cb45d45482f61e1470c6a02a6669790f58cc315931f638beebb963c50bcb410d

 

select lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'SHA512')) from dual;

6315cb20ebb5bbc580429ae5ebf1be3027bc6c92735ceb7a8e39686ccf6e73530e30197f4dac787fed63b714db239c7648370cf88e1023869727e8b7d16daab6

 

Consider the increasing hash code length as this is the topic addressing the security aspect and the probability of the collisions. The hash code length increases from 128 bits of MD5 up to 512 bits in SHA512.

 

Note that the MD5 hash we calculated is identical with the hash code we produced in our file example. Note also that we passed the whole content of the file (inclusive the line feed characters) to the hash function - otherwise the result will differ. (Caution this is OS dependent).

 

For small files this concatenation approach will work, but we will soon fail with the VARCHAR2 length limit. A switch to CLOB is possible, but not meaningful. Hash code is never calculated for a whole file, but in a per row manner.

 

Calculating the Hash Code while Reading a Table

 

Here we are back on the initial task 1) - calculating the hash code while selecting the data from a table. We will use the Java implementation java.security.MessageDigest - this is how the MD5 is calculated in the Java world.

 

Here is the pseudocode (Actually, this is not a pseudocode, but a snippet of a Groovy script. In general, Groovy is very useful while investigating Java based database features such as JDBC and is an excellent gate for DBA's and Database Developers to encounter such 'monsters' as Hibernate or MyBatis.)

 

     MessageDigest digest = MessageDigest.getInstance("MD5")

     byte[] md5hash

    

     conn.eachRow ('select txt from MY_TABLE order by id')

        {

           digest.update(it.txt.getBytes(StandardCharsets.UTF_8))

        }  

    

     md5hash = digest.digest();

     println md5hash.encodeHex().toString()

 

The code consists of three parts, the first part is simple initiation.

The crucial part is the middle step, performed for each column value returned by the cursor. The method update gets the new value (new line) and updates the value of the digest.

The final step gets the result value of the digest and produces the hash code.

 

Two Problems

 

The above solution works fine, but has two main problems. First, it uses Java libraries, so it is not possible to use PL/SQL as a language (which would be much more practicable).

Second, the cursor we read from needs to be ordered, otherwise we would get a different hash code. This makes perfect sense for files that have an implicit row order. Two files with reordered lines are considered different. Contrary to a database table the order does not play a role. Reordering rows in a table does not change the table content.

 

Let’s address the second problem first. Again, we get our inspiration from a Java implementation. A Set is a Java class which is not ordered, so how is calculated a hash code for such a class? After some research I found that a hash code for a Set class is calculated as XOR of the hash codes of its elements. Again, this makes perfect sense as the XOR operation is commutative, so the calculation is order independent.

 

Let’s simulate it with a simple query:

 

select lower(UTL_RAW.BIT_XOR(standard_hash('foo', 'MD5'), standard_hash('bar', 'MD5'))) hash_code from dual;

9b0805c206b7ebb8b6b9931d83e9f52a

 

select lower(UTL_RAW.BIT_XOR(standard_hash('bar', 'MD5'), standard_hash('foo', 'MD5'))) hash_code from dual;

9b0805c206b7ebb8b6b9931d83e9f52

 

Here we first calculate the hash code for two strings and then combine them with a BIT_XOR function. The queries use different order of the values, but both return identical results as expected.

Note that this approach of calculation is not comparable with the md5sum method - we use a completely different combination method (= the update method).

 

 

PL/SQL Implementation

 

But if we use the XOR for the update of our digest, we do not need the original Java implementation, we can switch back to PL/SQL. So the first problem is solved as well.

 

The implementation of a user defined aggregated function that calculates an MD5 hash code using the XOR combination is in Appendix B. The function is very straightforward. The only processing logic is in the Iterate function. The hash code for the new value is calculated and then combined using the BIT_XOR function with the current value.

 

Let’s return to our tables from Appendix A and prove that they are different using the hash code.

 

select   lower(MD5_XOR(col)) md5

from A;

acbd18db4cc2f85cedef654fccc4a4d8

 

select   lower(MD5_XOR(col)) md5

from B;

37b51d194a7513e45b56f6524f2d51f2

 

No surprise here, the hash codes are clearly different.

 

But of course there is always the question, what is it good for? The following case studies illustrate some possible use cases of the hash code aggregated function.

 

Case Study 1 - Observing a Table

 

Imagine there is a table belonging to a system that is for long deprovisioned and we expect no changes in the table. But we are not really sure, maybe there are some processes that are very infrequent, but still active and they may change the table. So how can we efficiently verify the change state of the table? Of course we can always copy the table and perform the MINUS based comparison.

An alternative way is to calculate the hash code of the table and store it. The identity check re-calculates the hash code and compares it with the stored value.

 

To calculate a hash code of a table with several columns, we concatenate them prior to the hash code calculation as shown in the query below:

 

select MD5_XOR(to_char(id)||COL_TXT|| to_char(COL_DATE,'dd.mm.yyyy hh24:mi:ss')) md5  from   hash_test_a;

 

A possible scenario is that with the hash code observation we limit the large number of tables to a small number of tables that must be observed on a detailed data level.

 

Use Case 2 - Comparing Remote Tables

 

Let´s imagine we have a replica of some table in a different database connected with a database link. The replication is done by an application logic and there is no transactional guarantee that the two copies are identical.

Again, it is possible to copy the whole table into one database and perform the compare. But can hash code provide a more efficient way?

 

Let´s assume there is some column that classifies the rows in different groups and that may be used as a GROUP BY key. We will calculate in both databases the hash code for each group. We may ignore all grouping keys that provide identical hash code (which will hopefully be the majority) and copy only the data with the keys that provided a different hash code. This will largely limit the data that must be copied over the database link and simplify the comparison.

 

with c1 as (

select grp_id, MD5_XOR(col1||col2) hash_md5

from cmp1

group by grp_id),

c2 as (

select grp_id, MD5_XOR(col1||col2) hash_md5

from cmp2@remote

group by grp_id),

comp as (

select grp_id, c1.hash_md5 c1_hash_md5, c2.hash_md5 c2_hash_md5

from c1 full outer join c2 using (grp_id))

select GRP_ID, C1_HASH_MD5, C2_HASH_MD5,

case when nvl(C1_HASH_MD5,'ab') !=  nvl(C2_HASH_MD5,'ab')

then 'N' else 'Y' end is_identical

from comp

order by 1;

 

    GRP_ID C1_HASH_MD5                          C2_HASH_MD5                          IS_IDENTICAL

---------- ------------------------------------ ------------------------------------ ------------

         0 49E24489B738E2EDD8B5870C9DD6F69A     49E24489B738E2EDD8B5870C9DD6F69A     Y           

         1 0D08057BAD83FECDF1139C44F7FD3049     BBDF62A9556EA3EC5558921C71953CF0     N           

         2 97C4D33DC611FD68FEE1157FA14AA87F     97C4D33DC611FD68FEE1157FA14AA87F     Y           

         3 7658E1C495C26B43047A6D6DFD3B5487     7658E1C495C26B43047A6D6DFD3B5487     Y           

         4 8D619C8BE12D7E9CA0450E7781F5C6E3     8D619C8BE12D7E9CA0450E7781F5C6E3     Y           

         5 92003DF5DE99F7CA8F6E247C3FEF9EF6     92003DF5DE99F7CA8F6E247C3FEF9EF6     Y           

         6 1DAF3E58FE1FC1BC0AA8613BF549A424     1DAF3E58FE1FC1BC0AA8613BF549A424     Y           

         7 45ACA5382828BE50436B3E254A0E5DD4     45ACA5382828BE50436B3E254A0E5DD4     Y           

         8 C57651D4B5D05BA8C63F20BD24D5B6BD     C57651D4B5D05BA8C63F20BD24D5B6BD     Y           

         9 14A6BF7CDA3B5A83E75AE4DB5CCB5ABD     14A6BF7CDA3B5A83E75AE4DB5CCB5ABD     Y 

 

 

 

Performance Consideration

 

We already discussed that the algorithmic complexity of the hash code calculation is superior to the one of the set operation. That would suggest that our aggregated function outperforms the MINUS / UNION based implementation. Unfortunately, the user defined implementation (probably due to a large number of context switches) is not very effective and the expected effect is not observed.

To obtain a real effective hash code aggregated function, we would require an Oracle integrated implementation. This implementation would cover all supported hash code types in the same manner.

The usability would be extended with a support for multi-column calculation without a need of explicit concatenation.

 

 

Appendix A

 

Well, I confess this is a prepared example.

 

select * from A;

COL

-----

foo  

bar  

bar

 

select * from B;

COL

-----

foo  

foo  

bar

 

But in case that the table has no primary key, this "collision" could happen. In a practical case I agree the probability would be similarly low as a hash code collision.

 

 

Appendix B

 

MD5_XOR_agg_fun.sql

-- MD5 hash  XOR combined aggregate function in PL/SQL

--

CREATE OR REPLACE

TYPE md5_xor_type AS OBJECT

(

   md5Hash RAW(16),

   STATIC FUNCTION

        ODCIAggregateInitialize(sctx IN OUT md5_xor_type )

        RETURN NUMBER,

 

   MEMBER FUNCTION

        ODCIAggregateIterate(self IN OUT md5_xor_type ,

                             VALUE IN VARCHAR2 )

        RETURN NUMBER,

   MEMBER FUNCTION

        ODCIAggregateTerminate(self IN md5_xor_type,

                               returnValue OUT  RAW,

                               flags IN NUMBER)

        RETURN NUMBER,

   MEMBER FUNCTION

        ODCIAggregateMerge(self IN OUT md5_xor_type,

                           ctx2 IN md5_xor_type)

        RETURN NUMBER

);

/

 

 

CREATE OR REPLACE

TYPE BODY md5_xor_type

 IS

 STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT md5_xor_type)

  RETURN NUMBER

  IS

 BEGIN

  sctx := md5_xor_type( NULL );

  RETURN ODCIConst.Success;

 END;

 

 MEMBER FUNCTION ODCIAggregateIterate(self IN OUT md5_xor_type,

                                      VALUE IN VARCHAR2 )

  RETURN NUMBER

  IS

 my_hash RAW(16);

 BEGIN

  BEGIN

     select  standard_hash(VALUE, 'MD5') into my_hash from dual;

  END; -- add exception handling

  IF self.md5Hash IS NULL THEN

     self.md5Hash := my_hash;

  ELSE

     self.md5Hash := UTL_RAW.BIT_XOR(self.md5Hash,my_hash);

  END IF;

  RETURN ODCIConst.Success;

 END;

 MEMBER FUNCTION ODCIAggregateTerminate(self IN md5_xor_type,

                                        returnValue OUT RAW,

                                        flags IN NUMBER)

  RETURN NUMBER

  IS

 BEGIN

  returnValue :=   self.md5Hash;

  RETURN ODCIConst.Success;

 END;

 

 MEMBER FUNCTION ODCIAggregateMerge(self IN OUT md5_xor_type,

                                    ctx2 IN md5_xor_type)

  RETURN NUMBER

  IS 

 BEGIN

  self.md5Hash := UTL_RAW.BIT_XOR(self.md5Hash,ctx2.md5Hash);

  RETURN ODCIConst.Success;

 END;

END;

/

 

 

CREATE OR REPLACE

FUNCTION MD5_XOR(input VARCHAR2 )

RETURN RAW

AGGREGATE USING MD5_XOR_TYPE ;

/

-- test

select   MD5_XOR(txt)

from foo;

 


 

Reference

 

 

Oracle Database Ideas

https://community.oracle.com/ideas/20275

 

DB Nemec Comparing Data of Two Oracle Tables Using MD5 Hash
http://www.db-nemec.com/MD5/CompareTablesUsingMD5Hash.html

 

MySQL Verifying MD5 Checksum

https://docs.oracle.com/cd/E17952_01/mysql-5.0-en/verifying-md5-checksum.html

 

Wikipedia Merkle Trees

https://en.wikipedia.org/wiki/Merkle_tree

 

Stack Overflow Order Independent Hash

https://stackoverflow.com/questions/47253864/order-independent-hash-in-java