Hash Code Based
Identity in the Database
Jaromír D.B. Němec,
DB-Nemec.com
Motivation
I once
worked on a project with a rather simple task. 1) Extract the data from a table
and 2) write it in a file. Finally, 3) transfer the file to a remote
destination. This is not a very challenging problem, but it was extremely
important that the data be processed in a reliable way. So I wanted to avoid
eventual data corruptions caused by spooling and/or transferring the file (both
are not transactional backed tasks).
How to
verify the identity? An often used procedure to compare two files is to calculate their hash codes
and compare them. I can calculate the
hash code of the final file, but how to calculate the hash code of an Oracle
table? This paper describes the solution that enables such hash code based
comparisons between a file and a database table and adapts the idea further to
the creation of an aggregated function enabling the comparison of the content of
two tables.
Summary
In recent
versions, Oracle provides an out of the box function for hash code calculation.
The problem is that this function is calculated on a single value and cannot be
used for a whole table (e.g. for all values of a column in a table). This paper
describes a user defined aggregate function that extends the hash code
calculation to that level.
Possible
use cases for a hash code based comparison of two tables are discussed and the
performance considerations are outlined.
What is Hash Code?
One very
well-known use case of hash code is the validation of the integrity of the file
after a download. Here is an example for a MySQL installation file. With the
calculation of the hash code (using md5sum) we can verify that the file is the exact same copy of the original
file, i.e. the file was not corrupted during the download.
shell>md5sum mysql-standard-5.0.96-linux-i686.tar.gz
aaab65abbec64d5e907dcd41b8699945 mysql-standard-5.0.96-linux-i686.tar.gz
Fig 1
Verifying Download – see [MySQL Verifying MD5 Checksum]
Here is how
it works. Let´s create a simple text file containing two lines foo and bar. With md5sum we calculate the hash code
# cat foo.txt
foo
bar
# md5sum foo.txt
f47c75614087a8dd938ba4acff252494 foo.txt
The central
point of the hash code is that it represents the content of the file. To see if
the file was modified, we do not need to check the content of the file, but we
simply observe the change (or identity) of the hash code.
Let’s
modify our file by adding a line and recalculate the hash code.
# cat foo.txt
foo
bar
foo
# md5sum foo.txt
657263afa14a81f87438ae18bf8be5c8
foo.txt
The
resulting hash code is different, which indicates a change of the file. So with
the hash code we can very efficiently recognise a change in a file (without
investigating the file content), but of course we cannot say anything about
what was changed.
Is Hash Code Unique?
An
important question is if the hash code identifies a file uniquely. Of course
not. The hash code is simply a sequence of bits of a specific length (128 bits
for MD5). So the number of all different hash code MD5 is large, but limited
and there exists for sure more different files. So there must exist two
different files that produce identical hash code. From a security point of view,
this is a problem, but for our consideration, we can neglect it, as the
probability that this happens by accident is extremely low. To further reduce
the collision probability, we may switch to the SHA family of hash code with
larger length (see Oracle Hash Support
for examples).
Database Identity
On some
level of abstraction we may describe the common way of comparing two database
tables to verify identity with this formula:
(A minus B)
union (B minus A) is null
The set
operation MINUS and UNION are used to check that there is no record in A that
does not occur in B and vice versa.
MINUS and
UNION are set operations, so they have the same complexity as sorting the data.
Therefore, a count check is performed as a first step in practice. If we see
that the count of records of the two tables is different, we are ready - the
tables cannot be identical. Note - COUNT only scans the data and is much
cheaper than a sort. Technically the complexity of the COUNT is O(n), while the complexity of the sort is O(n * log(n)).
Here is a
small example:
3
select
count(*) from B;
3
In case of
identical count, we check the difference in both directions:
select
col from A
MINUS
select
col from B
no
records selected
select
col from B
MINUS
select
col from A
no
records selected
So, we are
ready - the tables are identical. Note - if you think the tables are in this
case identical under all circumstances, see the discussion in Appendix A.
Oracle Hash Support
Let’s
recapitulate the Oracle implementation of the hash code calculating functions.
We will not consider the DBMS_OBFUSCATION_TOOLKIT package from the former release and focus only on the standard_hash function. The standard_hash
function receives the hash algorithm code (e.g. MD5) as a parameter and
calculates the hash code of some string.
Here are
some examples:
select
lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'MD5')) from dual;
f47c75614087a8dd938ba4acff252494
Oracle also
supports the SHA hash code family:
select
lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'SHA1')) from dual;
4e48e2c9a3d2ca8a708cb0cc545700544efb5021
select
lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'SHA256')) from dual;
d78931fcf2660108eec0d6674ecb4e02401b5256a6b5ee82527766ef6d198c67
select
lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'SHA384')) from dual;
9ba70653308f258bffed0ca6db288e22cb45d45482f61e1470c6a02a6669790f58cc315931f638beebb963c50bcb410d
select
lower(standard_hash('foo'||chr(10)||'bar'||chr(10), 'SHA512')) from dual;
6315cb20ebb5bbc580429ae5ebf1be3027bc6c92735ceb7a8e39686ccf6e73530e30197f4dac787fed63b714db239c7648370cf88e1023869727e8b7d16daab6
Consider
the increasing hash code length as this is the topic addressing the security
aspect and the probability of the collisions. The hash code length increases
from 128 bits of MD5 up to 512 bits in SHA512.
Note that
the MD5 hash we calculated is identical with the hash code we produced in our
file example. Note also that we passed the whole content of the file (inclusive
the line feed characters) to the hash function - otherwise the result will
differ. (Caution this is OS dependent).
For small
files this concatenation approach will work, but we will soon fail with the
VARCHAR2 length limit. A switch to CLOB is possible, but not meaningful. Hash
code is never calculated for a whole file, but in a per row manner.
Calculating the Hash Code while
Reading a Table
Here we are
back on the initial task 1) - calculating the hash code while selecting the
data from a table. We will use the Java implementation java.security.MessageDigest - this is how the MD5 is calculated
in the Java world.
Here is the
pseudocode (Actually, this is not a pseudocode, but a snippet of a Groovy
script. In general, Groovy is very useful while investigating Java based
database features such as JDBC and is an excellent gate for DBA's and Database
Developers to encounter such 'monsters' as Hibernate or MyBatis.)
MessageDigest
digest = MessageDigest.getInstance("MD5")
byte[] md5hash
conn.eachRow
('select txt from MY_TABLE order by id')
{
digest.update(it.txt.getBytes(StandardCharsets.UTF_8))
}
md5hash = digest.digest();
println
md5hash.encodeHex().toString()
The code
consists of three parts, the first part is simple initiation.
The crucial
part is the middle step, performed for each column value returned by the
cursor. The method update gets the new value (new line) and
updates the value of the digest.
The final
step gets the result value of the digest and produces the hash code.
Two Problems
The above
solution works fine, but has two main problems. First, it uses Java libraries,
so it is not possible to use PL/SQL as a language (which would be much more
practicable).
Second, the
cursor we read from needs to be ordered, otherwise we would get a different
hash code. This makes perfect sense for files that have an implicit row order.
Two files with reordered lines are considered different. Contrary to a database
table the order does not play a role. Reordering rows in a table does not
change the table content.
Let’s
address the second problem first. Again, we get our inspiration from a Java
implementation. A Set is a Java class which is not
ordered, so how is calculated a hash code for such a class? After some research
I found that a hash code for a Set class is
calculated as XOR of the hash codes of its elements. Again, this makes perfect
sense as the XOR operation is commutative, so the calculation is order
independent.
Let’s
simulate it with a simple query:
select lower(UTL_RAW.BIT_XOR(standard_hash('foo', 'MD5'), standard_hash('bar',
'MD5'))) hash_code from dual;
9b0805c206b7ebb8b6b9931d83e9f52a
select
lower(UTL_RAW.BIT_XOR(standard_hash('bar', 'MD5'), standard_hash('foo', 'MD5'))) hash_code
from dual;
9b0805c206b7ebb8b6b9931d83e9f52
Here we
first calculate the hash code for two strings and then combine them with a
BIT_XOR function. The queries use different order of the values, but both
return identical results as expected.
Note that
this approach of calculation is not comparable with the md5sum method - we use
a completely different combination method (= the update method).
PL/SQL Implementation
But if we
use the XOR for the update of our digest, we do not need the original Java
implementation, we can switch back to PL/SQL. So the first problem is solved as
well.
The
implementation of a user defined aggregated function that calculates an MD5
hash code using the XOR combination is in Appendix B. The function is very
straightforward. The only processing logic is in the Iterate function. The hash
code for the new value is calculated and then combined using the BIT_XOR
function with the current value.
Let’s
return to our tables from Appendix A and prove that they are different using
the hash code.
select lower(MD5_XOR(col)) md5
from
A;
acbd18db4cc2f85cedef654fccc4a4d8
select lower(MD5_XOR(col)) md5
from
B;
37b51d194a7513e45b56f6524f2d51f2
No surprise
here, the hash codes are clearly different.
But of
course there is always the question, what is it good for? The following case
studies illustrate some possible use cases of the hash code aggregated
function.
Case Study 1 -
Observing a Table
Imagine
there is a table belonging to a system that is for long deprovisioned
and we expect no changes in the table. But we are not really sure, maybe there
are some processes that are very infrequent, but still active and they may
change the table. So how can we efficiently verify the change state of the
table? Of course we can always copy the table and perform the MINUS based
comparison.
An
alternative way is to calculate the hash code of the table and store it. The
identity check re-calculates the hash code and compares it with the stored
value.
To
calculate a hash code of a table with several columns, we concatenate them
prior to the hash code calculation as shown in the query below:
select
MD5_XOR(to_char(id)||COL_TXT|| to_char(COL_DATE,'dd.mm.yyyy hh24:mi:ss')) md5 from hash_test_a;
A possible
scenario is that with the hash code observation we limit the large number of
tables to a small number of tables that must be observed on a detailed data
level.
Use Case 2 -
Comparing Remote Tables
Let´s
imagine we have a replica of some table in a different database connected with
a database link. The replication is done by an application logic and there is
no transactional guarantee that the two copies are identical.
Again, it
is possible to copy the whole table into one database and perform the compare.
But can hash code provide a more efficient way?
Let´s
assume there is some column that classifies the rows in different groups and
that may be used as a GROUP BY key. We will calculate in both databases the
hash code for each group. We may ignore all grouping keys that provide
identical hash code (which will hopefully be the majority) and copy only the
data with the keys that provided a different hash code. This will largely limit
the data that must be copied over the database link and simplify the
comparison.
with
c1 as (
select grp_id, MD5_XOR(col1||col2) hash_md5
from
cmp1
group
by grp_id),
c2
as (
select grp_id, MD5_XOR(col1||col2) hash_md5
from
cmp2@remote
group
by grp_id),
comp
as (
select grp_id, c1.hash_md5 c1_hash_md5, c2.hash_md5 c2_hash_md5
from
c1 full outer join c2 using (grp_id))
select
GRP_ID, C1_HASH_MD5, C2_HASH_MD5,
case
when nvl(C1_HASH_MD5,'ab') != nvl(C2_HASH_MD5,'ab')
then
'N' else 'Y' end is_identical
from
comp
order
by 1;
GRP_ID
C1_HASH_MD5
C2_HASH_MD5
IS_IDENTICAL
---------- ------------------------------------
------------------------------------ ------------
0 49E24489B738E2EDD8B5870C9DD6F69A 49E24489B738E2EDD8B5870C9DD6F69A Y
1
0D08057BAD83FECDF1139C44F7FD3049
BBDF62A9556EA3EC5558921C71953CF0
N
2
97C4D33DC611FD68FEE1157FA14AA87F 97C4D33DC611FD68FEE1157FA14AA87F Y
3
7658E1C495C26B43047A6D6DFD3B5487 7658E1C495C26B43047A6D6DFD3B5487 Y
4
8D619C8BE12D7E9CA0450E7781F5C6E3 8D619C8BE12D7E9CA0450E7781F5C6E3 Y
5
92003DF5DE99F7CA8F6E247C3FEF9EF6 92003DF5DE99F7CA8F6E247C3FEF9EF6 Y
6
1DAF3E58FE1FC1BC0AA8613BF549A424 1DAF3E58FE1FC1BC0AA8613BF549A424 Y
7
45ACA5382828BE50436B3E254A0E5DD4 45ACA5382828BE50436B3E254A0E5DD4 Y
8
C57651D4B5D05BA8C63F20BD24D5B6BD C57651D4B5D05BA8C63F20BD24D5B6BD Y
9
14A6BF7CDA3B5A83E75AE4DB5CCB5ABD 14A6BF7CDA3B5A83E75AE4DB5CCB5ABD Y
Performance
Consideration
We already
discussed that the algorithmic complexity of the hash code calculation is
superior to the one of the set operation. That would suggest that our
aggregated function outperforms the MINUS / UNION based implementation.
Unfortunately, the user defined implementation (probably due to a large number
of context switches) is not very effective and the expected effect is not
observed.
To obtain a
real effective hash code aggregated function, we would require an Oracle
integrated implementation. This implementation would cover all supported hash
code types in the same manner.
The
usability would be extended with a support for multi-column calculation without
a need of explicit concatenation.
Appendix A
Well, I
confess this is a prepared example.
select *
from A;
COL
-----
foo
bar
bar
select *
from B;
COL
-----
foo
foo
bar
But in case
that the table has no primary key, this "collision" could happen. In
a practical case I agree the probability would be similarly low as a hash code
collision.
Appendix B
-- MD5 hash XOR combined
aggregate function in PL/SQL
--
CREATE OR
REPLACE
TYPE
md5_xor_type AS OBJECT
(
md5Hash RAW(16),
STATIC FUNCTION
ODCIAggregateInitialize(sctx IN OUT md5_xor_type )
RETURN NUMBER,
MEMBER FUNCTION
ODCIAggregateIterate(self IN OUT md5_xor_type ,
VALUE IN VARCHAR2 )
RETURN NUMBER,
MEMBER FUNCTION
ODCIAggregateTerminate(self IN md5_xor_type,
returnValue
OUT RAW,
flags IN NUMBER)
RETURN NUMBER,
MEMBER FUNCTION
ODCIAggregateMerge(self IN OUT md5_xor_type,
ctx2
IN md5_xor_type)
RETURN NUMBER
);
/
CREATE OR
REPLACE
TYPE BODY
md5_xor_type
IS
STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT md5_xor_type)
RETURN NUMBER
IS
BEGIN
sctx
:= md5_xor_type( NULL );
RETURN ODCIConst.Success;
END;
MEMBER FUNCTION ODCIAggregateIterate(self IN
OUT md5_xor_type,
VALUE IN VARCHAR2 )
RETURN NUMBER
IS
my_hash
RAW(16);
BEGIN
BEGIN
select standard_hash(VALUE, 'MD5') into my_hash
from dual;
END; -- add exception handling
IF self.md5Hash IS NULL THEN
self.md5Hash := my_hash;
ELSE
self.md5Hash :=
UTL_RAW.BIT_XOR(self.md5Hash,my_hash);
END IF;
RETURN ODCIConst.Success;
END;
MEMBER FUNCTION ODCIAggregateTerminate(self
IN md5_xor_type,
returnValue OUT RAW,
flags IN NUMBER)
RETURN NUMBER
IS
BEGIN
returnValue
:= self.md5Hash;
RETURN ODCIConst.Success;
END;
MEMBER FUNCTION ODCIAggregateMerge(self IN
OUT md5_xor_type,
ctx2 IN
md5_xor_type)
RETURN NUMBER
IS
BEGIN
self.md5Hash :=
UTL_RAW.BIT_XOR(self.md5Hash,ctx2.md5Hash);
RETURN ODCIConst.Success;
END;
END;
/
CREATE OR
REPLACE
FUNCTION MD5_XOR(input VARCHAR2 )
RETURN RAW
AGGREGATE USING
MD5_XOR_TYPE ;
/
-- test
select MD5_XOR(txt)
from
foo;
Reference
Oracle
Database Ideas
https://community.oracle.com/ideas/20275
MySQL
Verifying MD5 Checksum
https://docs.oracle.com/cd/E17952_01/mysql-5.0-en/verifying-md5-checksum.html
Wikipedia Merkle Trees
https://en.wikipedia.org/wiki/Merkle_tree
Stack
Overflow Order Independent Hash
https://stackoverflow.com/questions/47253864/order-independent-hash-in-java