duplicate file check
I just wrote this procedure below that does duplicate file (dataset, i should say) check.
Let’s say, you have a system where your clients/users have rights to write straight to your “staging” or “landing” tables. You have a system where multiple users are feeding multiple files everyday. You have a requirement to check whether users are sending duplicate files into your system intentionally or unintentionally. You could probably find a million ways of doing a duplicate file check. I just wrote a simple stored procedure that does that for me. Let’s have a look at it. \
create or replace PROCEDURE clob_to_hash IS
i number(5);
mychar VARCHAR2(4000);
myclob CLOB;
myraw raw(2000);
rec_cnt NUMBER(2);
CURSOR mycur1 IS
SELECT column_name AS
field_name
FROM sys.all_tab_cols
WHERE owner = 'SCOTT'
AND TABLE_NAME = 'EMP'
ORDER BY column_id;
mycur mycur1%ROWTYPE;
col_names VARCHAR2(4000);
mysqlstr varchar2(4000);
BEGIN
i:=0;
OPEN mycur1;
LOOP
FETCH mycur1 INTO mycur;
EXIT WHEN mycur1%NOTFOUND;
IF i = 0 THEN
col_names := mycur.field_name;
ELSE
col_names := col_names || ',' || mycur.field_name;
end if;
i:=i+1;
END LOOP;
close mycur1;
mysqlstr := 'SELECT xmlagg(xmlelement("emp",' || col_names || ')).getclobval() FROM scott.emp';
execute immediate mysqlstr INTO myclob;
myraw := dbms_crypto.hash(src => myclob, typ => dbms_crypto.hash_md5);
mychar := rawtohex(myraw);
SELECT COUNT(*) INTO rec_cnt FROM hashtest WHERE mchar = mychar;
IF rec_cnt = 0 THEN
INSERT INTO hashtest VALUES(mychar);
ELSE
INSERT INTO hashtest VALUES('duplicate data set');
END IF;
COMMIT;
END;
This procedure creates a list of column names of ‘emp’ table from sys.all_tab_cols in order they were created. It builds a clob of xml data out of the emp table and create an MD5 key for that clob. It converts the ‘raw’ value returned by DBMS_CRYPTO.HASH function to a ‘hex’ value. After that it tries to find a row in the “hashtest” table to see if the same ‘hash’ key exists. If it finds duplicate records, it write ‘duplicate data set’ in the hashtest table. otherwise it writes the ‘hex’ value generated from the ‘hash’ value which was returned in ‘raw’ format by DBMS_CRYPTO.HASH function.
SQL> truncate table hashtest;
Table truncated.
SQL> exec clob_to_hash;
PL/SQL procedure successfully completed.
SQL> desc hashtest;
Name Null? Type
----------------------------------------- -------- ----------------------------
MCHAR VARCHAR2(255)
SQL> select * from hashtest;
MCHAR
--------------------------------------------------------------------------------
4FCD4351B718B46FC8DCE7BA01B3F910
SQL> exec clob_to_hash;
PL/SQL procedure successfully completed.
SQL> select * from hashtest;
MCHAR
--------------------------------------------------------------------------------
17A6E53EA979BFB0E67CCE2E5BDDB769
duplicate data set
SQL> update scott.emp set sal=sal+100 where empno=7844;
1 rows updated.
SQL> commit;
Commit complete
SQL> exec clob_to_hash;
PL/SQL procedure successfully completed.
SQL> select * from hashtest;
MCHAR
--------------------------------------------------------------------------------
17A6E53EA979BFB0E67CCE2E5BDDB769
duplicate data set
4FCD4351B718B46FC8DCE7BA01B3F910
That’s it!!! With a little modification, you could pretty much use this to compare any two datasets.
This entry was posted on Tuesday, April 3rd, 2007 at 11:24 pm and is filed under Oracle. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
May 15th, 2007 at 6:50 pm
Hi kirtan,
I read your post in informaticadevelopment yahoogroups on CRC55 function to find the change in a row with looking up into many columns at a time and your approach thru oracle pl/sql. I am in a learning stage and i am trying to execute the procedure you wrote above but i am getting an error while executing this. Can you please tell me how and where i should declare this function.
The error is:
PLS-00201: identifier ‘DBMS_CRYPTO.HASH’ must be declared
May 15th, 2007 at 7:05 pm
Your user ID does not have EXECUTE permission on that function.
GRANT EXECUTE on sys.DBMS_CRYPTO to your user id
that should take care of it. Let me know.
May 15th, 2007 at 11:02 pm
Hi Kirtan,
I logged on as ’system’ user name and executed the above statement for ‘Scott’ user. But i am unsuccessful i am getting the following error.
ORA:04042 Procedure,Function,Package or Package body does not exist.
May 16th, 2007 at 5:06 pm
I should have asked this before but anyways what version of Oracle are you using?
First of all, you should not be using sys or system accounts for regular use. Create your own user and assign DBA priv. to it.
For DBMS_CRYPTO, log in as sys (as sysdba) and execute the same command. It works fine for me as sys (as sysdba).
[oracle@raclinux1 ~]$ sqlplus sys as sysdba
SQL*Plus: Release 10.2.0.1.0 - Production on Wed May 16 13:14:49 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Enter password:
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP
and Data Mining Scoring Engine options
SQL> grant execute on dbms_crypto to scott;
Grant succeeded.
SQL>
Let me know.
May 16th, 2007 at 6:59 pm
I am working on Oracle 9i, oh it seems this function is not available in this version.
Thanks & Regards,
Sam.
May 16th, 2007 at 7:02 pm
Oops, i think this function is not available in Oracle 9i version. I am i correct.
Thanks & Regards,
Sam.
May 16th, 2007 at 7:04 pm
Check out DBMS_OBFUSCATION_TOOLKIT for 8i or 9i.
May 16th, 2007 at 7:49 pm
Now i can able to grant privilege to SCOTT user. I replaced the DBMS_CRYPTO with DBMS_OBFUSCATION_TOOLKIT,but now i am getting error at the following line in the code
myraw := DBMS_OBFUSCATION_TOOLKIT.hash(src => myclob, typ => DBMS_OBFUSCATION_TOOLKIT.hash_md5);
ERROR:
PLS-00302: component ‘HASH’ must be declared
May 16th, 2007 at 8:43 pm
You need to read the following.
http://download-east.oracle.com/docs/cd/B10501_01/appdev.920/a96612/d_obtool.htm#6518
Since 10g, with DBMS_CRYPTO, the task became much more easier. DBMS_OBFUSCATION_TOOLKIT hash its own challenges that you are coming across right now. I would suggest you read the documentation for 9i if that’s the version you want to use. Personally, i think time for switching to 10g was almost two years ago.
Read the 9i docs, write the code using DBMS_OBFUSCATION_TOOLKIT and let me know if you have any problems.
Anyways, have fun.
Kirtan