Kde…

duplicate file check

I just wrote this procedure below that does duplicate file (dataset, i should say) check.

Let’s say, you have a system where your clients/users have rights to write straight to your “staging” or “landing” tables. You have a system where multiple users are feeding multiple files everyday. You have a requirement to check whether users are sending duplicate files into your system intentionally or unintentionally. You could probably find a million ways of doing a duplicate file check. I just wrote a simple stored procedure that does that for me. Let’s have a look at it. \


create or replace PROCEDURE clob_to_hash IS
 i number(5);
 mychar VARCHAR2(4000);
 myclob CLOB;
 myraw raw(2000);
 rec_cnt NUMBER(2);
 CURSOR mycur1 IS
  SELECT column_name AS
  field_name
  FROM sys.all_tab_cols
  WHERE owner = 'SCOTT'
  AND TABLE_NAME = 'EMP'
  ORDER BY column_id;
 mycur mycur1%ROWTYPE;
 col_names VARCHAR2(4000);
 mysqlstr varchar2(4000);
BEGIN
 i:=0;
 OPEN mycur1;
 LOOP
  FETCH mycur1 INTO mycur;
  EXIT WHEN mycur1%NOTFOUND;
  IF i = 0 THEN
   col_names := mycur.field_name;
  ELSE
   col_names := col_names || ',' || mycur.field_name;
  end if;
 i:=i+1;
 END LOOP;
 close mycur1;
 mysqlstr := 'SELECT xmlagg(xmlelement("emp",' || col_names || ')).getclobval() FROM scott.emp';
 execute immediate mysqlstr INTO myclob;
 myraw := dbms_crypto.hash(src => myclob, typ => dbms_crypto.hash_md5);
 mychar := rawtohex(myraw);
 SELECT COUNT(*) INTO rec_cnt FROM hashtest WHERE mchar = mychar;
 IF rec_cnt = 0 THEN
  INSERT INTO hashtest VALUES(mychar);
 ELSE
  INSERT INTO hashtest VALUES('duplicate data set');
 END IF;
 COMMIT;
END;

This procedure creates a list of column names of ‘emp’ table from sys.all_tab_cols in order they were created. It builds a clob of xml data out of the emp table and create an MD5 key for that clob. It converts the ‘raw’ value returned by DBMS_CRYPTO.HASH function to a ‘hex’ value. After that it tries to find a row in the “hashtest” table to see if the same ‘hash’ key exists. If it finds duplicate records, it write ‘duplicate data set’ in the hashtest table. otherwise it writes the ‘hex’ value generated from the ‘hash’ value which was returned in ‘raw’ format by DBMS_CRYPTO.HASH function.


SQL> truncate table hashtest;

Table truncated.

SQL> exec clob_to_hash;

PL/SQL procedure successfully completed.

SQL> desc hashtest;

Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
MCHAR                                              VARCHAR2(255)

SQL> select * from hashtest;

MCHAR
--------------------------------------------------------------------------------
4FCD4351B718B46FC8DCE7BA01B3F910

SQL> exec clob_to_hash;

PL/SQL procedure successfully completed.

SQL> select * from hashtest;

MCHAR
--------------------------------------------------------------------------------
17A6E53EA979BFB0E67CCE2E5BDDB769
duplicate data set

SQL> update scott.emp set sal=sal+100 where empno=7844;

1 rows updated.

SQL> commit;

Commit complete

SQL> exec clob_to_hash;

PL/SQL procedure successfully completed.

SQL> select * from hashtest;

MCHAR
--------------------------------------------------------------------------------
17A6E53EA979BFB0E67CCE2E5BDDB769
duplicate data set
4FCD4351B718B46FC8DCE7BA01B3F910

That’s it!!! With a little modification, you could pretty much use this to compare any two datasets.

9 Responses to “duplicate file check”

  1. May 15th, 2007 at 6:50 pm

    sampathmadala says:

    Hi kirtan,

    I read your post in informaticadevelopment yahoogroups on CRC55 function to find the change in a row with looking up into many columns at a time and your approach thru oracle pl/sql. I am in a learning stage and i am trying to execute the procedure you wrote above but i am getting an error while executing this. Can you please tell me how and where i should declare this function.

    The error is:

    PLS-00201: identifier ‘DBMS_CRYPTO.HASH’ must be declared

  2. May 15th, 2007 at 7:05 pm

    Kirtan Desai says:

    Your user ID does not have EXECUTE permission on that function.

    GRANT EXECUTE on sys.DBMS_CRYPTO to your user id

    that should take care of it. Let me know.

  3. May 15th, 2007 at 11:02 pm

    sampathmadala says:

    Hi Kirtan,

    I logged on as ’system’ user name and executed the above statement for ‘Scott’ user. But i am unsuccessful i am getting the following error.

    ORA:04042 Procedure,Function,Package or Package body does not exist.

  4. May 16th, 2007 at 5:06 pm

    Kirtan Desai says:

    I should have asked this before but anyways what version of Oracle are you using?
    First of all, you should not be using sys or system accounts for regular use. Create your own user and assign DBA priv. to it.

    For DBMS_CRYPTO, log in as sys (as sysdba) and execute the same command. It works fine for me as sys (as sysdba).

    [oracle@raclinux1 ~]$ sqlplus sys as sysdba

    SQL*Plus: Release 10.2.0.1.0 - Production on Wed May 16 13:14:49 2007

    Copyright (c) 1982, 2005, Oracle. All rights reserved.

    Enter password:

    Connected to:
    Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
    With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP
    and Data Mining Scoring Engine options

    SQL> grant execute on dbms_crypto to scott;

    Grant succeeded.

    SQL>

    Let me know.

  5. May 16th, 2007 at 6:59 pm

    sampathmadala says:

    I am working on Oracle 9i, oh it seems this function is not available in this version.

    Thanks & Regards,
    Sam.

  6. May 16th, 2007 at 7:02 pm

    sampathmadala says:

    Oops, i think this function is not available in Oracle 9i version. I am i correct.

    Thanks & Regards,
    Sam.

  7. May 16th, 2007 at 7:04 pm

    Kirtan Desai says:

    Check out DBMS_OBFUSCATION_TOOLKIT for 8i or 9i.

  8. May 16th, 2007 at 7:49 pm

    sampathmadala says:

    Now i can able to grant privilege to SCOTT user. I replaced the DBMS_CRYPTO with DBMS_OBFUSCATION_TOOLKIT,but now i am getting error at the following line in the code

    myraw := DBMS_OBFUSCATION_TOOLKIT.hash(src => myclob, typ => DBMS_OBFUSCATION_TOOLKIT.hash_md5);

    ERROR:
    PLS-00302: component ‘HASH’ must be declared

  9. May 16th, 2007 at 8:43 pm

    Kirtan Desai says:

    You need to read the following.

    http://download-east.oracle.com/docs/cd/B10501_01/appdev.920/a96612/d_obtool.htm#6518

    Since 10g, with DBMS_CRYPTO, the task became much more easier. DBMS_OBFUSCATION_TOOLKIT hash its own challenges that you are coming across right now. I would suggest you read the documentation for 9i if that’s the version you want to use. Personally, i think time for switching to 10g was almost two years ago.
    Read the 9i docs, write the code using DBMS_OBFUSCATION_TOOLKIT and let me know if you have any problems.

    Anyways, have fun.
    Kirtan

Leave a Reply