1 / 19

Signature Based Duplicate Detection in Digital Libraries

Signature Based Duplicate Detection in Digital Libraries. L. Padmasree Vamshi Ambati J. Anand Chandulal M. Sreenivasa Rao. School of Information Technology, JNT University, Hyderabad, 500 072 , India. srmeda@gmail.com. Motivation.

kele
Download Presentation

Signature Based Duplicate Detection in Digital Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Signature Based Duplicate Detection in Digital Libraries L. Padmasree Vamshi Ambati J. Anand Chandulal M. Sreenivasa Rao School of Information Technology, JNT University, Hyderabad, 500 072 , India. srmeda@gmail.com

  2. Motivation • Books scanned in Digital Libraries are procured from varied sources. • Scanning centers are distributed across the country. • Duplicates could arise between scanning points. • Pre-scanning duplicate detection is required

  3. Challenges • Duplicate detection is by using metadata (title, author, publishing year, edition, etc) • Entered by varied operators and so there is scope for • Incorrectness • Incompleteness • Errors could be - • Typographical mistakes • Word disorder • Inconsistent abbreviations • Even with missing words • Makes duplicate detection more difficult. • Duplicate detection must have quick turnaround time and accuracy

  4. RELATED WORK • Most traditional methods based on string similarity are: • character-based techniques • vector space based techniques. • Character-based technique • rely on character edit operations, such as deletions, insertions, substitutions and sub sequence comparison. • Vector space based techniques • transform strings into vector representation on which similarity computations are conducted. • In the present work we used an efficient and fast duplication detection technique using similarity search.

  5. Our Approach • Uses Signature file method • Uses Similarity search techniques to find duplicates with close proximity match • Language independent • Fast and Accurate • Uses Online Tool to customize

  6. The Process • Metadata is created at scanning centers • Signature is computed for the metadata • Use superimposed Technique and Hashing method • Signature is stored in central repository • Pre-scanned book metadata is submitted as a query • Use same technique to compute the signature • Similarity search gives close proximity match duplicate

  7. Scanning Centre-I Scanning Centre-II Central Database Query Metadata Signature Metadata <Title> Y/N Signature 10001011 Duplicate Detection in Digital Library system Duplicate Detection Technique

  8. Example of the process Books Data Example Query: The Arts of Japan - Edward Dillon Result

  9. Superimposed Coding Technique • In Superimposed Coding Technique each record is mapped into an individual binary signature. • Record is either the title or the author name of the book or the combination. • Signatures of the records in the training data and testing data are encoded binary representations. • The signature of the 'title or author name' of the book is obtained by superimposing the signatures of the words with OR operation.

  10. The Hashing method • The signature of each word is obtained by hashing method. • The hashing function H(w) maps the word(w) into one of the patterns generated by computing a hash value of the word. • The hash function uses shift and add strategy. • The ASCII values of the characters in the word are added and shifted by H(w). • in order to compute the hash value. The final hash value is obtained by mod operation with nCr.

  11. Duplicate Detection in Digital Library System The Similarity Match Algorithm for Library Database Input : L library database consists of documents D1, D2, ……, Dm, query Q. Output : B book corresponding to query Q Procedure Library (D1, D2, ……,Dm, Q : in; B : out) • for i=1 to m do • Si = superimposed-coding (Di) • end do • X = superimposed-coding (Q) • O = Jaccard (S1, S2,……Sm, X) • Look up in Library database L for a book B (document) whose Signature matches with minimum Jaccard distance. • End

  12. Jaccard Distance • The Jaccard distance between the query signature and target signature can be obtained by using the expression d = (r + s) / (q + r + s+t) q - The number of bits that equals to1 for both target and query signatures. r - The number of bits that equals to 1 for target signature but that are 0 for the query signatures. s - The number of bits that equals to 0 for the target signature but equals to 1 for the query signature t - The number of bits that equals to 0 for both target and query signatures .

  13. False drops • Minimized on the appropriate choice of two parameters n and r. • Online Tool

  14. EXPERIMENTAL RESULTS DR: Detection Rate

  15. Scalability and accuracy of duplicate detection system

  16. CONCLUSION • Effective and efficient duplicate detection technique is proposed. • Duplicate detection was done by similarity search using signature file method where we can detect the duplicate with typographical mistakes, word disorder, and inconsistent abbreviations and even with missing words. • Language independent and High performance with 95% accuracy

  17. Questions?

More Related