Signature Based Duplicate Detection in Digital Libraries

Signature Based Duplicate Detection in Digital Libraries L. Padmasree Vamshi Ambati J. Anand Chandulal M. Sreenivasa Rao School of Information Technology, JNT University, Hyderabad, 500 072 , India. srmeda@gmail.com

Motivation • Books scanned in Digital Libraries are procured from varied sources. • Scanning centers are distributed across the country. • Duplicates could arise between scanning points. • Pre-scanning duplicate detection is required

Challenges • Duplicate detection is by using metadata (title, author, publishing year, edition, etc) • Entered by varied operators and so there is scope for • Incorrectness • Incompleteness • Errors could be - • Typographical mistakes • Word disorder • Inconsistent abbreviations • Even with missing words • Makes duplicate detection more difficult. • Duplicate detection must have quick turnaround time and accuracy

RELATED WORK • Most traditional methods based on string similarity are: • character-based techniques • vector space based techniques. • Character-based technique • rely on character edit operations, such as deletions, insertions, substitutions and sub sequence comparison. • Vector space based techniques • transform strings into vector representation on which similarity computations are conducted. • In the present work we used an efficient and fast duplication detection technique using similarity search.

Our Approach • Uses Signature file method • Uses Similarity search techniques to find duplicates with close proximity match • Language independent • Fast and Accurate • Uses Online Tool to customize

The Process • Metadata is created at scanning centers • Signature is computed for the metadata • Use superimposed Technique and Hashing method • Signature is stored in central repository • Pre-scanned book metadata is submitted as a query • Use same technique to compute the signature • Similarity search gives close proximity match duplicate

Scanning Centre-I Scanning Centre-II Central Database Query Metadata Signature Metadata <Title> Y/N Signature 10001011 Duplicate Detection in Digital Library system Duplicate Detection Technique

Example of the process Books Data Example Query: The Arts of Japan - Edward Dillon Result

Superimposed Coding Technique • In Superimposed Coding Technique each record is mapped into an individual binary signature. • Record is either the title or the author name of the book or the combination. • Signatures of the records in the training data and testing data are encoded binary representations. • The signature of the 'title or author name' of the book is obtained by superimposing the signatures of the words with OR operation.

The Hashing method • The signature of each word is obtained by hashing method. • The hashing function H(w) maps the word(w) into one of the patterns generated by computing a hash value of the word. • The hash function uses shift and add strategy. • The ASCII values of the characters in the word are added and shifted by H(w). • in order to compute the hash value. The final hash value is obtained by mod operation with nCr.

Duplicate Detection in Digital Library System The Similarity Match Algorithm for Library Database Input : L library database consists of documents D1, D2, ……, Dm, query Q. Output : B book corresponding to query Q Procedure Library (D1, D2, ……,Dm, Q : in; B : out) • for i=1 to m do • Si = superimposed-coding (Di) • end do • X = superimposed-coding (Q) • O = Jaccard (S1, S2,……Sm, X) • Look up in Library database L for a book B (document) whose Signature matches with minimum Jaccard distance. • End

Jaccard Distance • The Jaccard distance between the query signature and target signature can be obtained by using the expression d = (r + s) / (q + r + s+t) q - The number of bits that equals to1 for both target and query signatures. r - The number of bits that equals to 1 for target signature but that are 0 for the query signatures. s - The number of bits that equals to 0 for the target signature but equals to 1 for the query signature t - The number of bits that equals to 0 for both target and query signatures .

False drops • Minimized on the appropriate choice of two parameters n and r. • Online Tool

EXPERIMENTAL RESULTS DR: Detection Rate

Scalability and accuracy of duplicate detection system

CONCLUSION • Effective and efficient duplicate detection technique is proposed. • Duplicate detection was done by similarity search using signature file method where we can detect the duplicate with typographical mistakes, word disorder, and inconsistent abbreviations and even with missing words. • Language independent and High performance with 95% accuracy

Questions?

Signature Based Duplicate Detection in Digital Libraries

Signature Based Duplicate Detection in Digital Libraries

Presentation Transcript

Digital Signature-Based Image Authentication

SigRace: Signature-Based Data Race Detection

Digital Signature

Duplicate record detection

Near Duplicate Detection

Digital Signature

Signature Based and Anomaly Based Network Intrusion Detection

Digital Signature

Duplicate address detection and autoconfiguration in OLSR

Digital Signature

DIGITAL SIGNATURE

Duplicate Detection

Digital Signature

Duplicate! Duplicate! Duplicate! Duplicate!

Digital Signature

Near-Duplicate Detection for eRulemaking

Near-Duplicate Detection for eRulemaking

Digital Signature

Aadhar Based Digital Signature Certificate Provider in India

Digital Signature Provider in Delhi || Digital Signature Sales

Digital Signature

Duplicate address detection and autoconfiguration in OLSR