Duplicate Detection

Duplicate Detection

Exercise 1. Use Extended Key to do Entity Identification[1]

Table R and S as shown below: Table R Table S

Suppose the extended key is {name, city, homeaddress} and the following ILFDs: • (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”) • (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”) • (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”) • (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE”) • Please construct the integrated table. ----------------------------------------------------- [1] Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993

Answer Exercise • Integrated Table

Exercise 2. Use Priority Queue to do Duplicate Detection[2]

Table R,which is already sorted according to application-specific key: Similarities between tuples • Given conditions below, please use Priority Queue algorithm to find the Duplicate Clusters within.

Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is : The average of the tuple’s similarity with the cluster’s all representitives. • The condition to declare a new cluster : matching score < 0.5 • The condition to declare a representitive: 0.5 < matching score < 0.8 • The size of Priority Queue: 2 ----------------------------------------------------- [2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997

Answer Record 1 Queue{1} Record 2 2:1 = 0.6 > 0.5 and < 0.8 Queue {1,2} Record 3 3:1 = 0.1 3:2 = 0.2 representitive = (0.1 + 0.2) /2 = 0.15 < 0.5 Queue {3} {1, 2} Record 4 4:1 =0.3 4:2= 0.4 representitive = (0.3+0.4) /2 = 0.35 < 0.5 4:3= 0.9 > 0.5 and > 0.8 Queue {3, 4} {1,2} Record 5 5:1 = 0.5 5:2 = 0.4 representitive = (0.5 +0.4) /2 = 0.45 < 0.5 5:3= 0.4 representitive = 0.4 <0.5 Queue {5} {3, 4} {1,2} Record 6 6:3 = 0.6 representitive = 0.6 > 0.5 and < 0.8 6:5 = 0.4 < 0.5 Queue {3, 4, 6} {5} {1,2} Record 7 7:3 = 0.5 7:6 = 0.4 representitive = (0.5 +0.4)/2 = 0.45 < 0.5 7:5 = 0.8 >0.5 Queue {5, 7} {3, 4, 6} {1,2}

Duplicate Detection

Duplicate Detection

Presentation Transcript

Duplicate record detection

Weak Duplicate Address Detection in Mobile Ad Hoc Networks

Near Duplicate Detection

SVD-SIFT FOR WEB NEAR-DUPLICATE IMAGE DETECTION

Duplicate! Duplicate! Duplicate!

Adaptive Near-Duplicate Detection via Similarity Learning

Signature Based Duplicate Detection in Digital Libraries

Duplicate Detection of Short MAC Frame

Duplicate address detection and autoconfiguration in OLSR

Duplicate Retail Transactions

Duplicate! Duplicate! Duplicate! Duplicate!

Near-Duplicate Detection for eRulemaking

Near-Duplicate Detection for eRulemaking

Bses Duplicate Bill

DUPLICATE

TPDDL Duplicate Bill

NSF Duplicate Remover

Duplicate Key Maker

A Novel Approach for Progressive Duplicate Detection for Quality Assurance

Duplicate address detection and autoconfiguration in OLSR

Weak Duplicate Address Detection in Mobile Ad Hoc Networks