1 / 9

Duplicate Detection

Duplicate Detection. Exercise 1. Use Extended Key to do Entity Identification[1]. Table R and S as shown below: Table R Table S. Suppose the extended key is {name, city, homeaddress} and the following ILFDs: (E. HomeAddress=” Myskviksvägen 8 ”) ->(E.City= ” INGARÖ ”)

krista
Download Presentation

Duplicate Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Duplicate Detection

  2. Exercise 1. Use Extended Key to do Entity Identification[1]

  3. Table R and S as shown below: Table R Table S

  4. Suppose the extended key is {name, city, homeaddress} and the following ILFDs: • (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”) • (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”) • (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”) • (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE”) • Please construct the integrated table. ----------------------------------------------------- [1] Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993

  5. Answer Exercise • Integrated Table

  6. Exercise 2. Use Priority Queue to do Duplicate Detection[2]

  7. Table R,which is already sorted according to application-specific key: Similarities between tuples • Given conditions below, please use Priority Queue algorithm to find the Duplicate Clusters within.

  8. Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is : The average of the tuple’s similarity with the cluster’s all representitives. • The condition to declare a new cluster : matching score < 0.5 • The condition to declare a representitive: 0.5 < matching score < 0.8 • The size of Priority Queue: 2 ----------------------------------------------------- [2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997

  9. Answer Record 1 Queue{1} Record 2 2:1 = 0.6 > 0.5 and < 0.8 Queue {1,2} Record 3 3:1 = 0.1 3:2 = 0.2 representitive = (0.1 + 0.2) /2 = 0.15 < 0.5 Queue {3} {1, 2} Record 4 4:1 =0.3 4:2= 0.4 representitive = (0.3+0.4) /2 = 0.35 < 0.5 4:3= 0.9 > 0.5 and > 0.8 Queue {3, 4} {1,2} Record 5 5:1 = 0.5 5:2 = 0.4 representitive = (0.5 +0.4) /2 = 0.45 < 0.5 5:3= 0.4 representitive = 0.4 <0.5 Queue {5} {3, 4} {1,2} Record 6 6:3 = 0.6 representitive = 0.6 > 0.5 and < 0.8 6:5 = 0.4 < 0.5 Queue {3, 4, 6} {5} {1,2} Record 7 7:3 = 0.5 7:6 = 0.4 representitive = (0.5 +0.4)/2 = 0.45 < 0.5 7:5 = 0.8 >0.5 Queue {5, 7} {3, 4, 6} {1,2}

More Related