1 / 14

Record Linkage in a Distributed Environment

Record Linkage in a Distributed Environment. Literature Review. Contents. Record linkage Runtime reduction techniques Blocking Canopies Sorted Neighborhood Shift to p arallel computing Research directions . Record Linkage Problem.

zona
Download Presentation

Record Linkage in a Distributed Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Record Linkage in a Distributed Environment Literature Review

  2. Contents • Record linkage • Runtime reduction techniques • Blocking • Canopies • Sorted Neighborhood • Shift to parallel computing • Research directions

  3. Record Linkage Problem • Determining if pairs of records refer to the same entity • E.g. Distinguishing between data belonging to… Yipeng, the NUS student and Yipeng, the son of PM Lee

  4. Record Linkage Applications • Dedup Two Lists • Dedup Single List O(M*N) O(N2)

  5. Dealing with Large Data • Pairwise comparison increasing expensive • Blocking techniques • Reduce the search space Amanda Amanda David Daniel

  6. Canopies

  7. Sorted Neighborhood Comparison Window: 2w−1

  8. Dealing with Large Data • Pairwise comparison increasing expensive • Blocking techniques • Reduce the search space • Limitations • Single node computation • Localized data source • Conflicting in function Amanda Amanda David Daniel

  9. Shift to Parallel Computing • Multi node computation • Data source flexibility • Complementary to blocking methods • Frontrunners: • P-Febrl(P Christen 2003), • P-Swoosh (H Kawai 2006), • Parallel Linkage (H Kim 2007)

  10. Parallel Record Linkage Contributions • Peter Christen • Parallelized Febrl with MPI • Linear Speedup but did not Scaleup well • HidekiKawai • Designed P-swoosh in a simulated environment • Match based parallelism • 2x speedup with use of domain knowledge

  11. Parallel Record Linkage Contributions • Hung-sik Kim, Dongwon Lee • Explored parallel record linkage for different input cases in MATLAB • Consistent Speedup • Not validated with very large datasets

  12. MapReduce and Hadoop • Handles system level concerns… • E.g. Data distribution, fault tolerance, dynamic load balancing, portability and scalability • Convenient model for scaling record linkage • Beterscaleupon pairwisecomparisions (T Elsayed 2008) • Runtime increased linearly with dataset (R Vernica 2010)

  13. Research Directions • Tailoring Hadoop for record linkage problems • E.g. Bin packing blocks of different sizes • Experimenting with different problem types • E.g. Bipartite data centers • Adapting existing parallel clustering algorithms onto the MapReducemodel

  14. Conclusions • Parallelism a right step in the right direction • Complementary to existing approaches • Consistent with the object orientation • But… • Parallel design and implementation is difficult • MapReduce is a viable solution

More Related