70 likes | 170 Views
Discover duplicates within or between files, addressing lack of unique identifiers and variations in spelling. Explore deterministic and approximate matching methods to balance recall and effectiveness. Evaluate accuracy with anonymous record linkage assessments.
E N D
(De-Identified) Record Linkage DongqiuyePu, AshrafFarrag, JavedMostafa
Background • Identify duplicates in a file or across files • AKA: Object identification, data cleaning, entity resolution, etc….
Motivation • Lack of unique identifiers • Variations of spelling, misspelling, typo…
For Instance… (A) (B)
Methods In a Nutshell • Deterministic matching: straightforward, no human review needed, but suffer low recall • Approximate matching: harder to implement, human review needed, higher recall
Research Plan • Exact matching • Fuzzy matching for the rest
Evaluating accuracy of anonymous record linkage • Evaluate collision rate of hashing algorithm (most likely will be ZERO)