1 / 44

Deterministic Record Linking

Deterministic Record Linking. University of North Carolina, Chapel Hill Hye-Chung Kum. Example. Exact Match. Approximate Matching I : SSN. Approximate Matching II : DOB. Approximate Matching III : Name. Deterministic Record Linking. Allow for approximate matching

neith
Download Presentation

Deterministic Record Linking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum

  2. Example

  3. Exact Match

  4. Approximate Matching I : SSN

  5. Approximate Matching II : DOB

  6. Approximate Matching III : Name

  7. Deterministic Record Linking • Allow for approximate matching • Use explicit approximate rules • Pros : can control the linkage process • Con: difficult to implement • Alternative : Probabilistic record linking • Also approximate matching • However, uses general rules specified by users • Based on total probability • Con: can not control exactly what to consider a match or not • Pros: can use specialized software

  8. Approximate Matching : DOB • element to element match : date, month, year • Allow for one element difference • Allow for month and day transposed

  9. Approximate Matching : Name • First name soundex match • First name is approx • one letter different • insert or replace • and/or substr • lsound equal • or lname approx • MI=FI • FI equal • Fsound & Lsound swapped

  10. Approximate Matching : Name

  11. Match on ssn (ssn equal) • 1 : dob, fsound equal • dob approx • 2 : dob approx, fsound equal • 3 : dob approx, fname approx • 4 : dob approx, lsound equal, & fsound diff, but MI=FI • 5 : dob approx, lsound equal, & fsound diff, but FI equal • 6 : dob approx, lsound and fsound swapped • 7 : dob approx, lname approx & fsound diff • but MI=FI (4 with lname approx rather than equal) • or FI equal (5 with lname approx rather than equal) • dob mismatch • 8 : fname approx, lsound equal, and dob diff • 9 : fname approx, lsound approx, and dob diff

  12. Match on ssn (ssn equal) • 1 : dob, fsound equal • dob approx • 2 : dob approx, fsound equal • 3 : dob approx, fname approx • 4 : dob approx, lsound equal, & fsound diff, but MI=FI • 5 : dob approx, lsound equal, & fsound diff, but FI equal • 6 : dob approx, lsound and fsound swapped • 7 : dob approx, lname approx & fsound diff • but MI=FI (4 with lname approx rather than equal) • or FI equal (5 with lname approx rather than equal) • dob diff • 8 : fname approx, lsound equal, and dob diff • 9 : fname approx, lsound approx, and dob diff

  13. Approximate Matching : SSN • Digit to digit match • Allow for one digit difference • Allow for two digit difference if transposed

  14. ssn missing 1: lname equal 2: lname approx ssn approx 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different 11 : lname equal 12 : lname approx lname different 51: ssn approx 52: ssn missing Match on ndob (dob+fsound)

  15. ssn missing 1: lname equal 2: lname approx ssn approx 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different 11 : lname equal 12 : lname approx lname different 51: ssn approx 52: ssn missing Match on ndob (dob+fsound)

  16. Match on name (fname+lname) • ssn missing & dob approx • 1: MI equal • 7: MI missing • 8: MI not equal • ssn approx • 3: dob equal • dob approx • 4: one element • 5: transpose

  17. Match on name (fname+lname) • ssn missing & dob approx • 1: MI equal • 7: MI missing • 8: MI not equal • ssn approx • 3: dob equal • dob approx • 4: one element • 5: transpose

  18. Match on name (fname+lname)

  19. Match on name (fname+lname)

  20. link • Put together all links found • Identify indirect duplicates (type2>10000) • i.e. both EISID1 & EISID2 link to identical SISID1 • Consider indirect duplicates on both EIS & SIS • Create unique link and indirect duplicate files • Keep only the first id in data file link • Create indirect duplicates files • dupeis2 & dupsis2 • TODO : explore indirect duplicates

  21. Create unique list of EIS & SIS • Generate unique full list of each set of ids • use linkage info • Link in the duplicates (dupeis & dupsis) • TODO : link in the indirect duplicates • eis & sis

  22. 1,888,747 4,308,863 eisid.sas7bdat sisid.sas7bdat 31,461 dupeis.sas7bdat duplicates unduplicated unique records dupsis.sas7bdat 250,635 1,638,112 87% 4,277,402 99% 27% 1,173,404 72% link.sas7bdat 1270 dupeis2.sas7bdat eis.sas7bdat sis.sas7bdat 493 dupsis2.sas7bdat 4,308,863 28% 1,888,747 74% ueis.sas7bdat usis.sas7bdat Link eis to sis Data flow

  23. Type of links

  24. Type of duplicates and links

  25. Number of Duplicates

  26. Implementation details • Ndob & name must be looped • multiple matches • Too many match on name • use half of ssn • Overlap for transpose

  27. Basic Process • Unduplicate EIS (dupeis) • Unduplicate SIS (dupsis) • Link unduplicated EIS & SIS (link) • Generate unique full list of each set of ids (list) • use linkage info • Link in the duplicates • eis & sis

  28. Unduplication • Same as matching between different system • Except, match the database to itself • i.e. EIS to EIS, SIS to SIS • Randomly select one as Primary • TODO: for those not linked using primary ID, try with duplicate ID • TODO: explore indirect duplicate links

  29. Conclusion • Future work : • indirect duplicates • Link using duplicates • SSN have been changed from real data

  30. Thank You !

  31. Type of id • first letter: • P : primary id with duplicates • D : duplicates (primary info given with prefix ‘l’) • X : no duplicates • second letter: link status • L: linked • X: no linked id • third letter: duplicates status of the linked id • D: duplicates exist for the linked id • X: no duplicates for the linked id

  32. EIS & SIS Table • Unique full is of EIS (or SIS) ids • Type : type of id (XXX) – see next slide • All eis info have no prefix • All sis info have prefix ‘k’ • Prefix ‘l’ is the link id info • freqeis & freqsis : # of duplicate ids • Pindid (eis) & pkindid (sis) is the primary id • indid1-indid3 & kindid1-kindid8

  33. Link type • sdiff : # digits different in ssn • -1 : one or both ssn is missing • 2 : two digits are transposed • 10 : two digits are different but not transposed • ddiff : diff in dob • -1 : one or both dob is missing • 2 : date and month is transposed • 3 : date, month and year are different • 4 : date and month are different • Fdiff (ldiff) : difference in first (last) name • -1 : one or both are missing • 1 : one letter difference (INDEL or REPL) • 100 : one is a substring of the other • 101 : one letter diff & substring

  34. Duplicate type • If duplicate id • Primary id info is given with prefix “l” • Duplicate type • Lsdiff, lddiff, lfdiff, & lldiff • If primary id • # of duplicates : freqeis & freqsis • Duplicate ids • Indid1-indid3 (eis) & kindid1-kindid8 (sis)

  35. Other tables • Link • Linkage between the primary eis & sis ids • dupeis & dupsis • List of duplicates with primary id

  36. Data flow • eisid: 4,308,863 • ueis (4,277,402)+dupeis (31,461) : 99% • sisid: 1,888,747 • usis (1,638,112)+dupsis (250,635) : 87% • Link : 1,173,404 (eis: 27%, sis: 72%) • dupeis2 (1,270) + dupsis2(493) • EIS: 4,308,863 (28%) • SIS: 1,888,747 (74%)

More Related