Deterministic Record Linking

Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum

Example

Exact Match

Approximate Matching I : SSN

Approximate Matching II : DOB

Approximate Matching III : Name

Deterministic Record Linking • Allow for approximate matching • Use explicit approximate rules • Pros : can control the linkage process • Con: difficult to implement • Alternative : Probabilistic record linking • Also approximate matching • However, uses general rules specified by users • Based on total probability • Con: can not control exactly what to consider a match or not • Pros: can use specialized software

Approximate Matching : DOB • element to element match : date, month, year • Allow for one element difference • Allow for month and day transposed

Approximate Matching : Name • First name soundex match • First name is approx • one letter different • insert or replace • and/or substr • lsound equal • or lname approx • MI=FI • FI equal • Fsound & Lsound swapped

Approximate Matching : Name

Match on ssn (ssn equal) • 1 : dob, fsound equal • dob approx • 2 : dob approx, fsound equal • 3 : dob approx, fname approx • 4 : dob approx, lsound equal, & fsound diff, but MI=FI • 5 : dob approx, lsound equal, & fsound diff, but FI equal • 6 : dob approx, lsound and fsound swapped • 7 : dob approx, lname approx & fsound diff • but MI=FI (4 with lname approx rather than equal) • or FI equal (5 with lname approx rather than equal) • dob mismatch • 8 : fname approx, lsound equal, and dob diff • 9 : fname approx, lsound approx, and dob diff

Match on ssn (ssn equal) • 1 : dob, fsound equal • dob approx • 2 : dob approx, fsound equal • 3 : dob approx, fname approx • 4 : dob approx, lsound equal, & fsound diff, but MI=FI • 5 : dob approx, lsound equal, & fsound diff, but FI equal • 6 : dob approx, lsound and fsound swapped • 7 : dob approx, lname approx & fsound diff • but MI=FI (4 with lname approx rather than equal) • or FI equal (5 with lname approx rather than equal) • dob diff • 8 : fname approx, lsound equal, and dob diff • 9 : fname approx, lsound approx, and dob diff

Approximate Matching : SSN • Digit to digit match • Allow for one digit difference • Allow for two digit difference if transposed

ssn missing 1: lname equal 2: lname approx ssn approx 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different 11 : lname equal 12 : lname approx lname different 51: ssn approx 52: ssn missing Match on ndob (dob+fsound)

Match on name (fname+lname) • ssn missing & dob approx • 1: MI equal • 7: MI missing • 8: MI not equal • ssn approx • 3: dob equal • dob approx • 4: one element • 5: transpose

Match on name (fname+lname)

link • Put together all links found • Identify indirect duplicates (type2>10000) • i.e. both EISID1 & EISID2 link to identical SISID1 • Consider indirect duplicates on both EIS & SIS • Create unique link and indirect duplicate files • Keep only the first id in data file link • Create indirect duplicates files • dupeis2 & dupsis2 • TODO : explore indirect duplicates

Create unique list of EIS & SIS • Generate unique full list of each set of ids • use linkage info • Link in the duplicates (dupeis & dupsis) • TODO : link in the indirect duplicates • eis & sis

1,888,747 4,308,863 eisid.sas7bdat sisid.sas7bdat 31,461 dupeis.sas7bdat duplicates unduplicated unique records dupsis.sas7bdat 250,635 1,638,112 87% 4,277,402 99% 27% 1,173,404 72% link.sas7bdat 1270 dupeis2.sas7bdat eis.sas7bdat sis.sas7bdat 493 dupsis2.sas7bdat 4,308,863 28% 1,888,747 74% ueis.sas7bdat usis.sas7bdat Link eis to sis Data flow

Type of links

Type of duplicates and links

Number of Duplicates

Implementation details • Ndob & name must be looped • multiple matches • Too many match on name • use half of ssn • Overlap for transpose

Basic Process • Unduplicate EIS (dupeis) • Unduplicate SIS (dupsis) • Link unduplicated EIS & SIS (link) • Generate unique full list of each set of ids (list) • use linkage info • Link in the duplicates • eis & sis

Unduplication • Same as matching between different system • Except, match the database to itself • i.e. EIS to EIS, SIS to SIS • Randomly select one as Primary • TODO: for those not linked using primary ID, try with duplicate ID • TODO: explore indirect duplicate links

Conclusion • Future work : • indirect duplicates • Link using duplicates • SSN have been changed from real data

Thank You !

Type of id • first letter: • P : primary id with duplicates • D : duplicates (primary info given with prefix ‘l’) • X : no duplicates • second letter: link status • L: linked • X: no linked id • third letter: duplicates status of the linked id • D: duplicates exist for the linked id • X: no duplicates for the linked id

EIS & SIS Table • Unique full is of EIS (or SIS) ids • Type : type of id (XXX) – see next slide • All eis info have no prefix • All sis info have prefix ‘k’ • Prefix ‘l’ is the link id info • freqeis & freqsis : # of duplicate ids • Pindid (eis) & pkindid (sis) is the primary id • indid1-indid3 & kindid1-kindid8

Link type • sdiff : # digits different in ssn • -1 : one or both ssn is missing • 2 : two digits are transposed • 10 : two digits are different but not transposed • ddiff : diff in dob • -1 : one or both dob is missing • 2 : date and month is transposed • 3 : date, month and year are different • 4 : date and month are different • Fdiff (ldiff) : difference in first (last) name • -1 : one or both are missing • 1 : one letter difference (INDEL or REPL) • 100 : one is a substring of the other • 101 : one letter diff & substring

Duplicate type • If duplicate id • Primary id info is given with prefix “l” • Duplicate type • Lsdiff, lddiff, lfdiff, & lldiff • If primary id • # of duplicates : freqeis & freqsis • Duplicate ids • Indid1-indid3 (eis) & kindid1-kindid8 (sis)

Other tables • Link • Linkage between the primary eis & sis ids • dupeis & dupsis • List of duplicates with primary id

Data flow • eisid: 4,308,863 • ueis (4,277,402)+dupeis (31,461) : 99% • sisid: 1,888,747 • usis (1,638,112)+dupsis (250,635) : 87% • Link : 1,173,404 (eis: 27%, sis: 72%) • dupeis2 (1,270) + dupsis2(493) • EIS: 4,308,863 (28%) • SIS: 1,888,747 (74%)

Deterministic Record Linking

Deterministic Record Linking

Presentation Transcript

Deterministic Chaos

Deterministic Multiprocessing

Deterministic Scheduling

Fungal Infection P ost Corneal C ross Linking, Case Record

Deterministic Chaos

Linking

Linking

GINI/LISA Birth Cohort Studies Internal Record Linking For Analysis

Deterministic BIST

Linking

Deterministic Encryption

Deterministic Petrophysics

Linking

Deterministic Scheduling

Microassembly – deterministic

Deterministic ethernet

Deterministic Annealing

Web Service Composition and Record Linking

Karma: Scalable Deterministic Record-Replay

Deterministic Finite Automata

Microassembly – deterministic