Efficient Probabilistic Record Linkage Software Overview

Probabilistic Record Linkage Software Link Plus • Application Overview & User Training • Oklahoma City, Oklahoma • September 2007

CDC–NPCR Link Plus Contacts Kathleen K. Thoburn, CDC/NPCR Contractor E-mail: kthoburn@cdc.gov David Gu, CDC/NPCR Contractor E-mail: dgu@cdc.gov Tom Rawson, CDC Computer Programmer

Acknowledgements Rich Pinder Supervisor of Follow-UpLos Angeles Cancer Surveillance Program Melissa Jim, MPH Epidemiologist Centers for Disease Control Division of Cancer Prevention and Control

Training Outline • Brief Overview of Record Linkage • Central Cancer Registry Record Linkage • Deterministic Matching • Probabilistic Matching • Link Plus Software Overview • Link Plus Linkage Overview • Linkage Exercises • Open Discussion

Record Linkage Concepts

Overview of Record Linkage • Combine or merge together information describing the same individual from a variety of data sources • Merge information from individual’s record in 1st data source (file 1) with information from individual’s record in 2nd second data source (file 2) • Cancer information from cancer registry file, death information from vital statistics file • “Merge” aka “Record Linkage”

Overview of Record Linkage • Can be accomplished manually, by visually comparing records from two separate sources • Approach becomes time consuming, tedious, inefficient, and unpractical as the number of records in file 1 and file 2 increases • Technological advances in computer systems and programming techniques • Economically feasible to perform computerized record linkage between large files • Efficient and relatively accurate

Central Cancer Registry Record Linkage • Case Finding • Linking New Reports Consolidation • Duplicate Detection • Follow Up • Special Studies

Case Finding • Matching reports from • Pathology labs • Medical Records Disease Index • Treatment centers • No Match: tumor has not yet been reported • Request report of cancer from facility of diagnosis • Positive Match: tumor is already reported • New diagnostic/treatment information can be added to existing tumor record

Linking New Reports • Multiple notifications of the same cancer due to multiple reporting sources • Efficient record linkage procedures on same individual very important • Consolidation…Is this a • new person? • new tumor for an existing person? • new report for an existing person/tumor? • Failure in record linkage process results in missed cases and/or duplicate registrations • Leads to inaccurate counts and rates

Duplicate Detection • Fundamental requirement for accuracy and validity of counts in any disease registry • National Program of Cancer Registries/ North American Association of Central Cancer Registries standard • Maintain <= 0.1% (<=1 per 1,000) duplicates

Follow Up • Death Clearance – State vital statistic file • Hospital discharge data – Statewide file • Department of Motor Vehicles – Drivers' licenses and renewals • Social Security Death Master – SSA maintained file of death benefit claims • Medicare/Medicaid – Files of state enrollees • Voter Registration/Voter History - Statewide file of last 6 elections • National Change of Address (U.S. Postal Service) - File of individuals reporting change of address in last 3 years

Special Studies • Research questions often require linking external data against the registry • Allows hypothesis testing not available using other methods • Efficiency is a key feature • Faster, more efficient linkage process allows more linkages for less $$ and staff time • More research • Increased utilization of registry data

Deterministic Matching • Computerized comparison where EVERYTHING needs to match EXACTLY:

Deterministic Matching • Often slight variations exist in the data between the two files for the same variables: • Or variables are missing from one of the files: • These variations would prevent a match from being identified

Deterministic Matching • Describes an algorithm in which the correctnext step is PRE-defined (match/no match) • Good for production environments • Easily incorporated into existing data systems HOWEVER, • Will miss significant numbers of true matches • Will require enormous amount of manual review of results for missed matches

Deterministic MatchingManual Review • When we manually review, we use intuition to help us identify positive matches for records containing slight variations in, or missing information for, data between the two files for the same variables • Typo in SSN, transposition of digits in the day component of DOB, but would still deem a match

Probabilistic Matching • What do Humans know? • How can we translate intuition into formal decision rules to be used by a computer? • Use the concept of PROBABILITY and perform PROBABILISTIC matching • Recommended over traditional deterministic (exact matching) methods when: • coding errors, reporting variations, missing data or duplicate records • Estimate probability/likelihood that two records are for the same person versus not

Probabilistic Matching Definition of Probability: • Measure of how likely it is that some event will occur • “What is the probability of rain tonight?" • The likelihood that a given event will occur • “There is little probability of rain tonight.”

Probabilistic Matching • Find the records in File 2 that seem to match records in File 1 • Calculate a linkage score that indicates, for any pair of records, how likely it is that they both refer to the same person • Sort the likely and possible matched pairs in order of their scores • Define a threshold (Cut Off value) for automatically accepting and rejecting a potential link • Discard unlikely matched pairs (scores below Cut Off) • Gray area: range of scores considered as uncertain matches • Manually review uncertain matches

Probabilistic Matching • The total score for a linkage between any two records is the sum of the scores generated from matching individual fields • The score assigned to a matching of individual fields is: • Based on the probability that a matching variable agrees given that a comparison pair is a match • M Probability - similar to "sensitivity“ • Reduced by the probability that a matching variable agrees given that a comparison pair is not a match • U Probability - similar to "specificity"

Probabilistic Matching • Agreement argues for linkage (higher score) • Disagreement argues against linkage (lower score) • Full agreement argues more strongly for linkage than partial agreement • Some types of partial agreements are stronger than others; probabilistic scores are • Field-specific – Birth date versus Sex • Value-specific - “Jane” versus “Janiqua”

Concept of Blocking • With so many comparisons, large files can make impossible resource demands • Blocking is an initial probabilistic linkage step that reduces the number of record comparisons between files • Sort and match the two files by one or more identifying (“blocking”) variables • Comparisons subsequently made only within blocks • Discard very unlikely record-pairings from the start

Blocking Variables Sock Pattern: 7 of 13 socks fall outside pattern block 6 of 13 socks withinpattern block

Matching Within Blocks Blocking: PatternMatching: Color & Size High Score Gray Area Low Score

Overview of Link Plus Application

Link Plus Software • Stand-alone probabilistic record linkage program • Combines ease of use and statistical sophistication • Detects duplicates within a data file, or links two data files together • Supports fixed width files, delimited files, and North American Association of Central Cancer Registries files

Link Plus Software • Computes probabilistic record linkage scores based on the theoretical frame work developed by Fellegi and Sunter • Fellegi, I. P., and A. B. Sunter (1969), "A Theory for Record Linkage," Journal of the American Statistical Association, 64, pp. 1183-1210 • Can handle missing values of matching variables • automatically treats null or empty values as missing data and allows user to indicate additional values to be treated as missing data

Link Plus Software • Facilitates a simple and efficient blocking ("OR blocking") mechanism by indexing the variables for blocking and comparing the pairs with the identical values on at least one of those variables • Provides powerful support for manual review of uncertain matches

Link Plus Is Free $0.00

Link Plus Is Easy To Use Link Plus gets you from HERE: Cancer Registry data for John Smith: Vital Statistics data for John Smith:

Link Plus Is Easy To Use To HERE: Linked data for John Smith:

Link Plus Is Easy To Use Without having to go HERE:

Link Plus Is Easy To Use • Designed especially for cancer registry work • HOWEVER, can be used with any data • Mathematics largely hidden from user • Practical default values supplied for many tasks • Familiar Windows interface • Includes Help and test examples

Link Plus Is Robust • Program written by a mathematical statistician • Specifications based on research into the published literature • Tested by researchers experienced in record-linkage • Results are clear and accessible to novice users

Overview of Link Plus Linkage Process

Link Plus Linkage OverviewPrior to Linkage • Review and clean data files • Make sure you know your data! • Data cleaning tips in online help • Set up two data files • Make sure files use same coding convention • Link Plus provides view of first 20 records of each input file • Verify that data is being read in properly

Data Cleaning Tips • Last Name • Link Plus automatically cleans punctuation and strips off suffices, numbers III • First Name • May find Dr. Bill or Rev Bill or Sister Mary • Remove prefix in First Name field • Middle Name • Link Plus automatically cleans numbers, weird symbols • NMI-no middle initial or NMN-no middle name • DOB • Review day, mo, yyyy component • Replace errant values with missing • Sex • Make sure files use same coding convention; M, F, or Blank OR 1, 2 9

Link Plus Linkage Overview Two main types of linkage: • External Linkage • Probabilistically link one file to another file • Deduplication • Special case of record linkage • Records in the same file are blocked, compared, and scored against each other • Result is a ranked list of record pairs • High-scoring pairs may be duplicates

Link Plus Linkage Overview

Link Plus Linkage Overview External Linkage Steps: • Select Data Type for File 1 • Locate/Identify File 1 • Data Import for File 1 • Select Data Type for File 2 • Locate/Identify File 2 • Data Import for File 2 • Select Blocking Variables & Phonetic System • Select Matching Variables & Matching Methods • Select ID Variables • Define Missing Values • Select Direct/EM Method • Enter Cut-off Value • Specify Linkage File Name and Location • Perform Manual Review of Uncertain Matches • Export Merged File

Link Plus Linkage Overview Deduplication Linkage Steps: • Select Data Type for File • Locate/Identify File • Data Import for File • Select Blocking Variables & Phonetic System • Select Matching Variables & Matching Methods • Select ID Variables • Define Missing Values • Select Direct/EM Method • Enter Cut-off Value • Specify Linkage File Name and Location • Perform Manual Review of Uncertain Matches • Export Merged File

File ImportFile 1 verses File 2 • Designation of File 1 and File 2 is important • File 1 is generally the larger of the two files • µ probabilities are based on FILE 1 • Non_MatchReport.txt contains records from File 2 file not matched to records in File 1

Blocking Variables • Exact matches • Blocks of data to compare variables within • Up to 10 fields may be selected for blocking • Common blocking variables are: • Last Name • Social Security Number • Date of Birth

Phonetic Systems • Phonetic coding involves coding a string based on how it is pronounced • Link Plus offers a choice of 2 Phonetic Coding Systems: Soundex (120 + years old) • Code for a name consisting of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants • Zeroes are added at the end if necessary to produce a four-character code. Additional letters are disregarded. • Washington is coded W-252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded • Reduces matching problems due to different spellings • Simple and fast

Phonetic Systems New York State Identification and Intelligence System (NYSIIS; 1970 +) • Maps similar phonemes to the same letter; maintains relative vowel positioning • String can be pronounced by the reader without decoding • Deborah Walker = DABARA WALCAR • Improvement to the Soundex algorithm • More distinctive; people are more likely to have the same Soundex than the same NYSIIS • Reported accuracy increase of 2.7% over Soundex • Studies suggest NYSIIS performs better than Soundex when Spanish names are used • Soundex may bring more pairs for comparison when used for blocking

Matching Variables • Up to 10 fields may be selected for matching • Recommended variables (Matching Methods): • Name--Last (LastName) • Name--First (FirstName) • Name--Middle (MiddleName) • Sex (Exact) • Race (Exact) • Birth Date (Date) • Social Security Number (SSN)

Matching Methods • Exact • Case insensitive character-for-character string comparison method • Results are either yes or no • Generic String • Uses edit distance function (Levenshtein distance) to compute the similarity of two long strings • Minimum number of operations (insertion, deletion, or substitution of a single character) needed to transform one string into the other • Last Name/First Name • Partial matching and value-specific matching accounts for minor typographical errors, misspellings, and hyphenation • Use of nicknames in First Name Matching Method

Matching Methods • SSN (Social Security Number) • Partial matching accounts for typographical errors and transposition of digits • Accepts 4 digit SSNs • Zip Code • Can match 5 digit zip code to 9 digit zip code • Date • Incorporates partial matching to account for missing month values and/or day values • Middle Name • Accounts for occurrence of the middle initial only versus the full middle name

Matching Methods Value-Specific (Frequency-Based) • Sets weights for matching values based on frequencies of values in the two files being compared • Match on a frequent value associated with a low weight; match on a rare value associated with a high weight • In a file with a high proportion of records with a white race (value of 01), a match on value 01 would be weighted lower than a match on the value 03 (American Indian)

Efficient Probabilistic Record Linkage Software Overview

Efficient Probabilistic Record Linkage Software Overview

Presentation Transcript

Link Plus Version 3 Update

Video card plus link

LINK

Link Link Link

link

Link Plus

LINK

link

Link Plus Version 3 Update

Link

plus

Overview of Link Plus Probabilistic Record Linkage Software

Link

Notepad Plus Plus

Modbus Link/Plus

Link

Link

Link Plus

link