1 / 36

Kathleen K. Thoburn CDC/NPCR Contractor Joe Rogers

Make the Most of Your Data Using CDC’s Link Plus Free, Fast, and Efficient Probabilistic Record Linkage Program. Kathleen K. Thoburn CDC/NPCR Contractor Joe Rogers Team Lead, Data Analysis and Support Team, NPCR, CDC. National Center for Chronic Disease Prevention and Health Promotion.

destinyd
Download Presentation

Kathleen K. Thoburn CDC/NPCR Contractor Joe Rogers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Make the Most of Your Data Using CDC’s Link Plus Free, Fast, and Efficient Probabilistic Record Linkage Program Kathleen K. Thoburn CDC/NPCR Contractor Joe Rogers Team Lead, Data Analysis and Support Team, NPCR, CDC National Center for Chronic Disease Prevention and Health Promotion NCRA 2011 Annual Conference NPCR QC TrackOrlando, FloridaJune 24, 2010 Division of Cancer Prevention and Control

  2. Overview of Record Linkage • Can be accomplished manually, by visually comparing records from two separate sources or reviewing a single data source for duplicate records • Approach becomes time consuming, tedious, inefficient, and unpractical as the number of records in the files increases • Technological advances in computer systems and programming techniques • Economically feasible to perform computerized record linkage on large files • Efficient and relatively accurate

  3. Central Cancer Registry Record Linkage • Case Finding • Linking New Reports Consolidation • Follow Up • Special Studies • Duplicate Detection

  4. Duplicate Detection • Fundamental requirement for accuracy and validity of counts in any disease registry • National Program of Cancer Registries and North American Association of Central Cancer Registries standard • Maintain <= 0.1% (<=1 per 1,000) duplicates

  5. Deterministic Matching • Computerized comparison where EVERYTHING needs to match EXACTLY:

  6. Deterministic Matching • Often slight variations exist in the data between the two files for the same variables: • Or variables are missing from one of the files:

  7. When we manually review, we use intuition to help us identify positive matches for records containing slight variations in, or missing information for, data between the two files for the same variables Deterministic MatchingManual Review • Typo in SSN, transposition of digits in the day component of DOB, but would still deem a match

  8. Probabilistic Matching • What do Humans know? • How can we translate intuition into formal decision rules to be used by a computer? • Use the concept of PROBABILITY and perform PROBABILISTIC matching • Recommended over traditional deterministic (exact matching) methods when: • coding errors, reporting variations, missing data or duplicate records • Estimate probability/likelihood that two records are for the same person versus not

  9. Probabilistic Matching • Find the records in File 2 that seem to match records in File 1 • Calculate a score that indicates, for any pair of records, how likely it is that they both refer to the same person • Sort the likely and possible matched pairs in order of their scores • Define a threshold (Cut Off value) for automatically accepting and rejecting a potential link • Discard unlikely matched pairs (scores below Cut Off) • Gray area: range of scores considered as uncertain matches • Manually review uncertain matches

  10. Probabilistic Matching • Agreement argues for linkage (higher score) • Disagreement argues against linkage (lower score) • Full agreement argues more strongly for linkage than partial agreement • Some types of partial agreements are stronger than others; probabilistic scores are • Field-specific – Birth date versus Sex • Value-specific - “Jane” versus “Janiqua”

  11. Phonetic Systems • Phonetic coding involves coding a string based on how it is pronounced Soundex (120 + years old) • Code for a name consisting of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants • Zeroes are added at the end if necessary to produce a four-character code. Additional letters are disregarded. • Washington is coded W-252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded • Reduces matching problems due to different spellings • Simple and fast

  12. Phonetic Systems New York State Identification and Intelligence System (NYSIIS; 1970 +) • Maps similar phonemes to the same letter; maintains relative vowel positioning • String can be pronounced by the reader without decoding • Deborah Walker = DABARA WALCAR • Improvement to the Soundex algorithm • More distinctive; people are more likely to have the same Soundex than the same NYSIIS • Reported accuracy increase of 2.7% over Soundex • Studies suggest NYSIIS performs better than Soundex when Spanish names are used • Soundex may bring more pairs for comparison when used for blocking

  13. Concept of Blocking • With so many comparisons, large files can make impossible resource demands • Blocking is an initial probabilistic linkage step that reduces the number of record comparisons between files • Sort and match the two files by one or more identifying (“blocking”) variables • Comparisons subsequently made only within blocks • Discard very unlikely record-pairings from the start

  14. Blocking Variables • Exact matches • Blocks of data to compare variables within • Common blocking variables are: • Last Name • Social Security Number • Date of Birth

  15. Matching Variables • Probabilistic matching algorithms • Comparing variables within blocks • Common matching variables: • Name--Last • Name--First • Name--Middle • Sex • Race • Birth Date • Social Security Number

  16. Blocking Sock Pattern: 7 of 13 socks fall outside pattern block 6 of 13 socks withinpattern block

  17. Matching Within Blocks Blocking: Sock PatternMatching: Sock Color & Size High Score Gray Area Low Score

  18. Link Plus Software • Stand-alone probabilistic record linkage program • Combines ease of use and statistical sophistication • Detects duplicates within a cancer registry, or links cancer registry files to external files • Supports North American Association of Central Cancer Registries files, fixed width files, delimited files, and CRS Plus database • Provides powerful support for manual review of uncertain matches

  19. CDC–NPCR Link Plus Contacts Kathleen K. Thoburn, CDC/NPCR Contractor E-mail: kthoburn@cdc.gov David Gu, CDC/NPCR Contractor E-mail: dgu@cdc.gov Tom Rawson, CDC Computer Programmer

  20. Link Plus Is Free $0.00

  21. Link Plus Is Easy To Use Link Plus gets you from HERE: Cancer Registry data for John Smith: Vital Statistics data for John Smith: To HERE: Linked data for John Smith:

  22. Link Plus Is Easy To Use Without having to go HERE:

  23. Link Plus Is Easy To Use • Designed especially for cancer registry work • HOWEVER, can be used with any data • Mathematics largely hidden from user • Practical default values supplied for many tasks • Familiar Windows interface • Includes Help and test examples

  24. Link Plus Deduplication Linkage Overview

  25. Link Plus Linkage Overview Deduplication Linkage Steps: • Select Data Type for File • Locate/Identify File • Data Import for File • Select Blocking Variables & Phonetic System • Select Matching Variables & Matching Methods • Select ID Variables • Define Missing Values • Enter Cut-off Value • Select Direct/EM Method • Specify Linkage File Name and Location • Perform Manual Review of Uncertain Matches • Export Merged File

  26. Blocking Variables • Exact matches • Blocks of data to compare variables within • Up to 10 fields may be selected for blocking • Common blocking variables are: • Last Name • Social Security Number • Date of Birth

  27. Matching Variables • Up to 10 fields may be selected for matching • Recommended variables (Matching Methods): • Name--Last (LastName) • Name--First (FirstName) • Name--Middle (MiddleName) • Sex (Exact) • Race (Value-Specific) • Birth Date (Date) • Social Security Number (SSN)

  28. Matching Methods • Exact • Generic String • Last Name/First Name/Middle • SSN (Social Security Number) • Zip Code • Date • Generic ID • Confirmation • Value-Specific (Frequency-Based)

  29. Missing Values • Specify date format on the missing value grid

  30. Cut Off Value • The score value above which comparison pairs are accepted as potential links and presented for review  • Value should always be positive • Initial value of around 7-10 recommended when using the recommended Matching Variables • Run linkage, and quickly review potential matches to identify lower and upper cut off scores • At what score do perfect matches end and uncertain matches (gray area) begin? • At what score do false matches begin?

  31. Running Linkage &Linkage Process Progress Window • Linkage Process Progress window appears and provides the user with feedback about the linkage process as it is run • The progress window provides feedback regarding the preparation of the configuration, the reading in of the data files, the blocking of the files, and the calculation of the linkage scores

  32. Manual Review of Uncertain Matches

  33. Link Plus Deduplication LinkageDelimited File Export

  34. Enhancements New to Version 3.0 Data Link: • Removes the limitation on the number of records included in File 2; the program works for any number of records in File 2 as long as the computer has sufficient memory to read in data from File 1 • Users can choose whether to write all potential matches (many-to-many linkages) or only the matches with the highest score to the linkage report (1-to-many linkages) • Provides confirmation-like matching method for variables such as address that contributes positive weight for the linkage score with agreement but 0 weight with disagreement • Provides SSN-like matching method for a generic ID • Provides a new name matching method that is more robust against the frequency of names or outlier of names

  35. Enhancements New to Version 3.0 ManualReview • Users can use “Assign Set ID” (de-duplication linkages only) to group matches into mutually exclusive match sets • Removes the limitation of the maximum size of 30,000 pairs on the manual review window; the new maximum is 300,000 pairs • Provides option to allow users to assign match status by linkage score without overwriting any existing assigned match status  Export • Users can export the results of manual review to a NAACCR formatted file or any other fixed width file format

  36. Thank You! Be sure to stop by the Registry Plus booth with questions or for further demonstrations! Kathleen K. Thoburn, kthoburn@cdc.gov Joe Rogers, jrogers@cdc.gov The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. National Center for Chronic Disease Prevention and Health Promotion Division of Cancer Prevention and Control

More Related