1 / 22

Blindfolded Record Linkage

Blindfolded Record Linkage. Presented by Gautam Sanka. Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris. Introduction and Objectives. Challenges Patient Privacy vs. Building Cross-Site records Solutions Mandate that identifiers be disclosed Privacy officers find this unacceptable

abedi
Download Presentation

Blindfolded Record Linkage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Blindfolded Record Linkage Presented by Gautam Sanka Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris

  2. Introduction and Objectives • Challenges • Patient Privacy vs. Building Cross-Site records • Solutions • Mandate that identifiers be disclosed • Privacy officers find this unacceptable • Keep only de-identified information in the registry but share an algorithm to Third Parties for generating an anonymous identifier

  3. De-identification Explained • This anonymous identifier will be created in such a way that: • Probability of same identifier generated at two different sites is high for the same person • And low for different people

  4. What can be used? • Using SSN – Bad Idea • Using names and DOB may seem best but: • Nicknames at one site and full name at another • Misspellings • Different Titles (Mr. Ms. Mrs.)

  5. Goal of Project • Breast Cancer Patients at PAMF (Palo Alto Medical Foundation) and Stanford University Medical Center • Merge the Data with de-identification under HIPAA and IRB approval

  6. Interesting Approaches • Bigrams • For the names Ann and Anne • [AN, NN] • [AN, NN, NE] • The Dice Co-efficient is 2 * (2/5) = 4/5 • Bloom Filter • Both were not implemented due to the complexities

  7. A single SHA-1 string was constructed based on • Gender • DOB • Zip • Three letter Prefix of last name • In their case, only first two letters of patients’ first and last names were used

  8. Composite Identifier • Felt that a combination of DOB and the first two letters of names would uniquely identify • Most applicable when: • Compliance restrictions preclude the exchange of actual identifiers • Total number of comparisons is less than 10^8 • Names and DOB are easily available • DOB has a low error rate

  9. Methods • Measured Rate of false positives in data • Dropped name prefixes • Dropped DOB stating 1/1/1900 and 1/1/1901 • Performed a self-join on three sets of 1.5M rows, 0.5M rows and 10,000 rows

  10. Specificity based on Data Set Size

  11. Measure False Negative • Both sites exchanged cryptographic hashes based on SSNs • The number of matches found by matching SSNs and not composite identifiers became the Lower Bound for False Negatives • Removal of all False Positives based on real identifiers

  12. Sensitivity: • Specificity:

  13. 2087 Common Patients

  14. “This was a very interesting result in that it provided us with a measure of how much better our approach is compared to using full names rather than two-letter prefixes.”

  15. Reasons for False Negatives in Composite Identification Found by SSN and later confirmed manually

  16. Simply Using SSN • SSNs found only 1806 out of 2028 • Rate of false negatives is 10% higher than a composite identifier • Reasons • 172 of the 222 with false negatives had a missing SSN

  17. What about the other 50? In conclusion, 57 False Positives for SSN matches 3 False Positives for Composite Identifier 20 times worse

  18. Which identifiers are best?

  19. When should we use this tool? • Most useful where privacy policies preclude the full exchange of the identifiers required by more sophisticated and sensitive linkage algorithms • For Data Sets of High quality, this approach (in comparison to complex algorithms) • Easy to explain • Adheres to minimum rules set by HIPAA • Faster and less cumbersome

  20. Suggestions

More Related