QUAKE: Quadruple Key and Encryption Centers for Disease Control and Prevention Third Annual National Early Hearing Detection and Intervention Conference, Washington, DC, February, 2004.
Background • University of Maine research team involved in research in informatics and developmental epidemiology • Contact Information • Craig A. Mason: email@example.com • Shihfen Tu: firstname.lastname@example.org
To Link or Not to Link… • Data linkage provides huge opportunity for public health research • Integrate large, complex, longitudinal datasets • Address questions impossible to do any other way • This impractical 10 or 15 years ago • Lead to fears of “Big brother” • Abuse of information • Has identifiable information be released by researchers? • Individual rights versus public good • At what point does the public right to health trump my right to privacy? (assuming either of these exist)
Strategies for Addressing Concerns • Legislative • Procedural • Educational • Our focus: Technological • Review linkage strategies • Review encryption issues
Deterministic Linkage • A series of common identifying fields are selected across two databases • Records are matched across databases based on these fields • Two records must have identical values across all of these fields in order to be linked • “John”, “Bartholomew”, “Szapoznick” • “Jon”, “Bartholomew”, “Szapoznick”
Probabilistic Linkage • Two records do not have to match across all fields in order to be linked • For a possible pairing, a value is calculated that reflects the likelihood that the two records are (or are not) the same person • Based upon the frequencies of values and the quality of the data
Factors Influencing Probabilistic Linkage • Reliability of data fields • Greater reliability results in increased odds of a correct match • If a field is pure noise, correct matches will be random • Frequency of field values • The more common the value in a field, the greater the odds that the records will be erroneously matched • E.g., a match based on the name Szapocznik is more likely to reflect a correct match than is a match on the name Smith • Number of matches • The greater the number of individuals in one database that also appear in the other database, the greater probability of linkage across databases. • If two databases have no individuals in common, the probability of a linkage across the databases must be zero
Statistician’s Anonymous “I’m David, and I’m a bean-counter”
Encryption • Ecretsay odecay • Information is coded so that true values are not obvious • Ancient field • Modern era focus on electronic transmission of sensitive data • Notice the little yellow padlock in the bottom corner of your browser when shopping on e-bay?
Encryption Techniques • Asymmetric or public key • Different key for encryption and decryption • Encryption key is public • Decryption key is private • Decryption key cannot be derived from encryption key • Provide security of data transmission • Anyone can use the public key to code a message • Only I can decrypt it • Typically based on product of large primes
Challenge of Factorization • Factors hard to find • But once you know one, the other is easy to find Public Key: 114,381,625,757,888,867,669,235,779,976,146, 612,010,218,296,721,242,362,562,561,842,935,706,935,245, 733,897,830,597,123,563,958,705,058,989,075,147,599,290, 026,879,543,541 Private Key Based on Factors: 3,490,529,510,847,650,949,147,849,619,903,898, 133, 417,764,638,493,387,843,990,820,577 and 32,769,132,993,266,709,549,961,988,190,834,461, 413,177,642,967,992,942,539,798,288,533
Encryption Techniques • Symmetric key • Same key for encryption and decryption • Key is not made public • Secret key - One Key to Rule Them All • More secure than asymmetric key • Nothing suggesting a possible key is published • Asymmetric key must be 6 to 30 times longer than symmetric key for equivalent security • Useful if you know in advance exactly who will want to encrypt a message to you
Encryption Techniques • Security often described in terms of bits • 128 bit encryption indicated 2128 possible keys • 3,402,823,669,209,384,634,633,746,074,300,000,000,000,000,000,000,000,000,000,000,000,000 • A lot of possibilities… • Widespread use of 1024 and 2048 bit encryption on the horizon • 128 bit symmetric = 2304 bit asymmetric (Cryptography, p.166)
A Dirty Little Secret.. • These big numbers hide the fact that the security is only as good as the algorithm • Think reliability of DNA testing • Plaintext attack (and its variations) • If the only unique name in the data set is Szapocznik • And the only unique variation in the encrypted data set is “X*GFfF825d=“….. • The key can be resolved
A Dirty Little Secret.. • Even without the key, you can determine my grade • Some computational or physical wall between decrypted and encrypted data
One-to-One Encryption Craig • Identifiers are encrypted into a unique value 93812….2431 Encryption Key H3~f9(-d
One-to-Many Encryption Craig • Identifiers are encrypted into one of multiple values • Lack of uniqueness increases challenge of decryption 93812….2431 Encryption Key 9Dj1D[d H3~f9(-d dfR1”d/G or or
That’s nice, but how can this help with data linkage? • All right. But apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the fresh water system, and public health… What have the Romans ever done for us? --- Reg, spokesman for the People’s Front of Judea Monty Python Life of Brian (and Martin White, UC Berkeley)
The Politics of Linkage • Two data systems contain information on same individuals • Would like to link data for public health research Service Data: Craig A. Mason…. School Data: Craig A. Mason….
The Politics of Linkage • I may not want schools to know about health services I have received Service Data: Craig A. Mason…. School Data: Craig A. Mason….
The Politics of Linkage • What solution may allow data to be linked, yet prevent sources from seeing each other’s identifying data Service Data: Craig A. Mason…. School Data: Craig A. Mason….
Quake • QUAdruple Key and Encryption Service Data: Craig A. Mason…. School Data: Craig A. Mason….
Quake • Requires algorithms to be reversible • You can “undo” a process to come back to original value
Quake • Requires algorithms to be commutative • You get the same answer even if you do the problem backwards
Quake • Each provider selects their own unique encryption key that is used to encrypt identifiers prior to linkage 052385043…9471 757260024…2512 Service Data: Craig A. Mason…. School Data: Craig A. Mason….
Quake • Community members representing individuals in each dataset also select their own unique encryption keys 420504763….8372 850258434…3435 052385043…9471 757260024…2512 Service Data: Craig A. Mason…. School Data: Craig A. Mason….
Quake • The encryption keys for the community representatives and the providers are entered separately, and the combined keys are hidden from the users 420504763….8372 850258434…3435 Hidden Key: 342002330…2852 Hidden Key: 147742268…0042 052385043…9471 757260024…2512 Service Data: Craig A. Mason…. School Data: Craig A. Mason….
Quake • These combined encryption keys are used to encrypt identifiers in each file prior to linkage 420504763….8372 850258434…3435 Hidden Key: 342002330…2852 Hidden Key: 147742268…0042 052385043…9471 757260024…2512 Service Data: *Bj&!33t…. School Data: yy#K66….
Quake • Symmetric key with 1:many encryption 420504763….8372 850258434…3435 Hidden Key: 342002330…2852 Hidden Key: 147742268…0042 052385043…9471 757260024…2512 Service Data: *Bj&!33t…. School Data: yy#K66….
Quake • The combined encryption keys are not stored so neither party can decrypt on their own 420504763….8372 850258434…3435 Hidden Key: 342002330…2852 Hidden Key: 147742268…0042 052385043…9471 757260024…2512 Service Data: *Bj&!33t…. School Data: yy#K66….
Illustration of Security • To see why, consider the following simple keys • Service provider key: 7 • Community representative key: 3 • Combined key: 3 x 7 = 21 • Simple message to encrypt, “A” • Simple encryption algorithm • Each letter has a value 1-26, repeating • “A”=1, “Z”=26, “A”=27… • Multiply that value by the encryption key in order to obtain the new value Rep Key: 3 Hidden Combined Key: 21 Provider Key: 7
Illustration of Security • Once encrypted, “A” becomes “U” Rep Key: 3 Original Message: A Hidden Combined Key: 21 Provider Key: 7 Encrypted Message: U
Illustration of Security • If the community representative applied their key to the encrypted message, they would see “G” • 21 ÷ 3 = 7 • “G” is the letter with value 7 Rep Key: 3 Encrypted Message: U Hidden Combined Key: 21 Provider Key: 7 De-Encrypted Message: G
Illustration of Security • If the service provider applied their key to the encrypted message, they would see “C” • 21 ÷ 7 = 3 • “C” is the letter with value 3 Rep Key: 3 Encrypted Message: U Hidden Combined Key: 21 Service Provider Key: 7 De-Encrypted Message: C
Illustration of Security Encrypted Message: U • Only by working together can the message be decrypted Rep Key: 3 Partially Decrypted Message: G Hidden Combined Key: 21 Service Provider Key: 7 Fully Decrypted Message: A
Quake • Once each dataset encrypted, several possible methods for linking 420504763….8372 850258434…3435 Hidden Key: 342002330…2852 Hidden Key: 147742268…0042 052385043…9471 757260024…2512 Service Data: *Bj&!33t…. School Data: yy#K66….
Linking Encrypted Files • Simple approach • Bring both encrypted files together on independent, non-networked machine • Each of the four parties enters their own key • Respective files internally decrypted and linked • New, de-identified linked file containing fields of interest created • Record of identifiers and keys electronically or physically erased • DoD 5220.22-M protocol
Linking Encrypted Files • Benefits • Flexible linkage strategies (partial names, etc.) • Easiest to perform • Once completed no identifiers to enable plaintext attack • Issues • Process of encryption/decryption can be computationally demanding • Potential record of encrypted data and all keys • Can be destroyed, but time consuming
Variation of Quake • Each provider selects own unique encryption key used to encrypt identifiers prior to linkage Key: 052385043…9471 Key: 757260024…2512 Service Data: Craig A. Mason School Data: Craig A. Mason
Variation • Identifiers in their file encrypted with a 1:1 symmetric key Key: 052385043…9471 Key: 757260024…2512 Service Data: *Bj&!33t…. School Data: yy#K66….
Variation • Parties then switch encrypted files • If identifying fields in both files are all equal.. • May be prone to variations of a plaintext attack • Inclusion of additional records whose identifiers contain random noise can nearly eliminate this risk Key: 052385043…9471 Key: 757260024…2512 School Data: yy#K66…. Service Data: *Bj&!33t….
Variation • Each party then applies their own key to the other parties already-encrypted file • Identifiers in each file will have the same value • Can not determine key used by other source Key: 052385043…9471 Key: 757260024…2512 School Data: Jf*72Coo…. Service Data: Jf*72Coo….
Variation • If files brought together by one of the parties • They may be able to conduct a plaintext attack • May then be able to determine key used by other party • Both files linked by trusted third party Key: 052385043…9471 Key: 757260024…2512 School Data: Jf*72Coo…. Service Data: Jf*72Coo….
Variation • Again, may bring in community representatives Key: 052385043…9471 Key: 757260024…2512 School Data: Jf*72Coo…. Service Data: Jf*72Coo…. Linked Data: Jf*72Coo, Services, Grades Final Linked Data: Services, Grades
Variation • Link based upon the encrypted identifier fields • No need to decrypt files when linking • Apply deterministic and probabilistic algorithms to encrypted data • No machine ever sees all keys • Final file contains no identifiers and only a limited number of fields of interest
Variation of Quake • Issues • Requires 1:1 encryption algorithm • Can be addressed, but adds level of complexity • Can not examine partial strings • Specific partial strings can be generated prior to encryption • Month of birth, day of birth • First letter of first name
Advanced Linkage Protocols for Addressing Confidentiality Concerns • Encrypted Linkage Protocols • Unique encryption keys administered by each database administrator and community liaisons • No one at any time sees the other person’s identifiers • Person conducting the linkage never sees any identifiers • Resulting linked set includes no decrypted identifiers • Resulting file can not be decoded, expanded, or relinked without agreement and cooperation of all parties • The community participates in the process • Technology that creates confidentiality concerns may provide means for reducing those concerns