Probabilistic Linkage: Issues and Strategies

Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine craig.mason@umit.maine.edu

Faculty Disclosure Information In the past 12 months, I have not had a significant financial interest or other relationship with the manufacturer(s) of the product(s) or provider(s) of the service(s) that will be discussed in my presentation This presentation will (not) include discussion of pharmaceuticals or devices that have not been approved by the FDA or if you will be discussing unapproved or "off-label" uses of pharmaceuticals or devices.

Acknowledgements • Shihfen Tu, Quansheng Song • Keith Scott, Marygrace Yale, Tony Gonzalez • Derek Chapman

Overview of Linkage Process Birth Certificates EHDI Diagnostic Data • Two databases containing information on some of the same individuals

Overview of Linkage Process Birth Certificates EHDI Diagnostic Data • Many births not in Diagnostic Data

Overview of Linkage Process Birth Certificates EHDI Diagnostic Data • Some entries in EHDI Diagnostic Data do not appear in Electronic Birth Certificates

Overview of Linkage Process Birth Certificates EHDI Diagnostic Data • Final linkage is a subset of each

Linkage Algorithms • Deterministic • Exactly match on specified common fields • Easiest, quickest linkage strategy • Misconception that this is the “gold standard”

Linkage Algorithms • Deterministic • May result in significant bias • Non-traditional spellings in African American names • Result in errors due to non-links • Many non-links can result in greater bias than a few erroneous pairings

Linkage Algorithms • Probabilistic • Statistically estimate likelihood or odds that two records are for the same individual, even if they disagree on some fields

Linkage Algorithms • Factors Impacting Probabilistic Linkage • Likelihood that a fields would agree if a correct link • Good quality data counts more than poor quality data • Likelihood that fields would agree if not a correct link • Rare values count more than common values • Number of expected matches • Much more complicated and expensive strategy

Implementing an Effective Data Linkage out Then a miracle occurs Start Good work, but I think we might need just a little more detail right Modified from Kim Church, Maine Genetics Program here.

Probabilistic Matching • Probabilistic Matching: Two records are not required to match in all fields • Two records are compared on each of the specified fields. • A weight—wi—is calculated for each field in a potential match reflecting the strength of the agreement or disagreement w1 w2

Factors Influencing Likelihood of Match • Reliability of data fields • Greater reliability results in increased odds of correct match • A match on a high-quality, reliably entered field is good • Not matching on a poor-quality field with lots of known data entry errors may not be a fatal error • If a field is pure noise, correct matches will be random across the databases

Factors Influencing Likelihood of Match • Frequency of field values • The more common the value in a field, the greater the odds that the records will be erroneously matched • A match based on the name Zbignew is a relatively good indicator of a match, even if there may be disagreement in other fields • A match based on the name John may be of much less value, requiring matches on more fields in order to conclude two records are the same individual • Number of expected matches one would obtain randomly

Calculating Match Weights • Weight Calculation • M-probability • Probability that a field agrees if the pairing reflects a correct match • U-probability • Probability that a field agrees if the pairing reflects an incorrect match • Chance that a given field will agree randomly • Approximately = # records with a specific value/total # of records

Probabilistic Matching • If the field agrees, wi is equal to …. w1 w2

Probabilistic Matching • mi for first name = .98, or 98% of the time, if it’s a correct match, the first names will agree • ui for Zbignew is .00001 is the probability of randomly getting two first names that are Zbignew w1 w2

Probabilistic Matching • In cases where two records disagree on a specified field, wi is equal to ….. w1 w2

Probabilistic Matching • mi for last name = .96, or 96% of the time, if it’s a correct match, the last names will agree • ui for Brezinsky is .00003 is the probability of randomly getting two last names that are Brezinsky w1 w2

Calculating Match Weights • A composite weight, wt calculated for each pair of records • The sum of weights across all fields used in linkage • Larger wt suggest a correct match, • Smaller or negative wt suggest an incorrect match.

Blocking • Match Determination • Could compare every record in one dataset with every record in the second dataset • Result in N1 x N2 comparisons • Blocking • Records first “blocked” on a subset of fields for which a deterministic match is required. • Within each block, all records from the one dataset are compared to all records from the other dataset • wt calculated for each of these possible pairings. • The distribution of wt’s across all blocks examined in order to determine a critical cut-off score necessary to classify two records as a match.

Estimating Probabilities • The total-weight required for two records to have a probability, p, of being a match is equal to… • Where p is the desired probability of a match, • E is the expected potential matches • N1 and N2 are the number of records in each database, is the base 2 log of the odds of a random match

Estimating Probabilities From this formula, it is possible to derive an equation for estimating the probability that any two records are a match odds of a random match, if two fields agree, and… if two fields do not agree

Notes • Note that the probability equation is equivalent to a base-2 version of the logistic probability formula • The computational formula avoids the need to repeatedly calculate powers of 2 and log2 • This is due to the weights in the exponent themselves being a log-value • The same probability is obtained using e and the natural log in place of 2 and log2 throughout • Base 2 results in improved computational speed

That’s nice, but ….. • All right. But apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the fresh water system, and public health… What have the Romans ever done for us? --- Reg, spokesman for the People’s Front of Judea Monty Python Life of Brian (and Martin White, UC Berkeley)

Probabilistic Linkage: Issues and Strategies