1 / 19

# The Conditional Independence Assumption in Probabilistic Record Linkage Methods - PowerPoint PPT Presentation

The Conditional Independence Assumption in Probabilistic Record Linkage Methods. Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF [email protected] The record linkage problem.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' The Conditional Independence Assumption in Probabilistic Record Linkage Methods' - simeon

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland

Edinburgh EH12 7TF

• Given two files A and B, the aim is to find record pairs which refer to the same person.

• This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode

• The data matrix therefore looks like

What is the assumption of conditional independence? Record Linkage Methods

• The likelihood that the two records refer to the same person is measured by a log likelihood ratio

What is the assumption of conditional independence? Record Linkage Methods

• This is much easier to work out if the observations are independent conditional on match status because now

Why is the assumption of conditional independence important? Record Linkage Methods

• It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields

• Enables the use of frequency based agreement weights

• Speeds up computing time

• Improves stability of parameter estimation

• But is almost always wrong e.g. gender is almost wholly predictable from first name

• But does it matter?

• Rec Link (US Census Bureau) – yes

• Link Plus (US Centers for Disease Control and Prevention) – yes

• GRLS/Fundy (Statistics Canada) – yes

• ORLS – yes (probably)

• RELAIS (Italian Statistical Institute) - no

• To what extent is the assumption violated in real data sets?

• How much effect does it have on the output of linkage software?

What does the assumption look like in practice? Record Linkage MethodsA = Agree D = DisagreeM = Match N = Non-match

• Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields

• Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales

Run 1 - tetrachoric correlations for Record Linkage Methodsmatches in the Census/CCS data – medium linkage scores only

Run 1 - tetrachoric correlations for Record Linkage Methodsnon-matches in the Census/CCS data – medium linkage scores only

Run 2 - tetrachoric correlations for Record Linkage Methodsmatches in the NHSCR/HESA data – medium linkage scores only

Run 2 - tetrachoric correlations for Record Linkage Methodsnon-matches in the NHSCR/HESA data – medium linkage scores only

So the assumption of independence is significantly violated. Does it matter?

• Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth

• Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence”

• Run 4 – day, month and year treated as three separate fields (and therefore as independent)

• Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”

Conclusions Does it matter?

• Work in progress and limited amounts of data currently available

• No evidence that the assumption of conditional independence has negative effects on output quality

• Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available

• For the moment, any views on the methods used and/or findings so far?

### The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland