This presentation is the property of its rightful owner.
1 / 19

The Conditional Independence Assumption in Probabilistic Record Linkage Methods PowerPoint PPT Presentation

The Conditional Independence Assumption in Probabilistic Record Linkage Methods. Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk. The record linkage problem.

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland

Edinburgh EH12 7TF

stephen.sharp@gro-scotland.gsi.gov.uk

• Given two files A and B, the aim is to find record pairs which refer to the same person.

• This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode

• The data matrix therefore looks like

What is the assumption of conditional independence?

• The likelihood that the two records refer to the same person is measured by a log likelihood ratio

What is the assumption of conditional independence?

• This is much easier to work out if the observations are independent conditional on match status because now

Why is the assumption of conditional independence important?

• It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields

• Enables the use of frequency based agreement weights

• Speeds up computing time

• Improves stability of parameter estimation

• But is almost always wrong e.g. gender is almost wholly predictable from first name

• But does it matter?

Who adopts the conditional independence assumption?

• Rec Link (US Census Bureau) – yes

• Link Plus (US Centers for Disease Control and Prevention) – yes

• GRLS/Fundy (Statistics Canada) – yes

• ORLS – yes (probably)

• RELAIS (Italian Statistical Institute) - no

Two questions

• To what extent is the assumption violated in real data sets?

• How much effect does it have on the output of linkage software?

Calculating the correlations between linkage fields

• Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields

• Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales

So the assumption of independence is significantly violated. Does it matter?

• Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth

• Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence”

• Run 4 – day, month and year treated as three separate fields (and therefore as independent)

• Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”

Conclusions

• Work in progress and limited amounts of data currently available

• No evidence that the assumption of conditional independence has negative effects on output quality

• Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available

• For the moment, any views on the methods used and/or findings so far?

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland

Edinburgh EH12 7TF

stephen.sharp@gro-scotland.gsi.gov.uk