The conditional independence assumption in probabilistic record linkage methods
Sponsored Links
This presentation is the property of its rightful owner.
1 / 19

The Conditional Independence Assumption in Probabilistic Record Linkage Methods PowerPoint PPT Presentation


  • 52 Views
  • Uploaded on
  • Presentation posted in: General

The Conditional Independence Assumption in Probabilistic Record Linkage Methods. Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk. The record linkage problem.

Download Presentation

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland

Ladywell Road

Edinburgh EH12 7TF

stephen.sharp@gro-scotland.gsi.gov.uk


The record linkage problem

  • Given two files A and B, the aim is to find record pairs which refer to the same person.

  • This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode

  • The data matrix therefore looks like


With four linking fields


What is the assumption of conditional independence?

  • The likelihood that the two records refer to the same person is measured by a log likelihood ratio


What is the assumption of conditional independence?

  • This is much easier to work out if the observations are independent conditional on match status because now


Why is the assumption of conditional independence important?

  • It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields

  • Enables the use of frequency based agreement weights

  • Speeds up computing time

  • Improves stability of parameter estimation

  • But is almost always wrong e.g. gender is almost wholly predictable from first name

  • But does it matter?


Who adopts the conditional independence assumption?

  • Rec Link (US Census Bureau) – yes

  • Link Plus (US Centers for Disease Control and Prevention) – yes

  • GRLS/Fundy (Statistics Canada) – yes

  • ORLS – yes (probably)

  • RELAIS (Italian Statistical Institute) - no


Two questions

  • To what extent is the assumption violated in real data sets?

  • How much effect does it have on the output of linkage software?


What does the assumption look like in practice?A = Agree D = DisagreeM = Match N = Non-match


Calculating the correlations between linkage fields

  • Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields

  • Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales


Run 1 - tetrachoric correlations for matches in the Census/CCS data – medium linkage scores only


Run 1 - tetrachoric correlations for non-matches in the Census/CCS data – medium linkage scores only


Run 2 - tetrachoric correlations for matches in the NHSCR/HESA data – medium linkage scores only


Run 2 - tetrachoric correlations for non-matches in the NHSCR/HESA data – medium linkage scores only


So the assumption of independence is significantly violated. Does it matter?

  • Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth

  • Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence”

  • Run 4 – day, month and year treated as three separate fields (and therefore as independent)

  • Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”


Is run 4 worse than runs 3 and 5?


Run 6 – the Clackmannanshire data


Conclusions

  • Work in progress and limited amounts of data currently available

  • No evidence that the assumption of conditional independence has negative effects on output quality

  • Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available

  • For the moment, any views on the methods used and/or findings so far?


The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland

Ladywell Road

Edinburgh EH12 7TF

stephen.sharp@gro-scotland.gsi.gov.uk


  • Login