The conditional independence assumption in probabilistic record linkage methods
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

The Conditional Independence Assumption in Probabilistic Record Linkage Methods PowerPoint PPT Presentation


  • 40 Views
  • Uploaded on
  • Presentation posted in: General

The Conditional Independence Assumption in Probabilistic Record Linkage Methods. Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF [email protected] The record linkage problem.

Download Presentation

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The conditional independence assumption in probabilistic record linkage methods

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland

Ladywell Road

Edinburgh EH12 7TF

[email protected]


The record linkage problem

The record linkage problem

  • Given two files A and B, the aim is to find record pairs which refer to the same person.

  • This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode

  • The data matrix therefore looks like


With four linking fields

With four linking fields


What is the assumption of conditional independence

What is the assumption of conditional independence?

  • The likelihood that the two records refer to the same person is measured by a log likelihood ratio


What is the assumption of conditional independence1

What is the assumption of conditional independence?

  • This is much easier to work out if the observations are independent conditional on match status because now


Why is the assumption of conditional independence important

Why is the assumption of conditional independence important?

  • It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields

  • Enables the use of frequency based agreement weights

  • Speeds up computing time

  • Improves stability of parameter estimation

  • But is almost always wrong e.g. gender is almost wholly predictable from first name

  • But does it matter?


Who adopts the conditional independence assumption

Who adopts the conditional independence assumption?

  • Rec Link (US Census Bureau) – yes

  • Link Plus (US Centers for Disease Control and Prevention) – yes

  • GRLS/Fundy (Statistics Canada) – yes

  • ORLS – yes (probably)

  • RELAIS (Italian Statistical Institute) - no


Two questions

Two questions

  • To what extent is the assumption violated in real data sets?

  • How much effect does it have on the output of linkage software?


What does the assumption look like in practice a agree d disagree m match n non match

What does the assumption look like in practice?A = Agree D = DisagreeM = Match N = Non-match


Calculating the correlations between linkage fields

Calculating the correlations between linkage fields

  • Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields

  • Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales


Run 1 tetrachoric correlations for matches in the census ccs data medium linkage scores only

Run 1 - tetrachoric correlations for matches in the Census/CCS data – medium linkage scores only


Run 1 tetrachoric correlations for non matches in the census ccs data medium linkage scores only

Run 1 - tetrachoric correlations for non-matches in the Census/CCS data – medium linkage scores only


Run 2 tetrachoric correlations for matches in the nhscr hesa data medium linkage scores only

Run 2 - tetrachoric correlations for matches in the NHSCR/HESA data – medium linkage scores only


Run 2 tetrachoric correlations for non matches in the nhscr hesa data medium linkage scores only

Run 2 - tetrachoric correlations for non-matches in the NHSCR/HESA data – medium linkage scores only


So the assumption of independence is significantly violated does it matter

So the assumption of independence is significantly violated. Does it matter?

  • Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth

  • Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence”

  • Run 4 – day, month and year treated as three separate fields (and therefore as independent)

  • Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”


Is run 4 worse than runs 3 and 5

Is run 4 worse than runs 3 and 5?


Run 6 the clackmannanshire data

Run 6 – the Clackmannanshire data


Conclusions

Conclusions

  • Work in progress and limited amounts of data currently available

  • No evidence that the assumption of conditional independence has negative effects on output quality

  • Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available

  • For the moment, any views on the methods used and/or findings so far?


The conditional independence assumption in probabilistic record linkage methods1

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland

Ladywell Road

Edinburgh EH12 7TF

[email protected]


  • Login