The conditional independence assumption in probabilistic record linkage methods
Download
1 / 19

The Conditional Independence Assumption in Probabilistic Record Linkage Methods - PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on

The Conditional Independence Assumption in Probabilistic Record Linkage Methods. Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF [email protected] The record linkage problem.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Conditional Independence Assumption in Probabilistic Record Linkage Methods' - simeon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The conditional independence assumption in probabilistic record linkage methods

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland

Ladywell Road

Edinburgh EH12 7TF

[email protected]


The record linkage problem
The record linkage problem Record Linkage Methods

  • Given two files A and B, the aim is to find record pairs which refer to the same person.

  • This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode

  • The data matrix therefore looks like


With four linking fields
With four linking fields Record Linkage Methods


What is the assumption of conditional independence
What is the assumption of conditional independence? Record Linkage Methods

  • The likelihood that the two records refer to the same person is measured by a log likelihood ratio


What is the assumption of conditional independence1
What is the assumption of conditional independence? Record Linkage Methods

  • This is much easier to work out if the observations are independent conditional on match status because now


Why is the assumption of conditional independence important
Why is the assumption of conditional independence important? Record Linkage Methods

  • It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields

  • Enables the use of frequency based agreement weights

  • Speeds up computing time

  • Improves stability of parameter estimation

  • But is almost always wrong e.g. gender is almost wholly predictable from first name

  • But does it matter?


Who adopts the conditional independence assumption
Who adopts the conditional independence assumption? Record Linkage Methods

  • Rec Link (US Census Bureau) – yes

  • Link Plus (US Centers for Disease Control and Prevention) – yes

  • GRLS/Fundy (Statistics Canada) – yes

  • ORLS – yes (probably)

  • RELAIS (Italian Statistical Institute) - no


Two questions
Two questions Record Linkage Methods

  • To what extent is the assumption violated in real data sets?

  • How much effect does it have on the output of linkage software?


What does the assumption look like in practice a agree d disagree m match n non match
What does the assumption look like in practice? Record Linkage MethodsA = Agree D = DisagreeM = Match N = Non-match


Calculating the correlations between linkage fields
Calculating the correlations between linkage fields Record Linkage Methods

  • Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields

  • Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales


Run 1 tetrachoric correlations for matches in the census ccs data medium linkage scores only
Run 1 - tetrachoric correlations for Record Linkage Methodsmatches in the Census/CCS data – medium linkage scores only


Run 1 tetrachoric correlations for non matches in the census ccs data medium linkage scores only
Run 1 - tetrachoric correlations for Record Linkage Methodsnon-matches in the Census/CCS data – medium linkage scores only


Run 2 tetrachoric correlations for matches in the nhscr hesa data medium linkage scores only
Run 2 - tetrachoric correlations for Record Linkage Methodsmatches in the NHSCR/HESA data – medium linkage scores only


Run 2 tetrachoric correlations for non matches in the nhscr hesa data medium linkage scores only
Run 2 - tetrachoric correlations for Record Linkage Methodsnon-matches in the NHSCR/HESA data – medium linkage scores only


So the assumption of independence is significantly violated does it matter
So the assumption of independence is significantly violated. Does it matter?

  • Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth

  • Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence”

  • Run 4 – day, month and year treated as three separate fields (and therefore as independent)

  • Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”




Conclusions
Conclusions Does it matter?

  • Work in progress and limited amounts of data currently available

  • No evidence that the assumption of conditional independence has negative effects on output quality

  • Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available

  • For the moment, any views on the methods used and/or findings so far?


The conditional independence assumption in probabilistic record linkage methods1

The Conditional Independence Assumption in Probabilistic Record Linkage Methods

Stephen Sharp

National Records of Scotland

Ladywell Road

Edinburgh EH12 7TF

[email protected]


ad