Skip this Video
Download Presentation
Research Team

Loading in 2 Seconds...

play fullscreen
1 / 41

Research Team - PowerPoint PPT Presentation

  • Uploaded on

Constructing an adolescence friendship network within the ALSPAC birth cohort using probabilistic record linkage techniques. Research Team. Simon Burgess (CMPO, Bristol) Eleanor Sanderson (CMPO, Bristol) Marcela Umaña (CMPO, Bristol) Andy Boyd (ALSPAC, Bristol). Study Rationale.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Research Team' - allene

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Constructing an adolescence friendship network within the ALSPAC birth cohort using probabilistic record linkage techniques.
research team
Research Team
  • Simon Burgess (CMPO, Bristol)
  • Eleanor Sanderson (CMPO, Bristol)
  • Marcela Umaña (CMPO, Bristol)
  • Andy Boyd (ALSPAC, Bristol)
study rationale
Study Rationale

Social Networks are ubiquitous and powerful

“The people with whom we interact… influence our beliefs, decisions and behaviours” Jackson 2010

The manner in which networks carry this influence depends in detail on the structure and characteristics of the network.

  • Examples of researching Networks
    • ADD Health – Longitudinal survey of school children in the US. Questionnaire included a list of pupils in the school, respondent asked to nominate their five best male and five best female friends
    • Others based around communication networks or other defined communities
  • Advantages of studying social networks in a cohort study:
    • Extensive phenotype and genotype data

and extensive linkage data

  • Advantages of studying social networks in ALSPAC:
    • Regional catchment area, narrow age range of participants (18 month age range, 3 school years)
  • Disadvantages:
    • Only the study participant is asked to nominate their friends
data collection methodology
Data Collection Methodology
  • School based (register based) method not considered feasible
    • Cost
    • School Engagement
  • Questionnaire based alternative
    • Sent to participants still in compulsory education (age ~15-16)
    • Where the participant still lived in England
data collection methodology1
Data Collection Methodology
  • Asked the participant to nominate their 5 best friends, in no particular order
linkage objectives
Linkage Objectives
  • To identify all unique individuals from the pool of nominated friends (de-duplication)
  • To identify which of the nominated friends are also eligible to participate in ALSPAC
before we get to linkage there s ethics1
Before we get to linkage…… there’s ethics
  • Seeking personal identifiers of participants friends seen as contentious
  • Lawyers advised us that this is legal and within the bounds of Data Protection Act (1998)
  • Personal identifiers to be used for statistical use only and pseudonymised prior to research use
before we get to linkage there s ethics2
Before we get to linkage…… there’s ethics
  • Once the nominated friends have been coded the personal identifiers cannot be used again.
  • No longitudinal follow up possible on the full data set, but it is possible on those linked to ALSPAC.
the data
The Data
  • 3,132 participants returned a questionnaire
  • 14,500 nominated friends
  • Personal Identifiers include:
    • Name, Date of Birth, School, School year, gender
  • Phenotypic data includes:
    • How they met, duration of friendship, shared interests
data quality
Data Quality
  • Completeness of highly distinguishing personal identifiers
    • 14,414 nominated friends >=2 identifiers
    • 12,612 nominated friends >=3 identifiers
    • 6,215 nominated friends included all four identifiers
data quality2
Data Quality
  • All data reported by a participant (age ~16) about their friends
    • Some of this will be unknown or prone to greater error, particularly date of birth and non-local schools
    • Names include many spelling errors
    • Names and school details include many abbreviations and familiar names
  • School names coded to National Pupil Database ‘Unique Record Number’ (using
  • Names converted to upper case
  • All spaces and symbols contained within a name removed:
    • O’Driscoll to ODRISCOLL
    • St.Claire to STCLAIRE
  • Names matched to a name Lexicon, compiled from:
    • NHS name lexicon
    • National Pupil Database
    • ALSPAC ‘known as’ names
    • Non-matching names evaluated using Jaro string comparator metrics (assesses spelling differences, typos, keying errors, string lengths)
      • See Herzog, Scheuren and Winkler 2007
    • “A Dictionary of First Names” Oxford University Press 2006
  • Name Lexicon examples:
    • Andrew, Andy, Andi, Drew all categorised to the same male group
    • Abigail, Abbie, Abi, Ab1 all categorised to the same group
  • Where are two linked names not the same?
    • E.g. Should Abraham and Ibrahim be categorised together?
  • Names can be included in multiple groups (impacts on linkage evaluation)
  • Impact of Lexicon, unique values condensed into categories:
    • Forenames 2,108 into 1,339
    • Surnames 5,743 into 4,895
linkage methodology
Linkage Methodology
  • Used approach developed by Fellegi & Sunter (1969)

Aim to simulate human reasoning by comparing each of several elements from the two records… from fundamental concepts of probability

Clark 2004

estimating match weights
Estimating Match Weights
  • For a given field with match probability M and unmatch probability U
    • For an agreement:
      • Log (M/U)
    • For a disagreement
      • Log (1-M/1-U)
    • Sum the weights across all matching comparisons (all the fields)
  • M-Probability: Probability that the identifier agrees given a true match
    • Based on assessment of the quality of the data (i.e. data entry errors, missing data but accounting for improvements due to cleaning and standardisation)
  • U-Probability: Probability that identifier agrees given that the records do not constitute a true match
    • Based on ‘Gold Standard’ of the existing ALSPAC – National Pupil Database linkage
    • Supported by data, 95% nominated friends described as being in education in the ALSPAC time period
stratification or blocking
Stratification or ‘blocking’
  • Large number (14,500 x 14,500) of possibilities to evaluate
    • So we ‘blocked’ on identifiers with low discriminatory potential (gender, school year) and high potential (name, school)
    • Multiple iterations so as not to exclude cases which contained errors in the blocking identifiers
manual review
Manual Review
  • Evaluated a random selection of cases to determine thresholds for accepting a match as:
    • Definitely ‘true’ (including some false positives)
    • Definitely ‘false’ (excluding some true positives)
manual review1
Manual Review
  • Cases with results between the two thresholds all manually reviewed
  • Data
    • 3,123 respondents
    • nominated 4.64 friends on average
    • 14,503 nominated friends
  • First Phase of Linkage
    • 11,327 individuals identified
  • Linkage to ALSPAC
    • 6,961 nominated friends linked
    • 4,572 individuals linked
results network structure
Results: Network Structure
  • Total Network
    • 13,056 individuals in total

(1,394 respondents are also nominated as a friend)

  • 50% of nominations are to someone in ALSPAC
    • 12% of nominations are to someone who is also a respondent to the friendship questionnaire
results network structure1
Results: Network Structure
  • Largest component contains 2/3 of the individuals in the network
future research
Future Research
  • Structure of the network
  • Homophily
    • The tendancy to establish relationships among people who share similar characteristics or attributes
future research1
Future Research
  • Risk taking behaviour
  • Antisocial behaviour
  • Transition into Higher Education, Employment or unemployment
  • And many more…
reflections on linkage process
Reflections on Linkage Process

Quality of the data determines the quality of the linkage

  • To reflect this the majority of time/resource was spent on data cleaning, standardisation and extensive manual verification
reflections on linkage process1
Reflections on Linkage Process

Establishing the weightings

  • Method not without problems as excludes privately educated pupils, who have different name frequencies
  • Weighting established on national population, but ALSPAC regionally clustered
  • Potential to use statistical approaches instead
reflections on linkage process2
Reflections on Linkage Process


  • While resource intensive the methodology did allow the identification of a friendship network within ALSPAC
  • Little evidence to suggest that this was as ethical contentious from cohorts perspective as expected (based only on response rates and small numbers of complaints – further research into this would have been of interest)
continuing role of linkage
Continuing Role of Linkage
  • Linkage to administrative records is, by adding to the ALSPAC resource, providing new data which can be used in social network analysis
thank you
Thank You


Andy Boyd

[email protected]

  • Clark DE (2004) Practical introduction to record linkage for injury research. Injury Prevention 10, 186-191
  • Felligi IP & Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association 64, 1183-1210
  • Herzog TN, Scheuren FJ and Winkler WE (2007) Data Quality and Record Linkage Techniques. New York: Springer.
  • Jackson M (2010) An overview of social networks and economic applications. In Handbook of social economics, edited by Benhabib J, Bisin A & Jackson M