Saad sheikh department of computer science university of illinois at chicago
This presentation is the property of its rightful owner.
Sponsored Links
1 / 76

Reconstructing Sibling Relationships from Genotyping Data PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Saad Sheikh Department of Computer Science University of Illinois at Chicago. ?. Brothers!. ?. Reconstructing Sibling Relationships from Genotyping Data. Biological Motivation. Used in: conservation biology, animal management, molecular ecology, genetic epidemiology

Download Presentation

Reconstructing Sibling Relationships from Genotyping Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Saad sheikh department of computer science university of illinois at chicago

Saad Sheikh

Department of Computer Science

University of Illinois at Chicago

?

Brothers!

?

Reconstructing Sibling Relationships from Genotyping Data


Biological motivation

Biological Motivation

  • Used in: conservation biology, animal management, molecular ecology, genetic epidemiology

  • Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness.

Lemon sharks, Negaprionbrevirostris

  • But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier

2 Brown-headed cowbird (Molothrusater) eggs in a Blue-winged Warbler's nest


Basic genetics

Basic Genetics

  • Gene

    • Unit of inheritance

  • Allele

    • Actual genetic sequence

  • Locus

    • Location of allele in entire genetic sequence

  • Diploid

    • 2 alleles at each locus


Diploid siblings

Siblings: two children with the same parents

Question: given a set of children, find sibling groups

allele

locus

father(.../...),(a /b ),(.../...),(.../...)

one from fatherone from mother

Diploid Siblings

(.../...),(c /d ),(.../...),(.../...) mother

recombination

(.../...),(e /f ),(.../...),(.../...) child


Microsatellites str

CACACACA

5’

Alleles

CACACACA

#1

CACACACACACA

#2

#3

CACACACACACACA

Genotypes

1/1

2/2

1/2

1/3

2/3

3/3

Microsatellites (STR)

  • Advantages:

    • Codominant (easy inference of genotypes and allele frequencies)

    • Many heterozygous alleles per locus

    • Possible to estimate other population parameters

    • Cheaper than SNPs

  • But:

    • Few loci

  • And:

    • Large families

    • Self-mating


Sibling reconstruction problem

Sibling Groups:

2, 4, 5, 6

1, 3

7, 8

Sibling Reconstruction Problem

Animal

Locus1

Locus2

allele1/allele2

1

1/2

11/22

2

1/3

33/44

3

1/4

33/55

4

1/3

77/66

5

1/3

33/44

33/77

6

1/3

7

1/5

88/22

8

1/6

22/22

S={P1={2,4,5,6},P2={1,3},P3={7,8}}


Existing methods

Existing Methods


Kinship

David C. Queller and Keith F. Goodnight.

Computer software for performing likelihood tests of pedigree relationship using genetic markers.

Molecular Ecology, 8:1231–1234, 1999.

KINSHIP


Kinship1

KINSHIP

  • First software and likelihood measure for sibling/kinship reconstruction

  • Estimates a ratio of two likelihoods:

    • Primary vs. Null Hypothesis

  • Assumes Population Frequencies are known


Probability of sharing allele

Probability of sharing allele

  • R – Probability of alleles being identical by descent

    • Rp = Probability (Xp = Yp)‏

    • Rm = Probability (Xm = Ym)‏


Haploid likelihood

Haploid Likelihood

  • Two individuals X =<X> and Y=<Y>

  • If X=Y

    • Likelihood = Pr(Drawing X) x Pr(X = Y)‏

    • =R+(1-R)Px

  • Otherwise

    • Likelihood = Pr(Drawing X) x Pr(X  Y)‏

    • =Px(1-R)Py


Diploid individuals

Diploid Individuals

  • Diploid Individuals X=<Xp/Xm>‌, Y =<Yp/Ym>

  • Assumptions

    • We know which alleles are mother's and father's

    • No Inbreeding

      • Likelihood = Likelihoodp x Likelihoodm

  • Loci are independent

    • Total Likelihood is a product of likelihoods across loci


Calculating likelihood

Calculating Likelihood

  • Population Frequencies: Pxm,Pxp,Pym,Pyp

  • Likelihoods:


Likelihood ratios

Likelihood Ratios

  • Independent Likelihood is not very reliable or meaningful

  • Different Ratios => Different Loci

  • Ratio != Statistical Significance

  • Simulations used to determine P-values


Statistical significance

Statistical Significance

  • Randomly generate an individual X using allele frequencies

  • Draw Y using Rm and Rp

    • First Allele: Copy X's allele with Probability Rm or vice versa

    • Second Allele: Copy X's allele with Probability Rp or vice versa

  • Draw a large number of such <X,Y> pairs

  • The value of the ratio that excludes 95% of such pairs is at P=0.05 significance


Family finder

Jen Beyer and B. May.

A graph-theoretic approach to the partition of individuals into full-sib families.

Molecular Ecology, 12:2243–2250, 2003.

Family Finder


Graph theory

Graph-Theory?

  • Build a graph of all individuals

  • Connect individuals with edges representing relationships

  • Assign Likelihood Ratio Full Sib/Unrelated as distance measure

    • Filter using likelihood ratio at 0.05 significance level

  • Find a cut


Algorithm

Algorithm

  • Calculate LFS/LUR likelihood ratios for all pairs

  • Build a graph representing the full-sib relationships

  • Find the connected components in the graph and store them in a queue.

  • While the queue is not empty do

    • Remove a component from the queue and calculate its score.

    • Build a GH cut tree for the component.

    • For each cut with less than 1/3 the total number of edges in the component do

      • Score the components that would result if the cut's edges were removed.

      • If the scores are the best found so far, then store them.

    • If the best scores found are higher than the score for the original component

      • then separate the families and put them in the queue for further analysis.

  • Otherwise save the original component as a result family.


Example

Example

Score the components and Keep the best cuts


Conclusion family finder

Conclusion – Family Finder

  • Some theoretical basis

  • Efficiently computable

  • Produces reasonably good results for many loci

  • A lot of assumptions because of Goodknight & Queller measure

  • Requires a significant number of loci - 8+

  • Works well only when families are almost equal size


Parsimony

Parsimony

  • Parsimony=Occam’s Razor

    • "entities must not be multiplied beyond necessity”

    • "plurality should not be posited without necessity”

  • “Parsimony is a 'less is better' concept of frugality, economy or caution in arriving at a hypothesis or course of action. The word derives from Middle English parcimony, from Latin parsimonia, from parsus, past participle of parcere: to spare. It is a general principle that has applications from science to philosophy and all related fields. Parsimony is essentially the implementation of Occam's razor.”

    • Wikipedia

  • Min Sib groups = Most Parsimonious explanation


  • Mendelian constraints

    Mendelian Constraints

    4-allele rule:siblings have at most 4 different alleles in a locus

    Yes: 3/3, 1/3, 1/5, 1/6

    No:3/3, 1/3, 1/5, 1/6, 3/2

    2-allele rule:

    In a locus in a sibling group:

    a + R ≤ 4

    Yes:3/3, 1/3, 1/5

    No: 3/3, 1/3, 1/5, 1/6

    Num distinct alleles

    Num alleles that appear with 3 others or are homozygote


    Min sibgroups reconstruction

    Min Sibgroups Reconstruction

    • Find the minimum number of Sibling Groups necessary to explain the given cohort

    • Minimum Set Cover:

      • Cohort as universe U

      • Individuals as elements of U

      • Covering Groups C include all genetically feasible sibling groups

    • NP-complete even when we know sibsets at most 3

    • Hard to approximate (Ashley et al. 09)

    • ILP formulation (Chaovalitwongse et al. 08)


    Minimum set cover

    Given: universe U = {1, 2, …, n} collection of sets S = {S1, S2,…,Sm}

    where Si subset of U

    Find:the smallest number of sets in Swhose union is the universe U

    Minimum Set Cover

    Minimum Set Cover is NP-hard

    (1+ln n)-approximable (sharp)


    2 allele min set cover

    2-Allele Min Set Cover

    • Generate all maximal feasible sibling groups (sets) that satisfy 2-allele property using “2-Allele Algorithm” [ISMB 2007; Bioinformatics 23(13)]

    • Use Min Set Cover to find the minimum sibling groupsOptimally using ILP (CPLEX)


    2 allele algorithm overview

    2-Allele Algorithm Overview

    • Generate candidate sets by all pairs of individuals

    • Compare every set to every individual x

      • if x can be added to the set without any affecting “accomodability” or violating 2-allele:

        • add it

      • If the “accomodability” is affected , but the 2-allele property is still satisfied:

        • create a new copy of the set, and add to it

      • Otherwise ignore the individual, compare the next


    Canonical families

    4/1

    2/3

    2/1

    3/1

    2/1

    1/3

    3/2

    2/1

    3/1

    1/1

    1/1

    1/2

    2/2

    1/2

    1/3

    1/4

    2/3

    2/4

    3/1

    3/2

    4/2

    2/1

    1/1

    1/2

    2/1

    1/1

    1/3

    1/3

    2/1

    2/3

    2/1

    3/2

    Canonical families

    1/3

    2/2

    1/1

    1/2

    1/4

    2/3

    2/4

    3/4

    3/3

    4/4


    Examples

    1/4

    1/4

    1/4

    Examples

    • Add

    • New Group Add (won’t accommodate (2/2))

    • Can’t add (a+R =4)

    3/ 4

    1/ 2

    3/ 2

    1/ 2

    3/ 2

    3/ 2

    1/ 1

    1/ 2

    1/ 5


    Testing and validation protocol

    Testing and Validation: Protocol

    • Get a dataset with known sibgroups(real or simulated)

    • Find sibgroups using our alg

    • Compare the solutions

      • Partition distance, Gusfield’03

    • Compare results to other sibship methods


    Real data

    Salmon (Salmosalar) - Herbingeret al., 1999 351 individuals, 6 families, 4 loci. No missing alleles

    Shrimp (Penaeusmonodon) - Jerry et al., 200659 individuals,13 families, 7 loci. Some missing alleles

    Ants (Leptothoraxacervorum )- Hammond et al., 1999Ants dataset [16] are haplodiploid species. The data consists of 377 worker diploid ants

    Real Data


    Random data generation

    Generate F females and M males (F=M=5, 10, 15)

    Each with l loci (l=2, 4, 6)

    Each locus with a allelesa[uniform]=5, 10, 15 a[nonuniform]=4 12-4-1-1

    Generate f familiesf[uniform]=2, 5, 10 f[nonuniform]=5

    For each family select female+male uniformly at random

    For each parent pair generate o offspringo[uniform]=2, 5, 10 o[nonuniform]=25-10-10-4-1

    For each offspring for each locus choose allele outcome uniformly at random

    Random Data Generation


    Results

    Results


    Summary min sib groups

    Summary (Min Sib Groups)

    • 2-Allele Min Set Cover

      • First combinatorial

      • Makes no assumptions other parsimony

      • Works consistently and comparatively

    • Sibling Reconstruction

      • Growing number of methods

      • Biologists need (one) reliable reconstruction

      • Genotyping errors

    • Answer: Consensus


    Consensus methods

    S2

    Sk

    S

    Consensus Methods

    • Combine multiple solutions to a problem to generate one unified solution

      • C: S*→ S

      • Based on Social Choice Theory

      • Commonly used where the real solution is not known e.g. Phylogenetic Trees

    Consensus

    ...

    S1


    Strict consensus

    Strict Consensus

    • Only Pareto Optimality and Anti-Pareto Optimality are enforced

      • All solutions must agree on equivalence

    • All disputed individuals go to singletons

    Si x≡Siy≡ x≡Sy

    S1 = {{1,2,3},{4,5},{6,7}

    S2={{1,2,3,4},{5,6,7}}

    S3={{1,2},{3,4,5},{6,7}}

    Strict

    Consensus

    S={{1,2},{3},{4},{5},{6,7}}

    5 Sibling Groups?

    When 3 can do?


    Majority consensus

    Majority Consensus

    • Majority of solutions determine the final solution

      • Two individuals are together if a majority of solutions vote in their favour

      • Violates Transitivity: A≡B∧B≡C⇒A≡C

    S1 = {{1,2,3},{4,5},{6,7}

    S2={{1,2,3,4},{5,6,7}}

    S3={{1,2},{3,4,5},{6,7}}

    1 ≡ 3 AND 3 ≡ 4 BUT 1 ≡ 4


    Majority consensus1

    Majority Consensus

    • Voting Consensus

      • Majority under closure

      • Results in large monolithic groups

    S1 = {{1,2,3},{4,5},{6,7}

    S2={{1,2,3,4},{5,6,7}}

    S3={{1,2},{3,4,5},{6,7}}

    Voting

    Consensus

    S={{1,2,3,4,5},{6,7}}

    1 ≡5?


    Consensus methods1

    Consensus Methods

    • Commonly used consensus methods don’t work [AAAI-MPREF08]

      • Strict Consensus produces too many singletons

      • Majority violates transitivity AND doesn’t work for error-tolerance


    Distance based consensus

    fq

    S

    S2

    S1

    Sk

    Ss

    fd

    Distance-based Consensus

    • Algorithm

      • Compute a consensus solution S={g1,...,gk}

      • Search for a good solution near S

    fq

    fd

    Search

    Consensus

    ...


    Distance based consensus1

    Distance-based Consensus

    • Needs

      • A Distance Function fd: S x S →R

      • A Quality Function fq: S → R

    • What is the Catch? [Sheikh et al. CSB 2008]

      • Optimization of fd, fq or an arbitrary linearcombination is NP-Complete

      • Reduction from the 2-Allele Min Set CoverProblem


    A greedy approach

    A Greedy Approach

    • Algorithm

      • Compute a strict consensus

      • While distance is not too large

        • Merge two nearest sibgroups

    • Quality: fq=n-|C|

    • Distance Function

      • fd(C,C’)=cost of merging groups in C to obtain C’


    A greedy approach1

    A Greedy Approach

    • S1 ={ {1,2,3}, {4,5}, {6,7} }

    • S2={ {1,2,3}, {4}, {5,6,7} }

    • S3={ {1,2}, {3,4,5}, {6,7} }

    Strict

    Consensus

    S={ {1,2}, {3},{4},{5},{6,7} }

    S={ {1,2}, {3,6,7},{4},{5} }


    Greedy consensus

    Greedy Consensus

    • Distance Function(sibgroup, sibgroup)‏

      • Cost of assigning all individuals

        • fd(C,C’)=min(SXPifassign(Pj,X), SXPjfassign(Pi,X) )

    • Distance Function (sibgroup, individual)‏

      • Benefit: Alleles and allele pairs shared

      • Cost: Minimum Edit Distance

        • fassign(PiX)=

    benefit X can be a member of Pi

    cost X cannot be a member of Pi`


    Greedy consensus1

    Greedy Consensus

    • Algorithm

      • Compute a strict consensus

      • While distance is not too large

        • Merge two sibgroups which will minimize the TOTAL merging cost

        • Store the new merging cost in the merged set


    Error tolerant approach

    S2

    Sk

    S

    Error-Tolerant Approach

    ...

    Locus 1

    Locus 2

    Locus 3

    Locus k

    Sibling

    Reconstruction

    Algorithm

    ...

    Consensus

    S1


    Results1

    Results


    Results2

    Results

    • >90% accuracy for all real data


    Results3

    Results


    Results4

    Results


    Impossibility result

    Impossibility Result

    • A consensus method CANNOT be all of these [Arrow 1963,Mirkin 1975]

      • Fair

      • Independent

      • Pareto Optimal

    • Biologically [AAAI-MPREF 2008]

      • The subset of individuals chosen will impact the consensus considerably


    Problems

    Problems

    • Parametric

    • Does NOT outperform other algorithms on:

      • Biological data

      • Smaller families

      • High Allele Frequencies


    Auto greedy consensus

    Auto Greedy Consensus

    • Change costs to average per locus costs

    • Compare max group error on per locus basis

    • Treat cost and benefit independently

    • In order to qualify a merge

      • Cost <= maxcost

      • Benefit >= minbenefit

      • Benefit = max benefit among possible merges


    Results5

    Results


    Results6

    Results


    Results7

    Results


    Summary consensus

    Summary (Consensus)

    • First consensus method for Sibship Reconstruction

      • Majority won’t work

    • First combinatorial approach for Error-Tolerant Sibship Reconstruction

      • Fewer Assumptions

      • More Efficient

    • Distance-based Consensus is NP-Hard

    • New non-parametric consensus


    Parsimony alternate objectives

    Parsimony: Alternate Objectives

    • Min number of sibgroups is just ONE way to interpret parsimony

    • Alternate Objectives

      • Sibship that minimizes number of parents

        • Very Hard! Connection to Raz’s Parallel Repetition Theorem

      • Sibship that minimizes number of matings

      • Sibship that maximizes family size

      • Sibship that tries to satisfy uniform allele distributions


    Parsimony minimize parents

    Parsimony: Minimize Parents

    • Problem Statement:

      • Given a population U of individuals, partition the individuals into groups G such that the parents (mothers+fathers) necessary for G are minimized

    • Observations and Challenges:

      • MinParents: intractable, inapproximable

        • Reduction from Min-Rep Problem (Raz’s Parallel Repetition Theorem)

      • There may be O(2|loci|) potential parents for a sibgroup

      • Self-mating (plants) may or may not be allowed


    Is minparents minsibgroups

    Is MinParents = MinSibgroups?

    • Not Necessarily…


    Min parents meta approach

    Min Parents Meta Approach

    M={{1,2},{3,6,7},{3,5},

    {2,4},{1,6},{2,5},{6,7}}

    • Generate M a set of covering groups

    • Cover a subset S of covering groups

    • For each group x in S

      • Generate Parent Pairs for x

      • Insert parent vertices into graph G (if needed)

      • Connect the parents in each parent pair

    • Cover the minimum vertices necessary to (doubly) cover all the individuals

    S={{1,2,4},{3,5},{6,7}}

    X={3,5}

    {F=5/10, M=2/20},{F=5/20.M=2/10}

    5/

    10

    X={3,5}

    2/

    20

    X={3,5}

    5/

    20

    2/

    10


    Covering groups

    Covering Groups

    • Different approaches to selecting a subset of maximal feasible groups

      • Greedy Min Set Cover

      • K –Greedy Min Set Covers

      • All Sets! (Nearing optimality)

    • Forget maximal feasible sibling groups

      • Generate K random minimal feasible sibling reconstructions


    Generating parents

    Generating Parents

    • The number of generated parents is just too many!

    • Mine Association Rules across loci

      {A,B}locus1 => {C,D}locus2

    • Use Association Rules to filter parents

      {A,B}locus1 => {C,D}locus2 OR{C’,D’}locus2

    • Polygamy=>High Confidence Association Rules

    • No Polygamy=>Min Parents=Min Groups

    • If self-mating is not allowed, odd-cycles must be disallowed


    Covering vertices

    Covering Vertices

    • Heuristic

      • While all vertices are not covered

        • Select the vertex that will cover the most uncovered individuals

    • MIP Formulation


    Results8

    Results

    Legend:

    M1: k-greedy cover with optimal graph cover

    M2: greedy set cover with optimal graph cover

    M3: Randomized cover with optimal graph cover

    M4: k-greedy with graph heuristics

    M5: greedy set cover with graph heuristic


    Results9

    Results


    Results10

    Results


    Complexity results

    Complexity Results

    Reduction is from a version of Parallel Repetition theorem even if we know all the parents and just

    need to find the minimum parents to choose!

    But, what is the parallel repetition theorem?


    Reconstructing sibling relationships from genotyping data

    Unique games

    conjecture

    restriction

    restriction

    label cover problem

    for bipartite graphs

    2-prover 1-round

    proof system

    small inapproximability

    boosting

    (Raz’s parallel repetition theorem)

    label cover problem

    for some kind of

    “graph product” for

    bipartite graphs

    parallel repetition of

    2-prover 1-round

    proof system

    larger inapproximability


    Reconstructing sibling relationships from genotyping data

    We need some version of Raz’s parallel repetition theorem that is suitable for us

    Fortunately, the following two papers helped:

    U. Feige, A threshold of ln n for approximating set-cover,Journal of the ACM, 1998

    G. Kortsarz, R. Krauthgamer and J. R. Lee, Hardness of Approximating Vertex-Connectivity Network Design Problems, SIAM J. of Computing, 2004


    Reconstructing sibling relationships from genotyping data

    Inapproximability for MINREP

    (Raz’s parallel repetition theorem)

    Let LNP and x be an input instance of L

    O(npolylog(n))

    time

    MINREP

    L

    xL

    OPT ≤ α+β

    xL

    OPT  (α+β) 2log |A| +|B|

    0 < ε < 1 is any constant


    Reconstructing sibling relationships from genotyping data

    α partitions

    all of equal size

    α partitions

    all of equal size

    MINREP (minimum representative) problem

    A “super”-nodes

    A1

    A2

    A1

    A1

    A2

    A2

    A

    A

    B

    B

    B1

    B3

    B2

    B1

    B1

    B3

    B3

    B2

    B2

    B “super”-nodes

    β partitions

    all of equal size

    associated “super”-graph H

    input graph G

    (A1,B2)H if  uA1 and vB2 such that (u,v)G

    In this case, edge (u,v)G a witness of the super-edge (A1,B2)H


    Reconstructing sibling relationships from genotyping data

    MINREP goal

    Valid solution:

    A’  A and B’  B such that

    A’B’ contains a witness for every super-edge

    Objective:minimize the size of the solution |A’B’|


    Reconstructing sibling relationships from genotyping data

    Informally,

    • given a set of children

    • given a candidate set of parents

    • assuming we believe in Mendelian inheritance law

    • assuming that the parents tried to be as much monogamous as possible

      can we

      partition the children into a set of full siblings

      (full sibling group has the same pair of parents)

      Can reduce MINREP to show that this problem is hard


    Conclusions

    Conclusions

    • Parsimony-based combinatorial optimization works bet with least amount of information

    • Parsimony-based combinatorial optimization is NP-hard and inapproximable

    • First combinatorial approach for Error-Tolerant Sibship Reconstruction

      • Fewer Assumptions

      • More Efficient

    • Other parsimony-based optimization objectives are possible

      • Min Parents is interesting and hard!


    Future work

    Future Work

    • Better heuristics for Min Parents?

    • Other parsimony objectives

    • Further analysis of when objectives give same results


    Reconstructing sibling relationships from genotyping data

    Ashfaq KhokharUIC

    Bhaskar DasGuptaUIC

    Tanya Berger-WolfUIC

    Isabel CaballeroUIC

    W. Art ChaovalitwongseRutgers

    Mary AshleyUIC

    Sibship Reconstruction Project

    Thank You!!Questions?

    Chun-An (Joe)

    Chou

    Rutgers

    Priya GovindanUIC


  • Login