Reconstructing Sibling Relationships from Genotyping Data

Download Presentation

Reconstructing Sibling Relationships from Genotyping Data

Loading in 2 Seconds...

- 84 Views
- Uploaded on
- Presentation posted in: General

Reconstructing Sibling Relationships from Genotyping Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Saad Sheikh

Department of Computer Science

University of Illinois at Chicago

?

Brothers!

?

- Used in: conservation biology, animal management, molecular ecology, genetic epidemiology
- Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness.

Lemon sharks, Negaprionbrevirostris

- But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier

2 Brown-headed cowbird (Molothrusater) eggs in a Blue-winged Warbler's nest

- Gene
- Unit of inheritance

- Allele
- Actual genetic sequence

- Locus
- Location of allele in entire genetic sequence

- Diploid
- 2 alleles at each locus

Siblings: two children with the same parents

Question: given a set of children, find sibling groups

allele

locus

father(.../...),(a /b ),(.../...),(.../...)

one from fatherone from mother

(.../...),(c /d ),(.../...),(.../...) mother

recombination

(.../...),(e /f ),(.../...),(.../...) child

CACACACA

5’

Alleles

CACACACA

#1

CACACACACACA

#2

#3

CACACACACACACA

Genotypes

1/1

2/2

1/2

1/3

2/3

3/3

- Advantages:
- Codominant (easy inference of genotypes and allele frequencies)
- Many heterozygous alleles per locus
- Possible to estimate other population parameters
- Cheaper than SNPs

- But:
- Few loci

- And:
- Large families
- Self-mating
- …

Sibling Groups:

2, 4, 5, 6

1, 3

7, 8

Animal

Locus1

Locus2

allele1/allele2

1

1/2

11/22

2

1/3

33/44

3

1/4

33/55

4

1/3

77/66

5

1/3

33/44

33/77

6

1/3

7

1/5

88/22

8

1/6

22/22

S={P1={2,4,5,6},P2={1,3},P3={7,8}}

David C. Queller and Keith F. Goodnight.

Computer software for performing likelihood tests of pedigree relationship using genetic markers.

Molecular Ecology, 8:1231–1234, 1999.

- First software and likelihood measure for sibling/kinship reconstruction
- Estimates a ratio of two likelihoods:
- Primary vs. Null Hypothesis

- Assumes Population Frequencies are known

- R – Probability of alleles being identical by descent
- Rp = Probability (Xp = Yp)
- Rm = Probability (Xm = Ym)

- Two individuals X =<X> and Y=<Y>
- If X=Y
- Likelihood = Pr(Drawing X) x Pr(X = Y)
- =R+(1-R)Px

- Otherwise
- Likelihood = Pr(Drawing X) x Pr(X Y)
- =Px(1-R)Py

- Diploid Individuals X=<Xp/Xm>, Y =<Yp/Ym>
- Assumptions
- We know which alleles are mother's and father's
- No Inbreeding
- Likelihood = Likelihoodp x Likelihoodm

- Loci are independent
- Total Likelihood is a product of likelihoods across loci

- Population Frequencies: Pxm,Pxp,Pym,Pyp
- Likelihoods:

- Independent Likelihood is not very reliable or meaningful
- Different Ratios => Different Loci
- Ratio != Statistical Significance
- Simulations used to determine P-values

- Randomly generate an individual X using allele frequencies
- Draw Y using Rm and Rp
- First Allele: Copy X's allele with Probability Rm or vice versa
- Second Allele: Copy X's allele with Probability Rp or vice versa

- Draw a large number of such <X,Y> pairs
- The value of the ratio that excludes 95% of such pairs is at P=0.05 significance

Jen Beyer and B. May.

A graph-theoretic approach to the partition of individuals into full-sib families.

Molecular Ecology, 12:2243–2250, 2003.

- Build a graph of all individuals
- Connect individuals with edges representing relationships
- Assign Likelihood Ratio Full Sib/Unrelated as distance measure
- Filter using likelihood ratio at 0.05 significance level

- Find a cut

- Calculate LFS/LUR likelihood ratios for all pairs
- Build a graph representing the full-sib relationships
- Find the connected components in the graph and store them in a queue.
- While the queue is not empty do
- Remove a component from the queue and calculate its score.
- Build a GH cut tree for the component.
- For each cut with less than 1/3 the total number of edges in the component do
- Score the components that would result if the cut's edges were removed.
- If the scores are the best found so far, then store them.

- If the best scores found are higher than the score for the original component
- then separate the families and put them in the queue for further analysis.

- Otherwise save the original component as a result family.

Score the components and Keep the best cuts

- Some theoretical basis
- Efficiently computable
- Produces reasonably good results for many loci
- A lot of assumptions because of Goodknight & Queller measure
- Requires a significant number of loci - 8+
- Works well only when families are almost equal size

- Parsimony=Occam’s Razor
- "entities must not be multiplied beyond necessity”
- "plurality should not be posited without necessity”

- “Parsimony is a 'less is better' concept of frugality, economy or caution in arriving at a hypothesis or course of action. The word derives from Middle English parcimony, from Latin parsimonia, from parsus, past participle of parcere: to spare. It is a general principle that has applications from science to philosophy and all related fields. Parsimony is essentially the implementation of Occam's razor.”
- Wikipedia

4-allele rule:siblings have at most 4 different alleles in a locus

Yes: 3/3, 1/3, 1/5, 1/6

No:3/3, 1/3, 1/5, 1/6, 3/2

2-allele rule:

In a locus in a sibling group:

a + R ≤ 4

Yes:3/3, 1/3, 1/5

No: 3/3, 1/3, 1/5, 1/6

Num distinct alleles

Num alleles that appear with 3 others or are homozygote

- Find the minimum number of Sibling Groups necessary to explain the given cohort
- Minimum Set Cover:
- Cohort as universe U
- Individuals as elements of U
- Covering Groups C include all genetically feasible sibling groups

- NP-complete even when we know sibsets at most 3
- Hard to approximate (Ashley et al. 09)
- ILP formulation (Chaovalitwongse et al. 08)

Given: universe U = {1, 2, …, n} collection of sets S = {S1, S2,…,Sm}

where Si subset of U

Find:the smallest number of sets in Swhose union is the universe U

Minimum Set Cover is NP-hard

(1+ln n)-approximable (sharp)

- Generate all maximal feasible sibling groups (sets) that satisfy 2-allele property using “2-Allele Algorithm” [ISMB 2007; Bioinformatics 23(13)]
- Use Min Set Cover to find the minimum sibling groupsOptimally using ILP (CPLEX)

- Generate candidate sets by all pairs of individuals
- Compare every set to every individual x
- if x can be added to the set without any affecting “accomodability” or violating 2-allele:
- add it

- If the “accomodability” is affected , but the 2-allele property is still satisfied:
- create a new copy of the set, and add to it

- Otherwise ignore the individual, compare the next

- if x can be added to the set without any affecting “accomodability” or violating 2-allele:

4/1

2/3

2/1

3/1

2/1

1/3

3/2

2/1

3/1

1/1

1/1

1/2

2/2

1/2

1/3

1/4

2/3

2/4

3/1

3/2

4/2

2/1

1/1

1/2

2/1

1/1

1/3

1/3

2/1

2/3

2/1

3/2

1/3

2/2

1/1

1/2

1/4

2/3

2/4

3/4

3/3

4/4

1/4

1/4

1/4

- Add
- New Group Add (won’t accommodate (2/2))
- Can’t add (a+R =4)

3/ 4

1/ 2

3/ 2

1/ 2

3/ 2

3/ 2

1/ 1

1/ 2

1/ 5

- Get a dataset with known sibgroups(real or simulated)
- Find sibgroups using our alg
- Compare the solutions
- Partition distance, Gusfield’03

- Compare results to other sibship methods

Salmon (Salmosalar) - Herbingeret al., 1999 351 individuals, 6 families, 4 loci. No missing alleles

Shrimp (Penaeusmonodon) - Jerry et al., 200659 individuals,13 families, 7 loci. Some missing alleles

Ants (Leptothoraxacervorum )- Hammond et al., 1999Ants dataset [16] are haplodiploid species. The data consists of 377 worker diploid ants

Generate F females and M males (F=M=5, 10, 15)

Each with l loci (l=2, 4, 6)

Each locus with a allelesa[uniform]=5, 10, 15 a[nonuniform]=4 12-4-1-1

Generate f familiesf[uniform]=2, 5, 10 f[nonuniform]=5

For each family select female+male uniformly at random

For each parent pair generate o offspringo[uniform]=2, 5, 10 o[nonuniform]=25-10-10-4-1

For each offspring for each locus choose allele outcome uniformly at random

- 2-Allele Min Set Cover
- First combinatorial
- Makes no assumptions other parsimony
- Works consistently and comparatively

- Sibling Reconstruction
- Growing number of methods
- Biologists need (one) reliable reconstruction
- Genotyping errors

- Answer: Consensus

S2

Sk

S

- Combine multiple solutions to a problem to generate one unified solution
- C: S*→ S
- Based on Social Choice Theory
- Commonly used where the real solution is not known e.g. Phylogenetic Trees

Consensus

...

S1

- Only Pareto Optimality and Anti-Pareto Optimality are enforced
- All solutions must agree on equivalence

- All disputed individuals go to singletons

Si x≡Siy≡ x≡Sy

S1 = {{1,2,3},{4,5},{6,7}

S2={{1,2,3,4},{5,6,7}}

S3={{1,2},{3,4,5},{6,7}}

Strict

Consensus

S={{1,2},{3},{4},{5},{6,7}}

5 Sibling Groups?

When 3 can do?

- Majority of solutions determine the final solution
- Two individuals are together if a majority of solutions vote in their favour
- Violates Transitivity: A≡B∧B≡C⇒A≡C

S1 = {{1,2,3},{4,5},{6,7}

S2={{1,2,3,4},{5,6,7}}

S3={{1,2},{3,4,5},{6,7}}

1 ≡ 3 AND 3 ≡ 4 BUT 1 ≡ 4

- Voting Consensus
- Majority under closure
- Results in large monolithic groups

S1 = {{1,2,3},{4,5},{6,7}

S2={{1,2,3,4},{5,6,7}}

S3={{1,2},{3,4,5},{6,7}}

Voting

Consensus

S={{1,2,3,4,5},{6,7}}

1 ≡5?

- Commonly used consensus methods don’t work [AAAI-MPREF08]
- Strict Consensus produces too many singletons
- Majority violates transitivity AND doesn’t work for error-tolerance

fq

S

S2

S1

Sk

Ss

fd

- Algorithm
- Compute a consensus solution S={g1,...,gk}
- Search for a good solution near S

fq

fd

Search

Consensus

...

- Needs
- A Distance Function fd: S x S →R
- A Quality Function fq: S → R

- What is the Catch? [Sheikh et al. CSB 2008]
- Optimization of fd, fq or an arbitrary linearcombination is NP-Complete
- Reduction from the 2-Allele Min Set CoverProblem

- Algorithm
- Compute a strict consensus
- While distance is not too large
- Merge two nearest sibgroups

- Quality: fq=n-|C|
- Distance Function
- fd(C,C’)=cost of merging groups in C to obtain C’

- S1 ={ {1,2,3}, {4,5}, {6,7} }
- S2={ {1,2,3}, {4}, {5,6,7} }
- S3={ {1,2}, {3,4,5}, {6,7} }

Strict

Consensus

S={ {1,2}, {3},{4},{5},{6,7} }

S={ {1,2}, {3,6,7},{4},{5} }

- Distance Function(sibgroup, sibgroup)
- Cost of assigning all individuals
- fd(C,C’)=min(SXPifassign(Pj,X), SXPjfassign(Pi,X) )

- Cost of assigning all individuals
- Distance Function (sibgroup, individual)
- Benefit: Alleles and allele pairs shared
- Cost: Minimum Edit Distance
- fassign(PiX)=

benefit X can be a member of Pi

cost X cannot be a member of Pi`

- Algorithm
- Compute a strict consensus
- While distance is not too large
- Merge two sibgroups which will minimize the TOTAL merging cost
- Store the new merging cost in the merged set

S2

Sk

S

...

Locus 1

Locus 2

Locus 3

Locus k

Sibling

Reconstruction

Algorithm

...

Consensus

S1

- >90% accuracy for all real data

- A consensus method CANNOT be all of these [Arrow 1963,Mirkin 1975]
- Fair
- Independent
- Pareto Optimal

- Biologically [AAAI-MPREF 2008]
- The subset of individuals chosen will impact the consensus considerably

- Parametric
- Does NOT outperform other algorithms on:
- Biological data
- Smaller families
- High Allele Frequencies

- Change costs to average per locus costs
- Compare max group error on per locus basis
- Treat cost and benefit independently
- In order to qualify a merge
- Cost <= maxcost
- Benefit >= minbenefit
- Benefit = max benefit among possible merges

- First consensus method for Sibship Reconstruction
- Majority won’t work

- First combinatorial approach for Error-Tolerant Sibship Reconstruction
- Fewer Assumptions
- More Efficient

- Distance-based Consensus is NP-Hard
- New non-parametric consensus

- Min number of sibgroups is just ONE way to interpret parsimony
- Alternate Objectives
- Sibship that minimizes number of parents
- Very Hard! Connection to Raz’s Parallel Repetition Theorem

- Sibship that minimizes number of matings
- Sibship that maximizes family size
- Sibship that tries to satisfy uniform allele distributions

- Sibship that minimizes number of parents

- Problem Statement:
- Given a population U of individuals, partition the individuals into groups G such that the parents (mothers+fathers) necessary for G are minimized

- Observations and Challenges:
- MinParents: intractable, inapproximable
- Reduction from Min-Rep Problem (Raz’s Parallel Repetition Theorem)

- There may be O(2|loci|) potential parents for a sibgroup
- Self-mating (plants) may or may not be allowed

- MinParents: intractable, inapproximable

- Not Necessarily…

M={{1,2},{3,6,7},{3,5},

{2,4},{1,6},{2,5},{6,7}}

- Generate M a set of covering groups
- Cover a subset S of covering groups
- For each group x in S
- Generate Parent Pairs for x
- Insert parent vertices into graph G (if needed)
- Connect the parents in each parent pair

- Cover the minimum vertices necessary to (doubly) cover all the individuals

S={{1,2,4},{3,5},{6,7}}

X={3,5}

{F=5/10, M=2/20},{F=5/20.M=2/10}

5/

10

X={3,5}

2/

20

X={3,5}

5/

20

2/

10

- Different approaches to selecting a subset of maximal feasible groups
- Greedy Min Set Cover
- K –Greedy Min Set Covers
- All Sets! (Nearing optimality)

- Forget maximal feasible sibling groups
- Generate K random minimal feasible sibling reconstructions

- The number of generated parents is just too many!
- Mine Association Rules across loci
{A,B}locus1 => {C,D}locus2

- Use Association Rules to filter parents
{A,B}locus1 => {C,D}locus2 OR{C’,D’}locus2

- Polygamy=>High Confidence Association Rules
- No Polygamy=>Min Parents=Min Groups
- If self-mating is not allowed, odd-cycles must be disallowed

- Heuristic
- While all vertices are not covered
- Select the vertex that will cover the most uncovered individuals

- While all vertices are not covered
- MIP Formulation

Legend:

M1: k-greedy cover with optimal graph cover

M2: greedy set cover with optimal graph cover

M3: Randomized cover with optimal graph cover

M4: k-greedy with graph heuristics

M5: greedy set cover with graph heuristic

Reduction is from a version of Parallel Repetition theorem even if we know all the parents and just

need to find the minimum parents to choose!

But, what is the parallel repetition theorem?

Unique games

conjecture

restriction

restriction

label cover problem

for bipartite graphs

2-prover 1-round

proof system

small inapproximability

boosting

(Raz’s parallel repetition theorem)

label cover problem

for some kind of

“graph product” for

bipartite graphs

parallel repetition of

2-prover 1-round

proof system

larger inapproximability

We need some version of Raz’s parallel repetition theorem that is suitable for us

Fortunately, the following two papers helped:

U. Feige, A threshold of ln n for approximating set-cover,Journal of the ACM, 1998

G. Kortsarz, R. Krauthgamer and J. R. Lee, Hardness of Approximating Vertex-Connectivity Network Design Problems, SIAM J. of Computing, 2004

Inapproximability for MINREP

(Raz’s parallel repetition theorem)

Let LNP and x be an input instance of L

O(npolylog(n))

time

MINREP

L

xL

OPT ≤ α+β

xL

OPT (α+β) 2log |A| +|B|

0 < ε < 1 is any constant

α partitions

all of equal size

α partitions

all of equal size

MINREP (minimum representative) problem

A “super”-nodes

A1

A2

Aα

A1

A1

Aα

Aα

…

…

A2

A2

…

A

A

…

…

B

B

B1

B3

Bβ

B2

Bβ

Bβ

B1

B1

B3

B3

B2

B2

B “super”-nodes

β partitions

all of equal size

associated “super”-graph H

input graph G

(A1,B2)H if uA1 and vB2 such that (u,v)G

In this case, edge (u,v)G a witness of the super-edge (A1,B2)H

MINREP goal

Valid solution:

A’ A and B’ B such that

A’B’ contains a witness for every super-edge

Objective:minimize the size of the solution |A’B’|

Informally,

- given a set of children
- given a candidate set of parents
- assuming we believe in Mendelian inheritance law
- assuming that the parents tried to be as much monogamous as possible
can we

partition the children into a set of full siblings

(full sibling group has the same pair of parents)

Can reduce MINREP to show that this problem is hard

- Parsimony-based combinatorial optimization works bet with least amount of information
- Parsimony-based combinatorial optimization is NP-hard and inapproximable
- First combinatorial approach for Error-Tolerant Sibship Reconstruction
- Fewer Assumptions
- More Efficient

- Other parsimony-based optimization objectives are possible
- Min Parents is interesting and hard!

- Better heuristics for Min Parents?
- Other parsimony objectives
- Further analysis of when objectives give same results

Ashfaq KhokharUIC

Bhaskar DasGuptaUIC

Tanya Berger-WolfUIC

Isabel CaballeroUIC

W. Art ChaovalitwongseRutgers

Mary AshleyUIC

Sibship Reconstruction Project

Thank You!!Questions?

Chun-An (Joe)

Chou

Rutgers

Priya GovindanUIC