Loading in 2 Seconds...

Reconstructing Sibling Relationships from Genotyping Data

Loading in 2 Seconds...

- 107 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Reconstructing Sibling Relationships from Genotyping Data' - chance

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Saad Sheikh

Department of Computer Science

University of Illinois at Chicago

?

Brothers!

?

Reconstructing Sibling Relationships from Genotyping DataBiological Motivation

- Used in: conservation biology, animal management, molecular ecology, genetic epidemiology
- Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness.

Lemon sharks, Negaprionbrevirostris

- But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier

2 Brown-headed cowbird (Molothrusater) eggs in a Blue-winged Warbler\'s nest

Basic Genetics

- Gene
- Unit of inheritance
- Allele
- Actual genetic sequence
- Locus
- Location of allele in entire genetic sequence
- Diploid
- 2 alleles at each locus

Siblings: two children with the same parents

Question: given a set of children, find sibling groups

allele

locus

father(.../...),(a /b ),(.../...),(.../...)

one from fatherone from mother

Diploid Siblings(.../...),(c /d ),(.../...),(.../...) mother

recombination

(.../...),(e /f ),(.../...),(.../...) child

5’

Alleles

CACACACA

#1

CACACACACACA

#2

#3

CACACACACACACA

Genotypes

1/1

2/2

1/2

1/3

2/3

3/3

Microsatellites (STR)- Advantages:
- Codominant (easy inference of genotypes and allele frequencies)
- Many heterozygous alleles per locus
- Possible to estimate other population parameters
- Cheaper than SNPs
- But:
- Few loci
- And:
- Large families
- Self-mating
- …

2, 4, 5, 6

1, 3

7, 8

Sibling Reconstruction ProblemAnimal

Locus1

Locus2

allele1/allele2

1

1/2

11/22

2

1/3

33/44

3

1/4

33/55

4

1/3

77/66

5

1/3

33/44

33/77

6

1/3

7

1/5

88/22

8

1/6

22/22

S={P1={2,4,5,6},P2={1,3},P3={7,8}}

David C. Queller and Keith F. Goodnight.

Computer software for performing likelihood tests of pedigree relationship using genetic markers.

Molecular Ecology, 8:1231–1234, 1999.

KINSHIPKINSHIP

- First software and likelihood measure for sibling/kinship reconstruction
- Estimates a ratio of two likelihoods:
- Primary vs. Null Hypothesis
- Assumes Population Frequencies are known

Probability of sharing allele

- R – Probability of alleles being identical by descent
- Rp = Probability (Xp = Yp)
- Rm = Probability (Xm = Ym)

Haploid Likelihood

- Two individuals X =<X> and Y=<Y>
- If X=Y
- Likelihood = Pr(Drawing X) x Pr(X = Y)
- =R+(1-R)Px
- Otherwise
- Likelihood = Pr(Drawing X) x Pr(X Y)
- =Px(1-R)Py

Diploid Individuals

- Diploid Individuals X=<Xp/Xm>, Y =<Yp/Ym>
- Assumptions
- We know which alleles are mother\'s and father\'s
- No Inbreeding
- Likelihood = Likelihoodp x Likelihoodm
- Loci are independent
- Total Likelihood is a product of likelihoods across loci

Calculating Likelihood

- Population Frequencies: Pxm,Pxp,Pym,Pyp
- Likelihoods:

Likelihood Ratios

- Independent Likelihood is not very reliable or meaningful
- Different Ratios => Different Loci
- Ratio != Statistical Significance
- Simulations used to determine P-values

Statistical Significance

- Randomly generate an individual X using allele frequencies
- Draw Y using Rm and Rp
- First Allele: Copy X\'s allele with Probability Rm or vice versa
- Second Allele: Copy X\'s allele with Probability Rp or vice versa
- Draw a large number of such <X,Y> pairs
- The value of the ratio that excludes 95% of such pairs is at P=0.05 significance

A graph-theoretic approach to the partition of individuals into full-sib families.

Molecular Ecology, 12:2243–2250, 2003.

Family FinderGraph-Theory?

- Build a graph of all individuals
- Connect individuals with edges representing relationships
- Assign Likelihood Ratio Full Sib/Unrelated as distance measure
- Filter using likelihood ratio at 0.05 significance level
- Find a cut

Algorithm

- Calculate LFS/LUR likelihood ratios for all pairs
- Build a graph representing the full-sib relationships
- Find the connected components in the graph and store them in a queue.
- While the queue is not empty do
- Remove a component from the queue and calculate its score.
- Build a GH cut tree for the component.
- For each cut with less than 1/3 the total number of edges in the component do
- Score the components that would result if the cut\'s edges were removed.
- If the scores are the best found so far, then store them.
- If the best scores found are higher than the score for the original component
- then separate the families and put them in the queue for further analysis.
- Otherwise save the original component as a result family.

Example

Score the components and Keep the best cuts

Conclusion – Family Finder

- Some theoretical basis
- Efficiently computable
- Produces reasonably good results for many loci
- A lot of assumptions because of Goodknight & Queller measure
- Requires a significant number of loci - 8+
- Works well only when families are almost equal size

Parsimony

- Parsimony=Occam’s Razor
- "entities must not be multiplied beyond necessity”
- "plurality should not be posited without necessity”
- “Parsimony is a \'less is better\' concept of frugality, economy or caution in arriving at a hypothesis or course of action. The word derives from Middle English parcimony, from Latin parsimonia, from parsus, past participle of parcere: to spare. It is a general principle that has applications from science to philosophy and all related fields. Parsimony is essentially the implementation of Occam\'s razor.”
- Wikipedia
- Min Sib groups = Most Parsimonious explanation

Mendelian Constraints

4-allele rule:siblings have at most 4 different alleles in a locus

Yes: 3/3, 1/3, 1/5, 1/6

No:3/3, 1/3, 1/5, 1/6, 3/2

2-allele rule:

In a locus in a sibling group:

a + R ≤ 4

Yes: 3/3, 1/3, 1/5

No: 3/3, 1/3, 1/5, 1/6

Num distinct alleles

Num alleles that appear with 3 others or are homozygote

Min Sibgroups Reconstruction

- Find the minimum number of Sibling Groups necessary to explain the given cohort
- Minimum Set Cover:
- Cohort as universe U
- Individuals as elements of U
- Covering Groups C include all genetically feasible sibling groups
- NP-complete even when we know sibsets at most 3
- Hard to approximate (Ashley et al. 09)
- ILP formulation (Chaovalitwongse et al. 08)

Given: universe U = {1, 2, …, n} collection of sets S = {S1, S2,…,Sm}

where Si subset of U

Find: the smallest number of sets in S whose union is the universe U

Minimum Set CoverMinimum Set Cover is NP-hard

(1+ln n)-approximable (sharp)

2-Allele Min Set Cover

- Generate all maximal feasible sibling groups (sets) that satisfy 2-allele property using “2-Allele Algorithm” [ISMB 2007; Bioinformatics 23(13)]
- Use Min Set Cover to find the minimum sibling groupsOptimally using ILP (CPLEX)

2-Allele Algorithm Overview

- Generate candidate sets by all pairs of individuals
- Compare every set to every individual x
- if x can be added to the set without any affecting “accomodability” or violating 2-allele:
- add it
- If the “accomodability” is affected , but the 2-allele property is still satisfied:
- create a new copy of the set, and add to it
- Otherwise ignore the individual, compare the next

2/3

2/1

3/1

2/1

1/3

3/2

2/1

3/1

1/1

1/1

1/2

2/2

1/2

1/3

1/4

2/3

2/4

3/1

3/2

4/2

2/1

1/1

1/2

2/1

1/1

1/3

1/3

2/1

2/3

2/1

3/2

Canonical families1/3

2/2

1/1

1/2

1/4

2/3

2/4

3/4

3/3

4/4

Testing and Validation: Protocol

- Get a dataset with known sibgroups(real or simulated)
- Find sibgroups using our alg
- Compare the solutions
- Partition distance, Gusfield’03
- Compare results to other sibship methods

Salmon (Salmosalar) - Herbingeret al., 1999 351 individuals, 6 families, 4 loci. No missing alleles

Shrimp (Penaeusmonodon) - Jerry et al., 200659 individuals,13 families, 7 loci. Some missing alleles

Ants (Leptothoraxacervorum )- Hammond et al., 1999Ants dataset [16] are haplodiploid species. The data consists of 377 worker diploid ants

Real DataGenerate F females and M males (F=M=5, 10, 15)

Each with l loci (l=2, 4, 6)

Each locus with a allelesa[uniform]=5, 10, 15 a[nonuniform]=4 12-4-1-1

Generate f familiesf[uniform]=2, 5, 10 f[nonuniform]=5

For each family select female+male uniformly at random

For each parent pair generate o offspringo[uniform]=2, 5, 10 o[nonuniform]=25-10-10-4-1

For each offspring for each locus choose allele outcome uniformly at random

Random Data GenerationSummary (Min Sib Groups)

- 2-Allele Min Set Cover
- First combinatorial
- Makes no assumptions other parsimony
- Works consistently and comparatively
- Sibling Reconstruction
- Growing number of methods
- Biologists need (one) reliable reconstruction
- Genotyping errors
- Answer: Consensus

Sk

S

Consensus Methods- Combine multiple solutions to a problem to generate one unified solution
- C: S*→ S
- Based on Social Choice Theory
- Commonly used where the real solution is not known e.g. Phylogenetic Trees

Consensus

...

S1

Strict Consensus

- Only Pareto Optimality and Anti-Pareto Optimality are enforced
- All solutions must agree on equivalence
- All disputed individuals go to singletons

Si x≡Siy≡ x≡Sy

S1 = {{1,2,3},{4,5},{6,7}

S2={{1,2,3,4},{5,6,7}}

S3={{1,2},{3,4,5},{6,7}}

Strict

Consensus

S={{1,2},{3},{4},{5},{6,7}}

5 Sibling Groups?

When 3 can do?

Majority Consensus

- Majority of solutions determine the final solution
- Two individuals are together if a majority of solutions vote in their favour
- Violates Transitivity: A≡B∧B≡C⇒A≡C

S1 = {{1,2,3},{4,5},{6,7}

S2={{1,2,3,4},{5,6,7}}

S3={{1,2},{3,4,5},{6,7}}

1 ≡ 3 AND 3 ≡ 4 BUT 1 ≡ 4

Majority Consensus

- Voting Consensus
- Majority under closure
- Results in large monolithic groups

S1 = {{1,2,3},{4,5},{6,7}

S2={{1,2,3,4},{5,6,7}}

S3={{1,2},{3,4,5},{6,7}}

Voting

Consensus

S={{1,2,3,4,5},{6,7}}

1 ≡5?

Consensus Methods

- Commonly used consensus methods don’t work [AAAI-MPREF08]
- Strict Consensus produces too many singletons
- Majority violates transitivity AND doesn’t work for error-tolerance

S

S2

S1

Sk

Ss

fd

Distance-based Consensus- Algorithm
- Compute a consensus solution S={g1,...,gk}
- Search for a good solution near S

fq

fd

Search

Consensus

...

Distance-based Consensus

- Needs
- A Distance Function fd: S x S →R
- A Quality Function fq: S → R
- What is the Catch? [Sheikh et al. CSB 2008]
- Optimization of fd, fq or an arbitrary linearcombination is NP-Complete
- Reduction from the 2-Allele Min Set CoverProblem

A Greedy Approach

- Algorithm
- Compute a strict consensus
- While distance is not too large
- Merge two nearest sibgroups
- Quality: fq=n-|C|
- Distance Function
- fd(C,C’)=cost of merging groups in C to obtain C’

A Greedy Approach

- S1 ={ {1,2,3}, {4,5}, {6,7} }
- S2={ {1,2,3}, {4}, {5,6,7} }
- S3={ {1,2}, {3,4,5}, {6,7} }

Strict

Consensus

S={ {1,2}, {3},{4},{5},{6,7} }

S={ {1,2}, {3,6,7},{4},{5} }

Greedy Consensus

- Distance Function(sibgroup, sibgroup)
- Cost of assigning all individuals
- fd(C,C’)=min(SXPifassign(Pj,X), SXPjfassign(Pi,X) )
- Distance Function (sibgroup, individual)
- Benefit: Alleles and allele pairs shared
- Cost: Minimum Edit Distance
- fassign(PiX)=

benefit X can be a member of Pi

cost X cannot be a member of Pi`

Greedy Consensus

- Algorithm
- Compute a strict consensus
- While distance is not too large
- Merge two sibgroups which will minimize the TOTAL merging cost
- Store the new merging cost in the merged set

Sk

S

Error-Tolerant Approach...

Locus 1

Locus 2

Locus 3

Locus k

Sibling

Reconstruction

Algorithm

...

Consensus

S1

Results

- >90% accuracy for all real data

Impossibility Result

- A consensus method CANNOT be all of these [Arrow 1963,Mirkin 1975]
- Fair
- Independent
- Pareto Optimal
- Biologically [AAAI-MPREF 2008]
- The subset of individuals chosen will impact the consensus considerably

Problems

- Parametric
- Does NOT outperform other algorithms on:
- Biological data
- Smaller families
- High Allele Frequencies

Auto Greedy Consensus

- Change costs to average per locus costs
- Compare max group error on per locus basis
- Treat cost and benefit independently
- In order to qualify a merge
- Cost <= maxcost
- Benefit >= minbenefit
- Benefit = max benefit among possible merges

Summary (Consensus)

- First consensus method for Sibship Reconstruction
- Majority won’t work
- First combinatorial approach for Error-Tolerant Sibship Reconstruction
- Fewer Assumptions
- More Efficient
- Distance-based Consensus is NP-Hard
- New non-parametric consensus

Parsimony: Alternate Objectives

- Min number of sibgroups is just ONE way to interpret parsimony
- Alternate Objectives
- Sibship that minimizes number of parents
- Very Hard! Connection to Raz’s Parallel Repetition Theorem
- Sibship that minimizes number of matings
- Sibship that maximizes family size
- Sibship that tries to satisfy uniform allele distributions

Parsimony: Minimize Parents

- Problem Statement:
- Given a population U of individuals, partition the individuals into groups G such that the parents (mothers+fathers) necessary for G are minimized
- Observations and Challenges:
- MinParents: intractable, inapproximable
- Reduction from Min-Rep Problem (Raz’s Parallel Repetition Theorem)
- There may be O(2|loci|) potential parents for a sibgroup
- Self-mating (plants) may or may not be allowed

Is MinParents = MinSibgroups?

- Not Necessarily…

Min Parents Meta Approach

M={{1,2},{3,6,7},{3,5},

{2,4},{1,6},{2,5},{6,7}}

- Generate M a set of covering groups
- Cover a subset S of covering groups
- For each group x in S
- Generate Parent Pairs for x
- Insert parent vertices into graph G (if needed)
- Connect the parents in each parent pair
- Cover the minimum vertices necessary to (doubly) cover all the individuals

S={{1,2,4},{3,5},{6,7}}

X={3,5}

{F=5/10, M=2/20},{F=5/20.M=2/10}

5/

10

X={3,5}

2/

20

X={3,5}

5/

20

2/

10

Covering Groups

- Different approaches to selecting a subset of maximal feasible groups
- Greedy Min Set Cover
- K –Greedy Min Set Covers
- All Sets! (Nearing optimality)
- Forget maximal feasible sibling groups
- Generate K random minimal feasible sibling reconstructions

Generating Parents

- The number of generated parents is just too many!
- Mine Association Rules across loci

{A,B}locus1 => {C,D}locus2

- Use Association Rules to filter parents

{A,B}locus1 => {C,D}locus2 OR{C’,D’}locus2

- Polygamy=>High Confidence Association Rules
- No Polygamy=>Min Parents=Min Groups
- If self-mating is not allowed, odd-cycles must be disallowed

Covering Vertices

- Heuristic
- While all vertices are not covered
- Select the vertex that will cover the most uncovered individuals
- MIP Formulation

Results

Legend:

M1: k-greedy cover with optimal graph cover

M2: greedy set cover with optimal graph cover

M3: Randomized cover with optimal graph cover

M4: k-greedy with graph heuristics

M5: greedy set cover with graph heuristic

Complexity Results

Reduction is from a version of Parallel Repetition theorem even if we know all the parents and just

need to find the minimum parents to choose!

But, what is the parallel repetition theorem?

conjecture

restriction

restriction

label cover problem

for bipartite graphs

2-prover 1-round

proof system

small inapproximability

boosting

(Raz’s parallel repetition theorem)

label cover problem

for some kind of

“graph product” for

bipartite graphs

parallel repetition of

2-prover 1-round

proof system

larger inapproximability

We need some version of Raz’s parallel repetition theorem that is suitable for us

Fortunately, the following two papers helped:

U. Feige, A threshold of ln n for approximating set-cover,Journal of the ACM, 1998

G. Kortsarz, R. Krauthgamer and J. R. Lee, Hardness of Approximating Vertex-Connectivity Network Design Problems, SIAM J. of Computing, 2004

Inapproximability for MINREP

(Raz’s parallel repetition theorem)

Let LNP and x be an input instance of L

O(npolylog(n))

time

MINREP

L

xL

OPT ≤ α+β

xL

OPT (α+β) 2log |A| +|B|

0 < ε < 1 is any constant

all of equal size

α partitions

all of equal size

MINREP (minimum representative) problem

A “super”-nodes

A1

A2

Aα

A1

A1

Aα

Aα

…

…

A2

A2

…

A

A

…

…

B

B

B1

B3

Bβ

B2

Bβ

Bβ

B1

B1

B3

B3

B2

B2

B “super”-nodes

β partitions

all of equal size

associated “super”-graph H

input graph G

(A1,B2)H if uA1 and vB2 such that (u,v)G

In this case, edge (u,v)G a witness of the super-edge (A1,B2)H

MINREP goal

Valid solution:

A’ A and B’ B such that

A’B’ contains a witness for every super-edge

Objective:minimize the size of the solution |A’B’|

Informally,

- given a set of children
- given a candidate set of parents
- assuming we believe in Mendelian inheritance law
- assuming that the parents tried to be as much monogamous as possible

can we

partition the children into a set of full siblings

(full sibling group has the same pair of parents)

Can reduce MINREP to show that this problem is hard

Conclusions

- Parsimony-based combinatorial optimization works bet with least amount of information
- Parsimony-based combinatorial optimization is NP-hard and inapproximable
- First combinatorial approach for Error-Tolerant Sibship Reconstruction
- Fewer Assumptions
- More Efficient
- Other parsimony-based optimization objectives are possible
- Min Parents is interesting and hard!

Future Work

- Better heuristics for Min Parents?
- Other parsimony objectives
- Further analysis of when objectives give same results

Bhaskar DasGuptaUIC

Tanya Berger-WolfUIC

Isabel CaballeroUIC

W. Art ChaovalitwongseRutgers

Mary AshleyUIC

Sibship Reconstruction Project

Thank You!!Questions?

Chun-An (Joe)

Chou

Rutgers

Priya GovindanUIC

Download Presentation

Connecting to Server..