1 / 76

Reconstructing Sibling Relationships from Genotyping Data

Saad Sheikh Department of Computer Science University of Illinois at Chicago. ?. Brothers!. ?. Reconstructing Sibling Relationships from Genotyping Data. Biological Motivation. Used in: conservation biology, animal management, molecular ecology, genetic epidemiology

chance
Download Presentation

Reconstructing Sibling Relationships from Genotyping Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Saad Sheikh Department of Computer Science University of Illinois at Chicago ? Brothers! ? Reconstructing Sibling Relationships from Genotyping Data

  2. Biological Motivation • Used in: conservation biology, animal management, molecular ecology, genetic epidemiology • Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness. Lemon sharks, Negaprionbrevirostris • But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier 2 Brown-headed cowbird (Molothrusater) eggs in a Blue-winged Warbler's nest

  3. Basic Genetics • Gene • Unit of inheritance • Allele • Actual genetic sequence • Locus • Location of allele in entire genetic sequence • Diploid • 2 alleles at each locus

  4. Siblings: two children with the same parents Question: given a set of children, find sibling groups allele locus father(.../...),(a /b ),(.../...),(.../...) one from fatherone from mother Diploid Siblings (.../...),(c /d ),(.../...),(.../...) mother recombination (.../...),(e /f ),(.../...),(.../...) child

  5. CACACACA 5’ Alleles CACACACA #1 CACACACACACA #2 #3 CACACACACACACA Genotypes 1/1 2/2 1/2 1/3 2/3 3/3 Microsatellites (STR) • Advantages: • Codominant (easy inference of genotypes and allele frequencies) • Many heterozygous alleles per locus • Possible to estimate other population parameters • Cheaper than SNPs • But: • Few loci • And: • Large families • Self-mating • …

  6. Sibling Groups: 2, 4, 5, 6 1, 3 7, 8 Sibling Reconstruction Problem Animal Locus1 Locus2 allele1/allele2 1 1/2 11/22 2 1/3 33/44 3 1/4 33/55 4 1/3 77/66 5 1/3 33/44 33/77 6 1/3 7 1/5 88/22 8 1/6 22/22 S={P1={2,4,5,6},P2={1,3},P3={7,8}}

  7. Existing Methods

  8. David C. Queller and Keith F. Goodnight. Computer software for performing likelihood tests of pedigree relationship using genetic markers. Molecular Ecology, 8:1231–1234, 1999. KINSHIP

  9. KINSHIP • First software and likelihood measure for sibling/kinship reconstruction • Estimates a ratio of two likelihoods: • Primary vs. Null Hypothesis • Assumes Population Frequencies are known

  10. Probability of sharing allele • R – Probability of alleles being identical by descent • Rp = Probability (Xp = Yp)‏ • Rm = Probability (Xm = Ym)‏

  11. Haploid Likelihood • Two individuals X =<X> and Y=<Y> • If X=Y • Likelihood = Pr(Drawing X) x Pr(X = Y)‏ • =R+(1-R)Px • Otherwise • Likelihood = Pr(Drawing X) x Pr(X  Y)‏ • =Px(1-R)Py

  12. Diploid Individuals • Diploid Individuals X=<Xp/Xm>‌, Y =<Yp/Ym> • Assumptions • We know which alleles are mother's and father's • No Inbreeding • Likelihood = Likelihoodp x Likelihoodm • Loci are independent • Total Likelihood is a product of likelihoods across loci

  13. Calculating Likelihood • Population Frequencies: Pxm,Pxp,Pym,Pyp • Likelihoods:

  14. Likelihood Ratios • Independent Likelihood is not very reliable or meaningful • Different Ratios => Different Loci • Ratio != Statistical Significance • Simulations used to determine P-values

  15. Statistical Significance • Randomly generate an individual X using allele frequencies • Draw Y using Rm and Rp • First Allele: Copy X's allele with Probability Rm or vice versa • Second Allele: Copy X's allele with Probability Rp or vice versa • Draw a large number of such <X,Y> pairs • The value of the ratio that excludes 95% of such pairs is at P=0.05 significance

  16. Jen Beyer and B. May. A graph-theoretic approach to the partition of individuals into full-sib families. Molecular Ecology, 12:2243–2250, 2003. Family Finder

  17. Graph-Theory? • Build a graph of all individuals • Connect individuals with edges representing relationships • Assign Likelihood Ratio Full Sib/Unrelated as distance measure • Filter using likelihood ratio at 0.05 significance level • Find a cut

  18. Algorithm • Calculate LFS/LUR likelihood ratios for all pairs • Build a graph representing the full-sib relationships • Find the connected components in the graph and store them in a queue. • While the queue is not empty do • Remove a component from the queue and calculate its score. • Build a GH cut tree for the component. • For each cut with less than 1/3 the total number of edges in the component do • Score the components that would result if the cut's edges were removed. • If the scores are the best found so far, then store them. • If the best scores found are higher than the score for the original component • then separate the families and put them in the queue for further analysis. • Otherwise save the original component as a result family.

  19. Example Score the components and Keep the best cuts

  20. Conclusion – Family Finder • Some theoretical basis • Efficiently computable • Produces reasonably good results for many loci • A lot of assumptions because of Goodknight & Queller measure • Requires a significant number of loci - 8+ • Works well only when families are almost equal size

  21. Parsimony • Parsimony=Occam’s Razor • "entities must not be multiplied beyond necessity” • "plurality should not be posited without necessity” • “Parsimony is a 'less is better' concept of frugality, economy or caution in arriving at a hypothesis or course of action. The word derives from Middle English parcimony, from Latin parsimonia, from parsus, past participle of parcere: to spare. It is a general principle that has applications from science to philosophy and all related fields. Parsimony is essentially the implementation of Occam's razor.” • Wikipedia • Min Sib groups = Most Parsimonious explanation

  22. Mendelian Constraints 4-allele rule:siblings have at most 4 different alleles in a locus Yes: 3/3, 1/3, 1/5, 1/6 No:3/3, 1/3, 1/5, 1/6, 3/2 2-allele rule: In a locus in a sibling group: a + R ≤ 4 Yes: 3/3, 1/3, 1/5 No: 3/3, 1/3, 1/5, 1/6 Num distinct alleles Num alleles that appear with 3 others or are homozygote

  23. Min Sibgroups Reconstruction • Find the minimum number of Sibling Groups necessary to explain the given cohort • Minimum Set Cover: • Cohort as universe U • Individuals as elements of U • Covering Groups C include all genetically feasible sibling groups • NP-complete even when we know sibsets at most 3 • Hard to approximate (Ashley et al. 09) • ILP formulation (Chaovalitwongse et al. 08)

  24. Given: universe U = {1, 2, …, n} collection of sets S = {S1, S2,…,Sm} where Si subset of U Find: the smallest number of sets in S whose union is the universe U Minimum Set Cover Minimum Set Cover is NP-hard (1+ln n)-approximable (sharp)

  25. 2-Allele Min Set Cover • Generate all maximal feasible sibling groups (sets) that satisfy 2-allele property using “2-Allele Algorithm” [ISMB 2007; Bioinformatics 23(13)] • Use Min Set Cover to find the minimum sibling groupsOptimally using ILP (CPLEX)

  26. 2-Allele Algorithm Overview • Generate candidate sets by all pairs of individuals • Compare every set to every individual x • if x can be added to the set without any affecting “accomodability” or violating 2-allele: • add it • If the “accomodability” is affected , but the 2-allele property is still satisfied: • create a new copy of the set, and add to it • Otherwise ignore the individual, compare the next

  27. 4/1 2/3 2/1 3/1 2/1 1/3 3/2 2/1 3/1 1/1 1/1 1/2 2/2 1/2 1/3 1/4 2/3 2/4 3/1 3/2 4/2 2/1 1/1 1/2 2/1 1/1 1/3 1/3 2/1 2/3 2/1 3/2 Canonical families 1/3 2/2 1/1 1/2 1/4 2/3 2/4 3/4 3/3 4/4

  28. 1/4 1/4 1/4 Examples • Add • New Group Add (won’t accommodate (2/2)) • Can’t add (a+R =4) 3/ 4 1/ 2 3/ 2 1/ 2 3/ 2 3/ 2 1/ 1 1/ 2 1/ 5

  29. Testing and Validation: Protocol • Get a dataset with known sibgroups(real or simulated) • Find sibgroups using our alg • Compare the solutions • Partition distance, Gusfield’03 • Compare results to other sibship methods

  30. Salmon (Salmosalar) - Herbingeret al., 1999 351 individuals, 6 families, 4 loci. No missing alleles Shrimp (Penaeusmonodon) - Jerry et al., 200659 individuals,13 families, 7 loci. Some missing alleles Ants (Leptothoraxacervorum )- Hammond et al., 1999Ants dataset [16] are haplodiploid species. The data consists of 377 worker diploid ants Real Data

  31. Generate F females and M males (F=M=5, 10, 15) Each with l loci (l=2, 4, 6) Each locus with a allelesa[uniform]=5, 10, 15 a[nonuniform]=4 12-4-1-1 Generate f familiesf[uniform]=2, 5, 10 f[nonuniform]=5 For each family select female+male uniformly at random For each parent pair generate o offspringo[uniform]=2, 5, 10 o[nonuniform]=25-10-10-4-1 For each offspring for each locus choose allele outcome uniformly at random Random Data Generation

  32. Results

  33. Summary (Min Sib Groups) • 2-Allele Min Set Cover • First combinatorial • Makes no assumptions other parsimony • Works consistently and comparatively • Sibling Reconstruction • Growing number of methods • Biologists need (one) reliable reconstruction • Genotyping errors • Answer: Consensus

  34. S2 Sk S Consensus Methods • Combine multiple solutions to a problem to generate one unified solution • C: S*→ S • Based on Social Choice Theory • Commonly used where the real solution is not known e.g. Phylogenetic Trees Consensus ... S1

  35. Strict Consensus • Only Pareto Optimality and Anti-Pareto Optimality are enforced • All solutions must agree on equivalence • All disputed individuals go to singletons Si x≡Siy≡ x≡Sy S1 = {{1,2,3},{4,5},{6,7} S2={{1,2,3,4},{5,6,7}} S3={{1,2},{3,4,5},{6,7}} Strict Consensus S={{1,2},{3},{4},{5},{6,7}} 5 Sibling Groups? When 3 can do?

  36. Majority Consensus • Majority of solutions determine the final solution • Two individuals are together if a majority of solutions vote in their favour • Violates Transitivity: A≡B∧B≡C⇒A≡C S1 = {{1,2,3},{4,5},{6,7} S2={{1,2,3,4},{5,6,7}} S3={{1,2},{3,4,5},{6,7}} 1 ≡ 3 AND 3 ≡ 4 BUT 1 ≡ 4

  37. Majority Consensus • Voting Consensus • Majority under closure • Results in large monolithic groups S1 = {{1,2,3},{4,5},{6,7} S2={{1,2,3,4},{5,6,7}} S3={{1,2},{3,4,5},{6,7}} Voting Consensus S={{1,2,3,4,5},{6,7}} 1 ≡5?

  38. Consensus Methods • Commonly used consensus methods don’t work [AAAI-MPREF08] • Strict Consensus produces too many singletons • Majority violates transitivity AND doesn’t work for error-tolerance

  39. fq S S2 S1 Sk Ss fd Distance-based Consensus • Algorithm • Compute a consensus solution S={g1,...,gk} • Search for a good solution near S fq fd Search Consensus ...

  40. Distance-based Consensus • Needs • A Distance Function fd: S x S →R • A Quality Function fq: S → R • What is the Catch? [Sheikh et al. CSB 2008] • Optimization of fd, fq or an arbitrary linearcombination is NP-Complete • Reduction from the 2-Allele Min Set CoverProblem

  41. A Greedy Approach • Algorithm • Compute a strict consensus • While distance is not too large • Merge two nearest sibgroups • Quality: fq=n-|C| • Distance Function • fd(C,C’)=cost of merging groups in C to obtain C’

  42. A Greedy Approach • S1 ={ {1,2,3}, {4,5}, {6,7} } • S2={ {1,2,3}, {4}, {5,6,7} } • S3={ {1,2}, {3,4,5}, {6,7} } Strict Consensus S={ {1,2}, {3},{4},{5},{6,7} } S={ {1,2}, {3,6,7},{4},{5} }

  43. Greedy Consensus • Distance Function(sibgroup, sibgroup)‏ • Cost of assigning all individuals • fd(C,C’)=min(SXPifassign(Pj,X), SXPjfassign(Pi,X) ) • Distance Function (sibgroup, individual)‏ • Benefit: Alleles and allele pairs shared • Cost: Minimum Edit Distance • fassign(PiX)= benefit X can be a member of Pi cost X cannot be a member of Pi`

  44. Greedy Consensus • Algorithm • Compute a strict consensus • While distance is not too large • Merge two sibgroups which will minimize the TOTAL merging cost • Store the new merging cost in the merged set

  45. S2 Sk S Error-Tolerant Approach ... Locus 1 Locus 2 Locus 3 Locus k Sibling Reconstruction Algorithm ... Consensus S1

  46. Results

  47. Results • >90% accuracy for all real data

  48. Results

  49. Results

  50. Impossibility Result • A consensus method CANNOT be all of these [Arrow 1963,Mirkin 1975] • Fair • Independent • Pareto Optimal • Biologically [AAAI-MPREF 2008] • The subset of individuals chosen will impact the consensus considerably

More Related