1 / 20

Detecting copy number variations using paired-end sequence data

Detecting copy number variations using paired-end sequence data. CS224 May 29, 2009. Nick Furlotte. SNPs are old news. SNPs have been the main way of quantifying genetic variation Attention is now switching to other harder to detect variations

zuzela
Download Presentation

Detecting copy number variations using paired-end sequence data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting copy number variations using paired-end sequence data CS224 May 29, 2009 Nick Furlotte

  2. SNPs are old news • SNPs have been the main way of quantifying genetic variation • Attention is now switching to other harder to detect variations • Structural variations possibly account for a large portion of genetic variance • Structural variations include • Insertions • Deletions • Inversions • Translocations • Copy number variations (CNVs)

  3. What does a CNV look like? Copy Number Variation: Reference Donor

  4. How do we find CNVs? • De novo sequence assembly is hard • Resequencing is now an option with low-cost next gen sequencing • This project aims to find CNVs using next gen sequencing reads.

  5. Paired-end sequencing Illumina website

  6. Paired-end sequencing Output: • Read lengths are a known size • Insert length has a distribution

  7. The Computational Problem • Given: A set of paired-end reads The mapping positions in the reference Read and Mapping Quality • Output: A set of CNVs An estimation of the boundaries of each CNV For each CNV an associated probability for the number of copies

  8. Proposed Method Use discordant read pairs as CNV signal Cluster discordant read pairs that explain the same CNV Estimate CNV boundaries based on clustered reads For each cluster calculate the probability of the number of copies = 1,2,3…

  9. Discordant read pair Donor Reference Concordant

  10. Discordant read pair Donor Reference Discordant

  11. Clustering Discordant Read Pairs • Use a greedy approach • Pick any discordant read • Compare with all other discordant reads and group any that are within a given distance • Do this until no reads can be clustered together • This sounds problematic but with the right assumptions it works • Assume to know the maximum insert length • Assume that the reverse read maps into the second copy and the forward read maps into the first copy • Assume that CNVs are far apart

  12. Read Pair Cluster Distance R+MI xmin N N = Xmin + R + MI

  13. if otherwise Read Pair Cluster Distance R N xmax N = Xmin+ R+ MI N = Xmax + R Xmax - Xmin = MI

  14. Estimating Boundaries • Have a set of clusters now • Simple Boundary estimation: Right bounary = Left bounary =

  15. Estimating the number of copies • Utilize coverage - • Position Coverage ~ Poisson(coverage) • For each cluster: • Perform a goodness of fit test for each coverage level • (ie. Number of copies = 2 => coverage’ = 2*coverage) • The coverage level that gives the best fit is the most • likely Sadly this did not work :( So I resorted to estimating by looking at the ratio of the mean Coverage to the expected.

  16. Simulation • Wrote simulator tool in C • Simulates FASTQ paired-end reads • Reads and writes MAQ files • Computes Coverage • Generated 10MB random genome using mouse chr 19 • Inserted 5 CNVs spread across the genome • Generated reads at 40x Coverage • Mapped to fake reference using MAQ • Applied the previously mentioned method

  17. Results • Found all of the CNVs • No False Positives • Worked exactly as predicted, because reads were perfect.

  18. Results • Applied method to real mouse sequence data • 40x coverage on chromosome 17 for CAST mouse strain • Found 1456 possible CNVs • Most of them were crazy looking • Zeroed in on 5 that look interesting Obviously the method needs some work

  19. Future Work • Table from Lee, et al. shows that the number of perfectly mapped reads that are discordant is very small • Their method considers all possible mapping positions for each pair • My method needs to consider this and toss MAQ

  20. Future Work • Replace the ad-hociness with a more formal probabilistic framework • Consider all high-quality mapping positions (take into account low quality mappings) • Consider the problem of repeated sequences

More Related