1 / 18

EFFICIENT ALGORITHM FOR COPY NUMBER VARIATION RECONSTRUCTION

EFFICIENT ALGORITHM FOR COPY NUMBER VARIATION RECONSTRUCTION. CS224 Project Dan He. CNV (COPY NUMBER VARIATION). One type of SV (Structural Variation). Reference. Donner. RE-SEQUENCING USING SHORT READS. Donner. COVERAGE RATIO.

lamya
Download Presentation

EFFICIENT ALGORITHM FOR COPY NUMBER VARIATION RECONSTRUCTION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EFFICIENT ALGORITHM FOR COPY NUMBER VARIATION RECONSTRUCTION CS224 Project Dan He

  2. CNV (COPY NUMBER VARIATION) • One type of SV (Structural Variation) Reference Donner

  3. RE-SEQUENCING USING SHORT READS Donner

  4. COVERAGE RATIO • Coverage Ratio indicates the region of the copies and the number of the copies Coverage Ratio Reference Donner

  5. HISTOGRAM OF COVERAGE RATIO FOR SIMULATED SEQUENCES

  6. CNV RE-CONSTRUCTION • Input: Reference sequence, Set of paired-end reads randomly sampled from donner sequence, Coverage ratio. • Output: Donner sequence such that (1) The number of CNV is consistent with the coverage ratio (2) The number of mismatches between the set of paired-end reads and the donner sequence is minimized.

  7. STEP 1 • Select Un-Mapped reads Reference Donner

  8. STEP 2 • Find junctions between copies using Un-Mapped Reads: (1) split the Un-Mapped Read into 2 parts at each internal position of the Read.(2) match both parts to the reference sequence. If both parts match the reference sequence, record the matched positions. start end Reference Donner junction

  9. STEP 3 • Order the junctions such that the donner sequence is valid, when you have more than 2 copies and the copies are of difference length.

  10. A SIMPLE EXAMPLE • Set length threshold for the two parts of the split reads such that the matches are “significant”, namely not happen by random chance. • Given reference sequence length L, the threshold can be computed by selecting a n such that L/4^n (number of expected occurrence for a length n string in the reference) is very small, say, less than 0.001. Reference ACTGGTCACTGTCGATC Donner ACTGGTCACTGTCGCTGTCGCTGTCGATC Un-Mapped Reads TCGCTCGCTG

  11. A MORE COMPLICATED EXAMPLE • 3 copies with various lengths Reference ACTGGTCACTGTCGATC Donner ACTGGTCACTGTCGTGTCGCTGTCATC Un-Mapped Reads TCGTGCGCTGTCATC

  12. EFFICIENT IMPLEMENTATIONS • Suffix Tree on the reference sequence such that search for match positions will be much faster. • Search Tree for the un-mapped reads such that redundant comparisons can be avoided.

  13. DIFFICULTIES IN REAL APPLICATIONS • Reads contain errors • Copies can contain heterozygous SNPs • Distinguish repeats regions from CNV region • CNV may occur together with other SVs

  14. ERRORS ALLOWED IN THE READS • Still the same steps but allow mismatches when map the split parts of reads to the reference sequence. • The number of candidate matched positions increases as the number of allowed mismatches increases. • Apply clustering on the candidate matched positions such that positions adjacent to each other are clustered into one group. The groups with maximal sizes are selected.

  15. EXPERIMENTAL RESULTS • Reference sequence length: 1000 • CNV length: around 100 • Length threshold for significant match: 10 • Reads length: 36 • Compare cases for 3 copies and 4 copies, with allowed errors as 0, 1, 2, respectively • Simulate reference sequence, CNV and reads using Nick’s simulator

  16. RUNNING TIME VS. NUMBER OF COPIES

  17. NUMBER OF CANDIDATE JUNCTIONS VS. NUMBER OF ERRORS

  18. QUESTIONS?

More Related