1 / 27

AMOS tools for assembly validation

AMOS tools for assembly validation. Automatically scan an assembly to locate misassembly signatures for further analysis and correction Load Assembly Data into Bank Evaluate Mate Pairs & Libraries Evaluate Read Alignments Evaluate Read Breakpoints Analyze Depth of Coverage

kanoa
Download Presentation

AMOS tools for assembly validation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMOS tools for assembly validation • Automatically scan an assembly to locate misassembly signatures for further analysis and correction • Load Assembly Data into Bank • Evaluate Mate Pairs & Libraries • Evaluate Read Alignments • Evaluate Read Breakpoints • Analyze Depth of Coverage • Identify “Surrogates” • Load Misassembly Signatures into Bank AMOS Bank http://amos.sourceforge.net

  2. Assembly QC: mate happiness • Evaluate mate “happiness” across assembly • Happy = Correct orientation and distance • Finds regions with multiple: • Compressed Mates (too close together) • Expanded Mates (too far apart) • Invalid same orientation () • Invalid “outie” orientation () • Missing Mates • Linking mates (mate in a different scaffold) • Singleton mates (mate is not in any contig) • Regions with high C/E statistic

  3. Mate happiness • Excision:Skip reads between flanking repeats • Truth • Misassembly: Compressed Mates, Missing Mates

  4. Mate happiness • Insertion:Additional reads between flanking repeats • Truth • Misassembly: Expanded Mates, Missing Mates

  5. A B B A Mate happiness • Rearrangement: Reordering of reads • Truth • Misassembly: Misoriented Mates Note: if A,B too far apart, mates may all be “happy”

  6. Compression/Expansion (C/E) Statistic • The presence of individual compressed or expanded mates is rare but expected • Do the inserts spanning a given position differ from the rest of the library? • Flag large differences as potential misassemblies • Even if each individual mate is “happy” • Compute the statistic at all positions • (Local Mean – Global Mean) / Scaling Factor • Introduced by Jim Yorke’s group at UMD

  7. Library size variation 0kb 2kb 4kb 6kb 8 inserts: 3kb-6kb Local Mean: 4048 C/E Stat: (4048-4000) = +0.33 (400 / √8) Near 0 indicates overall happiness

  8. C/E statistic: Compression 0kb 2kb 4kb 6kb 8 inserts: 3.2 kb-4.8kb Local Mean: 3488 C/E Stat: (3488-4000) = -3.62 (400 / √8) C/E Stat ≤ -3.0 indicates Compression

  9. Read Alignment • Multiple reads with same conflicting base are unlikely • 1x QV 30: 1/1000 base calling error • 2x QV 30: 1/1,000,000 base calling error • 3x QV 30: 1/1,000,000,000 base calling error • Correlated SNPs are likely to be assembly errors, usually collapsed repeats • AMOS Tools: analyzeSNPs & clusterSNPs • Locate regions with high rate of correlated SNPs • Parameterized thresholds: • Multiple positions within 100bp sliding window • 2+ conflicting reads • Cumulative QV >= 40 (1/10000 base calling error) AG C AG C AG C AG C AG C AG C C TA C TAC TAC TAC TA

  10. Read breakpoints: compression error ribosomal RNA repeats, B. anthracis • QC METHOD: • Align singleton reads to consensus assembly • Find any breakpoints shared by multiple reads mates “chimeric” reads

  11. “Uncompress” by creating new repeat copy Reference: B. anthracis Ames ‘ancestor’ strain B. anthracis Ames Porton Down strain Tandem duplication

  12. Read Coverage • Find regions of contigs where the depth of coverage is unusually high • AMOS Tool: analyzeReadDepth • 2.5x mean coverage B A R1 + R2 A R1 R2 B

  13. Hawkeye: assembly viewer and debugger

  14. Launch Pad

  15. Histograms & Statistics Insert Size Read Length GC Content Overall Statistics • Bird’s eye view of data and assembly quality

  16. Scaffold View • Statistical Plots • Scaffold • Features • Clone inserts • Overview • Control Panel • Details

  17. Standard Feature Types [B] Breakpoint Alignment ends at this position [C] Coverage Location of unusual mate coverage (asmQC) [S] SNPs Location of Correlated SNPs [U] Unitig Used to report location of surrogate unitigs in CA assemblies [X] Other All other Features

  18. Insert (mate) Happiness Happy • Oriented Correctly && • |Insert Size – Library.mean| <= Happy-Distance * Library.sd Stretched • Oriented Correctly && • Insert Size > Library.mean + Happy-Distance * Library.sd Compressed • Oriented Correctly && • Insert Size < Library.mean - Happy-Distance * Library.sd Misoriented • Same or Outies Linking • Read’s mate is in some other scaffold Singleton • Read’s mate is a singleton Unmated • No mate was provided for read Both mates present Only 1 read present

  19. Discrepancy Navigation Contig Quick Select Discrepancy Regular Expression Consensus Search Consensus & Position Scrollable Read Tiling Summary Read Orientation Discrepancy Highlight Contig View: detailed alignment of reads to contigs

  20. Zoom Out SNP View SNP Sorted Reads Polymorphism View

  21. SNP Barcode SNP Sorted Reads Colored Rectangle indicate the positions and composition of the SNPs

  22. Scaffold View Coverage CE Statistic SNP Feature Happy Stretched Compressed Misoriented Linking

  23. Collapsed Repeat Read Coverage Spike -5.5 CE Dip Compressed Mates Cluster 68 Correlated SNPs

  24. Example 1: Compression in Prevotella intermedia 17assembly, found by the CE statistic • Green inserts are <=2 standard deviations from the mean, and the orange inserts are compressed by > 2 standard deviations. • Vertical yellow line shows the most likely place of a compression misassembly. • Only one insert in this case is compressed by > 3 standard deviations

  25. Example 2: Compression in Prevotella intermedia 17assembly, found by the CE statistic

  26. Fixing collapsed repeats with AMOS Original Contig Compression Point Before Patch Contig Resolved “Stitched” Contig After

  27. Assemblies can be preserved at NCBI’s Assembly Archivehttp://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi

More Related