1 / 32

Whole Genome Assembly

Whole Genome Assembly. WGA. 1. Screener 2. Overlapper 3. Unitigger, 4. Scaffolder, 5. Repeat Resolver. Overlapper. ...looks for end-to end overlaps of at least 40 bp with no more than 6% differences in match. What’s the significance?. ...a one in 10 17 event.

Download Presentation

Whole Genome Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Whole Genome Assembly

  2. WGA 1. Screener 2. Overlapper 3. Unitigger, 4. Scaffolder, 5. Repeat Resolver.

  3. Overlapper ...looks for end-to end overlaps of at least 40 bp with no more than 6% differences in match. What’s the significance? ...a one in 1017 event. Sequencing Fidelity: 99.96%

  4. However ...the Screener doesn’t include all of the “low frequency” level repeats, ...so, a majority of the Overlapper outputs are bogus.

  5. Unitigger ...differentiates between a true overlap, and an overlap that includes more than one loci.

  6. ...in a world where real data matches expected data, each loci would have 8X coverage, ...if there were repeats, then contigs would be “over-represented”, on average 8 more per repeat. ...over-collapsed. 8X

  7. What Now? ... uniquely assembled contigs (unitigs) are readily identifiable, • all of the assembled sequences match over all of the known sequence, - and - ...are consistent with an 8x coverage.

  8. ...contig cluster is consistent with expected size, ...no dissimilar sequences between any members. Unitigs ...all other contigs are sent to the Discriminator.

  9. ...parses the “over-collapsed” contig by using sequence outside of the overlap region Discriminator

  10. Discriminator ...may yield unitigs.

  11. Unitigger Output ...correctly assembled contigs covering 73.6% of the genome.

  12. Repeat Resolver ...most of the remaining gaps were due to repeats. 1. Allow “low Discriminator Value” contigs to fill gaps, 2. Find BAC sequences that unambiguously match outside the nearest unitig, • 1 in 107 chance of being wrong, 3. Ensure the mate end sequence of candidate BACs match.

  13. ...make sequencing primer from BES... If that Doesn’t Work ...find a mate-pair that spans the gap, and sequence it, Chromosome Walking

  14. Scaffolder ...contigs the contigs, • uses mate-pair information.

  15. WGA Result ...91% sequence, 9% gaps,

  16. Mapping Compartmentalized Shotgun Assembly

  17. Scaffolds

  18. Sequence Tagged Sites STS ...PCR primers are designed for unique regions of the genome or chromosome, ...the chromosome is cut , ...assay two PCR products, frequency of co-amplification indicates .

  19. Sequence Tagged Sites STS

  20. ...ideally 24, Compartmentalized Shotgun Assembly ...really 3845.

  21. 92.2 % Sequence 7.8 % Gaps 91 % Sequence 9 % Gaps CSA WPA

  22. Blue: Gaps Violations: Red : misoriented Yellow: distance PFP CSA Green: Same Order, Orientation Chromosome 21 Yellow: Same Orientation Red: Out of Order, Orientation

  23. PFP CSA Chromosome 8

  24. PFP CSA

  25. Major Public Sequence Databases

  26. 281 Curated Data Bases, • “... facilitating Biological Discovery”.

  27. Science 291 (5507), 1304-1351 What Do We Know?(based on functional group analysis)

  28. Functional Groups 1stGenBank NR protein database was partitioned into clusters using BLASTP,

  29. Describing Aligned Sequences • 2ndStatistical descriptions of the cluster are developed and tested, • Hidden-Markov Markers: statistical descriptions of aligned sequences.

  30. Functional Group Annotation 3rdCategorization was done by manual review of the family and subfamily names, ...by examining SwissProt and GenBank records, ...and by review of the literature as well as resources on the World Wide Web. http://www.expasy.ch/cgi-bin/niceprot.pl?P29965

  31. Outcomes? • A relatively small number of structural and functional domains are used in a large number of different proteins, • Pfam: 527 families, • average length is 275 residues, • 456 had “annotated functions”. Nucleic Acids Research 26, 320-322

  32. New Genes 4thNewly sequenced genes are virtually translated, and the predicted proteins are assayed against raw and HMM databases, ...significance cut-off levels are determined for each functional group family.

More Related