1 / 24

How to Build a Horse: Final Report

How to Build a Horse: Final Report. Megan Smedinghoff. Project Goals. Goal 1. Produce an assembly of the horse genome using the Celera Assembler. Goal 2. Reconcile my assembly with the Arachne assembly done by Broad Institute in February 2007. Goal 3.

bwasson
Download Presentation

How to Build a Horse: Final Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Build a Horse:Final Report Megan Smedinghoff

  2. Project Goals Goal 1 Produce an assembly of the horse genome using the Celera Assembler Goal 2 Reconcile my assembly with the Arachne assembly done by Broad Institute in February 2007 Goal 3 Produce a parallelized version of the UMD Overlapper to help facilitate easier assembly of large organisms in the future

  3. DNA • Cells in all organisms have a DNA molecule in their nucleus • DNA is an abbreviation forDeoxyriboNucleic Acid • DNA is translated into proteins, which determine the structure, function, and behavior of the cell

  4. What does DNA look like? • Linear, double-stranded, helical molecule. • Consists of four types of nucleotides (bases): • adenine (A) • thymine (T) • guanine (G) • cytosine (C)

  5. What is a genome? • A genome is the linear sequence of bases of the DNA molecule: …..GATGACATGTAT….. • Bacteria: ~2-5 million bases • Insects (fly): ~200 million bases • Mammals: ~3 billion bases

  6. The Horse Genome • In February 2007, Broad Institute released a draft assembly of the horse genome • The sequencing cost $15 million • The assembly contains approximately 2.7 billion bases and was done using 6.8-fold coverage The horse that was sequenced

  7. Why Sequence the Horse? • Allows scientists to study diseases that primarily affect horses such as Glanders • Over 80 known genetic conditions exist in both horses and humans; comparative genomics can lead to better treatments for both species • Work done on the horse will lead to improvements in the assembly pipeline and allow easier assembly of large mammals in the future

  8. DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 750bp LIGATE & CLONE Primer SEQUENCE Vector Review of Genome Sequencing Slide courtesy of Art Delcher

  9. Review of Assembly Process Calculating Overlaps • We compare each read to every other read and record pairs whose ends overlap by a certain amount (usually ~40 nucleotides) AGTGATTAGATGATAGTAGA | | | | | | | | | | | | GATGATAGTAGAGGATAGATTTA • We use the University of Maryland overlapper to calculate the overlaps • UMD Overlapper is advantageous because it corrects for sequencing errors and does additional trimming to help produce the most accurate list of overlaps possible

  10. Unitig Contig Scaffold Review of Assembly Process Unitigs We use overlap information to combine reads into unitigs Contigs We use mate pair information to combine unitigs into contigs Scaffolds We further use mate pair information to determine how far apart contigs should be (the result is called a scaffold)

  11. Goal 1: Creating an Assembly Running the Assembler • It took nine days for the Celera Assembler to produce an assembly • Celera generated over a terabyte of data • The final assembly used 24.9 million reads (7.9x coverage) Results

  12. Contig A Contig B Contig C Contig D Celera Scaffold Contig A’ Contig B’ Contig C’ Contig D’ Arachne Scaffold Mapping of Celera Scaffold to Arachne Scaffold Goal 2: Produce a Reconciled Assembly Step 1: Match the Celera assembly to the Arachne assembly Step 2: Compare scaffolds from the two assemblies Step 3: Run reconciliation software

  13. Contig A Contig B Contig C Contig D Celera Scaffold Contig A’ Contig B’ Contig C’ Contig D’ Arachne Scaffold Illustration of an orientation problem between the two assemblies Orientation Problems Definition: An orientation problem occurs when a set of Celera contigs matches a set of Arachne contigs, but one of the contigs is placed in the opposite orientation

  14. Contig A Contig B Contig C Contig D Celera Scaffold Contig A’ Contig C’ Contig B’ Contig D’ Arachne Scaffold Illustration of an ordering problem between the two assemblies Ordering Problems Definition: An ordering problem occurs when a set of Celera contigs map to a set of Arachne contigs, but the matches occur in a different order in each scaffold

  15. Contig A Contig B Contig C Contig D Gap Celera Scaffold Contig A’ Contig D’ Contig B’ Contig C’ Arachne Scaffold Gap Illustration of an gap size problem between the two assemblies Gap Size Problems Definition: A gap problem occurs when the size of the Arachne gap is more than four standard deviations away from the estimated Celera gap size

  16. Results of the Comparison Note: In most Celera scaffolds, there was at most one contig that matched a contig in the Arachne assembly. In these scaffolds, there was no way to look for discrepancies between the two assemblies.

  17. Matching sequences Assembly A Assembly B Matching sequences Assembly A Assembly B Compression error in Assembly A or expansion error in Assembly B Reconciliation Gaps Gap in Assembly B Compressions

  18. Reconciliation Results

  19. ROM MOR Overlaps Additional Post-Processing Goal 3: Parallelizing the Overlapper Get associated hex value for each 20-mer in read Record “minimizer” (20-mer with minimum hex value) for sliding 20-base window Create Read, Offset, Minimizer file Filter by certain minimizers and then sort to get candidate pairs for overlap Run “checker” on candidate pairs and produce edit-transcripts Output Overlap List

  20. Parallelization of Overlapper • The parallelization occurs during the “checker” process when the reads with common minimizers are compared • Since the reads are already grouped by minimizers, there is no overhead for doing the comparisons in parallel • I used the Sun Grid Engine routines installed on the genome cluster to submit the jobs to available processors

  21. Results of Parallelization Bacteria Fly Horse

  22. Summary Goal 1 I created an assembly of the horse using the Celera assembler. The assembly size was 2.5 billion bases. Goal 2 I created a reconciled assembly that fixed 551 out of 687 compressions in the Arachne assembly and increased the assembly size by 4.3 million bases Goal 3 I created a parallelized version of the overlapper that significantly reduces the running time on larger genomes

  23. Acknowledgements Thanks to Aleksey Zimin, Guillaume Marçais, Mike Roberts, James Yorke, and Poorani Subramanian for instruction and advice on this project Thanks also to Matthew Pragel for moral support

  24. ROM MOR Overlaps 10-8 More Technical Overlapper Slide Get associated hex value for each 20-mer in read Record “minimizer” (20-mer with minimum hex value) for sliding 20-base window Create Read, Offset, Minimizer file Filter by certain minimizers and then sort to get Candidate pairs for overlap Run “checker” on candidate pairs And produce edit-transcripts Output Overlap List Do Multi-compare and split overlaps when they have a certain number of differences Do Poisson test and eliminate any overlaps that have less than a 10-8 probability of occurring

More Related