1 / 32

Genome Sequencing and Assembly High throughput Sequencing

Genome Sequencing and Assembly High throughput Sequencing. Xiaole Shirley Liu Jun Liu STAT115, STAT215. Outline. Genome sequencing strategy Clone-by-clone sequencing Whole-genome shotgun sequencing Hybrid method Next generation sequencing technologies. Competing Sequencing Strategies.

maitland
Download Presentation

Genome Sequencing and Assembly High throughput Sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Sequencing and AssemblyHigh throughput Sequencing Xiaole Shirley Liu Jun Liu STAT115, STAT215

  2. Outline • Genome sequencing strategy • Clone-by-clone sequencing • Whole-genome shotgun sequencing • Hybrid method • Next generation sequencing technologies

  3. Competing Sequencing Strategies • Clone-by-clone and whole-genome shotgun

  4. Clone-by-Clone Shotgun Sequencing • E.g. Human genome project • Map construction • Clone selection • Subclone library construction • Random shotgun phase • Directed finishing phase and sequence authentication

  5. Map Construction • Clone genomic DNA in YACs (~1MB) or BACs (~200KB) • Map the relative location of clones • Sequenced-tagged sites (STS, e.g. EST) mapping • PCR or probe hybridization to screen STS • Restriction site fingerprint • Most time consuming • 1990-98 to generate physical maps for human http://www.ncbi.nlm.nih.gov/genemap99/

  6. Resolve Clone Relative Location • Find a column permutation in the binary hybridization matrix, all ones each row are located in a block STS: 1 2 3 4 5 DNA clone 1 clone 2 clone 3 clone 4 STS Clone

  7. Clone Selection • Based on clone map, select authentic clones to generate a minimum tiling path • Most important criteria: authentic

  8. Subclone Library Construction • DNA fragmented by sonication or RE cut • Fragment size ~ 2-5 KB

  9. Random Shotgun Phase • Dideoxy termination reaction • Informatics programs • Coverage and contigs

  10. Sequencing primers Dideoxy Termination • Method invented by Fred Sanger • Automated sequencing developed by Leroy Hood (Caltech) and Michael Hunkapiller (ABI)

  11. Bioinformatics Programs • Developed at Univ. Wash • Phred • Base calling • Phil Green • Phrap • Assembly • Brent Ewing • Consed • Viewing and editing • David Gordon

  12. Coverage and contigs • Coverage: sequenced bp / fragment size • E.g. 200KB BAC, sequenced 1000 x 500bp subclones, coverage = 1000 x 500bp / 200KB = 2.5X • Lander-Waterman curve

  13. Directed Finishing Phase • David Gordon: auto-finish • Deign primers at gap 2 ends, PCR amplify, and sequence the two ends until they meet • Sequence authentication: verify STS and RE sites • Finished: < 1 error (or ambiguity) in 10,000bp, in the right order and orientation along a chromosome, almost no gaps.

  14. Genome-Shotgun Sequencing • Celera human and drosophila genomes • No physical map • Jigsaw puzzle assembly • Coverage ~7-10X

  15. Shotgun Assembly • Screener • Identify low quality reads, contamination, and repeats • Overlapper • >= 40bp overlap with <= 6% mismatches • Unitigger • Combine the easy (unique assembly) subset first • Scaffolder & repeat resolution • Generate different sized-clone libraries, and just sequence the clone ends (read pairs) • Use physical map information if available • Consensus ..ACGATTACAATAGGTT..

  16. Hybrid Method

  17. Hybrid Method • Optimal mixture of clone-by-clone vs whole-genome shotgun not established • Still need 8-10X overall coverage • Bacteria genomes can be sequenced WGS alone • Higher eukaryotes need more clone-by-clone • Comparative genomics can reduce the physical mapping (clone-by-clone) need • Sequencing cost decreasing quickly • Goal: $1000 / genome

  18. PRODUCTION Rooms of equipment Sample preparations 35 people 3-4 weeks SEQUENCING 74x Capillary Sequencers 10 people 15-40 runs per day 1-2Mb per instrument per day 120Mb total capacity per day First Generation Sequencing

  19. PRODUCTION 1x Cluster Station 1 person 1 day SEQUENCING 1x Genome Analyzer Same person as above 1 run per 3-5 days 0.5Gb per day per instrument Second Generation Sequencing

  20. 2nd Gen Sequencing Tech • Traditional sequencing machine: • 384 reads ~1kb / 3 hours • 454 (Roche): • 1M reads ~400bp / 5 hours • Solexa (Illumina): • 100M-1B reads of 30-100bp / 3-8 days, 8-16 samples • SOLiD (Applied Biosystems) • 1.4B reads of 35-50bp / 5-8 days, 16 samples • Helicos (single molecule sequencing): • 500M reads of ~30 bp / week, 50 samples • Moving targets

  21. Illumina (Solexa) Workflow

  22. Illumina HiSeq2000 • Throughput: • ~ $1100 / lane • 35-100 bp / read • 16 lanes (2 flow cells) / run • ~ 60-80 million reads / lane • Sequencing a human genome: $10000, 1 week • Bioinfo challenges • Very large files • CPU and RAM hungry • Sequence quality filtering • Mapping and downstream analysis

  23. Seq Files @HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB HWUSI-EAS366_0112:6:1:1298:18828#0/1    16      chr9    98116600        255     38M     *       0       0       TACAATATGTCTTTATTTGAGATATGGATTTTAGGCCG  Y\]bc^dab\[_UU`^`LbTUT\ccLbbYaY`cWLYW^  XA:i:1  MD:Z:3C30T3     NM:i:2 HWUSI-EAS366_0112:6:1:1257:18819#0/1    4       *       0       0       *       *       0       0       AGACCACATGAAGCTCAAGAAGAAGGAAGACAAAAGTG  ece^dddT\cT^c`a`ccdK\c^^__]Yb\_cKS^_W\  XM:i:1 HWUSI-EAS366_0112:6:1:1315:19529#0/1    16      chr9    102610263       255     38M     *       0       0       GCACTCAAGGGTACAGGAAAAGGGTCAGAAGTGTGGCC  ^c_Yc\Lcb`bbYdTa\dd\`dda`cdd\Y\ddd^cT`  XA:i:0  MD:Z:38 NM:i:0 chr1 123450 123500 + chr5 28374615 28374615 - • Raw FASTQ • Sequence ID, sequence • Quality ID, quality score • Mapped SAM • Map: 0 OK, 4 unmapped, 16 mapped reverse strand • MD: mismatch info • NM: number of mismatch • Mapped BED • Chr, start, end, strand

  24. (Potential) Applications • Metagenomics and infectious disease • Ancient DNA, recreate extinct species • Comparative genomics (between species) and personal genomes (within species) • Genetic tests and forensics • Circulating nucleic acids • Risk, diagnosis, and prognosis prediction • Transcriptome and transcriptional regulation • More later in the semester…

  25. Third Generation Sequencing • Single molecule sequencing (no amplification needed) • Some can read very long sequences • In 2-3 years, the cost of sequencing a human genome will drop below $1000 • Personal genome sequencing will be a key component of public health in every developed country • The cost of sequencing will be lower than the cost of storing the sequences • Bioinformatics will be key to convert data into knowledge

  26. HW Q for Graduate Students • Write the first page (Specific Aims) of an NIH research proposal with ~ $1 million budget that uses high throughput sequencing and bioinformatics analysis to solve some interesting biomedical problems

  27. How to Write a Specific Aims Page • Grant title and your name • Introductory (1-2) paragraphs (1/4-1/3 page) • A is very important in bio/medicine/disease • Recent development in A has made some really significant findings or improvements • However, something is still lacking or not known about A • The long term goal of our research is: • The focus of this proposal is: • The central hypothesis of this proposal is … • Therefore, we plan to do investigate / develop

  28. How to Write a Specific Aims Page • Specific Aims (1/3 to ½ page) • Specifically, we plan to: • Aim 1: profile the genome-wide xxx of xxx • 1.1: establish xxx • 1.2 • Aim 2: develop a computer algorithm or knowledge base • 2.1: model xxx • … • Aim 3: identify the mechanism of xxx • …

  29. Specific Aims • Sound like you can definitely do it • Do not use words that sound like a fishing expedition such as try, explore (find) or words that sound too trivial such as download (collect) • Try not to let your aims’ successes depend on each other, but it is also important to be intellectually coherent among the aims (not a collection of unrelated topics) • Propose as many aims as years for the grant

  30. How to Write a Specific Aims Page • Last paragraph (< ¼ page) • What’s novel about our approach • Deliverables (what will the scientific community see at the end) • A software, database, map, mechanism, resource • What’s the (potential) significance about our proposal • Best possible outcome

  31. Summary • Genome sequencing and assembly • Clone-by-clone: HGP • Map big clones, find path, shotgun sequence subclones, assemble and finish • Sequencing: dideoxy termination • Whole genome shotgun: Celera • Massively Parallel Sequencing • 454, Solexa, SOLiD, Helicos… • Many opportunities and many challenges • Project proposal

  32. Acknowledgement • Fritz Roth • Dannie Durand • Larry Hunter • Richard Davis • Wei Li • Jarek Meller • Stefan Bekiranov • Stuart M. Brown • Rob Mitra

More Related