370 likes | 845 Views
Genome Sequencing and Assembly High throughput Sequencing. Xiaole Shirley Liu Jun Liu STAT115, STAT215. Outline. Genome sequencing strategy Clone-by-clone sequencing Whole-genome shotgun sequencing Hybrid method Next generation sequencing technologies. Competing Sequencing Strategies.
E N D
Genome Sequencing and AssemblyHigh throughput Sequencing Xiaole Shirley Liu Jun Liu STAT115, STAT215
Outline • Genome sequencing strategy • Clone-by-clone sequencing • Whole-genome shotgun sequencing • Hybrid method • Next generation sequencing technologies
Competing Sequencing Strategies • Clone-by-clone and whole-genome shotgun
Clone-by-Clone Shotgun Sequencing • E.g. Human genome project • Map construction • Clone selection • Subclone library construction • Random shotgun phase • Directed finishing phase and sequence authentication
Map Construction • Clone genomic DNA in YACs (~1MB) or BACs (~200KB) • Map the relative location of clones • Sequenced-tagged sites (STS, e.g. EST) mapping • PCR or probe hybridization to screen STS • Restriction site fingerprint • Most time consuming • 1990-98 to generate physical maps for human http://www.ncbi.nlm.nih.gov/genemap99/
Resolve Clone Relative Location • Find a column permutation in the binary hybridization matrix, all ones each row are located in a block STS: 1 2 3 4 5 DNA clone 1 clone 2 clone 3 clone 4 STS Clone
Clone Selection • Based on clone map, select authentic clones to generate a minimum tiling path • Most important criteria: authentic
Subclone Library Construction • DNA fragmented by sonication or RE cut • Fragment size ~ 2-5 KB
Random Shotgun Phase • Dideoxy termination reaction • Informatics programs • Coverage and contigs
Sequencing primers Dideoxy Termination • Method invented by Fred Sanger • Automated sequencing developed by Leroy Hood (Caltech) and Michael Hunkapiller (ABI)
Bioinformatics Programs • Developed at Univ. Wash • Phred • Base calling • Phil Green • Phrap • Assembly • Brent Ewing • Consed • Viewing and editing • David Gordon
Coverage and contigs • Coverage: sequenced bp / fragment size • E.g. 200KB BAC, sequenced 1000 x 500bp subclones, coverage = 1000 x 500bp / 200KB = 2.5X • Lander-Waterman curve
Directed Finishing Phase • David Gordon: auto-finish • Deign primers at gap 2 ends, PCR amplify, and sequence the two ends until they meet • Sequence authentication: verify STS and RE sites • Finished: < 1 error (or ambiguity) in 10,000bp, in the right order and orientation along a chromosome, almost no gaps.
Genome-Shotgun Sequencing • Celera human and drosophila genomes • No physical map • Jigsaw puzzle assembly • Coverage ~7-10X
Shotgun Assembly • Screener • Identify low quality reads, contamination, and repeats • Overlapper • >= 40bp overlap with <= 6% mismatches • Unitigger • Combine the easy (unique assembly) subset first • Scaffolder & repeat resolution • Generate different sized-clone libraries, and just sequence the clone ends (read pairs) • Use physical map information if available • Consensus ..ACGATTACAATAGGTT..
Hybrid Method • Optimal mixture of clone-by-clone vs whole-genome shotgun not established • Still need 8-10X overall coverage • Bacteria genomes can be sequenced WGS alone • Higher eukaryotes need more clone-by-clone • Comparative genomics can reduce the physical mapping (clone-by-clone) need • Sequencing cost decreasing quickly • Goal: $1000 / genome
PRODUCTION Rooms of equipment Sample preparations 35 people 3-4 weeks SEQUENCING 74x Capillary Sequencers 10 people 15-40 runs per day 1-2Mb per instrument per day 120Mb total capacity per day First Generation Sequencing
PRODUCTION 1x Cluster Station 1 person 1 day SEQUENCING 1x Genome Analyzer Same person as above 1 run per 3-5 days 0.5Gb per day per instrument Second Generation Sequencing
2nd Gen Sequencing Tech • Traditional sequencing machine: • 384 reads ~1kb / 3 hours • 454 (Roche): • 1M reads ~400bp / 5 hours • Solexa (Illumina): • 100M-1B reads of 30-100bp / 3-8 days, 8-16 samples • SOLiD (Applied Biosystems) • 1.4B reads of 35-50bp / 5-8 days, 16 samples • Helicos (single molecule sequencing): • 500M reads of ~30 bp / week, 50 samples • Moving targets
Illumina HiSeq2000 • Throughput: • ~ $1100 / lane • 35-100 bp / read • 16 lanes (2 flow cells) / run • ~ 60-80 million reads / lane • Sequencing a human genome: $10000, 1 week • Bioinfo challenges • Very large files • CPU and RAM hungry • Sequence quality filtering • Mapping and downstream analysis
Seq Files @HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB HWUSI-EAS366_0112:6:1:1298:18828#0/1 16 chr9 98116600 255 38M * 0 0 TACAATATGTCTTTATTTGAGATATGGATTTTAGGCCG Y\]bc^dab\[_UU`^`LbTUT\ccLbbYaY`cWLYW^ XA:i:1 MD:Z:3C30T3 NM:i:2 HWUSI-EAS366_0112:6:1:1257:18819#0/1 4 * 0 0 * * 0 0 AGACCACATGAAGCTCAAGAAGAAGGAAGACAAAAGTG ece^dddT\cT^c`a`ccdK\c^^__]Yb\_cKS^_W\ XM:i:1 HWUSI-EAS366_0112:6:1:1315:19529#0/1 16 chr9 102610263 255 38M * 0 0 GCACTCAAGGGTACAGGAAAAGGGTCAGAAGTGTGGCC ^c_Yc\Lcb`bbYdTa\dd\`dda`cdd\Y\ddd^cT` XA:i:0 MD:Z:38 NM:i:0 chr1 123450 123500 + chr5 28374615 28374615 - • Raw FASTQ • Sequence ID, sequence • Quality ID, quality score • Mapped SAM • Map: 0 OK, 4 unmapped, 16 mapped reverse strand • MD: mismatch info • NM: number of mismatch • Mapped BED • Chr, start, end, strand
(Potential) Applications • Metagenomics and infectious disease • Ancient DNA, recreate extinct species • Comparative genomics (between species) and personal genomes (within species) • Genetic tests and forensics • Circulating nucleic acids • Risk, diagnosis, and prognosis prediction • Transcriptome and transcriptional regulation • More later in the semester…
Third Generation Sequencing • Single molecule sequencing (no amplification needed) • Some can read very long sequences • In 2-3 years, the cost of sequencing a human genome will drop below $1000 • Personal genome sequencing will be a key component of public health in every developed country • The cost of sequencing will be lower than the cost of storing the sequences • Bioinformatics will be key to convert data into knowledge
HW Q for Graduate Students • Write the first page (Specific Aims) of an NIH research proposal with ~ $1 million budget that uses high throughput sequencing and bioinformatics analysis to solve some interesting biomedical problems
How to Write a Specific Aims Page • Grant title and your name • Introductory (1-2) paragraphs (1/4-1/3 page) • A is very important in bio/medicine/disease • Recent development in A has made some really significant findings or improvements • However, something is still lacking or not known about A • The long term goal of our research is: • The focus of this proposal is: • The central hypothesis of this proposal is … • Therefore, we plan to do investigate / develop
How to Write a Specific Aims Page • Specific Aims (1/3 to ½ page) • Specifically, we plan to: • Aim 1: profile the genome-wide xxx of xxx • 1.1: establish xxx • 1.2 • Aim 2: develop a computer algorithm or knowledge base • 2.1: model xxx • … • Aim 3: identify the mechanism of xxx • …
Specific Aims • Sound like you can definitely do it • Do not use words that sound like a fishing expedition such as try, explore (find) or words that sound too trivial such as download (collect) • Try not to let your aims’ successes depend on each other, but it is also important to be intellectually coherent among the aims (not a collection of unrelated topics) • Propose as many aims as years for the grant
How to Write a Specific Aims Page • Last paragraph (< ¼ page) • What’s novel about our approach • Deliverables (what will the scientific community see at the end) • A software, database, map, mechanism, resource • What’s the (potential) significance about our proposal • Best possible outcome
Summary • Genome sequencing and assembly • Clone-by-clone: HGP • Map big clones, find path, shotgun sequence subclones, assemble and finish • Sequencing: dideoxy termination • Whole genome shotgun: Celera • Massively Parallel Sequencing • 454, Solexa, SOLiD, Helicos… • Many opportunities and many challenges • Project proposal
Acknowledgement • Fritz Roth • Dannie Durand • Larry Hunter • Richard Davis • Wei Li • Jarek Meller • Stefan Bekiranov • Stuart M. Brown • Rob Mitra