1 / 37

Next Generation Sequencing

Next Generation Sequencing. Sequencing techniques. ChIP-seq MBD-seq (MIRA-seq) BS-seq RNA-seq miRNA-seq. ChIP-seq. ChIP-Seq is a new frontier technology to analyze in vivo protein-DNA interactions. ChIP-Seq

Download Presentation

Next Generation Sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Next Generation Sequencing

  2. Sequencing techniques • ChIP-seq • MBD-seq (MIRA-seq) • BS-seq • RNA-seq • miRNA-seq

  3. ChIP-seq • ChIP-Seq is a new frontier technology to analyze in vivo protein-DNA interactions. • ChIP-Seq • Combination of chromatin immunoprecipitation (ChIP) with ultra high-throughput massively parallel sequencing • Allow mapping of protein–DNA interactions in-vivo on a genome scale

  4. Workflow of ChIP-Seq Mardis, E.R. Nat. Methods4, 613-614 (2007)

  5. The advantages of ChIP-seq • Current microarray and ChIP-ChIP designs require knowing sequence of interest as a promoter, enhancer, or RNA-coding domain. • Lower cost • Higher resolution • Higher accuracy • Alterations in transcription-factor binding in response to environmental stimuli can be evaluated for the entire genome in a single experiment.

  6. Sequencers • Solexa (Illumina) • 1 GB of sequences in a single run • 35 bases in length • 454 Life Sciences (Roche Diagnostics) • 25-50 MB of sequences in a single run • Up to 500 bases in length • SOLiD (Applied Biosystems) • 6 GB of sequences in a single run • 35 bases in length

  7. 8 lanes 100 tiles per lane Illumina Genome Analysis System

  8. Sequencing

  9. Sequence Files Quality Scores Sequencer Output

  10. Sequence Files • 10-40 million reads per lane • ~500 MB files

  11. Quality Score Files • Quality scores describe the confidence of bases in each read • Solexa pipeline assigns a quality score to the four possible nucleotides for each sequenced base • 9 million sequences (500MB file)  ~6.5GB quality score file

  12. Bioinformatics Challenges • Rapid mapping of these short sequence reads to the reference genome • Visualize mapping results • Thousand of enriched regions • Peak analysis • Peak detection • Finding exact binding sites • Compare results of different experiments • Normalization • Statistical tests

  13. Mapping of Short Oligonucleotides to the Reference Genome • Mapping Methods • Need to allow mismatches and gaps • SNP locations • Sequencing errors • Reading errors • Indexing and hashing • genome • oligonucleotide reads • Use of quality scores • Use of SNP knowledge • Performance • Partitioning the genome or sequence reads

  14. Mapping Methods: Indexing the Genome • Fast sequence similarity search algorithms (like BLAST) • Not specifically designed for mapping millions of query sequences • Take very long time • e.g. 2 days to map half million sequences to 70MB reference genome (using BLAST) • Indexing the genome is memory expensive

  15. SOAP (Li et al, 2008) • Both reads and reference genome are converted to numeric data type using 2-bits-per-base coding • Load reference genome into memory • For human genome, 14GB RAM required for storing reference sequences and index tables • 300(gapped) to 1200(ungapped) times faster than BLAST • 2 mismatches or 1-3bp continuous gap • Errors accumulate during the sequencing process • Much higher number of sequencing errors at the 3’-end (sometimes make the reads unalignable to the reference genome) • Iteratively trim several basepairs at the 3’-end and redo the alignment • Improve sensitivity

  16. Mapping Methods: Indexing the Oligonucleotide Reads • ELAND (Cox, unpublished) • “Efficient Large-Scale Alignment of Nucleotide Databases” (Solexa Ltd.) • SeqMap (Jiang, 2008) • “Mapping massive amount of oligonucleotides to the genome” • RMAP (Smith, 2008) • “Using quality scores and longer reads improves accuracy of Solexa read mapping” • MAQ (Li, 2008) • “Mapping short DNA sequencing reads and calling variants using mapping quality scores”

  17. Mapping Algorithm (2 mismatches) • Partition reads into 4 seeds {A,B,C,D} • At least 2 seed must map with no mismatches • Scan genome to identify locations where the seeds match exactly • 6 possible combinations of the seeds to search • {AB, CD, AC, BD, AD, BC} • 6 scans to find all candidates • Do approximate matching around the exactly-matching seeds. • Determine all targets for the reads • Ins/del can be incorporated • The reads are indexed and hashed before scanning genome • Bit operations are used to accelerate mapping • Each nt encoded into 2-bits

  18. ELAND (Cox, unpublished) • Commercial sequence mapping program comes with Solexa machine • Allow at most 2 mismatches • Map sequences up to 32 nt in length • All sequences have to be same length

  19. RMAP (Smith et al, 2008) • Improve mapping accuracy • Possible sequencing errors at 3’-ends of longer reads • Base-call quality scores • Use of base-call quality scores • Quality cutoff • High quality positions are checked for mismatces • Low quality positions always induce a match • Quality control step eliminates reads with too many low quality positions • Allow any number of mismatches

  20. Mapped to a unique location Mapped to multiple locations No mapping Low quality 7.2 M 1.8 M 2.5 M 0.5 M 3 M Quality filter 12 M Map to reference genome Map to reference genome

  21. Visualization • BED files are build to summarize mapping results • BED files can be easily visualized in Genome Browser http://genome.ucsc.edu

  22. Visualization: Genome Browser Robertson, G. et al. Nat. Methods 4, 651-657 (2007)

  23. Visualization: Custom 300 kb region from mouse ES cells Mikkelsen,T.S. et al. Nature448, 553-562 (2007)

  24. Screen shot for ZNF263 peaks Frietze et al JBC 2010

  25. ChIP-seq peak analysis programs • SISSRs (Site Identification from Short Sequence Reads): Jothi et al. NAR, 2008. • MACS (Model-based Analysis of ChIP-Seq): Zhang et al, Genome Biology, 2008. • QuEST (Genome-wide analysis of transcription factor binding sites based on ChIP–seq data): Valouev, A. et al. Nature Methods, 2008. • PeakSeq (PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls): Rozowsky, J. et al. Nature Biotech. 2009. • FindPeaks (FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology.): Fejes, A .P. et al. Bioinformatisc, 2008. • Hpeak (An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data): Xu et al, Bioinformatics, 2008.

  26. MBD-seq (MIRA-seq) • The MBD methyl-CpG binding domain-based (MBDCap) technology to capture the methylation sites. Double stranded methylated DNA fragments can be detected. It is sensitive to different methylation densities • Genome-wide sequencing technology was used to get the sequence of each short fragment. • The sequenced read was mapped to human genome to find the locations.

  27. Application on MBD-seq data (MCF7) Lan et al Unpublished

  28. BS-seq • BS-seq: genomic DNA is treated with sodium bisulphite (BS) to convert cytosine, but not methylcytosine, to uracil, and subsequent high-throughput sequencing. • Truly single-base resolution

  29. RNA-seq • RNA-Seq is a new approach to transcriptome profiling that uses deep-sequencing technologies. • Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods.

  30. RNA-seq protocol

  31. The advantages of RNA-seq • Single base resolution • High throughput • Low background noise • Ability to distinguish different isoforms and alleic expression • Relatively low cost

More Related