1 / 51

Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis

Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis. Han Liang, Ph.D. Department of Bioinformatics and Computational Biology 3/25/2014 @ Rice University. Outline. History NGS Platforms Applications Bioinformatics Analysis Challenges. Central Dogma.

Download Presentation

Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis Han Liang, Ph.D. Department of Bioinformatics and Computational Biology 3/25/2014 @ Rice University

  2. Outline • History • NGS Platforms • Applications • Bioinformatics Analysis • Challenges

  3. Central Dogma

  4. Sanger sequencing • DNA is fragmented • Cloned to a plasmid vector • Cyclic sequencing reaction • Separation by electrophoresis • Readout with fluorescent tags

  5. Sanger vs NGS ‘Sanger sequencing’ has been the only DNA sequencing method for 30 years but… …hunger for even greater sequencing throughput and more economical sequencing technology… NGS has the ability to process millions of sequence reads in parallel rather than 96 at a time (1/6 of the cost) Objections: fidelity, read length, infrastructure cost, handle large volum of data .

  6. Platforms • Roche/454 FLX: 2004 • Illumina Solexa Genome Analyzer: 2006 • Applied Biosystems SOLiDTM System: 2007 • Helicos HeliscopeTM : recently available • Pacific Biosciencies SMRT: launching 2010

  7. Quickly reduced Cost

  8. Three Leading Sequencing Platforms • Roche 454 • Illumina Solexa • Applied Biosystems SOLiD

  9. The general experimental procedure Wang et al. Nature Reviews Genetics 2009

  10. 454bead microreactor Maridis Annu. Rev. Genome. Human Genet. 2008

  11. Illumina(Solexa)Bridge amplification Maridis Annu. Rev. Genome. Human Genet. 2008

  12. SOLiDcolor coding Maridis Annu. Rev. Genome. Human Genet. 2008

  13. Comparison of existing methods

  14. Real Data – nucleotide space • Solexa @SRR002051.1 :8:1:325:773 length=33 AAAGAACATTAAAGCTATATTATAAGCAAAGAT +SRR002051.1 :8:1:325:773 length=33 IIIIIIIIIIIIIIIIIIIIIIIII'II@I$)- @SRR002051.2 :8:1:409:432 length=33 AAGTTATGAAATTGTAATTCCAATATCGTAAGC +SRR002051.2 :8:1:409:432 length=33 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII07 @SRR002051.3 :8:1:488:490 length=33 AATTTCTTACCATATTAGACAAGGCACTATCTT +SRR002051.3 :8:1:488:490 length=33 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII&I

  15. Real Data – color space • SOLiD Data >1_24_47_F3 T1.1.23..0120230.320033300030030010022.00.0201.0201 >1_24_52_F3 T2.3.21..2122321.213110332101132321002.11.0111.1222 >1_24_836_F3 T0.2.22..2222222.010203032021102220200.01.2211.2211 >1_24_1404_F3 T2.3.30..2013222.222103131323012313233.22.2220.0213 >1_25_202_F3 T0.3213.111202312203021101111330201000313.121122211 >1_25_296_F3 T0.1130.100123202213120023121112113212121.013301210

  16. Data output difference among the three platforms • Nucleotide space vs. color space • Length of short reads 454 (400~500 bp) > SOLiD (70 bp) ~ Solexa (36~120bp)

  17. Applications with “Digital output” • De novo genome assembly • Genome re-sequencing • RNA-Seq (gene expression, exon-intron structure, small RNA profiling, and mutation) • CHIP-Seq (protein-DNA interaction) • Epigenetic profiling

  18. Ancient Genomes Resurrected • Degraded state of the sample  mitDNA sequencing • Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106 bp ) Problems: contamination modern humans and coisolation bacterial DNA

  19. Elucidating DNA-protein interactions through chromoatin immunoprecipitation sequencing • Key part in regulating gene expression • Chip: technique to study DNA-protein interaccions • Recently genome-wide ChIP-based studies of DNA-protein interactions • Readout of ChIP-derived DNA sequences onto NGS platforms • Insights into transcription factor/histone binding sites in the human genome • Enhance our understanding of the gene expression in the context of specific environmental stimuli

  20. Discovering noncoding RNAs • ncRNA presence in genome difficult to predict by computational methods with high certainty because the evolutionary diversity • Detecting expression level changes that correlate with changes in environmental factors, with disease onset and progression, complex disease set or severity • Enhance the annotation of sequenced genomes (impact of mutations more interpretable)

  21. Metagenomics • Characterizing the biodiversity found on Earth • The growing number of sequenced genomes enables us to interpret partial sequences obtained by direct sampling of specif environmental niches. • Examples: ocean, acid mine site, soil, coral reefs, human microbiome which may vary according to the health status of the individual

  22. Defining variability in many human genomes • Common variants have not yet completly explained complex disease genetics rare alleles also contribute • Also structural variants, large and small insertions and deletions • Accelerating biomedical research

  23. Epigenomic variation • Enable of genome-wide patterns of methylation and how this patterns change through the course of an organism’s development. • Enhanced potential to combine the results of different experiments, correlative analyses of genome-wide methylation, histone binding patterns and gene expression, for example.

  24. :Integrating Omics Mutation discovery Protein-DNA interaction Copy number variation mRNA expression microRNA expression Alternative Splicing Kahvejian et al. 2008

  25. Data Analysis Flow SOLiD machine: Raw data Central Server Basic processing decoding, filter and mapping Local Machine Downstream analysis

  26. Short Read Mapping • DNA-Resequencing BLAST-like approach • RNA-Seq

  27. Read length and pairing TCGTACCGATATGCTG ACTTAAGGCTGACTAGC • Short reads are problematic, because short sequences do not map uniquely to the genome. • Solution #1: Get longer reads. • Solution #2: Get paired reads.

  28. Post-alignment Analysis • DNA-SEQ • SNP calling • RNA-SEQ • Quantifying gene expression level

  29. Concepts The reference genome: hg19 (GRC37) Main assembly: Chr1-22, X, and Y 3,095,677,412 bp Target Region: exonome Ensembl: 85.3 Million (2.94%) RefSeq: 67.7Million (2.34%) ccds: 31,266,049 (1.08%) consisting of 185,446 nr exons

  30. Target Coverage

  31. SOLiDcolor coding Maridis Annu. Rev. Genome. Human Genet. 2008

  32. SNP calling

  33. Array-based High-throughput Dataset

  34. Limitations of hybridization-based approach • Reliance existing knowledge about genome sequence • Background noise and a limited dynamic detecting range • Cross-experiment comparison is difficult • Requiring complicated normalization methods Wang et al. Nature Reviews Genetics 2009

  35. Quantifying gene expression using RNA-Seq data RPKM: Reads Per Kb exon length and Millions of mapped readings

  36. Large Dynamic Range Mortazavi et al. Nature Methods 2008

  37. High reproducibility Mortazavi et al. Nature Methods 2008

  38. High Accuracy Wang et al. Nature 2008

  39. Advantages of RNA-Seq • Not limited to the existing genomic sequence • Very low (if any) background signal • Large dynamic detecting range • Highly reproducibility • Highly accurate • Less sample • Low cost per base Wang et al. Nature Reviews Genetics 2009

  40. Huge amount of data! • For a typical RNA-Seq SOLiD run, ~ 2T image file ~ 120G text file for downstream analysis ~ 75 M short reads per sample Efficient methods for data storage and management

  41. Considerable sequencing error High-quality image analysis for base calling

  42. Genome alignment and assembly: time consuming and memory demanding • To perform genome mapping for SOLiD data 32-opteron HP DL785 with 128GB of ram 12~14 hours per sample High-performance parallel computing

  43. Bioinformatics Challenges • Efficient methods to store, retrieve and process huge amount of data • To reduce errors in image analysis and base calling • Fast and accurate for genome alignment and assembly • New algorithms in downstream analyses

  44. Experimental Challenges Library fragmentation Strand specific Wang et al. Nature Reviews Genetics 2009

More Related