1 / 39

Introduction To Next Generation Sequencing (NGS) Data Analysis

Introduction To Next Generation Sequencing (NGS) Data Analysis. Jenny Wu UCI Genomics High Throughput Facility. Outline. Goals : Practical guide to NGS data processing Bioinformatics in NGS data analysis Basics: terminology, data file formats, general workflow Data Analysis Pipeline

martha-horn
Download Presentation

Introduction To Next Generation Sequencing (NGS) Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction To Next Generation Sequencing (NGS) Data Analysis Jenny Wu UCI Genomics High Throughput Facility

  2. Outline • Goals : Practical guide to NGS data processing • Bioinformatics in NGS data analysis • Basics: terminology, data file formats, general workflow • Data Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • Example: RNA-Seq analysis with Tuxedo protocol • Summary and future plan

  3. Why Next Generation Sequencing • One can sequence hundreds of millions of short sequences (35bp-120bp) in a single run in a short period of time with low per base cost. • Illumina/Solexa GA II / HiSeq 2000, 2500 • Life Technologies/Applied BiosystemsSOLiD • Roche/454 FLX, Titanium Reviews: Michael Metzker(2010) Nature Reviews Genetics 11:31 Quail et al (2012) BMC Genomics Jul 24;13:341.

  4. Why Bioinformatics Informatics (wall.hms.harvard.edu)

  5. Bioinformatics Challenges in NGS Data Analysis • VERY large text files (tens of millions of lines long) • Can’t do ‘business as usual’ with familiar tools • Impossible memory usage and execution time • Manage, analyze, store, transfer and archive huge files • Need for powerful computers and expertise • Informatics groups must manage compute clusters • New algorithms and software are required and often time they are open source Unix/Linux based. • Collaboration of IT, bioinformaticians and biologists

  6. Basic NGS Workflow

  7. NGS Data Analysis Overview Olson et al.

  8. Outline • Goals • Bioinformatics Challenges in NGS data analysis • Basics: terminology, data file formats, general workflow • Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq analysis with Tuxedo protocol • Summary and future plan

  9. Terminology • Coverage (depth):The number of nucleotides from reads that are mapped to a given position. • Quality Score: Each called base comes with a quality score which measures the probability of base call error. • Mapping:Align reads to reference to identify its origin. • Assembly:Merging of fragments of DNA in order to reconstruct the original sequence. • Duplicate reads: Reads that are identical. • Multi-reads: Reads that can be mapped to multiple locations equally well.

  10. What does the data look like?Common NGS Data Formats

  11. FASTA Format (Reference Seq)

  12. FASTQ Format (reads)

  13. FASTQ Format (Illumina Example) Lane Tile Barcode Read Record Header Flow Cell ID Tile Coordinates @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG + @@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2 @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG + CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ Read Bases Separator (with optional repeated header) Read Quality Scores NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads. (Passarelli, 2012)

  14. Outline • Goals • Bioinformatics Challenges in NGS data analysis • Basics: terminology, data file formats, general workflow, • Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq analysis with Tuxedo protocol • Summary and future plan

  15. Data Analysis Pipeline Raw reads FASTQ • FASTQC, FASTX-toolkit, PRINSEQ • Local realignment, base quality recalibration • Read QC and preprocessing Analysis-ready reads FASTQ • Read Mapping • Bowtie, BWA, MAQ • Collecting reference sequences and annotation FASTA GTF/GFF • Visualization (IGV, USCS GB) Mapped reads SAM/BAM Data Task File Format Software • Whole Genome Sequencing: • Variant calling, annotation • RNA-Seq: • Transcript assembly, quantification • ChIP-Seq : • Peak Calling • Methyl-Seq: • Methylation calling ……

  16. Why QC? Sequencing runs cost money • Consequences of not assessing the Data • Sequencing a poor library on multiple runs – throwing money away! Data analysis costs money and time • Cost of analyzing data, CPU time $$ • Cost of storing raw sequence data $$$ • Hours of analysis could be wasted $$$$ • Downstream analysis can be incorrect.

  17. How to QC? $: fastqcs_1_1.fastq; http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, available on HPC Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y

  18. Outline • Goals • Bioinformatics Challenges in NGS data analysis • Basics: terminology, data file formats, general workflow, • Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq analysis with Tuxedo protocol • Summary and future plan

  19. The UCSC Genome Browser Homepage General information Get genome annotation here! Get reference sequences here! Specific information— new features, current status, etc.

  20. Getting reference sequences

  21. Getting Reference Annotation

  22. Outline • Goals • Bioinformatics Challenges in NGS data analysis • Basics: terminology, data file formats, general workflow, • Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq analysis with Tuxedo protocol • Summary and future plan

  23. Sequence Mapping Challenges • Alignment (Mapping) is the first steps once read sequences are obtained. • The task: to align sequencing reads against a known reference • Difficulties: high volume of data, size of reference genome, computation time, read length constraints, ambiguity caused by repeats and sequencing errors.

  24. Short Read Alignment Olson et al.

  25. Short Read Alignment Software

  26. Short Reads Mapping Software

  27. How to choose an aligner? • There are many aligners and they vary a lot in performance (accuracy, memory usage, speed, etc). • Factors to consider : application, platform, read length, downstream analysis, etc. • Constant trade off between speed and sensitivity (e.g. MAQ vs. Bowtie) • Guaranteed high accuracy will take longer.

  28. Outline • Goals • Bioinformatics Challenges in NGS data analysis • Basics: terminology, data file formats, general workflow, • Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq analysis with Tuxedo protocol • Summary and future plan

  29. NGS Applications and Analysis Strategy (Hunicke-Smith et al, 2010)

  30. Application Specific Software Mapped reads • Whole Genome Sequencing, Exome Sequencing • RNA-Seq: • Transcriptome analysis • ChIP-Seq : • Protein DNA binding site, • Methyl-Seq: • Methylation pattern analysis …… • Peak Identification • Methylation calling • Variant Calling: SNPs, InDels • 1: Transcriptome assembly • 2. Abundance quantification • 3. Differential expression and regulation • Tophat, STAR, Cufflinks, edgeR, • MACS, AREM, PeakSeq • ssahaSNP, Samtools, PyroBayes • Bismark, BS Seeker

  31. Outline • Goals • Bioinformatics Challenges in NGS data analysis • Basics: terminology, data file formats, general workflow, • Analysis Pipeline • Sequence QC and preprocessing • Obtaining and preparing reference • Sequence mapping • Downstream analysis workflow and software • RNA-Seq analysis with Tuxedo protocol • Summary and future plan

  32. RNA-seq (Tuxedo Protocol) • Read mapping SAM/BAM • 2. Transcript assembly and quantification GTF/GFF • 3. Merge assembled transcripts from multiple samples • 4. Differential Expression analysis http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

  33. 1. Spliced Alignment: Tophat Tophat : a spliced short read aligner for RNA-seq. $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq

  34. 2.Transcript assembly and abundance quantification: Cufflinks • CuffLinks: a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide. • $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/ accepted_hits.bam • $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/ accepted_hits.bam • $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/ accepted_hits.bam • $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/ accepted_hits.bam

  35. 3. Final Transcriptome assembly: Cuffmerge $ cuffmerge-g genes.gtf -s genome.fa -p 8 assemblies.txt $ more assembies.txt ./C1_R1_clout/transcripts.gtf ./C1_R2_clout/transcripts.gtf ./C2_R1_clout/transcripts.gtf ./C2_R2_clout/transcripts.gtf

  36. 4.Differential Expression: Cuffdiff • CuffDiff: a program that compares transcript abundance between samples. • $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf • ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam, • ./C2_R1_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam

  37. Integrative Genomics Viewer (IGV) http://www.broadinstitute.org/igv

  38. Visualizing RNA-seq mapping with IGV Specify range or tem in search box Click on ruler Click and drag Use scroll bar Use keyboard: Arrow keys, Page up Page down, Home, End http://www.broadinstitute.org/igv/UserGuide Neilsen, C.B., et al. Visualizing Genomes: techniques and challenges Nature Methods 7:S5‐S15 (2010)

  39. Summary • NGS technologies are transforming molecular biology. • Bioinformatics analysis is a crucial part in NGS applications • Data formats, terminology, general workflow • Analysis pipeline • Software for various NGS applications • RNA-seq with Tuxedo suite Thank you!

More Related