1 / 40

last time

last time. p bx1 assignment…..find location of the probes in another one of the probesets for zebrafish . Read limma documentation Run limma on your data set Be sure you have your Galaxy account set up. pbx1. UCSC Genome Browser on Zebrafish Jul. 2010 (Zv9/danRer7) Assembly.

aulani
Download Presentation

last time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. last time • pbx1 assignment…..find location of the probes in another one of the probesets for zebrafish. • Read limma documentation • Run limma on your data set • Be sure you have your Galaxy account set up

  2. pbx1 UCSC Genome Browser on Zebrafish Jul. 2010 (Zv9/danRer7) Assembly chr2:19,708,833-19,758,832

  3. limma

  4. From gene list to intepretation • limma will generate a list of probeset ids for differentially expressed genes • What next? • Convert the probeset ids to gene symbols • Look for enrichment of functional terms associated with the genes in your list

  5. http://david.abcc.ncifcrf.gov/

  6. RNA Seq • Use of next-generation sequencing technology (NGS) to measure RNA levels • RNA Seq advantages: • Wider dynamic range compared to microarray technology • Not dependent on known genome annotations • Higher throughput compared to microarray technology • RNA Seq challenges: • Specificity versus completeness of alignments..especially for short sequence reads • Manipulation and analysis of large files • Data storage costs

  7. RNA Seq Library Prep http://www.geospiza.com/finchtalk/uploaded_images/rna-seq-steps-786705.png

  8. Sequencing Technologies http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

  9. Sequence “Space” • Roche 454 – Flow space • Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain • Flow space describes sequence in terms of these base incorporations • http://www.youtube.com/watch?v=bFNjxKHP8Jc • AB SOLiD – Color space • Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye • Each base sequenced twice • http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related • Illumina/Solexa – Base space • Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups • Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH • http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related • GenomeTV – Next Generation Sequencing (lecture) • http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

  10. Further Reading • Metzker, ML. (2010) Sequencing technologies – the next generation. Nature Reviews Genetics 11:31-36.

  11. Short Read Archive http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi? Short Read Archive Handbook http://www.ncbi.nlm.nih.gov/books/NBK47528/

  12. Aspera Connect http://www.asperasoft.com/en/products/client_software_2/aspera_connect_8 • High performance file transfer for getting data from the Short Read Archive

  13. SRA Toolkit http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

  14. RNA Seq Workflow • RNA Seq • FASTQ file format • Alignment • SAM file format • Annotation • GTF, BED file format • Alignment Counts • RPKM • Statistical analysis

  15. FASTQ: Data Format • FASTQ • Text based • Encodes sequence calls and quality scores with ASCII characters • Stores minimal information about the sequence read • 4 lines per sequence • Line 1: begins with @; followed by sequence identifier and optional description • Line 2: the sequence • Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional) • Line 4: encoding of quality scores for the sequence in line 2 • References/Documentation • http://maq.sourceforge.net/fastq.shtml • Cock et al. (2009). Nuc Acids Res 38:1767-1771.

  16. FASTQ Example For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example, Illumina stores quality scores ranging from 0-62; Sanger quality scores range from 0-93. Solexa quality scores have to be converted to PHRED quality scores. • FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.

  17. Example Data Data deposited in GEO with accession id GSE20846

  18. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20846

  19. http://www.ncbi.nlm.nih.gov/sra?term=SRP002119

  20. SRP002119 (study/project) SRX017794 (experiment) SRS025246 (source) SRR037945 (run) SRR037946 (run)

  21. SRA to FASTQ • NCBI’s SRA Tools contains utilities to convert SRA format to FASTQ • fastq-dump • If utilities and sra formatted file are in the same directory, command line is… fastq-dump <name of sra formatted file> NOTE: Downloading and working with next generation sequence data will very quickly exceed the capacity of a typical desktop or laptop computer. You will need appropriate infrastructure in place to work with these files…or consider scalable Cloud storage and compute services!

  22. TopHat http://tophat.cbcb.umd.edu/ TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process. Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515. Trapnell et al. (2009). Bioinformatics 25:1105-1111.

  23. TopHat is built on the Bowtie alignment algorithm. Trapnell C et al. Bioinformatics 2009;25:1105-1111

  24. SAM (Sequence Alignment/Map) • It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format • SAM is the output of aligners that map reads to a reference genome • Tab delimited w/ header section and alignment section • Header sections begin with @ (are optional) • Alignment section has 11 mandatory fields • BAM is the binary format of SAM http://samtools.sourceforge.net/

  25. Mandatory Alignment Fields http://samtools.sourceforge.net/SAM1.pdf

  26. Alignment Examples Alignments in SAM format http://samtools.sourceforge.net/SAM1.pdf

  27. Cufflinks http://cufflinks.cbcb.umd.edu/ • Assembles transcripts, • Estimates their abundances, and • Tests for differential expression and regulation in RNA-Seq samples Trapnell et al. (2010). Nature Biotechnology 28:511-515.

  28. Cufflinks Output • Gene expression • Transcript expression • Assembled transcripts

  29. Annotations • Mapping reads to specific transcripts/genes

  30. Data Visualization • UCSC Browser (accessible from Galaxy) • Trackster (native to Galaxy) External visualization tools: • Genome Workbench • http://www.ncbi.nlm.nih.gov/projects/gbench/ • Integrative Genomics Viewer (IGV) • http://www.broadinstitute.org/igv/

  31. Statistical Analysis • Once the mapping and genome summarization are done, the data can be analyzed just like any other count data • Bullard, et al. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11:94.

  32. Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA cDNA Sequencing FASTQ file QC TopHat Cufflinks Gene/Transcript/Exon Expression Visualization Statistical Analysis JAX Computational Sciences Service

  33. Galaxy See Tutorial 1 http://main.g2.bx.psu.edu/ Build and share data and analysis workflows No programming experience required Strong and growing development and user community

  34. RNA Seq Workflow • Convert data to FASTQ • Upload files to Galaxy • Quality Control • Throw out low quality sequence reads, etc. • Map reads to a reference genome • Many algorithms available • Trade off between speed and sensitivity • Data summarization • Associating alignments with genome annotations • Counts • Data Visualization • Statistical Analysis

  35. Dialog/Parameter Selection History Tools

  36. Uploading Data to Galaxy Because of the size of most sequence files it is necessary to use ftp to get files to Galaxy. Select appropriate reference genome at time of data upload.

  37. You can upload compressed files and they will be uncompressed upon loading into Galaxy.

  38. Tutorial Web Site http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml This site will be accessible after the meeting. Check back for updates and new tutorials.

  39. next time • Analyze project data with DAVID • Convert probeset ids to genes • Look for enrichment of functional terms • Try the first part of Tutorial 5 in Galaxy

More Related