1 / 27

Getting the computer setup

Getting the computer setup. Follow directions on handout to login to server. Type “ qsub - I ” to get a compute node. The data you will be using is stored in ../shared/. Mapping RNA-seq data. Matthew Young Alicia Oshlack Bernie Pope. Illumina Sequencing Technology. 3’. 5’. G. C. T.

nay
Download Presentation

Getting the computer setup

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/

  2. Mapping RNA-seq data Matthew Young Alicia Oshlack Bernie Pope

  3. Illumina Sequencing Technology 3’ 5’ G C T A C G T A C T A C C 5’ C C G A T A A A C G T T T A T G G G C 1 2 3 4 5 6 7 8 9 Base calling TGCTACGAT … DNA(0.1-1.0 ug) Sample preparation Single molecule array Cluster growth Sequencing Image acquisition Slide courtesy of G Schroth, Illumina

  4. Raw data • Short sequence reads • Quality scores = -10log10(p) or similar… @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Handout Reference: Page 2

  5. Quality checks • Base composition • Quality score • PCR Artifacts

  6. Sequencing transcripts not the genome Reads Gene CDS CDS CDS CDS CDS CDS transcript CDS CDS Difficulty here is that reads spanning a exon-exon junction may not get mapped when mapping to the genome. One strategy: Supplement the reference genome with sequences that span all known or possible junctions.

  7. CDS CDS CDS CDS Exons Coding Sequence Introns Splice Junctions • Aggregate reads based on: • exons • Exons + junctions • All reads start to end of transcript • De novo methods

  8. Mapping tools

  9. Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/

  10. Familiarizing yourself with bowtie • The minimum information bowtie needs is your reads (i.e., the fastq file from the machine) and a reference. • The reference is the genome or transcriptome we are trying to map to, transformed using a Burrows Wheeler Transform to allow fast searching. • Many optional parameters to tweak alignment. Handout Reference: Pages 2-7

  11. How do you map 10^9, 76bp sequences, to a 10^9 bp reference • Ideally we’d test every position on the genome for its suitability as a match and assign it a score based on # mismatches, indels etc. • However, this is computationally impossible, so we have to come up with something else. Handout Reference: Pages 2-7

  12. Lots of aligners, one general strategy • Quick “heuristic” is performed to cut down the number of candidate alignment regions for each read. • More precise algorithm is employed to decide which of these candidates is a valid alignment. Handout Reference: Pages 2-7

  13. A fragment as seen by an aligner • Heuristic acts only on the seed. Putative mapping locations identified. • A more precise algorithm extends each seed to the full read and ranks them. • The fragment is not sequenced directly, it’s sequence is inferred based on mapping of the read. ; Seed Burrows Wheeler acts on this Smith-Waterman acts on this Read Fragment Handout Reference: Pages 2-7

  14. Our test data set • 76bp single end reads. • Sequenced using the Illumina Genome Analyzer • RNA taken from Mouse myoblast cell line. • We only look at a subset of one lane, full data available here http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2084

  15. Getting the computer setup • Follow directions on handout to login to server. • Type “qsub -I” to get a compute node. • The data you will be using is stored in ../shared/

  16. A strict mapping bowtie -t -p 1 -n 0 -e 1 --sam --best --un Strict_Unmapped.fastq ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastqStrict_Mapped.sam • -n sets the number of mismatches in the seed • The sum of quality scores at ALL mismatches (not just the seed) must be less than -e. • --un saves the unmapped reads to the file specified • -t prints timing information • -p sets the number of simultaneous threads • --sam makes bowtie output in SAM format • --best ensures the best map is returned Handout Reference: Page 8

  17. Using the defaults bowtie -t -p 1 --best --sam ../shared/BowtieIndexes/mm9 ../shared/Sample_reads.fastqtest.sam • -t prints timing information • -p sets the number of simultaneous threads • --sam makes bowtie output in SAM format • --best ensures the best map is returned • --un saves the unmapped reads to the file specified Handout Reference: Page 8

  18. Allowing some mismatches Bowtie -t -p 1 -n 3 -e 200 --sam –best --un Loose_Unmapped.fastq ../shared/BowtieIndexes/mm9Strict_Unmapped.fastqLoose_Mapped.sam • Set the number of seed mismatches to the maximum. • Set -e to a value more appropriate for our read length. • How many reads do you map? Handout Reference: Pages 8-9

  19. A closer look at our data set: fastQC Handout Reference: Pages 9-10

  20. Handout Reference: Pages 9-10

  21. Trimming reads • Bowtie allows you to trim reads before it attempts to map them. • You can trim from the left (5’) end of the read with the --trim5 option. • You can trim from the right (3’) end of the read with the --trim3 option. Handout Reference: Page 10

  22. Sequencing transcripts not the genome Reads Gene CDS CDS CDS CDS CDS CDS transcript CDS CDS A library containing these bits of sequence (which do not appear in the genome) can help map junction reads. This is called an exon-junction library. A reference built from this library is in BowtieIndexes, called mm9.UCSC.knownGene.junctions (named for the annotation it was built from). Handout Reference: Pages 10-12

  23. Options to map more reads… • Trim some bases from the end of the reads using --trim5 and/or --trim3. • Map to the junction library instead of the mouse genome using the mm9.UCSC.knownGene.junctions index. Handout Reference: Pages 9-12

  24. Trim [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 2 -e 70 --trim5 15 --trim3 25 --sam --best --un Trimmed_Unmapped.fastq ../shared/BowtieIndexes/mm9 Loose_Unmapped.fastqTrimmed_Mapped.sam Time loading forward index: 00:00:02 Time loading mirror index: 00:00:02 Seeded quality full-index search: 00:00:47 # reads processed: 531414 # reads with at least one reported alignment: 164256 (30.91%) # reads that failed to align: 367158 (69.09%) Reported 164256 alignments to 1 output stream(s) Time searching: 00:00:53 Overall time: 00:00:54 Handout Reference: Page 10

  25. Then map to junctions [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 3 -e 200 --sam --best --un Junctions_Unampped.fastq ../shared/BowtieIndexes/mm9.UCSC.knownGene.junctions Trimmed_Unmapped.fastqJunctions_Mapped.sam Time loading forward index: 00:00:35 Time loading mirror index: 00:00:34 Seeded quality full-index search: 00:00:24 # reads processed: 367158 # reads with at least one reported alignment: 128756 (35.07%) # reads that failed to align: 238402 (64.93%) Reported 128756 alignments to 1 output stream(s) Time searching: 00:01:43 Overall time: 00:01:45 Handout Reference: Page 11

  26. Number of mapped reads Handout Reference: Page 12

  27. Further options Handout Reference: Pages 12-13

More Related