MCB3895-004 Lecture #10 Sept 25/14

MCB3895-004 Lecture #10Sept 25/14 SRA, Illumina data QC

Underwstanding the BBC cluster • What is the cluster? Many individual computers controlled by a "head node" • The head node is what you log onto by default using SSH • It is bad etiquette to run things off the head node • Can slow down the entire system

Using the cluster - method 1 • When you SSH in, use the "qlogin" command to take you to a subordinate note • Running programs here will not disrupt the head node • You need to stay connected to the network until all of your programs are completed

Checking a qlogin job • Use the terminal command "top" • Shows all processes running on your node • kill a process by pressing "k" and then entering its PID when prompted

Using the cluster - method 2 • Use the command "qsub" combined with a shell script • e.g., qsub script.sh • shell is a programming language commonly used for controlling actual processes • The BBC has example scripts for you to modify: http://bioinformatics.uconn.edu/understanding-the-bbc-cluster-and-sge/ • This method allows you to walk away once your script is running

qsub bash script #!/bin/bash ############################################################# ##### TEMPLATE SGE SCRIPT - BLAST EXAMPLE ################### ##### /common/sge_templates/template_single.sh ############## ############################################################# # Specify the name of the data file to be used INPUTFILENAME="test.fasta" # Name the directory (assumed to be a direct subdir of $HOME) from which the file is listed PROJECT_SUBDIR="test" # Specify name to be used to identify this run #$ -N blastp_job # Email address (change to yours) #$ -M bioinformatics@uconn.edu # Specify mailing options: b=beginning, e=end, s=suspended, n=never, a=abortion #$ -m besa

qsub bash script # Specify that bash shell should be used to process this script #$ -S /bin/bash # Specify working directory (created on compute node used to do the work) WORKING_DIR="/scratch/$USER/$PROJECT_SUBDIR-$JOB_ID" # If working directory does not exist, create it # The -p means "create parent directories as needed" if [ ! -d "$WORKING_DIR" ]; then mkdir -p $WORKING_DIR fi # Specify destination directory (this will be subdirectory of your home directory) DESTINATION_DIR="$HOME/$PROJECT_SUBDIR/$JOB_ID-$INPUTFILENAME" # If destination directory does not exist, create it # The -p in mkdir means "create parent directories as needed" if [ ! -d "$DESTINATION_DIR" ]; then mkdir -p $DESTINATION_DIR fi

qsub bash script # navigate to the working directory cd $WORKING_DIR # Get script and input data from home directory and copy to the working directory cp $HOME/$PROJECT_SUBDIR/$INPUTFILENAME ./test.fasta cp $HOME/template_single.sh . # Specify the output file #$ -o $JOB_ID.out # Specify the error file #$ -e $JOB_ID.err # Run the program blastp -query $INPUTFILENAME -db /usr/local/blast/data/refseq_protein -num_alignments 5 -num_descriptions 5 -out my-results # copy output files back to your home directory cp * $DESTINATION_DIR # clear scratch directory rm -rf $WORKING_DIR

Checking a qsub job • Use "qstat" to understand the status of your job • Shows jobs waiting to be executed • Monitor a running job's status using qstat -j <job_number> • Retrieve information about a completed job using qaact -j <job_number>

Cluster etiquette • Never run something on the head node! • Always check that your processes will run correctly before starting a large task! • Best strategy: run commands individually using a reduced input dataset • If using a loop to execute multiple commands, only go through a single iteration (e.g., use die)

The first part of Assignment #4 • Write a perl script that subsamples the first ~10000 reads of your input fasta file(s) • Allows you to do quick troubleshooting • Can be modified later to examine the effect of sampling depth

SRA • "Sequence Read Archive" • http://www.ncbi.nlm.nih.gov/sra • The part of NCBI that holds raw sequencing data • Generally, this is where you need to put your raw data when you publish genomic research

SRA

A SRA record

SRA run browser

For kicks… • Go to http://www.ncbi.nlm.nih.gov/sra • Search "Escherichia coli MG1655" • Note different results • Different sequencing platforms • Note mutant strains! • Try "Escherichia coli MG1655 Pacbio"

Downloading SRA data • Possible to do from web browser, but transferring large files is cumbersome • Better: use NCBI's SRA Toolkit on the BBC server to perform file conversion while downloading /opt/bioinformatics/sratoolkit.2.3.5-2-centos_linux64/bin/fastq-dump --split-files SRR826450 • Decompresses files, splits paired ends into separate files

The fastq file format • 4 lines per sequence: • Line 1: begins with "@", followed by sequence ID • Line 2: raw sequence data • Line 3: begins with "+", may have sequence ID • Line 4: Phred quality score for each position, in ASCII @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ASCII from low (left) to high (right): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ http://en.wikipedia.org/wiki/FASTQ_format

Phred scores • Developed by old program "Phred" during human genome project, adopted as standard throughout field • Phred score = -10log(P(base call error)) • e.g., Phred score of 10 = 90% base call accuracy Phred score of 20 = 99% base call accuracy Phred score of 30 = 99.9% base call accuracy Phred score of 40 = 99.99% base call accuracy etc.

FastQC - QC for raw reads • FastQC is the most common software to understand the quality of raw sequencing reads • http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • Runs using a java applet • Using the server, we have to run via command line

FastQC screenshot Summary stats What it thinks of your run quality - NOT HARD AND FAST RULES!! Starts into specific figures

FastQC - Per base quality • Blue line: mean • Red line: median • Boxes: 25-75% range • Whiskers: • 10-90% range Phred score

FastQC - Per read quality • Highlights systematic problems • e.g., a region of flowcell is problematic

FastQC - Per base sequence content • Unbiased sequences should have the same content across all bases • Will show biases if some sequence is hugely overrepresented • e.g., adapter contamination • e.g., biased fragmentation

FastQC - Per Sequence GC content • Unbiased sequencing should have a normally distributed %GC content • Deviations may indicated contamination • e.g., adapter • e.g., two species with different %GC contents

FastQC - Per base N count • Ns indicate that the base caller could not determine a base at that position • Global N abundance generally correlates with sequence quality

FastQC - Sequence length • Some methods yield non-uniform read lengths • e.g., Pacbio (shown) • Illumina will only show one uniform value here

FastQC - Duplicate sequences • An unbiased library should have few duplicates • A few duplicates may indicate saturated template sequencing • High duplication may indicate adapter contamination or enrichment bias

FastQC - Kmer content • Tests for kmers enriched as a certain read position • Graphs 6 worst, tabulates the rest • Can indicate sequencing/library bias • Can indicate contamination by one sequence, e.g., primers, adapters

FastQC - Overrepresented sequences • May indicate how read diversity is limited, e.g., adapter/primer contamination • May be biological, e.g., repeats

FastQC - Adapter content • Specifically looks for known adapter/primer contamination • Indicates reads are longer than insert size

FastQC - Per tile sequence quality • Shows flowcell tiles that are particularly error-prone • Illumina data only, and only if positional metadata is included with reads

Running FastQC on the server • Very simple: $ fastqc <input_file> • Produces a .html file as output • Transfer the html to your computer and open it using your favorite web browser

Getting rid of adapters using Trimmomatic • Trimmomatic is a standard method to remove adapter contamination • http://www.usadellab.org/cms/?page=trimmomatic • Bolger et al. 2014 Bioinformatics btu170

Running Trimmomatic • Admittedly, a complex syntax java -jar <path to trimmomatic.jar> PE [-threads <threads] [-phred33 | -phred64] [-trimlog <logFile>] <input 1> <input 2> <paired output 1> <unpaired output 1> <paired output 2> <unpaired output 2> <step 1> ... java -jar /opt/bioinformatics/Trimmomatic-0.32/trimmomatic-0.32.jar PE SRR826450_1.fastq SRR826450_2.fastq output_forward_paired.fqoutput_forward_unpaired.fqoutput_reverse_paired.fqoutput_reverse_unpaired.fq ILLUMINACLIP:/opt/bioinformatics/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10

Assignment #4 • Download these two E. coli K-12 MG1655 genome sequencing reads from NCBI SRA: SRR826450, SRR956947 • What are the differences? • Write script to subsample fastq files • Analyze your input data using fastqc • If appropriate, adapter trim using Trimmomatic

MCB3895-004 Lecture #10 Sept 25/14

MCB3895-004 Lecture #10 Sept 25/14

Presentation Transcript

PHYS 1444 – Section 004 Lecture #22

MCB3895-004 Lecture #3 Sept 2/14

MCB3895-004 Lecture #4 Sept 4/14

MCB3895-004 Lecture #2 Aug 28/14

MCB3895-004 Lecture # 12 Oct 2/14

MCB3895-004 Lecture #10 Sept 25/14

MCB3895-004 Lecture # 11 Sept 30/14

MCB3895-004 Lecture # 14 Oct 21/14