Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux Cluster

Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux Cluster Steve Newhouse 28 Jan 2011

Overview • Practical guide to processing next generation sequencing data • No details on the inner workings of the software/code & theory • Based on the 1KG pipeline from the Broad Institute using their Genome Analysis Tool Kit (GATK). • Focus on Illumina paired-end sequence data • Alignment with BWA or Novoalign • SNP & Indel calling with GATK • NB: This is one way processing the data that works well

Main Tools/Resources • BRC Cluster Software : http://compbio.brc.iop.kcl.ac.uk/cluster/software.php • Maq: http://maq.sourceforge.net/ • Fastqc : http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ • Fastx: http://hannonlab.cshl.edu/fastx_toolkit/ • cmpfastq.pl : http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php • BWA: http://bio-bwa.sourceforge.net/bwa.shtml • Novoalign: http://www.novocraft.com • Genome Analysis Toolkit: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit • PICARD TOOLS: http://picard.sourceforge.net/ • SAMTOOLS: http://samtools.sourceforge.net/ • VCFTOOLS: http://vcftools.sourceforge.net/ • FASTQ Files : http://en.wikipedia.org/wiki/FASTQ_format, • SAM/BAM Format : http://samtools.sourceforge.net/SAM1.pdf • PHRED Scores: http://en.wikipedia.org/wiki/Phred_quality_score • Next Generation Sequencing Library: http://ngslib.genome.tugraz.at • http://seqanswers.com • http://www.broadinstitute.org/gsa/wiki/index.php/File:Ngs_tutorial_depristo_1210.pdf

Analysis Pipeline • Convert IlluminaFastq to sangerFastq • QC raw data • Mapping (BWA, QC-BWA, Novoalign) • Convert Sequence Alignment/Map (SAM) to BAM • Local realignment around Indels • Remove duplicates • Base Quality Score Recalibration • Variant Discovery

Work flow Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

What does the raw data look like? • Fastq Format :*_sequence.txt; • s_1_1_sequence.txt = lane 1, read 1 • s_1_2_sequence.txt = lane 1, read 2 • Text file storing both nucleotide sequence and quality scores. • Both the sequence letter and quality score are encoded with a single ASCII character for brevity. • Standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer • http://en.wikipedia.org/wiki/FASTQ_format

FASTQ Format • Raw Data :- @315ARAAXX090414:8:1:567:552#0 TGTTTCTTTAAAAAGGTAAGAATGTTGTTGCTGGGCTTAGAAATATGAATAACCATATGCCAGATAGATAGATGGA + ;<<=<===========::==>====<<<;;;:::::99999988887766655554443333222211111000// • @315ARAAXX090414: the unique instrument name • 8: flowcell lane • 1: tile number within the flowcell lane • 567: 'x'-coordinate of the cluster within the tile • 552: 'y'-coordinate of the cluster within the tile • # :index number for a multiplexed sample (0 for no indexing) • 0 :the member of a pair, /1 or /2 (paired-end or mate-pair reads only) • http://en.wikipedia.org/wiki/FASTQ_format

Illumina Raw fastq Convert IlluminaFastq to sangerFastq Convert IlluminaFastq to sangerFastq QC raw data QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

Quality Control & Pre-Alignment Processing.1 • Convert IlluminaFastq to sangerFastq $: maq ill2sanger s_1_1_sequence.txt foo.1.sanger.fastq $: maq ill2sanger s_1_2_sequence.txt foo.2.sanger.fastq

Quality Control & Pre-Alignment Processing.2 • FastQC: Provides a simple way to do some quality control checks on raw sequence data. • Quick impression of whether the data has problems. • Import of data from BAM, SAM or FastQ • Summary graphs and tables to quickly assess your data • Export of results to an HTML based permanent report • Offline operation to allow automated generation of reports without running the interactive application $: fastqc foo.1.sanger.fastq; $: fastqc foo.2.sanger.fastq; http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

FastQC

Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

Reference genomes • Available genomes • Homo_sapiens_assembly18.fasta • human_b36_both.fasta • human_g1k_v37.fasta (1000 genomes) • Indexed for use with BWA or Novoalign • Location: /scratch/data/reference_genomes/human • Human reference sequences and dbSNP reference metadata are available in a tarball: • ftp://ftp.broadinstitute.org/pub/gsa/gatk_resources.tgz

BWA: Burrows-Wheeler Aligner ## Align with BWA $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: bwaaln -t 8 $REF foo.1.sanger.fastq > foo.1.sai; $: bwaaln -t 8 $REF foo.2.sanger.fastq > foo.2.sai; ## Generate alignment in the SAM format $: bwasampe $REF foo.1.sai foo.2.sai foo.1.sanger.fastq foo.2.sanger.fastq > foo.bwa.raw.sam; ## Sort bwa SAM file using PICARD TOOLS SortSam.jar - this will also produce the BAM file $: java -jar SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT \ INPUT= foo.bwa.raw.sam OUTPUT= foo.bwa.raw.bam; ## samtools index $: samtools index foo.novo.raw.bam; • Use option -q15 if the quality is poor at the 3' end of reads • http://bio-bwa.sourceforge.net/bwa.shtml

BWA : with pre-alignment QC.1 • Fastx: http://hannonlab.cshl.edu/fastx_toolkit • QC filter raw sequence data • always use -Q 33 for sangerphred scaled data (-Q 64) $: cat foo.1.sanger.fastq | \ fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 -v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | \ fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | \ fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | \ fastx_artifacts_filter -Q 33 -v | \ fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo.1.sanger.qc.fastq; $: cat foo.2.sanger.fastq | \ fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 –v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | \ fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | \ fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | \ fastx_artifacts_filter -Q 33 -v | \ fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo.2.sanger.qc.fastq;#

BWA : with pre-alignment QC.2 • Compare QCdfastq files • One end of each read could be filtered out in QC • BWA cant deal with mixed PE & SE data • Need to id reads that are still paired after QC • Need to id reads that are no longer paired after QC $: perl cmpfastq.pl foo.1.sanger.qc.fastq foo.2.sanger.qc.fastq • Reads matched on presence/absence of id's in each file : • foo.1.sanger.qc.fastq : @315ARAAXX090414:8:1:567:552#1 • foo.2.sanger.qc.fastq : @315ARAAXX090414:8:1:567:552#2 • Output: 2 files for each *QC.fastq file • foo.1.sanger.qc.fastq-common-out (reads in foo.1.* == reads in foo.2.*) • foo.1.sanger.qc.fastq-unique-out (reads in foo.1.* not in foo.2.*) • foo.2.sanger.qc.fastq-commont-out • foo.2.sanger.qc.fastq-unique-out http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php

BWA : with pre-alignment QC.3 • Align with BWA $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: bwaaln -t 8 $REF foo.1.sanger.qc.fastq-common-out > foo.1.common.sai; $: bwaaln -t 8 $REF foo.2.sanger.qc.fastq-common-out > foo.2.common.sai; $: bwaaln -t 8 $REF foo.1.sanger.qc.fastq-unique-out > foo.1.unique.sai; $: bwaaln -t 8 $REF foo.1.sanger.qc.fastq-unique-ou > foo.2.unique.sai; • Multi threading option : -t N • http://bio-bwa.sourceforge.net/bwa.shtml

BWA : with pre-alignment QC.4 • Generate alignments in the SAM/BAM format $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta ## bwasampe for *common* $: bwasampe $REF foo.1.common.sai foo.2.common.sai foo.1.sanger.qc.fastq-common-out foo.2.sanger.qc.fastq-common-out > foo.common.sam; ## bwasamse for *unique* $: bwasamse $REF foo.1.unique.sai foo.1.sanger.qc.fastq-unique-out > foo.1.unique.sam; $: bwasamse $REF foo.2.unique.sai foo.2.sanger.qc.fastq-unique-out > foo.2.unique.sam; ## merge SAM files using PICARD TOOLS MergeSamFiles.jar - this will also sort the BAM file $: java -jar MergeSamFiles.jar INPUT=foo.common.sam INPUT=foo.1.unique.sam INPUT=foo.2.unique.sam ASSUME_SORTED=false VALIDATION_STRINGENCY=SILENT OUTPUT=foo.bwa.raw.bam; ## samtools index samtools index foo.bwa.raw.bam; • Details SAM/BAM Format : http://samtools.sourceforge.net/SAM1.pdf

Novoalign : quality aware gapped aligner.1 • Has options for adaptor stripping and quality filters – and much more • More accurate than BWA but slower unless running MPI version • $1,990/year for full set of tools – worth it! $: REF=/scratch/data/reference_genomes/human/human_g1k_v37 $: novoalign -d $REF -F STDFQ -f foo.1.sanger.fastq foo.2.sanger.fastq \ -a GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG ACACTCTTTCCCTACACGACGCTCTTCCGATCT \ -r Random -i PE 200,50 -c 8 --3Prime -p 7,10 0.3,10 -k -K foo.novo.test \ -o SAM $'@RG\tID:foo\tPL:Illumina\tPU:Illumina\tLB:tumour\tSM:foo' \ > foo.novo.stats > foo.novo.raw.sam; • http://www.novocraft.com

Novoalign : quality aware gapped aligner.2 • Novoalign produces a name sorted SAM file which needs to be co-ordinate sorted for downstream processing ## Sort novo SAM file using PICARD TOOLS SortSam.jar - this will also produce the BAM file $: java -jar SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT \ INPUT= foo.novo.raw.sam OUTPUT= foo.novo.raw.bam; ## samtools index $: samtools index foo.novo.raw.bam;

Post Aligment processing of BAM files • Local realignment around Indels • Remove duplicate reads • Base Quality Score Recalibration • GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit • PICARD TOOLS: http://picard.sourceforge.net • SAMTOOLS: http://samtools.sourceforge.net • Many other quality stats/options for processing files using these tools : see web documentation

Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

Local realignment around Indels.1 • Sequence aligners are unable to perfectly map reads containing insertions or deletions • Alignment artefacts • False positives SNPs • Steps to the realignment process: • Step 1: Determining (small) suspicious intervals which are likely in need of realignment • Step 2: Running the realigner over those intervals • Step 3: Fixing the mate pairs of realigned reads http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels

Local realignment around Indels.2 Original BAM file RealignerTargetCreator (GATK) forRealigner.intervals (interval file) IndelRealigner (GATK) Realigned BAM file SortSam (PICARD) Co-ordinate sorted Realigned BAM file FixMateInformation (PICARD) Co-ordinate sorted Realigned fixed BAM file

Local realignment around Indels.3 $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod $: TMPDIR=~/scratch/tmp $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar ## 1. Creating Intervals : RealignerTargetCreator $: java –Xmx5g -jar $GATK -T RealignerTargetCreator -R $REF -D $ROD \ -I foo.novo.bam -o foo.novo.bam.forRealigner.intervals; ## 2. Realigning : IndelRealigner $: java -Djava.io.tmpdir=$TMPDIR –Xmx5g -jar $GATK -T IndelRealigner \ -R $REF -D $ROD -I foo.novo.bam -o foo.novo.realn.bam \ -targetIntervalsfoo.novo.bam.forRealigner.intervals; ## samtools index $: samtools index foo.novo.realn.bam;

Local realignment around Indels.3 ## 3. Sort realigned BAM file using PICARD TOOLS SortSam.jar ## GATK IndelRealigner produces a name sorted BAM $: java –Xmx5g -jar SortSam.jar \ INPUT= foo.novo.realn.bam OUTPUT=foo.novo.realn.sorted.bam \ SORT_ORDER=coordinate TMP_DIR=$TMPDIR VALIDATION_STRINGENCY=SILENT; ## samtools index $: samtools index foo.novo.realn.soretd.bam; ## 4. Fixing the mate pairs of realigned reads using Picard tools FixMateInformation.jar $: java -Djava.io.tmpdir=$TMPDIR -jar -Xmx6g FixMateInformation.jar \ INPUT= foo.novo.realn.sorted.bam OUTPUT= foo.novo.realn.sorted.fixed.bam \ SO=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=$TMPDIR; ## samtools index samtools index foo.novo.realn.sorted.fixed.bam ;

Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

Remove duplicate reads • Examine aligned records in the supplied SAM or BAM file to locate duplicate molecules and remove them $: TMPDIR=~/scratch/tmp ## Remove duplicate reads with Picard tools MarkDuplicates.jar $: java -Xmx6g –jar MarkDuplicates.jar \ INPUT= foo.novo.realn.sorted.fixed.bam \ OUTPUT= foo.novo.realn.duperemoved.bam \ METRICS_FILE=foo.novo.realn.Duplicate.metrics.file \ REMOVE_DUPLICATES=true \ ASSUME_SORTED=false TMP_DIR=$TMPDIR \ VALIDATION_STRINGENCY=SILENT; ## samtools index samtools index foo.novo.realn.duperemoved.bam;

Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Base Quality Score Recalibration Analysis-ready reads Analysis-ready reads Indels & SNPs

Base Quality Score Recalibration.1 • Correct for variation in quality with machine cycle and sequence context • Recalibrated quality scores are more accurate • Closer to the actual probability of mismatching the reference genome • Done by analysing the covariation among several features of a base • Reported quality score • The position within the read • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine • Probability of mismatching the reference genome • Known SNPs taken into account (dbSNP 131) • Covariates are then used to recalibrate the quality scores of all reads in a BAM file http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

Base Quality Score Recalibration.2 Co-ordinate sorted Realigned fixed BAM file CountCovariates AnalyzeCovariates Pre-recalibration analysis plots Covariates table (.csv) TableRecalibraion Final Recalibrated BAM file CountCovariates AnalyzeCovariates Post-recalibration analysis plots Recalibrated covariates table (.csv)

Base Quality Score Recalibration.3 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## 1. GATK CountCovariates java -Xmx8g -jar $GATK -T CountCovariates -R $REF --DBSNP $ROD \ -I foo.novo.realn.duperemoved.bam\ -recalFilefoo.novo.realn.duperemoved.bam.recal_data.csv \ --default_platformIllumina \ -covReadGroupCovariate \ -covQualityScoreCovariate \ -covCycleCovariate \ -covDinucCovariate \ -covTileCovariate \ -covHomopolymerCovariate \ -nback 5; http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

Base Quality Score Recalibration.4 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: GATKacov=/share/apps/gatk_20100930/Sting/dist/AnalyzeCovariates.jar $: GATKR=/share/apps/gatk_20100930/Sting/R $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod $: Rbin=/share/apps/R_current/bin/Rscript ## 2. GATK AnalyzeCovariates java -Xmx5g –jar $GATKacov \ -recalFilefoo.novo.realn.duperemoved.bam.recal_data.csv \ -outputDirfoo.novo.realn.duperemoved.bam.recal.plots \ -resources $GATKR \ -Rscript $Rbin; http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

Base Quality Score Recalibration.5 • Generate the final analysis ready BAM file for Variant Discovery and Genotyping ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## 3. GATK TableRecalibration $: java –Xmx6g -jar $GATK -T TableRecalibration -R $REF \ -I foo.novo.realn.duperemoved.bam \ --out foo.novo.final.bam \ -recalFilefoo.novo.realn.duperemoved.bam.recal_data.csv \ --default_platformIllumina; ##samtools index $: samtools index foo.novo.final.bam; http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads SNP & Indel calling with GATK Indels & SNPs

SNP & Indel calling with GATK Final Recalibrated BAM file IndelGenotyperV2 UnifiedGenotyper filterSingleSampleCalls.pl gatk.raw.indels.verbose.output.bed gatk.raw.indels.detailed.output.vcf gatk.indels.filtered.bed gatk.raw.indels.bed gatk.raw.snps.vcf gatk.filtered.indels.simple.bed ... chr1 556817 556817 +G:3/7 chr1 3535035 3535054 -TTCTGGGAGCTCCTCCCCC:9/21 chr1 3778838 3778838 +A:15/48 ... makeIndelMask.py indels.mask.bed VariantFiltration gatk.filtered.snps.vcf

Short Indels (GATK IndelGenotyperV2) ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: GATKPERL=/share/apps/gatk_20100930/Sting/perl $: GATKPYTHON=/share/apps/gatk_20100930/Sting/python $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## Generate raw indel calls $: java -Xmx6g -jar $GATK -T IndelGenotyperV2 -R $REF --DBSNP $ROD \ -I foo.novo.final.bam \ -bed foo.gatk.raw.indels.bed \ -o foo.gatk.raw.indels.detailed.output.vcf \ --metrics_filefoo.gatk.raw.indels.metrics.file \ -verbose foo.gatk.raw.indels.verbose.output.bed \ -minCoverage 8 -S SILENT –mnr 1000000; ## Filter raw indels $: perl $GATKPERL/filterSingleSampleCalls.pl --calls foo.gatk.raw.indels.verbose.output.bed \ --max_cons_av_mm 3.0 --max_cons_nqs_av_mm 0.5 --mode ANNOTATE > foo.gatk.filtered.indels.bed http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0

Creating an indel mask file • The output of the IndelGenotyper is used to mask out SNPs near indels. • “makeIndelMask.py” creates a bed file representing the masking intervals based on the output of IndelGenotyper. $: GATKPYTHON=/share/apps/gatk_20100930/Sting/python ## Create indel mask file $: python $GATKPYTHON/makeIndelMask.py foo.gatk.raw.indels.bed 5 indels.mask.bed • The number in this command stands for the number of bases that will be included on either side of the indel. http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0

SNP Calling (GATK UnifiedGenotyper) ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## SNP Calling $: java -Xmx5g -jar $GATK -T UnifiedGenotyper -R $REF -D $ROD \ -baq CALCULATE_AS_NECESSARY -baqGOP 30 -nt 8 \ -A DepthOfCoverage -A AlleleBalance -A HaplotypeScore -A HomopolymerRun -A MappingQualityZero -A QualByDepth -A RMSMappingQuality -A SpanningDeletions \ -I foo.novo.final.bam -o foo.gatk.raw.snps.vcf \ -verbose foo.gatk.raw.snps.vcf.verbose -metrics foo.gatk.raw.snps.vcf.metrics; • This results in a VCF (variant call format) file containing raw SNPs. • VCF is a text file format. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome (SNP/INDEL calls). http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper

SNP Filtering & annotation (GATK VariantFiltration) • VariantFiltration is used annotate suspicious calls from VCF files based on their failing given filters. • It annotates the FILTER field of VCF files for records that fail any one of several filters: • Variants that lie in clusters, using the specified values to define a cluster (i.e. the number of variants and the window size). • Any variant which overlaps entries from a masking rod. • Any variant whose INFO field annotations match a specified expression (i.e. the expression is used to describe records which should be filtered out). ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## VariantFiltration & annotation $: java –Xmx5g -jar $GATK -T VariantFiltration -R $REF -D $ROD \ -o foo.gatk.VariantFiltration.snps.vcf \ -B variant,VCF,foo.gatk.raw.snps.vcf\ -B mask,Bed,indels.mask.bed--maskNameInDel \ --clusterSize 3 --clusterWindowSize 10 \ --filterExpression "DP <= 8" --filterName "DP8" \ --filterExpression "SB > -0.10" --filterName "StrandBias" \ --filterExpression "HRun > 8" --filterName "HRun8" \ --filterExpression "QD < 5.0" --filterName "QD5" \ --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "hard2validate"; • More information on the parameters used can be found in: http://www.broadinstitute.org/gsa/wiki/index.php/VariantFiltrationWalkerhttp://www.broadinstitute.org/gsa/wiki/index.php/Using_JEXL_expressions

SNP Filtering: VCFTOOLS • VCFTOOLS: methods for working with VCF files: filtering,validating, merging, comparing and calculate some basic population genetic statistics. • Example of some basic hard filtering: ## filter poor quality & suspicious SNP calls vcftools --vcffoo.gatk.VariantFiltration.snps.vcf\ --remove-filtered DP8 --remove-filtered StrandBias--remove-filtered LowQual \ --remove-filtered hard2validate --remove-filtered SnpCluster \ --keep-INFO AC --keep-INFO AF --keep-INFO AN --keep-INFO DB \ --keep-INFO DP --keep-INFO DS --keep-INFO Dels --keep-INFO HRun --keep-INFO HaplotypeScore --keep-INFO MQ --keep-INFO MQ0 --keep-INFO QD --keep-INFO SB --out foo.gatk.good.snps ; • this produces the file " foo.gatk.good.snps.recode.vcf"

VCFTOOLS for SNP QC and statistics • VCFTOOLS can be used to generate useful QC measures, PLINK ped/map, Impute input and more.... ## QC & info $: for MYQC in missing freq2 counts2 depth site-depth site-mean-depth geno-depth het hardy singletons;do vcftools --vcffoo.gatk.good.snps.recode.vcf --$MYQC \ --out foo.gatk.good.snps.QC; done ## write genotypes, genotype qualities and genotype depth to separate files $: for MYFORMAT in GT GQ DP;do vcftools--vcffoo.gatk.good.snps.recode.vcf \ --extract-FORMAT-info $MYFORMAT --out foo.gatk.good.snps; done ## make PLINK ped and map files vcftools --vcffoo.gatk.good.snps.recode.vcf --plink --out foo.gatk.good.snps http://vcftools.sourceforge.net/

Picard Tools : Post Alignment Summary reports • See : http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_fennell.pdf

Questions? • Email : stephen.newhouse@kcl.ac.uk • More useful links: • http://www.broadinstitute.org/gsa/wiki/index.php/Prerequisites • http://www.broadinstitute.org/gsa/wiki/index.php/Building_the_GATK • http://www.broadinstitute.org/gsa/wiki/index.php/Downloading_the_GATK • http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK • http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_the_essential_GATK_input_files:_the_reference_genome • http://www.broadinstitute.org/gsa/wiki/index.php/The_DBSNP_rod

Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux Cluster