1 / 44

Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux Cluster

Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux Cluster. Steve Newhouse 28 Jan 2011. Overview. Practical guide to processing next generation sequencing data No details on the inner workings of the software/code & theory

badu
Download Presentation

Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next Generation Sequence Alignment & Variant Discovery on the BRC-MH Linux Cluster Steve Newhouse 28 Jan 2011

  2. Overview • Practical guide to processing next generation sequencing data • No details on the inner workings of the software/code & theory • Based on the 1KG pipeline from the Broad Institute using their Genome Analysis Tool Kit (GATK). • Focus on Illumina paired-end sequence data • Alignment with BWA or Novoalign • SNP & Indel calling with GATK • NB: This is one way processing the data that works well

  3. Main Tools/Resources • BRC Cluster Software : http://compbio.brc.iop.kcl.ac.uk/cluster/software.php • Maq: http://maq.sourceforge.net/ • Fastqc : http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ • Fastx: http://hannonlab.cshl.edu/fastx_toolkit/ • cmpfastq.pl : http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php • BWA: http://bio-bwa.sourceforge.net/bwa.shtml • Novoalign: http://www.novocraft.com • Genome Analysis Toolkit: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit • PICARD TOOLS: http://picard.sourceforge.net/ • SAMTOOLS: http://samtools.sourceforge.net/ • VCFTOOLS: http://vcftools.sourceforge.net/ • FASTQ Files : http://en.wikipedia.org/wiki/FASTQ_format, • SAM/BAM Format : http://samtools.sourceforge.net/SAM1.pdf • PHRED Scores: http://en.wikipedia.org/wiki/Phred_quality_score • Next Generation Sequencing Library: http://ngslib.genome.tugraz.at • http://seqanswers.com • http://www.broadinstitute.org/gsa/wiki/index.php/File:Ngs_tutorial_depristo_1210.pdf

  4. Analysis Pipeline • Convert IlluminaFastq to sangerFastq • QC raw data • Mapping (BWA, QC-BWA, Novoalign) • Convert Sequence Alignment/Map (SAM) to BAM • Local realignment around Indels • Remove duplicates • Base Quality Score Recalibration • Variant Discovery

  5. Work flow Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

  6. What does the raw data look like? • Fastq Format :*_sequence.txt; • s_1_1_sequence.txt = lane 1, read 1 • s_1_2_sequence.txt = lane 1, read 2 • Text file storing both nucleotide sequence and quality scores. • Both the sequence letter and quality score are encoded with a single ASCII character for brevity. • Standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer • http://en.wikipedia.org/wiki/FASTQ_format

  7. FASTQ Format • Raw Data :- @315ARAAXX090414:8:1:567:552#0 TGTTTCTTTAAAAAGGTAAGAATGTTGTTGCTGGGCTTAGAAATATGAATAACCATATGCCAGATAGATAGATGGA + ;<<=<===========::==>====<<<;;;:::::99999988887766655554443333222211111000// • @315ARAAXX090414: the unique instrument name • 8: flowcell lane • 1: tile number within the flowcell lane • 567: 'x'-coordinate of the cluster within the tile • 552: 'y'-coordinate of the cluster within the tile • # :index number for a multiplexed sample (0 for no indexing) • 0 :the member of a pair, /1 or /2 (paired-end or mate-pair reads only) • http://en.wikipedia.org/wiki/FASTQ_format

  8. Illumina Raw fastq Convert IlluminaFastq to sangerFastq Convert IlluminaFastq to sangerFastq QC raw data QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

  9. Quality Control & Pre-Alignment Processing.1 • Convert IlluminaFastq to sangerFastq $: maq ill2sanger s_1_1_sequence.txt foo.1.sanger.fastq $: maq ill2sanger s_1_2_sequence.txt foo.2.sanger.fastq

  10. Quality Control & Pre-Alignment Processing.2 • FastQC: Provides a simple way to do some quality control checks on raw sequence data. • Quick impression of whether the data has problems. • Import of data from BAM, SAM or FastQ • Summary graphs and tables to quickly assess your data • Export of results to an HTML based permanent report • Offline operation to allow automated generation of reports without running the interactive application $: fastqc foo.1.sanger.fastq; $: fastqc foo.2.sanger.fastq; http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

  11. FastQC

  12. Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

  13. Reference genomes • Available genomes • Homo_sapiens_assembly18.fasta • human_b36_both.fasta • human_g1k_v37.fasta (1000 genomes) • Indexed for use with BWA or Novoalign • Location: /scratch/data/reference_genomes/human • Human reference sequences and dbSNP reference metadata are available in a tarball: • ftp://ftp.broadinstitute.org/pub/gsa/gatk_resources.tgz

  14. BWA: Burrows-Wheeler Aligner ## Align with BWA $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: bwaaln -t 8 $REF foo.1.sanger.fastq > foo.1.sai; $: bwaaln -t 8 $REF foo.2.sanger.fastq > foo.2.sai; ## Generate alignment in the SAM format $: bwasampe $REF foo.1.sai foo.2.sai foo.1.sanger.fastq foo.2.sanger.fastq > foo.bwa.raw.sam; ## Sort bwa SAM file using PICARD TOOLS SortSam.jar - this will also produce the BAM file $: java -jar SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT \ INPUT= foo.bwa.raw.sam OUTPUT= foo.bwa.raw.bam; ## samtools index $: samtools index foo.novo.raw.bam; • Use option -q15 if the quality is poor at the 3' end of reads • http://bio-bwa.sourceforge.net/bwa.shtml

  15. BWA : with pre-alignment QC.1 • Fastx: http://hannonlab.cshl.edu/fastx_toolkit • QC filter raw sequence data • always use -Q 33 for sangerphred scaled data (-Q 64) $: cat foo.1.sanger.fastq | \ fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 -v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | \ fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | \ fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | \ fastx_artifacts_filter -Q 33 -v | \ fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo.1.sanger.qc.fastq; $: cat foo.2.sanger.fastq | \ fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT | \ fastx_clipper -Q 33 -l 20 –v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC | \ fastx_clipper -Q 33 -l 20 -v –a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC | \ fastq_quality_trimmer -Q 33 -t 20 -l 20 -v | \ fastx_artifacts_filter -Q 33 -v | \ fastq_quality_filter -Q 33 -q 20 -p 50 -v -o foo.2.sanger.qc.fastq;#

  16. BWA : with pre-alignment QC.2 • Compare QCdfastq files • One end of each read could be filtered out in QC • BWA cant deal with mixed PE & SE data • Need to id reads that are still paired after QC • Need to id reads that are no longer paired after QC $: perl cmpfastq.pl foo.1.sanger.qc.fastq foo.2.sanger.qc.fastq • Reads matched on presence/absence of id's in each file : • foo.1.sanger.qc.fastq : @315ARAAXX090414:8:1:567:552#1 • foo.2.sanger.qc.fastq : @315ARAAXX090414:8:1:567:552#2 • Output: 2 files for each *QC.fastq file • foo.1.sanger.qc.fastq-common-out (reads in foo.1.* == reads in foo.2.*) • foo.1.sanger.qc.fastq-unique-out (reads in foo.1.* not in foo.2.*) • foo.2.sanger.qc.fastq-commont-out • foo.2.sanger.qc.fastq-unique-out http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php

  17. BWA : with pre-alignment QC.3 • Align with BWA $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: bwaaln -t 8 $REF foo.1.sanger.qc.fastq-common-out > foo.1.common.sai; $: bwaaln -t 8 $REF foo.2.sanger.qc.fastq-common-out > foo.2.common.sai; $: bwaaln -t 8 $REF foo.1.sanger.qc.fastq-unique-out > foo.1.unique.sai; $: bwaaln -t 8 $REF foo.1.sanger.qc.fastq-unique-ou > foo.2.unique.sai; • Multi threading option : -t N • http://bio-bwa.sourceforge.net/bwa.shtml

  18. BWA : with pre-alignment QC.4 • Generate alignments in the SAM/BAM format $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta ## bwasampe for *common* $: bwasampe $REF foo.1.common.sai foo.2.common.sai foo.1.sanger.qc.fastq-common-out foo.2.sanger.qc.fastq-common-out > foo.common.sam; ## bwasamse for *unique* $: bwasamse $REF foo.1.unique.sai foo.1.sanger.qc.fastq-unique-out > foo.1.unique.sam; $: bwasamse $REF foo.2.unique.sai foo.2.sanger.qc.fastq-unique-out > foo.2.unique.sam; ## merge SAM files using PICARD TOOLS MergeSamFiles.jar - this will also sort the BAM file $: java -jar MergeSamFiles.jar INPUT=foo.common.sam INPUT=foo.1.unique.sam INPUT=foo.2.unique.sam ASSUME_SORTED=false VALIDATION_STRINGENCY=SILENT OUTPUT=foo.bwa.raw.bam; ## samtools index samtools index foo.bwa.raw.bam; • Details SAM/BAM Format : http://samtools.sourceforge.net/SAM1.pdf

  19. Novoalign : quality aware gapped aligner.1 • Has options for adaptor stripping and quality filters – and much more • More accurate than BWA but slower unless running MPI version • $1,990/year for full set of tools – worth it! $: REF=/scratch/data/reference_genomes/human/human_g1k_v37 $: novoalign -d $REF -F STDFQ -f foo.1.sanger.fastq foo.2.sanger.fastq \ -a GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG ACACTCTTTCCCTACACGACGCTCTTCCGATCT \ -r Random -i PE 200,50 -c 8 --3Prime -p 7,10 0.3,10 -k -K foo.novo.test \ -o SAM $'@RG\tID:foo\tPL:Illumina\tPU:Illumina\tLB:tumour\tSM:foo' \ > foo.novo.stats > foo.novo.raw.sam; • http://www.novocraft.com

  20. Novoalign : quality aware gapped aligner.2 • Novoalign produces a name sorted SAM file which needs to be co-ordinate sorted for downstream processing ## Sort novo SAM file using PICARD TOOLS SortSam.jar - this will also produce the BAM file $: java -jar SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT \ INPUT= foo.novo.raw.sam OUTPUT= foo.novo.raw.bam; ## samtools index $: samtools index foo.novo.raw.bam;

  21. Post Aligment processing of BAM files • Local realignment around Indels • Remove duplicate reads • Base Quality Score Recalibration • GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit • PICARD TOOLS: http://picard.sourceforge.net • SAMTOOLS: http://samtools.sourceforge.net • Many other quality stats/options for processing files using these tools : see web documentation

  22. Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

  23. Local realignment around Indels.1 • Sequence aligners are unable to perfectly map reads containing insertions or deletions • Alignment artefacts • False positives SNPs • Steps to the realignment process: • Step 1: Determining (small) suspicious intervals which are likely in need of realignment • Step 2: Running the realigner over those intervals • Step 3: Fixing the mate pairs of realigned reads http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels

  24. Local realignment around Indels.2 Original BAM file RealignerTargetCreator (GATK) forRealigner.intervals (interval file) IndelRealigner (GATK) Realigned BAM file SortSam (PICARD) Co-ordinate sorted Realigned BAM file FixMateInformation (PICARD) Co-ordinate sorted Realigned fixed BAM file

  25. Local realignment around Indels.3 $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod $: TMPDIR=~/scratch/tmp $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar ## 1. Creating Intervals : RealignerTargetCreator $: java –Xmx5g -jar $GATK -T RealignerTargetCreator -R $REF -D $ROD \ -I foo.novo.bam -o foo.novo.bam.forRealigner.intervals; ## 2. Realigning : IndelRealigner $: java -Djava.io.tmpdir=$TMPDIR –Xmx5g -jar $GATK -T IndelRealigner \ -R $REF -D $ROD -I foo.novo.bam -o foo.novo.realn.bam \ -targetIntervalsfoo.novo.bam.forRealigner.intervals; ## samtools index $: samtools index foo.novo.realn.bam;

  26. Local realignment around Indels.3 ## 3. Sort realigned BAM file using PICARD TOOLS SortSam.jar ## GATK IndelRealigner produces a name sorted BAM $: java –Xmx5g -jar SortSam.jar \ INPUT= foo.novo.realn.bam OUTPUT=foo.novo.realn.sorted.bam \ SORT_ORDER=coordinate TMP_DIR=$TMPDIR VALIDATION_STRINGENCY=SILENT; ## samtools index $: samtools index foo.novo.realn.soretd.bam; ## 4. Fixing the mate pairs of realigned reads using Picard tools FixMateInformation.jar $: java -Djava.io.tmpdir=$TMPDIR -jar -Xmx6g FixMateInformation.jar \ INPUT= foo.novo.realn.sorted.bam OUTPUT= foo.novo.realn.sorted.fixed.bam \ SO=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=$TMPDIR; ## samtools index samtools index foo.novo.realn.sorted.fixed.bam ;

  27. Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Remove duplicates Base Quality Score Recalibration Analysis-ready reads Indels & SNPs

  28. Remove duplicate reads • Examine aligned records in the supplied SAM or BAM file to locate duplicate molecules and remove them $: TMPDIR=~/scratch/tmp ## Remove duplicate reads with Picard tools MarkDuplicates.jar $: java -Xmx6g –jar MarkDuplicates.jar \ INPUT= foo.novo.realn.sorted.fixed.bam \ OUTPUT= foo.novo.realn.duperemoved.bam \ METRICS_FILE=foo.novo.realn.Duplicate.metrics.file \ REMOVE_DUPLICATES=true \ ASSUME_SORTED=false TMP_DIR=$TMPDIR \ VALIDATION_STRINGENCY=SILENT; ## samtools index samtools index foo.novo.realn.duperemoved.bam;

  29. Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Base Quality Score Recalibration Analysis-ready reads Analysis-ready reads Indels & SNPs

  30. Base Quality Score Recalibration.1 • Correct for variation in quality with machine cycle and sequence context • Recalibrated quality scores are more accurate • Closer to the actual probability of mismatching the reference genome • Done by analysing the covariation among several features of a base • Reported quality score • The position within the read • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine • Probability of mismatching the reference genome • Known SNPs taken into account (dbSNP 131) • Covariates are then used to recalibrate the quality scores of all reads in a BAM file http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

  31. Base Quality Score Recalibration.2 Co-ordinate sorted Realigned fixed BAM file CountCovariates AnalyzeCovariates Pre-recalibration analysis plots Covariates table (.csv) TableRecalibraion Final Recalibrated BAM file CountCovariates AnalyzeCovariates Post-recalibration analysis plots Recalibrated covariates table (.csv)

  32. Base Quality Score Recalibration.3 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## 1. GATK CountCovariates java -Xmx8g -jar $GATK -T CountCovariates -R $REF --DBSNP $ROD \ -I foo.novo.realn.duperemoved.bam\ -recalFilefoo.novo.realn.duperemoved.bam.recal_data.csv \ --default_platformIllumina \ -covReadGroupCovariate \ -covQualityScoreCovariate \ -covCycleCovariate \ -covDinucCovariate \ -covTileCovariate \ -covHomopolymerCovariate \ -nback 5; http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

  33. Base Quality Score Recalibration.4 ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: GATKacov=/share/apps/gatk_20100930/Sting/dist/AnalyzeCovariates.jar $: GATKR=/share/apps/gatk_20100930/Sting/R $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod $: Rbin=/share/apps/R_current/bin/Rscript ## 2. GATK AnalyzeCovariates java -Xmx5g –jar $GATKacov \ -recalFilefoo.novo.realn.duperemoved.bam.recal_data.csv \ -outputDirfoo.novo.realn.duperemoved.bam.recal.plots \ -resources $GATKR \ -Rscript $Rbin; http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

  34. Base Quality Score Recalibration.5 • Generate the final analysis ready BAM file for Variant Discovery and Genotyping ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## 3. GATK TableRecalibration $: java –Xmx6g -jar $GATK -T TableRecalibration -R $REF \ -I foo.novo.realn.duperemoved.bam \ --out foo.novo.final.bam \ -recalFilefoo.novo.realn.duperemoved.bam.recal_data.csv \ --default_platformIllumina; ##samtools index $: samtools index foo.novo.final.bam; http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration

  35. Illumina Raw fastq Convert IlluminaFastq to sangerFastq QC raw data Mapping (BWA, QC-BWA, Novoalign) Convert SAM to BAM Local realignment around Indels Remove duplicates Base Quality Score Recalibration Analysis-ready reads SNP & Indel calling with GATK Indels & SNPs

  36. SNP & Indel calling with GATK Final Recalibrated BAM file IndelGenotyperV2 UnifiedGenotyper filterSingleSampleCalls.pl gatk.raw.indels.verbose.output.bed gatk.raw.indels.detailed.output.vcf gatk.indels.filtered.bed gatk.raw.indels.bed gatk.raw.snps.vcf gatk.filtered.indels.simple.bed ... chr1 556817 556817 +G:3/7 chr1 3535035 3535054 -TTCTGGGAGCTCCTCCCCC:9/21 chr1 3778838 3778838 +A:15/48 ... makeIndelMask.py indels.mask.bed VariantFiltration gatk.filtered.snps.vcf

  37. Short Indels (GATK IndelGenotyperV2) ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: GATKPERL=/share/apps/gatk_20100930/Sting/perl $: GATKPYTHON=/share/apps/gatk_20100930/Sting/python $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## Generate raw indel calls $: java -Xmx6g -jar $GATK -T IndelGenotyperV2 -R $REF --DBSNP $ROD \ -I foo.novo.final.bam \ -bed foo.gatk.raw.indels.bed \ -o foo.gatk.raw.indels.detailed.output.vcf \ --metrics_filefoo.gatk.raw.indels.metrics.file \ -verbose foo.gatk.raw.indels.verbose.output.bed \ -minCoverage 8 -S SILENT –mnr 1000000; ## Filter raw indels $: perl $GATKPERL/filterSingleSampleCalls.pl --calls foo.gatk.raw.indels.verbose.output.bed \ --max_cons_av_mm 3.0 --max_cons_nqs_av_mm 0.5 --mode ANNOTATE > foo.gatk.filtered.indels.bed http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0

  38. Creating an indel mask file • The output of the IndelGenotyper is used to mask out SNPs near indels. • “makeIndelMask.py” creates a bed file representing the masking intervals based on the output of IndelGenotyper. $: GATKPYTHON=/share/apps/gatk_20100930/Sting/python ## Create indel mask file $: python $GATKPYTHON/makeIndelMask.py foo.gatk.raw.indels.bed 5 indels.mask.bed • The number in this command stands for the number of bases that will be included on either side of the indel. http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0

  39. SNP Calling (GATK UnifiedGenotyper) ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## SNP Calling $: java -Xmx5g -jar $GATK -T UnifiedGenotyper -R $REF -D $ROD \ -baq CALCULATE_AS_NECESSARY -baqGOP 30 -nt 8 \ -A DepthOfCoverage -A AlleleBalance -A HaplotypeScore -A HomopolymerRun -A MappingQualityZero -A QualByDepth -A RMSMappingQuality -A SpanningDeletions \ -I foo.novo.final.bam -o foo.gatk.raw.snps.vcf \ -verbose foo.gatk.raw.snps.vcf.verbose -metrics foo.gatk.raw.snps.vcf.metrics; • This results in a VCF (variant call format) file containing raw SNPs. • VCF is a text file format. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome (SNP/INDEL calls). http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40 http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper

  40. SNP Filtering & annotation (GATK VariantFiltration) • VariantFiltration is used annotate suspicious calls from VCF files based on their failing given filters. • It annotates the FILTER field of VCF files for records that fail any one of several filters: • Variants that lie in clusters, using the specified values to define a cluster (i.e. the number of variants and the window size). • Any variant which overlaps entries from a masking rod. • Any variant whose INFO field annotations match a specified expression (i.e. the expression is used to describe records which should be filtered out). ## set env variables $: GATK=/share/apps/gatk_20100930/Sting/dist/GenomeAnalysisTK.jar $: REF=/scratch/data/reference_genomes/human/human_g1k_v37.fasta $: ROD=/scratch/data/reference_genomes/human/dbsnp_131_b37.final.rod ## VariantFiltration & annotation $: java –Xmx5g -jar $GATK -T VariantFiltration -R $REF -D $ROD \ -o foo.gatk.VariantFiltration.snps.vcf \ -B variant,VCF,foo.gatk.raw.snps.vcf\ -B mask,Bed,indels.mask.bed--maskNameInDel \ --clusterSize 3 --clusterWindowSize 10 \ --filterExpression "DP <= 8" --filterName "DP8" \ --filterExpression "SB > -0.10" --filterName "StrandBias" \ --filterExpression "HRun > 8" --filterName "HRun8" \ --filterExpression "QD < 5.0" --filterName "QD5" \ --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "hard2validate"; • More information on the parameters used can be found in: http://www.broadinstitute.org/gsa/wiki/index.php/VariantFiltrationWalkerhttp://www.broadinstitute.org/gsa/wiki/index.php/Using_JEXL_expressions

  41. SNP Filtering: VCFTOOLS • VCFTOOLS: methods for working with VCF files: filtering,validating, merging, comparing and calculate some basic population genetic statistics. • Example of some basic hard filtering: ## filter poor quality & suspicious SNP calls vcftools --vcffoo.gatk.VariantFiltration.snps.vcf\ --remove-filtered DP8 --remove-filtered StrandBias--remove-filtered LowQual \ --remove-filtered hard2validate --remove-filtered SnpCluster \ --keep-INFO AC --keep-INFO AF --keep-INFO AN --keep-INFO DB \ --keep-INFO DP --keep-INFO DS --keep-INFO Dels --keep-INFO HRun --keep-INFO HaplotypeScore --keep-INFO MQ --keep-INFO MQ0 --keep-INFO QD --keep-INFO SB --out foo.gatk.good.snps ; • this produces the file " foo.gatk.good.snps.recode.vcf"

  42. VCFTOOLS for SNP QC and statistics • VCFTOOLS can be used to generate useful QC measures, PLINK ped/map, Impute input and more.... ## QC & info $: for MYQC in missing freq2 counts2 depth site-depth site-mean-depth geno-depth het hardy singletons;do vcftools --vcffoo.gatk.good.snps.recode.vcf --$MYQC \ --out foo.gatk.good.snps.QC; done ## write genotypes, genotype qualities and genotype depth to separate files $: for MYFORMAT in GT GQ DP;do vcftools--vcffoo.gatk.good.snps.recode.vcf \ --extract-FORMAT-info $MYFORMAT --out foo.gatk.good.snps; done ## make PLINK ped and map files vcftools --vcffoo.gatk.good.snps.recode.vcf --plink --out foo.gatk.good.snps http://vcftools.sourceforge.net/

  43. Picard Tools : Post Alignment Summary reports • See : http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_fennell.pdf

  44. Questions? • Email : stephen.newhouse@kcl.ac.uk • More useful links: • http://www.broadinstitute.org/gsa/wiki/index.php/Prerequisites • http://www.broadinstitute.org/gsa/wiki/index.php/Building_the_GATK • http://www.broadinstitute.org/gsa/wiki/index.php/Downloading_the_GATK • http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK • http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_the_essential_GATK_input_files:_the_reference_genome • http://www.broadinstitute.org/gsa/wiki/index.php/The_DBSNP_rod

More Related