1 / 55

BF528 - Sequence Analysis 1 - WGS/WES

BF528 - Sequence Analysis 1 - WGS/WES. Lecture 6 02/07/2018. Whole Genome Sequencing. W hole G enome S equencing (WGS) WGS covers 95% to 98% of the genome. Depth at least 30X to be able to detect heterozygous SNPs. Analysis is harder, more data and more complex regions.

elijahm
Download Presentation

BF528 - Sequence Analysis 1 - WGS/WES

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BF528 - Sequence Analysis 1 - WGS/WES Lecture 6 02/07/2018

  2. Whole Genome Sequencing Whole Genome Sequencing (WGS) • WGS covers 95% to 98% of the genome. • Depth at least 30X to be able to detect heterozygous SNPs. • Analysis is harder, more data and more complex regions

  3. Whole Exome Sequencing Whole Exome Sequencing (WES) • Exons consist of about 1-2% of the genome that code proteins. • Much higher coverage (~100X) • Sequencing known exons • Variants have high impact

  4. Alignment • NGS produces short reads (30-250bp) • We need to identify where in the genome they came from. • We do this based on similarity, observing that the probability of 2 regions of the genome of a given length being exactly the same is very low. • This is a guided analysis: we need a reference to align to. Used to find variants and haplotypes.

  5. The reference • The reference is a haploid representation of the consensus of multiple individuals. • Human genome: • GRCh37 hg19 • GRCh38 hg38 • Mouse: • GRCm38 mm10 • NCBI37 mm9 • It is in fasta format (.fa or .fasta)

  6. Human genome Genome Reference Consortium human build 38 patch 12 (latest) Assembly GRCh38.p12 from NCBI

  7. Human genome Genome Reference Consortium human build 37 patch 13 Assembly GRCh38.p12 from NCBI

  8. The reference To load samtools use: For the first time you need to index the reference (makes a fai file): samtools faidx module load samtools samtools faidx ref.fa samtools faidx ref.fa ch1:10000-10100

  9. Make a small reference Go to: ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/ Download chromosome 22. wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz gunzip chr22.fa.gz samtools faidx chr22.fa samtools faidx chr22.fa chr22:29268316-29300343 | less

  10. Let’s get some reads On SRA look for the reads you want, for example WES of NA12878 with access ID SRR1919605 (took an hour) : Note: always use qsub. Note: you might get errors regarding space. You might need to change it’s cache from your home root. echo '/repository/user/main/public/root = "/projectnb/bf528/..."' > $HOME/.ncbi/user-settings.mkfg module load sratoolkit fastq-dump --split-files --gzip SRR1919605

  11. Don’tdownload now, just make a simlink Location: /projectnb/bf528/06/sample_reads/SRR1919605_first400K_1.fastq /projectnb/bf528/06/sample_reads/SRR1919605_first400K_2.fastq

  12. Quality check with FastQC First thing to do is to check the quality of the reads using FastQC. Do not run this. It will take half 12min. The results are at: /projectnb/bf528/06/fastqc module load fastqc fastqc --help fastqc SRR1919605_1.fastq.gz SRR1919605_2.fastq.gz

  13. Quality check with FastQC Started analysis of SRR1919605_1.fastq.gz Approx 5% complete for SRR1919605_1.fastq.gz Approx 10% complete for SRR1919605_1.fastq.gz Approx 15% complete for SRR1919605_1.fastq.gz Approx 20% complete for SRR1919605_1.fastq.gz Approx 25% complete for SRR1919605_1.fastq.gz Approx 30% complete for SRR1919605_1.fastq.gz Approx 35% complete for SRR1919605_1.fastq.gz Approx 40% complete for SRR1919605_1.fastq.gz Approx 45% complete for SRR1919605_1.fastq.gz Approx 50% complete for SRR1919605_1.fastq.gz Approx 55% complete for SRR1919605_1.fastq.gz Approx 60% complete for SRR1919605_1.fastq.gz Approx 65% complete for SRR1919605_1.fastq.gz Approx 70% complete for SRR1919605_1.fastq.gz Approx 75% complete for SRR1919605_1.fastq.gz Approx 80% complete for SRR1919605_1.fastq.gz Approx 85% complete for SRR1919605_1.fastq.gz Approx 90% complete for SRR1919605_1.fastq.gz Approx 95% complete for SRR1919605_1.fastq.gz real 5m47.401s user 5m42.057s sys 0m4.673s

  14. Quality check with FastQC Can take fastq or bam as input. Output is in html format. Better to run on BAM file.

  15. Quality check with FastQC

  16. Quality check with FastQC

  17. Quality check with FastQC

  18. Alignment of short reads There are various tools to align reads: • First hit aligners: • BWA MEM→ the most common • Bowtie2 • SNAP → forces reads to align close to each other • All possible hits aligners: • MrFAST • MrsFAST

  19. Aligner comparison 50 bp reads with SNP (substitutes) Correctly mapped reads Incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014

  20. Aligner comparison 50 bp reads with inserts Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014

  21. Aligner comparison 50 bp reads with deletion Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014

  22. Aligner comparison 100 bp reads with SNP (substitutes) Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014

  23. Aligner comparison 100 bp reads with inserts Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014

  24. Aligner comparison 100 bp reads with deletion Correctly mapped reads incorrect mapped reads Mark Zeiman, DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads, November 04, 2014

  25. Alignment of short reads • Each aligner requires the reference to be prepared (indexed) in a certain way. • We will do an example with BWA-MEM • http://bio-bwa.sourceforge.net/bwa.shtml Let’s prepare reference indexes for BWA. module load bwa bwa index chr22.fa

  26. BWA-MEM • Align the reads with the gapless bwa-mem algorithm: Output is a SAM file. SAM format is used to save alignments. bwa mem -R "@RG\tID:test\tSM:NA12878" chr22.fa SRR1919605_1.fq.gz SRR1919605_2.fq.gz > SRR1919605.sam

  27. BWA-MEM • Align the reads with the gapless bwa-mem algorithm: Output is a SAM file. SAM format is used to save alignments. bwa mem-R "@RG\tID:test\tSM:NA12878" chr22.fa SRR1919605_1.fq.gz SRR1919605_2.fq.gz > SRR1919605.sam Read group

  28. SAM format https://samtools.github.io/hts-specs/SAMv1.pdf

  29. SAM format • @ header line • Each line has at least 11 fields • 1 line per each read

  30. SAM format The second field is a flag: You can extract flags using -f and -F in samtools.

  31. SAM format header lines start with @ @HD VN:1.5 GO:none SO:coordinate @SQ SN:1 LN:248956422 @SQ SN:10 LN:133797422 @SQ SN:11 LN:135086622 @SQ SN:12 LN:133275309 @SQ SN:13 LN:114364328 @SQ SN:14 LN:107043718 . . . @RG ID:50 LB:SRR1514950 SM:CHM1 @PG ID:MarkDuplicates VN:2.8.0-SNAPSHOT CL:picard.sam.markduplicates.MarkDuplicates INPUT=[SRR1514950.sorted.bam] OUTPUT=SRR1514950.rmdup.bam METRICS_FILE=marked_dup_metrics50.txt REMOVE_DUPLICATES=true TMP_DIR=[/projectnb/vntrseek/CHM1/illumina/temp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:bwa PN:bwa VN:0.7.12-r1039 CL:/share/pkg/bwa/0.7.12/install/bin/bwa.real mem -t 16 /projectnb/vntrseek/share/GRCh38/GRCh38.fa fastq_SRR1514950_1.tar.gz fastq_SRR1514950_2.tar.gz @PG ID:GATK IndelRealigner VN:3.7-0-gcfedb67 CL:knownAlleles=[] targetIntervals=50.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates-24B543D9 VN:2.8.0-SNAPSHOT CL:picard.sam.markduplicates.MarkDuplicates INPUT=[SRR1514952.sorted.bam] OUTPUT=SRR1514952.rmdup.bam METRICS_FILE=marked_dup_metrics52.txt REMOVE_DUPLICATES=true TMP_DIR=[/projectnb/vntrseek/CHM1/illumina/temp] MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.jsonPN:MarkDuplicates @PG ID:bwa-136CB686 PN:bwa VN:0.7.12-r1039 CL:/share/pkg/bwa/0.7.12/install/bin/bwa.real mem -t 16 /projectnb/vntrseek/share/GRCh38/GRCh38.fa fastq_SRR1514952_1.tar.gz fastq_SRR1514952_2.tar.gz @PG ID:GATK IndelRealigner-CC92922 VN:3.7-0-gcfedb67 CL:knownAlleles=[] targetIntervals=52.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null version The reference headers and the line each starts at read groups Programs you ran

  32. SAM format • Each line has at least 11 fields • 1 line per each alignment, some aligners produce multi-alignments SRR1514952.11241320147 11000027 41S60M =10034 -26GAACCCTAGCCCTACCCCAACCCCGAACCCTACCCCGAACCATAACCCTAACCCTAACCCTAACCCTATCCCTAACCCTAGCCCTAACCCTAACCCTAACC #############################################################################A2A+;H;F;FB?A+2+:??DDB88 XA:Z:X,+156030660,76M25S,5;Y,+57217180,76M25S,5;22,+50808410,60M41S,2;6,-147845,18S23M1D38M1D22M,7; MC:Z:75M1I25M MD:Z:27A11A20 NM:i:2 MQ:i:27 AS:i:50 XS:i:51 RG:Z:52 PG:Z:MarkDuplicates-24B543D9

  33. samtools • BAM is a compressed sam. • CRAM is a better compressed SAM. • You can read all with samtools module load samtools samtools view SRR1919605.sam # unmapped ones samtools view -f 4 SRR1919605.sam # pair unmapped samtools view -f 8 SRR1919605.sam # only mapped ones samtools view -F 4 SRR1919605.sam # second reads only samtools view -f 128 SRR1919605.sam # quality at least 30 samtools view -q 30 SRR1919605.sam

  34. Sorting and indexing Always sort and compress your SAM file. Sorting makes your file smaller and gives you instant access to alignments. Note: You could use the -@ parameter to run the sorting in parallel, if so, don’t forget to allocate multiple processes in your qsub. samtools sort -o SRR1919605_sorted.bam -O bam SRR1919605.sam # don’t forget to index your sorted bam samtool indexSRR1919605.bam

  35. Sorting and indexing You can extract reads by location on a sorted indexed SAM/BAM/CRAM file with samtools. samtools sort -n sorts reads by name, no tools have been implemented to extract alignments by read name. Few applications require sorted sam in read name order such as: bedtools bamtofastq. samtools viewSRR1919605_sorted.bam chr22:29268316-29300343

  36. Do it all in one PROC=8 bwa mem -R "@RG\tID:test\tSM:NA12878" chr22.fa SRR1919605_1.fastq SRR1919605_2.fastq | samtools sort -o SRR1919605_sorted.bam -O bam # don’t forget to index it samtools indexalignment.bam Check the header of the new bam file, see your command? samtools view -H SRR1919605_rmdup.bam

  37. Duplicated reads • PCR duplicates • Sequencing duplicates (mirror) "A duplicate could be PCR effect or reading same fragment twice, there is no way to tell." HAVE TO DO FOR WGS AND WES Debate for RNA-seq (Parekh, Swati, et al. "The impact of amplification on differential expression analyses by RNA-seq." Scientific reports 6 (2016): 25533.)

  38. Quality check with FastQC

  39. Removing duplicates - samtools samtools rmdupSRR1919605_sorted.bam SRR1919605_rmdup.bam # don’t forget to index it samtools index alignment.bam

  40. Removing duplicates - picard You could use GATK and picard-tools too (from the broad), is better, takes longer. module load java/1.8.0_92 picard picard MarkDuplicates \ I=SRR1919605_sorted.bam \ O=SRR1919605_markdup.bam \ REMOVE_DUPLICATES=TRUE \ M=marked_dup_metrics.txt # don’t forget to index it samtools index alignment.bam

  41. Removing duplicates - comparison Count your reads: samtools can only remove duplicates across one chromosome, picard can do over different chromosomes. picard is more stringent.

  42. Realigning around indels

  43. Realigning around indels Preprocessing: BWA and Bowtie2 are gapless aligners. You will need to make a dictionary file for your reference. You can use samtools or picard-tools: module load java/1.8.0_92 gatk/3.7 samtools dict chr22.fa > chr22.dict picard CreateSequenceDictionary REFERENCE=chr22.fa OUTPUT=chr22.dict

  44. Realigning around indels First find all the bad mapping intervals and indels: gatk -T RealignerTargetCreator \ -R chr22.fa \ -I SRR1919605_markdup.bam -o SRR1919605.intervals \# if you have your own targets # -known indels.vcf \

  45. Realigning around indels Using the indels you found, try to realign around them allowing gaps: gatk -T IndelRealigner \ -R reference.fasta \ -I SRR1919605.markdup.bam \# if you have your own targets # -known indels.vcf \ -targetIntervals SRR1919605.intervals -o SRR1919605_markdup_realigned.bam

  46. Realigning around indels GATK version 4 does not use the same algorithm, has a LeftAlignIndels. I have not tested it, and didn’t find blogs on it. For more information on realigning around indels see: https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-3-IndelRealignment.pdf

  47. Realigning around indels GATK version 4 does not use the same algorithm, has a LeftAlignIndels. I have not tested it, and didn’t find blogs on it. For more information on realigning around indels see: https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-3-IndelRealignment.pdf Always read blogs, review papers seem to favor their own tool or collaborators’

  48. Realigning around indels

  49. Check your bam file Check the header of the new bam file, see your commands? Now we are ready to do analysis on this bam file. samtools view -H SRR1919605_rmdup.bam

  50. The overall pipeline input: FastQ Output: indexed sorted bam or cram • FastQC, trimming • Align, don’t forget read groups • Sort and index the sam file • Marking/remove duplicates • Realignment around indels (or STRs…) ------------------------------------------------------------------------------------- Insert: bam Output: vcf 6. Call variants 7. Filter the variants

More Related