bioinformatic techniques tools for snp analysis n.
Skip this Video
Loading SlideShow in 5 Seconds..
Bioinformatic Techniques & Tools for SNP Analysis PowerPoint Presentation
Download Presentation
Bioinformatic Techniques & Tools for SNP Analysis

Loading in 2 Seconds...

play fullscreen
1 / 53

Bioinformatic Techniques & Tools for SNP Analysis - PowerPoint PPT Presentation

  • Uploaded on

Bioinformatic Techniques & Tools for SNP Analysis. Jill L Wegrzyn Department of Plant Sciences University of California at Davis, Davis, CA 95616 USA E-mail: SNP Analysis Agenda. Sequence-Based SNP Identification Common Bioinformatic Solutions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Bioinformatic Techniques & Tools for SNP Analysis' - gwydion

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bioinformatic techniques tools for snp analysis

Bioinformatic Techniques & Tools for SNP Analysis

Jill L Wegrzyn

Department of Plant Sciences University of California at Davis, Davis, CA 95616 USAE-mail:

snp analysis agenda

SNP Analysis Agenda

Sequence-Based SNP Identification

Common Bioinformatic Solutions

Phred, Phrap, Consed, Polyphred, and Polybayes

Re-Sequencing Project (ADEPT2)

High-Throughput SNP Identification Solution (PineSAP)











Finding SNPs: Sequence-based SNP Mining










Align to


Sequence Overlap - SNP Discovery




Comprehensive SNP Discovery: Resequencing

  • Overlapping PCR Amplicons across entire gene
  • Make no assumptions about sequence function
  • Sequence diversity and genetic structure for each gene is different
  • Proper association studies can only be designed in this context
  • Complete resequencing facilitates population genetics methods

Sequence-based SNP Identification




Amplify DNA


Contig assembly

Sequence each end

Quality determination


quality determination



of the fragment.


Polymorphism detection






Sequence viewing

Polymorphism tagging


Polymorphism reporting



Individual genotyping

Phylogenetic analysis

what is phred phrap consed
What is Phred/Phrap/Consed ?

Phred/Phrap/Consed is a DNA sequence analysis package from University of Washington:

  • Trace file (chromatograms) reading
  • Quality (confidence) assignment to each individual base
  • Vector and repeat sequences identification and masking

Sequence assembly

  • Assembly visualization and editing
  • Automatic finishing
phred phrap consed polyphred polybayes
Phred, Phrap, Consed, Polyphred, Polybayes
  • phred: Base calling and quality assignments
  • phrap: Contig formation and new quality assignments
  • consed: Visual X-Windows graphic interface, to view and edit alignments and contigs, and to view the original traces
  • polyphred: find polymorphisms in phrap contigs, quality calls, add data to phrap files to permit consed finding and visualization of polymorphisms.
  • polybayes: Fully probabilistic SNP detection algorithm that calculates the probability (SNP score) that discrepancies at a given location of a multiple alignment represent true sequence variations as opposed to sequencing errors.
files and program execution
Files and Program Execution
  • Input files: ABI chromat or SCF files
  • Order of Program execution:
    • phred, then phrap, then consed
    • or: phred, then phrap, then polyphred, polybayes, then consed
    • or: phred, then phrap, then consed, then polyphred, polybayes, then consed
what is phred

What is Phred?

Phred observes the base trace, makes base calls, and assigns qualityvalues (qv) of bases in the sequence.

Writes base calls and qv to output files for Phrap assembly.

Quality values (phred scores) range from 0 to 60. 20 and above is considered a confident base call (1 in 100 chance that is has been called wrong ~99% accuracy)

Useful for consensus sequence construction

ATGCATTC string1


ATGC-TTCATGC superstring

Here we have a mismatch ‘A’ and ‘G’, the qv will determine the dash in the superstring. The base with higher qv will replace the dash.

phred value formula
Phred value formula

q = - 10 x log10 (p)


q - quality value

p - estimated probability error for a base call


q = 20 means p = 10-2 (1 error in 100 bases)

q = 40 means p = 10-4 (1 error in 10,000 bases)

why phred
Why Phred?
  • Output sequence might contain errors.
    • Vector contamination
    • Dye-terminator reaction might not occur.
    • Weak or variable signal strength of peak corresponding to a base.
the structure of a phd file
The structure of a phd file

t 6 11908

a 6 11921

g 6 11927

t 6 11947

c 6 11953

a 6 11964

g 6 11981

c 4 11994

n 4 12015

c 4 12037

n 4 12044

n 4 12058

n 4 12071

n 4 12085

n 4 12098

n 4 12111

n 4 12124

c 4 12144

n 4 12151



t 16 8191

g 19 8200

t 13 8211

c 13 8229

g 4 8241

n 4 8253

c 4 8263

t 10 8276

t 9 8286

c 12 8301

t 16 8313

c 12 8329

c 12 8336

c 15 8343

t 19 8356

c 9 8371

g 13 8386

g 14 8397

a 7 8417

g 9 8427

g 4 8445





PHRED_VERSION: 0.990722.g



TIME: Thu May 24 00:18:58 2001




CHEM: term

DYE: big



t 8 5

c 13 17

a 19 26

c 19 32

t 24 2221

a 24 2232

a 22 2245

a 27 2261

g 25 2272

c 19 2286

c 12 2302

t 19 2314

g 12 2324

g 15 2331

g 19 2346

g 23 2363

t 33 2378

g 36 2390

c 44 2404

c 44 2419

t 39 2433

a 39 2446

a 34 2460

t 35 2470

g 34 2482

  • Phrap constructs the contig sequence as a mosaic of the highest quality parts of the reads rather than a statistically computed “consensus”.
  • Avoids the complex algorithm issues associated with multiple alignment methods
    • Speed and accuracy
    • Phrap is an assembler NOT an aligner
  • The sequence produced by Phrap is quite accurate
      • Less than 1 error per 10 kb in typical datasets.
  • Phrap considers sequence quality at a given position (determined by Phred)
running phredphrap
Running phredPhrap

Construct directory structure

    • directory for the run, containing subdirectories chromat_dir, edit_dir, phd_dir, poly_dir

Move tracefiles to directory chromat_dir

Run phredPhrap from directory edit_dir

  • Determine phredPhrap parameters

The greater the forcelevel, the more liberal the assembly

(generally produces fewer contigs).

however results in more manual editing reviewing polymorphisms later


Adjusting Parameters

phrap output files
Phrap output files
  • *.contigs– fasta file containing the contigs
    • Contigs with more than one read
    • Singletons (single reads with a match to some other contig but could not be merged consistently)
  • *.singlets–fasta file of the singlet reads
    • Reads with no match to other read
  • *.ace– for viewing the assembly w/ Consed
  • *.view – for viewing the assembly w/ Phrapview

Consed is a program for viewing and editing assemblies produced by Phrap

a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence.

b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads.

c. Navigation –identify and list regions which are below a given quality threshold, contain high quality discrepancies, single-strand coverage, etc.

phred phrap consed pipeline
Phred/Phrap/Consed Pipeline






Using PolyPhred to Visualize SNPs

  • Compares sequences across traces obtained from different individuals to identify sites for SNPs.
  • Will occasionally miscall genotypes - frequency of such mistakes depends on the sequencing chemistry used to generate the trace.
  • To reduce the number of miscalled sites, ignores regions of poor quality & ends
  • polyphred -ace gene_file.ace.1 -tag p -snp hom -indel -f 50 > gene_file.polyphred.out
    • where gene_file.ace.1 is the .ace file present in the edit_dir directory.
    • Note: this command will give polymorphisms of quality 'ranks' 1 through 3: 1 is the highest quality, 6 is the lowest quality
    • The qualifier -f 50 is used to list 50bp of flanking sequence on each side of the detected polymorphism
    • The qualifier -indel is used to identify insertions or deletions
    • The qualifier -snp hom is used to identify homozygous SNPs only
    • The qualifier -tag p is used to list the tagged polymorphisms in the polyphred output file *.polyphred.out.
    • It first reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. It then reads the PHD and POLY files associated with each trace.
  • Reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly.
  • Reads the PHD and POLY files associated with each trace.
  • During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence
  • The score indicates how well the trace at the site matches the expected pattern for a SNP.
  • Updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using Consed.
  • Bayesian statistical model takes into account:
  • - depth of coverge
  • - base quality values of the sequences
  • a priori expected rate of polymorphic sites in region
  • Polybayes calculations are aided with information on major/minor allele frequencies as well as polymorphism rates within the species under investigation
  • **Can also integrate into the poly files for viewing with Consed
allele discovery of economic pine traits
Allele Discovery of Economic Pine Traits
  • Goals
    • Re-sequence ~8,000 amplicons in Loblolly Pine
    • SNP Discovery
    • Genotyping of 3,000 trees using Illumina technology
    • Phenotyping of same 3,000 trees
    • Association studies
resequencing project pipeline



Resequencing Project Pipeline

EST Databases


Diversity Panel

DNA Extraction


Agencourt Sequencing

18 individuals & 5 other







Illumina Genotyping

SNP Identication


Drought, Dieases, & Wood

Population DNA Extraction




primer selection and validation

Report to TreeGenes - several primers per amplicon will be stored in TreeGenes

Successes - select 1 primer set per amplicon

Primer synthesis at Illumina

Report to TreeGenes

Sequencing on validation DNA


BLASTing of validation DNA sequence

back to EST database

Report to TreeGenes


Sequencing of diversity panel - 1 primer set per amplicon which has been successful at each step

Report to TreeGenes

Primer Selection and Validation

Primer Selection

Using OSP primer picking

alignment and snp calling pipeline challenges in high throughput snp identification
Alignment and SNP Calling PipelineChallenges in High-Throughput SNP Identification
  • Alignment

Critical in the automation of base calls

    • Commonly used Phrap (from PhredPhrap) is an assembler and is NOT ideal for alignments
    • Many commonly used aligners work best with protein sequences or with a reference sequence
    • Preservation of quality scores for input into SNP identification programs
    • Speed for high-throughput programs
  • Automated SNP Calls
    • Reference Sequence Required
    • Traditional approaches without reference sequence include “eSNPs” (human, maize, and pine)

-Very little redundancy outside of abundant genes

-Overall high number of false positives (single pass reads)

    • Not specific to frequencies observed in different organisms
    • High number of false positives in currently accepted methods (Polybayes & Polyphred)

Re-Sequencing data

from Agencourt:

Initial Processing

Base Calling

Sequence Alignment

SNP Identification

Machine Learning

Data Storage &



Base Calling and Sequence Alignment

Modified PhredPhrap

allows for trimming of bases from

start and end of sequence based

on trace quality


Converts native PhredPhrap output (ace file) into an unaligned FASTA file


Optimal DNA sequence alignment program


Provides single multifasta file with all reads aligned to the contig from PhredPhrap AND the contigs alignment to the other contigs from probconsRNA


Converts resulting FASTA file back into ace file for SNP Identification


SNP Identification and Machine Learning

Algorithms applied Independently for SNP Identification:

Polyphred and Polybayes both utilize

alignment and base quality information to identify potential polymorphisms.

Both programs yield a high # of false positives

Machine Learning

Utilizes sequence based information and compares actual base calls to proposed calls to develop SNP identification filters

specific to pine

Parameters under Consideration: Local and global quality scores, frequencies of major and minor alleles, local and global SNP frequences, relative positions in sequence to existing variation, and polyphred and polybayes scores

snp identification overview
SNP Identification Overview
  • Examine features to improve the accuracy of SNP location prediction
  • Utilize machine learning to apply the features
  • Refine the accuracy of the learning algorithm through adjustments to feature representation (iterative process)
  • Utilize the final classifier against the large re-sequenced set to improve accuracy of SNP calls originating from Polybayes and Polyphred
feature representation
Feature Representation
  • Sequence depth
    • Count of number of sequences in the alignment at the position of variation.
    • All sequences in the alignment may not overlap at the position of variation;
      • number is different from the total number of the sequences in the alignment
  • Variation type
    • Variation type can be SNP or INDEL
  • PolyBayes score
    • PolyBayes program assigns a Bayesian posterior probability value for each called SNP using the frequency priors given for observing a variation at that position
  • Polyphred score
    • Polyphred assigns a score calculated primarily from sequence depth and quality score.
feature representation1
Feature Representation
  • Base frequencies
    • The number of occurrences of different bases at the position of variation is important in determining a polymorphic position.
      • Frequencies of the first (major allele) and the second (minor allele) are represented as ratio to sequence depth.
  • Relative distance
    • Sequence quality at the ends of the alignment tends to be poor due to inherent limitations of current sequencing technology.
      • SNP position was represented as the ratio of the distance in the consensus sequence from the closest end, or the relative distance
feature representation2
Feature Representation
  • Sequence quality
    • Variation is observed because of a poor quality base.
    • Based on the base frequencies calculated:
      • maximum qualities of the major and minor alleles
      • average qualities of major and minor alleles
  • Alignment quality
    • Misalignment of bases caused by sequence alignment programs sometimes result in an erroneous SNP call.
    • In the neighborhood of the SNP (+/- 10 bases) all the mismatches with the consensus sequence are given a penalty and the penalty is more if the mismatch is continuous

Goal: Prediction

Software: Decision tree C4.5 program is open-source C code (WEKA)

  • At each point in the construction of the decision tree, C4.5 selects the feature to test based on maximum information gain.
  • The goal is to generate a minimum size tree that correctly classifies all the elements in the training set.
  • The size of the tree is the number of nodes (decision nodes) and the number of leaves are the classes (categories) that they are distributed to.
alignment and snp identification
Alignment and SNP Identification

SNP Identification Datasets

  • Training set for loblolly pine was composed of a total of 300 validated sequences.
    • Divided to represent the relative percentages of sequence source
    • Testing set is composed of 120 validated sequence sets
  • Training set for poplar was composed of 42 validated sequences selected at random
    • Testing set is composed of a total of 30 validated sequence sets.
  • Validation = manually observed FP, FN, TP, and TN SNP calls through observation of tracefiles in Consed.
alignment and snp identification evaluation criteria
Alignment and SNP IdentificationEvaluation Criteria
  • Accuracy = (TP + TN)/total
  • Sensitivity = TP/(TP + FN)
  • Specificity = TN/(FP + TN)
resulting identifications
Resulting Identifications
  • SNPs have been called using PineSAP on 7424 amplicons representing 6924 Unigenes.
  • Average amplicon length is 445 bp
  • Average of 5.5 SNPs/amplicons
  • Pine is highly polymorphic!
  • Total SNPS ~41,500
  • Distribution of SNPs/amplicon
    • 1404 - 0 SNPs
    • 1133 - 1 SNP
    • 4887 - 2 or more SNPs
alignment and snp identification snp formatting
Alignment and SNP IdentificationSNP Formatting

Custom lists can be exported into Illumina style input

**option for adding IUPAC codes for SNPs in flanking sequence


Data Storage and Release

SNP Identifcations



SNP Discovery: dbSNP database


NCBI SNP database


Unique mapping

to a genome location

(reference SNP = rs#)

SNPs submitted

by research communtiy

(submitted SNPs = ss#)

(by 2hit-2allele)

SNP data submitted to dbSNP

dbSNP processing of SNPs

  • PineSAP improves
    • Inaccuracies introduced by using Phrap to align sequences
    • Time which would be required by using a aligner such as ProbconsRNA or ClustalW on its own
  • PineSAP has a 98% success rate when used to align loblolly resequencing data.
  • PineSAP identified a success list of features to enhance polymorphism predictions and generated a prediction accuracy of 93%
  • PineSAP provided a full alignment and polymorphism detection system that can be adapted to specific genomes
more information
More Information…