Sequence similarity alignment with blast
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Sequence Similarity & Alignment with BLAST PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on
  • Presentation posted in: General

Sequence Similarity & Alignment with BLAST. June 12, 2014. Outline. How sequences are aligned How alignments are scored The different BLAST algorithms Using NCBI BLAST programs. Sequence alignment. Determine if & how two sequences are related Sequence assembly Sequence annotation

Download Presentation

Sequence Similarity & Alignment with BLAST

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sequence similarity alignment with blast

Sequence Similarity & Alignment with BLAST

June 12, 2014


Outline

Outline

  • How sequences are aligned

  • How alignments are scored

  • The different BLAST algorithms

  • Using NCBI BLAST programs


Sequence alignment

Sequence alignment

  • Determine if & how two sequences are related

  • Sequence assembly

  • Sequence annotation

  • Identify shared protein domains or motifs

  • Analysis of genomes

  • Phylogeny and evolution


Definitions

Definitions

  • Homologous – share a common ancestor

    • Cannot be measured

    • Measure similarity; infer homology

  • Orthologs: separated by speciation

  • Paralogs: separated by duplication


Defining orthologs

Defining orthologs

  • Alternative splicing is estimated to occur in ~90% of human genes

  • Contributes an order of magnitude to the transcriptional complexity

    • Caspase 9 gene

      • Longest transcript is pro-apoptotic

      • Shorter transcript lacks 4 exons and a functional protease domain and is anti-apoptotic

  • Mouse – human transcript comparison

    • ~25% of the human transcripts (13% of Refseq genes) have no splicing ortholog in the mouse genome


Pairwise alignments

Pairwise alignments

  • How are two sequences related to each other?

  • Are there gaps in one versus the other?

  • What is the percent similarity?

  • How do I determine the significance?


Pairwise alignment

Pairwise alignment

First string = a b c d e

Second string = a c d d e f

Two alignments:a b c d - e –

a – c d d e f

a b c – d e –

a – c d d e f

How do we define the criteria so that an algorithm will choose the best alignment?


Defining the algorithm

Defining the algorithm

  • Are sequences DNA or protein

  • What is the expected level of similarity

  • Scoring method that reflects degree of similarity

  • Allow for gaps (insertions and deletions)

  • Statistical measure of the probability that alignment occurred by chance


Identity and similarity

% identity and similarity

  • % identity: percentage of aligned residues that are identical

  • % similarity: percentage of aligned residues that have similar chemical/physical properties

    • Amino acid alignments only


Scoring schemes

Scoring schemes

  • Method of scoring matches, mismatches & gaps that is biologically relevant

  • Nucleotide alignments:

    • Identity only, with positive score for matches & negative score for mismatches

    • Score transitions (AnG, TnC) & transversions (purinenpyrimidine) differently

      • Transitions more common and more likely to be silent


Amino acid substitution matrices

Amino acid substitution matrices

  • Method to score matches and mismatches

  • Based on observed frequencies of amino acid distributions and substitutions

  • Must model conservative nature of substitutions

  • Implicitly represent evolutionary patterns

  • Scores are based in Information Theory


Scoring amino acid substitutions

Scoring amino acid substitutions

  • Amino acids are NOT distributed evenly

  • Amino acids share similarity based on chemical and physical properties

  • Not all substitutions are equally likely due to physical/chemical constraints

    • i.e. L -> I is much more conservative than L -> Y

vs


Entropy

Entropy

H = information, associated with some probability p, is the base 2 logarithm of the inverse of p. Values converted to base 2 logarithms are given the unit bits.

Information is described as a message of symbols. If there are n symbols and all n have an equal probability then the probability of any symbol appearing is 1/n


Information t heory

Information Theory

If all symbols are NOT equally probable, then the entropy (H)

is the negative sum over all symbols (n) of the probability of a symbol (pi) multiplied by the log base 2 of the symbol (log pi)

The entropy of a normal coin is therefore:

-( (0.5)(-1) + (0.5)(-1) ) = 1 bit

The entropy of a trick coin where heads comes up ¾ of the time is:

-( (0.75)(-.415) + (0.25)(-2) ) = 0.81 bit

The entropy of random DNA is:

-( (0.25)(-2) + (0.25)(-2) + (0.25)(-2) + (0.25)(-2) ) = 2 bits


Scoring matrices

Commonly observed substitutions: S > 0

Rarely observed substitutions: S< 0

Observed and random frequency same: S = 0

Scoring matrices

S = score for amino acid pairing in the alignment

qij is the observed pairing frequency of amino acids iand j.

piand pj are the expected frequencies for amino acids iand j.


Blosum62 matrix

BLOSUM62 Matrix

  • BLOcksSUbstitutionMatrix are based on protein alignments

  • Number indicates minimal percent identity between proteins in the alignment


Amino acid chemical relationships

Amino acid chemical relationships


Blosum62 matrix1

Large positive;

Rare amino acids

Large negative; unlikely subs

Near zero; no penalty for subs

BLOSUM62 Matrix


Blosum90

BLOSUM90

More positive; more negative than BLOSUM62

Based on blocks of aligned protein sequences that are at least 90% identical to another sequence in the block


Choosing a matrix

Choosing a matrix


Sequence similarity alignment with blast

Gaps

  • Insertions can lead to gaps of varying lengths

  • Use 2 gap penalties:

    • higher penalty for opening a gap

    • lower penalty for extension of a gap


Blast

BLAST

Calculate statistical significance of matches

  • Build a list of words from query sequence

    (3 for proteins, 11 for DNA)

  • Evaluate each word for match using scoring matrix and discard all below threshold

    • Generally 50 matches per word

    • T value is threshold; determines sensitivity and speed of search

Build word list from query sequence

Find hits in database sequence

Extend the hits to form HSPs


Sequence similarity alignment with blast

Query sequence:

PSATPVLICWAAG

Word list:

PSA

ATP

VLI

CWA

Threshold score (T):

11

Matches to PSA: Score:

PSA15

PST9

PDA11

WSA 4


Blast1

BLAST

  • Find match for each word in database

    • Database is indexed so all possible words in all sequences is known

    • This search is very fast (500K words/sec)

  • Matches > threshold(T) are used as seed for alignments

Calculate statistical significance of matches

Build word list from query sequence

Extend the hits to form HSPs

Find hits in database sequence


Blast2

BLAST

  • Extend alignment from each word in both directions so long as score increases

  • These alignments are the high scoring pairs (HSPs)

  • Keep HSPs if score is above a given threshold

Calculate statistical significance of matches

Build word list from query sequence

Find hits in database sequence

Extend the hits to form HSPs


Extending the hit

Extending the hit

Score of previous alignment (A)

Score of new

aligned pair

Score of new alignment

=

+

(1)

p S A

P S A

15

C

C

9

P S A C

P S A C

24

=

+

(2)

Score of new

aligned pair

Score of previous

alignment (B)

Score of

alignment (C)

+

=

P S A C

P S A C

24

Y

W

2

P S A C Y

P S A C W

26

=

+

(3)

Repeat adding aligned pairs until score goes down or reach end of sequence.


Blast3

BLAST

  • Highest scoring HSPs extended in both directions as long as score > threshold

  • Do NOT usually get an alignment over the ENTIRE length of the sequence

Combine HSPs into a gapped alignment

Build word list from query sequence

Find hits in database sequence

Extend the hits to form HSPs


Sequence similarity alignment with blast

Positives = 200/310 (64%)

Identities = 135/310 (43%)

Score = 272 bits

Expect = 2e-73


Significance of alignment

Significance of alignment

probability that the observed match could have happened by chance

P =

number of matches as good as the observed one that would be expected to appear by chance in a database of the sizeprobed

E =

Expect value


Significance of alignments

Significance of alignments

  • P: values between 0 and 1

  • E = P x size of the database

  • E values range from 0 to the size of the database


What is a significant alignment

What is a significant alignment?

  • Identify a true ortholog between species

    • In a protein-protein alignment, E-values <10-25

      • Are all the domains present in both?

      • Does the number of exons match?

      • Are the splice boundaries the same?

  • Similar function

    • Used 10-6 between C. neoformans and S. cerevisiae

  • Annotation (transfer annotation between species)

    • E-values < 10-25 fairly standard


Caveats

Caveats

  • Repetitive sequence

  • Regions of low complexity

  • Repeated motifs

  • Unusually high number of low abundant amino acids (i.e. cysteines)


Ncbi blast homepage

NCBI Blast homepage


Nucleotide blast

Nucleotide BLAST

  • Megablast

    • long alignments between very similar sequences

    • FASTEST

    • Can set percent identity for cut-off of the alignment

  • discontinuous megablast

    • find sequences similar, but not identical to the query,

    • more sensitive than megablast

  • blastn

    • most sensitive; shorter word query

    • SLOWEST

    • E value cutoff automatically adjusted for short query sequences


Protein blast

Protein BLAST

  • BLASTP

    • protein query; protein DB

    • Finds local regions of similarity

    • Can define the scoring matrix

    • Automatic adjusts parameters for peptide searches

  • PSI-BLAST

    • very sensitive protein-protein searches

    • Uses PSSM and can find distant homologs

  • PHI-BLAST

    • restricted protein pattern search

    • Search with a query + pattern

    • Returns a match IF the pattern is matched


Types of questions

Types of questions

  • What is the taxonomic distribution of my protein?

    • BLASTP against NR restricted to different taxonomic groups

    • Looking for it in a specific group? Restrict DB to group

  • What is the potential function of my protein?

    • BLASTP against Uniprot (best annotation)

  • Does the match contain a specific motif?

    • PHI-BLAST (BLASTP with a search for a motif pattern)


Other blast algorithms

Other BLAST algorithms

  • BLASTX

    • Query: Nucleotide, translated in 3 frames

    • DB: protein

    • You have an EST and want to find similarity to protein

  • TBLASTN

    • Query: protein

    • DB: Nucleotide, translated in 3 frames

    • Looking for protein homologs in an unannotated EST database

  • TBLASTX

    • Nucleotide query, nucleotide DB; both translated in 3 frames

    • Looking for novel sequences in error prone nucleotide query sequences

    • Very computer intensive


Choosing the database

Choosing the database

nr/nt: Genbank+ Refseq Nucleotides + EMBL + DDBJ + PDB

Excludes: HTGS, EST, GSS, STS, PAT and WGS

  • locate source of a sequence

  • find taxonomic distribution of a sequence


Choosing the database1

Choosing the database

  • locate source of a sequence

  • find taxonomic distribution of a sequence


Choosing the database2

Choosing the database


Report options

Report options

  • Default is human readable with links to NCBI records

  • Can download a hit table in CSV format and import into Excel

  • Taxonomy report

    • Shows distribution of all hits by lineage and taxonomic group

  • How many results will it return?

    • Default is 100. May need to increase that or restrict the database to confirm a negative result


Specialized blast

Specialized BLAST

  • Blast2seq

    • Click on the “Align 2 or more sequences” and the interface changes to allow you to put in a second sequence

    • Can align with all the major blast programs

    • Get both alignment as well as a dotplot of the alignment

    • Useful for quick comparison of a few sequences


Primer blast

Primer BLAST

  • Find primers specific to your PCR template

  • Can check for specificity with BLAST


This weeks exercise

This weeks exercise

  • Using BLAST to identify the source of unknown DNA sequences

  • Using BLAST to identify taxonomic distribution of an unknown sequence

  • Using BLAST to identify homologs of specific proteins in other species

  • Using Primer-BLAST to find locations of primers


  • Login