Sequence analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 42

Sequence Analysis PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

Sequence Analysis. Millions of entries of protein and nucleotide data now in databases - How to convert this to useful information?

Download Presentation

Sequence Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sequence analysis

Sequence Analysis

Millions of entries of protein and nucleotide data now in databases - How to convert this to useful information?

Sequence analysis - explore newly determined sequence = determination of gene or regulatory sequences, determination of homologous sequences, comparison w/ databases, determination of function, determination of evolutionary history


Sequence change

Sequence Change

Even homologous sequences differ - mutation/selection through time (evolution)

Genes can also differ b/c of duplications (paralogs vs orthologs) and pseudogenes

Changes can “mask” underlying sequence similarity


Sequence alignment

Sequence alignment seeks to line-up (align) homologous bases - bases that are descendants of a common ancestral residue

Similarity/Identity (match) does not equal homology

Compare ACGCTGA and ACTGT

ACGCTGAACGCTGA

A--CTGTACTGT—

Sequence Alignment


Sequence alignment1

Sequence similarity can be either result from random chance, convergent evolution or from a shared evolutionary origin (homology)

Homologous sequences likely to have similar functions

Sequence Alignment

vs


Sequence alignment2

Sequence Alignment

Alignments can be given scores, e.g. -1 for each substitution, -5 for an indel, +3 for a match

These are then scored 9,5,4,4

Overall score can then be used to determine “best” alignment


Sequence alignment score

Align AGCGTAT and ACGGTAT

AGCGTATAGC-GTATAGCG-TAT

|••|||||-|-|||||-||-|||

ACGGTATA-CGGTATA-CGGTAT

Which alignment is “best” depends on the gap penalty

Gap penalty -5: 2nd two alignments both score (6x3)-(2x5)=8

Gap penalty -1: (6x3)-(2x1) = 16

Either gap penalty: 1st alignment scores (5x3)-(2x1) = 13

Sequence Alignment - score


Sequence alignment gaps

Align THISSEQUENCE and THATSEQUENCE

THISSEQUENCE THISSEQUENCE

||••||||||||

THATSEQUENCE THATSEQUENCE

More divergent sequences are more difficult to compare

THATSEQUENCE and THISISASEQUENCE

THATSEQUECNE

THISISASEQUENCE

THISISA-SEQUENCE

TH----ATSEQUENCE

An alignment is a hypothesis about which residues evolved from the same ancestral residue = homology

Sequence Alignment - gaps


Sequence alignment hypothesis

THISISA-SEQUENCE

TH----ATSEQUENCE

An alignment is a hypothesis about which residues evolved from the same ancestral residue

Comparisons need to take into consideration various factors: types of mutations (transitions/transversions), difference in physicochemical properties of amino acids and role in protein structure and function = evolutionary processes

Alignment scoring schemes can take none, some or all of this into consideration in scoring alignments

Sequence Alignment = hypothesis


Amino acid vs nucleotide

Alignment scoring schemes can take none, some or all of this into consideration in scoring alignments

Amino acids are usually easier to align than nucleotides

4 letter nucleotide codes has less information than 20 letter amino acid code - greater probability of match by chance in DNA than amino acids

Amino acid vs nucleotide


Amino acid vs nucleotide1

Alignment scoring schemes can take none, some or all of this into consideration in scoring alignments

Amino acids are usually easier to align than nucleotides

4 letter nucleotide codes has less information than 20 letter amino acid code - greater probability of match by chance in DNA than amino acids

Alignments can also take similarity (lysine/arginine vs lysine/glutamate) into consideration

Genetic code is redundant - will change through time and not alter amino acid sequence, but amino acid sequence determines structure and function of the protein

In some cases, only nucleotides seq. can be compared (gene id, regulatory DNA, etc)

Amino acid vs nucleotide


Sequence alignment score1

Align AGCGTAT and ACGGTAT

AGCGTATAGC-GTATAGCG-TAT

|••|||||-|-|||||-||-|||

ACGGTATA-CGGTATA-CGGTAT

Which alignment is “best” depends on the gap penalty

Gap penalty -5 = 2nd two alignments both score = 8

Gap penalty -1 = 16

Either gap penalty: 1st alignment scores = 13

Best score is the optimal alignment, others are suboptimal

Assumption is alignment of related seq. will give a better score than random sequences

No algorithms yet incorporates complete evol. theory, but many yield reasonable results

Sequence Alignment - Score


Quantifying similarity

The simplest way of quantifying similarity between 2 sequences is percentage (percent) identity - actual identity, an objective measure

THISISA-SEQUENCE

TH----ATSEQUENCE

11/16 = 68.75%

Even unrelated sequences will have some amount of sequence identity (less in aa than nuc.), but this will decrease w/ the amount of sequence compared

Quantifying similarity


Similarity

Percent identity is relatively crude - genuine matches do not have to be completely identical - homologous amino acids are often similar, not identical

Similarity


Percent similarity

THISISA-SEQUENCE

TH----ATSEQUENCE

THISISASEQUENCE

|| ||||||||

THAT---SEQUENCE

Isoleucine and alanine (hydrophobic) are similar amino acids, as are serine and threonine (polar)

Not all similar amino acids are equally likely to occur

Other factors, like cysteine residues and disulfide bridges and tryptophan in hydrophobic structure can also be factored in - summing all values gives an overall alignment score

Scores not nec. simple to interpret and will change w/ length

Percent Similarity


Minimum percent identity

A comparison of >1 million protein sequences w/ structural information suggests that 90% of sequence pairs w/ identity of 30% or greater over their entire length were structurally similar proteins

Below 25% identity, 10% of pairs represented structural similarity. 30-25% is the twilight zone. Even lower sequence identity (<20%) is the midnight zone

There are many different ways to score alignments, some more common w/ some applications than others.

In all they must score both the degree of relatedness between residues (from a presumed common ancestor) and the validity of gaps

Minimum percent identity


Dot plot

Dot-plot

THISISA-SEQUENCE

TH----ATSEQUENCE

Dot-plots give a visual assessment of similarity based on identity for either aa or nuc.

One sequence, X, is written out horizontally and the second, Y, vertically. Each residue compared in a row to column comparison.

Dots are placed if residues are identical


Dot plot1

Dot-plot

Dots are placed if residues are identical

Here red dots indicate identical residues and breaks represent points where gaps are needed

Pink dots indicate residues that are also present elsewhere in sequence


Dot plot2

Dot-plot

Dot-plots can suffer from noise caused by regions of similarity arising by chance

“filters” are often used to remove this - overlapping, fixed length windows (e.g. 10 amino acids) w/ some minimum identity score before a dot is assigned

On the left is a window of length 1 aa, on the right length 10, minimum identity score 3


Dot plot3

Internal repeats w/in BRCA2 protein

Dot-plot

Windows can be set for different applications - exon size in DNA, repeat motifs in amino acid sequences or length of secondary structure

Scoring can also be more subtle than 0/1 identity scores depending on the types of residues compared

Here the BRCA2 sequence is compared to itself, left unfiltered, right, filtered w/ window of 30 and minimum score of 5


Scoring alignments

Alignments can be given scores (e.g. to compare two possible alignments) by different means

Substitution matrices can be used to assign individual scores to aligned sequence positions

Many different matrices exist, but each assigns different values for all possible pairs of residues

Matrices can be based on theoretical considerations, but the most successful are based on empirical data gathered by comparison of known homologous sequences

Scoring alignments


Scoring matrices

BLOSUM-62

No one matrix is best for all applications, use depends on the time (evolutionary distance) between sequences and the type of protein

Most scoring schemes construct a 20x20 substitution matrix. Each cell represents the likelihood that that particular pair of amino acids will occupy the same position through time

Here color reflects similar physicochemical properties

Scoring matrices

PAM-120


Scoring matrices1

PAM-120

Scoring matrices

SEQ1 :T H I S S E Q U E N C E

SEQ2 :T H A T S E Q U E N C E

SCORE:5 8-1 1 4 5 5 0 5 6 9 5

“U” represents an unknown residue

The overall score, S, for the alignment equals 52


Scoring matrices2

PAM-120

Scoring matrices

Different matrices are based on different sets of observed amino acid substitution frequencies

First set constructed by Margaret Dayhoff and co-workers in 1960s/1970s

Original comparisons used very similar sequences so that alignments would be unambiguous


Pam matrices

PAM-120

PAM matrices

PAM units - Point Accepted Mutations, accepted point mutations per 100 residues

The matrix is a PAM matrix

PAM250 = 250 mutations have been fixed on average between 100 residues = many residues w/ more than one mutation - distant relationships


Blosum matrices

BLOSUM-62

BLOSUM matrices

BLOSUM, BLOck SUbstitution Matrix, matrices developed in the 1990s using local multiple alignments not global alignments - a large set of aligned, highly conserved, short regions from analysis of protein-sequence database SWISS-PROT


Blosum matrices1

BLOSUM-62

BLOSUM matrices

Matrix was calculated for changes between clustered groups of closely related proteins w/out use of phylogenetic trees

Different matrices vary the percentage identity cut-off for clustering, BLOSUM-62 derived using threshold of 62%


Choice of matrices

Which matrix to use depends on the question being asked

Within PAM matrices, the number represents evolutionary distance - larger, greater distance

Within BLOSUM, the number represents the percentage identity - larger, greater similarity

When aligning distantly related sequences, PAM250 or BLOSUM5-50 may be preferable, PAM120 and BLOSUM-80 for more closely related sequences

Choice of Matrices


Choice of matrices1

Some matrices also incorporate additional information - STR matrix includes information about protein structure and can be used with very distantly related sequences

Other matrices are specific for different types of proteins - SLIM (ScoreMatrix Leading to Intra-Membrane) and PHAT (Predicted Hydrophobic and Transmembrane matrix) are designed from/for membrane proteins (not soluble proteins)

As of 2006, 94 matrices in GenomeNet

Choice of Matrices


Inserting gaps

Homologous sequences are often different lengths - indels - and alignment requires gaps

Adding gaps will decrease an alignment score by a “gap penalty”

Indels rarely happen in structures of fxal importance, more likely at the ends, and are generally more than a single residue - gap extension penalty is less than gap penalty

The best alignment is generally the one that returns the maximum score for the smallest number of introduced gaps

Inserting Gaps


Inserting gaps1

Alignment programs generally allow the user to vary the gap penalty

If the penalty is set high, few gaps will be introduced - good for closely related sequences

low penalty and more gaps are introduced - good for distantly related sequences

The most appropriate gap penalty may also vary depending on the substitution matrix being used

Gap score can also vary w/ the type of residue, some aa very rarely change (i.e. tryptophan)

Inserting Gaps


Gap penalties

Gap Penalties

Alignments of two distantly related proteins phosphatidylinositol-3-OH and protein kinase


Gap penalties1

Gap Penalties

1st alignment, gap penalty set high, low in 2nd. In both, end gaps are not penalized

With this small amount of identity expert knowledge of protein structure and fx can be helpful


Local and global alignments

Global alignment - alignment of entire sequences, generally possible with closely related sequences

Local alignment - alignment of parts, or domains, of a sequence, possible w/ more distantly related sequences, or sequences in which different regions have different evolutionary histories (multi-domain proteins)

In some cases, local alignments can be used as a 1st step toward a global alignment

Local and Global Alignments


Local and global alignments1

Local vs global alignment of bovine PI3-kinase p100 and the cAMP-dependent kinase

The two share structural homology in catalytic domain but very little sequence homology

Note that the global alignment fails to identify the homologous region

Local and Global Alignments


Pairwise and multiple alignments

Pairwise and Multiple Alignments

Alignments can be made between 2 sequences, pairwise, or many sequences, multiple alignments.

Multiple alignments can resolve ambiguities in alignments and illustrate sequence conservation over evolutionary time

Also generally require more computing power and more sophisticated algorithms


Databases

Alignments can be used to locate and identify a gene in a new genome, identify the possible function of a new sequence or novel gene, or find a given gene in a specific taxa

Searches have to be sensitive enough to detect distant similarities and avoid false-negative searches and specific enough to reject unrelated sequences, false-positives

Verification of homology of identified matches is generally required

Databases


Blast search

Database searching is essentially the same as a pair-wise alignment

BLAST, Basic Local Alignment Search Tool, software for searching databases of molecular sequences for regions of similarity to a query sequence

BLAST searches for regions of local alignment = isolated regions in seq. pairs that have high levels of similarity

BLAST report ranks “hits” in order of statistical significance using an E-value

E-values are not the same as p-values, but do approximate them when small

BLAST Search


Blast

BLAST, Basic Local Alignment Search Tool, is widely used to find core similarity using a window of preset size (a “word”) and a certain minimum density of matches (DNA) or amino-acid similarity score

blastp - compare amino acid query with protein-sequence database

blastn - compare nucleotide query with nucleic acid sequence database

blastx - compare all translation of a nucleotide sequence w/ protein database

tblastn - compare protein query with translated nucleic acid sequence database

tblastx - compares all 6 frame translations of nucleotide sequence w/ all 6 frame translations of the nucleotide database

BLAST


Scoring a search

Amino acid searches are easier than nucleotide - but data is often nucleotide

Quality of results depends on appropriate algorithms - and well-maintained databases

Alignments are given scores - looking for alignments w/ higher score than would be expected from a random match

Can estimate the probability of two random sequences aligning with a score ≥ S, the expectation or E-value

The E-value is the number of alignments w/ a score of at least S that would be expected by chance alone in searching a complete dataset of n sequences, from 0 to n

An E-value of 3 means you would expect 3 such matches, 10-29 means very few

Scoring a Search


Low complexity

Low-complexity regions in a protein, e.g. simple repeats, biased amino acid composition, will lead to false matches of proteins or domains that are unlikely to be homologous

These regions are generally removed from a search

Low-Complexity


Blast search1

BLAST search

BLASTn of zebrafishTpiB


Blast search2

BLASTp of zebrafish TPIB

BLAST search


  • Login