Sequence alignment techniques
Download
1 / 36

Sequence Alignment Techniques - PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on

Sequence Alignment Techniques. In this presentation……. Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment. Part 1. Searching for Sequence Similarity. Sequence similarity searches.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Sequence Alignment Techniques' - chiquita-warren


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

In this presentation
In this presentation……

Part 1 – Searching for Sequence Similarity

Part 2 – Multiple Sequence Alignment


Part 1

Part1

Searching for Sequence Similarity


Sequence similarity searches
Sequence similarity searches

  • Sequence similarity searches of database enable us to extract sequences that are similar to a query sequence

  • Information about these extracted sequences can be used to predict the structure or function of the query sequence

  • Prediction using similarity is a powerful and ubiquitous idea in bioinformatics. The underlying reason for this is molecular evolution


Sequence alignment
Sequence alignment

  • Any pair of DNA sequence will show some degree of similarity

  • Sequence alignment is the first step in quantifying this in order to distinguish between chance similarity and real biological relationships

  • Alignments show the differences between sequences and changes (mutations), insertions or deletions (indels or gaps) and can be interpreted in evolutionary terms


Alignment algorithms
Alignment algorithms

  • Dynamic programming algorithms can calculate the best alignment of two sequences

  • Well-known variants are

    • the Smith-Waterman algorithm (local alignments)

    • the Needleman-Wunsch algorithm (global alignments)

  • Local alignments are useful when sequences are not related over their full lengths, e.g., proteins sharing only certain domains or DNA sequences related only in exons


Alignment scores and gap penalties
Alignment scores and gap penalties

  • A simple alignment score measures the number or proportion of identically matching residues

  • Gap penalties are subtracted from such scores to ensure that alignment algorithms produce biologically sensible alignments without many gaps

  • Gap penalties may be constant (independent of the length of the gap), proportional (proportional to the length of the gap) or affine (containing gap opening and gap extension contributions)

  • Gap penalties can be varied according to the desired application


Similarity and homology
Similarity and homology

  • Similarity may exist between any sequences

  • Sequences are homologous only if they have evolved from a common ancestor

  • Homologous sequences often have similar biological functions (orthologs), but the mechanism of gene duplication allows homologous sequences to evolve different functions (paralogs)


Similarity search in databases
Similarity search in databases

  • Sequences similar to a query can be found in a database by aligning it to each database sequence in turn and returning the highest scoring (most similar) sequences

  • This can be achieved by dynamic programming algorithms but in practice faster approximate methods are often used


Statistical scores
Statistical scores

  • The p value of a similarity score is the probability of obtaining a score at least as high in a chance similarity between two unrelated sequences of similar composition

  • Low p values indicate significance matches that are likely to have real biological significance

  • The related E value is the expected frequency of chance occurrences scoring at least as high as the identified similarity

  • A low p value for a similarity between two sequences can translate into a high E value for a search of a large database


Sensitivity and specificity
Sensitivity and specificity

  • These measures quantify the success of a database search strategy

  • Sensitivity measures the proportion of real biological sequence relationships in the database that were detected as hits in the search

  • Specificity is the proportion of the hits corresponding to real biological relationships

  • Changing E and p value thresholds results in a trade-off between these complementary measures of success


Maximizing amino acid identities
Maximizing amino acid identities

  • Protein sequences can be aligned to maximize amino acid identities, but this will not reveal distant evolutionary relationships


Evolution
Evolution

  • Protein-coding sequences evolve slowly compared with most other parts of the genome, because of the need to maintain protein structure and function

  • An exception to this is the fast evolution that might occur in the redundant copy of a recently duplicated gene


Allowed changes
Allowed changes

  • Changes in protein sequences during evolution tend to involve substitutions between amino acids with similar properties because these tend to maintain the structural stability of the protein


Substitution score matrices
Substitution score matrices

  • These matrices give scores for all possible amino acid substitutions during evolution

  • Higher scores indicate more likely substitutions

  • Example matrices are BLOSUM62 and PAM250

  • PAM stands for Accepted Point Mutations, and in this case, the evolutionary distance of the matrix is 250 amino acid changes per 100 residues

  • Dynamic programming algorithms for sequence alignment can operate using scores from these matrices


Significance of score matrices
Significance of score matrices

  • Substitution score matrices allow detection of distant evolutionary relationships between protein sequences

  • It is possible to detect much more distant relationships by comparing protein sequences than by comparing nucleic acid sequences


Part of the sequence of human Huntington’s disease protein (Huntingtin) showing low complexity regions (underlined) associated with compositional bias towards glutamine (Q) and proline (P)

MATLEKLMKA FESLKSFQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQPLLPQPQPPPPPPPPPPGP AVAEEPLHRP KKELSATKKD RVNHCLTICE NIVAQSVRNS PEFQKLLGIA MELFLLCSDD AESDVRMVAD ECLNKVIKAL MSDNLPRLQL ELYKEIKKNG APRSLRAALW RFAELAHLVR PQKCRPYLVN LLPCLTRTSK RPEESVQETL AAAVPKIMAS


100 (Huntingtin) showing low complexity regions (underlined) associated with compositional bias towards glutamine (Q) and proline (P)

200

300

400

0

50

100

150

200

250

300

350

400

A dot plot of human pleckstrin sequence against itself produced with Erik Sonnhammer’s ‘dotter’ program. The sequence is plotted from N- to C- terminus along horizontal and vertical axes between residues 1 and approximately 350.

PLEK_HUMAN (horizontal) vs. PLEK_HUMAN (vertical)


The PAM250 matrix and alignment of sequences. Total alignment scores for two matrices should not be compared, but note that the PAM matrix is able to detect a much better alignment in second halves of these sequences rather than identity matrix. With the introduction of a single gap, sensible alignments of hydrophobic amino acids, and alignment of K with R (both basic), D with E (both acidic) and F with Y (both aromatic) can be seen

C 12

S 0 2

T –2 1 3

P –1 1 0 6

A –2 1 1 1 2

G –3 1 0 –1 1 5

N –4 1 0 –1 0 0 2

D –5 0 0 –1 0 1 2 4

E –5 0 0 –1 0 0 1 3 4

Q –5 –1 –1 0 0 –1 1 2 2 4

B –3 –1 –1 0 –1 –2 2 1 4 3 6

R –4 0 –1 0 –2 –3 0 –1 –1 1 2 5

K –5 0 0 –1 –1 –2 1 0 0 1 0 3 5

M –5 –2 –1 –2 –1 –3 –2 –3 –2 –1 –2 0 0 6

I –3 –1 0 –2 –1 –3 –2 –2 –2 -2 -2 -2 –2 2 5

L –6 –3 –2 –3 –2 –4 –3 –4 –3 -2 -2 –3 –3 4 2 6

V –2 –3 0 –1 0 –1 –2 –2 –4 -2 -2 –2 –2 2 4 2 4

F –4 –3 –3 –5 –4 –5 –4 –6 –5 -5 –2 –4 –5 0 1 2 –1 9

Y 0 –3 –3 –5 –3 –5 –2 –4 –4 –4 0 –4 –4 –2 –1 –1 –2 7 10

W –8 –2 –5 –6 –6 –7 4 7 7 5 3 2 –3 –4 –5 –5 –6 0 0 17

C S T P A G N D E Q H R K M I L V F Y W

Sequence 1: MIIVKP –VVLKGDFG

Sequence 2: MILLKP AIIIRAEY-

Position score: 656256 044231370


Figure 3. Display of the DNA unit. DNA can be described at several levels of detail. At the most detailed level, DNA can be characterized by the 5' and 3' termini at both external and internal positions; at the most abstract level, the substrate DNA can be one of 16 common structures. The goal is to provide methods for specifying the properties of DNA in as many ways as is natural for a scientist.


Figure 7. An initial experimental environment. The temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible


Part 2

Part temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible2

Multiple Sequence Alignment


Non specific sequence similarity
Non specific sequence similarity temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Certain types of sequence similarity are less likely to be indicative of an evolutionary relationship than others are

  • Examples of this are similarity between regions of low compositional complexity, short period repeats and protein sequences coding for generic structures like coiled coils


Similarity search filters
Similarity search filters temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Regions of the non specific sequence types can degrade the results of similarity searches and are often filtered out of query sequences prior to searching

  • The programs SEG and DUST can be used to detect and filter low complexity sequences, XNU can filter short period repeats and COILS can detect the presence of potential coiled coil structures


Database types for searches
Database types for searches temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Database and query sequences can be protein or nucleic acid sequences and different query strategies are required for different types and combinations

  • In general, searches are more sensitive using strategies where protein-coding nucleic acid database and/or query sequences are first translated to protein sequences


Iterative database searches
Iterative database searches temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • PSI-BLAST is an iterative search method that improves on the detection rate of BLAST and FASTA

  • Each iteration discovers intermediate sequences that are used in a sequence profile to discover more distant relatives of the query sequence in subsequent iterations

  • Potential problems with PSI-BLAST are associated with the potential for unrelated sequences to pollute the iterative search, and difficulties associated with the domain structure of proteins

  • PSI-BLAST often detects up to twice as many evolutionary relationships as BLAST


Multiple sequence alignment
Multiple sequence alignment temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Multiple alignment illustrates relationships between two or more sequences

  • When the sequences involved are diverse, the conserved residues are often key residues associated with maintenance of structural stability or biological function

  • Multiple alignments can reveal many clues about protein structure and functions


Multiple alignment
Multiple alignment temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

Part of a (artificial) multiple alignment of a family consisting of 7 sequences, which subdivide into 3 subfamilies. The bars on the left indicate subfamilies; the dotted boxes highlight conservation patterns.


Progressive sequence alignment
Progressive sequence alignment temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Most commonly used software uses the method of progressive alignment

  • This is a fast method, but frozen-in errors mean that it does not always work perfectly

  • Biological knowledge can provide information about likely alignments, and where automatically produced alignments turn out to be imperfect, software for manual alignment editing is required


Protein families
Protein families temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Assigning sequences to protein families is a very valuable way of predicting protein family (consensus sequences, conserved residues, residue patterns, sequence profiles, etc.)

  • Many ways have been developed to represent protein family information and these have been stored in secondary protein family databases


Consensus sequences
Consensus sequences temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • These condenses the information from a multiple alignment into single sequence

  • Their main shortcoming is the inability to represent any probabilistic information apart from the most common residue at a particular position

  • Derivation of consensus sequence illustrates that any protein family representation is subject to bias if the set of sequences from which it was derived is biased


Prints and blocks
PRINTS and BLOCKS temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • These represent protein families of multiply aligned ungapped segments (motifs) derived from the most highly conserved regions of sequences

  • By representing more of the sequence, they have the potential to be more sensitive than short PROSITE patterns

  • The ability to match in only a subset of the motifs associated with a particular family means that they have the ability to detect splice variants and sequence fragments and to represent subfamilies

  • WWW-based search engines for the databases are available


Protein domain families
Protein domain families temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Many proteins are built up from domains in a modular architecture

  • The study of protein families is best pursued as a study of protein domain families

  • Prodom is a database of protein domain sequences created by automatic means from the protein sequence databases


Resources for domain families
Resources for domain families temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Pfam and SMART can be used for protein domain family analysis

  • The integrated resource Interpro unites PROSITE, PRINTS, Pfam, Prodom and SMART


Visualization of similarities
Visualization of similarities temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Dot plots are a very good way to visualize sequence similarity and find repeats


ad