Bioinformatics

Lec-6 Bioinformatics Ayesha M. Khan Spring 2013

Some statistics of local sequence comparison (BLAST) • Once BLAST has found a similar sequence to the query in the database, it is helpful to have some idea of whether the alignment is “good” and whether it portrays a possible biological relationship, or whether the similarity observed is attributable to chance alone. • BLAST uses statistical theory to produce a bit score and expect value (E-value) for each alignment pair (query to hit). Lec-6

BLAST Results: Scores and Values Max score = highest alignment score (bit-score) between the query sequence and the database sequence segment. Total score = sum of alignment scores of all segments from the same database sequence that match the query sequence (calculated over all segments). This score is different from the max score if several parts of the database sequence match different parts of the query sequence. Query coverage = percent of the query length that is included in the aligned segments. This coverage is calculated over all segments. E-value = number of alignments expected by chance with a particular score or better. Lec-6

Some details: Bit score • The bit score gives an indication of how good the alignment is; the higher the score, the better the alignment. • In general terms, this score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences. • Key element substitution matrix Lec-6

Bit score (contd.) • The BLOSUM62 matrix is the default for most BLAST programs, the exceptions being blastn, megaBLAST and discontigmegablast (programs that perform nucleotide–nucleotide comparisons and hence do not use protein-specific matrices). • Bit scores are normalized, which means that the bit scores from different alignments can be compared, even if different scoring matrices have been used. Lec-6

Some details: E-value The E-value gives an indication of the statistical significance of a given pairwise alignment and reflects the size of the database and the scoring system used. The lower the E-value, the more significant the hit. A sequence alignment that has an E-value of 0.05 means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone. E=Kmne-λS m, n is size of the search space (n is length of query sequence, m is length of the database) K is a scale parameter for size of search space λ is a scale parameter for scoring method S is bit score Lec-6

Difference between BLOSUM and PAM matrices • BLOSUM comes from alignments of shorter sequences-blocks of sequences that match each other at some defined level of similarity. The BLOSUM method thereby incorporates much more data into its matrices, and is therefore, presumably more accurate. • PAM is derived from alignments of proteins. • BLOSUM matrices tend to be more sensitive to distant relationships than PAM. • BLOSUM tends to give higher scores to substitutions involving hydrophilic amino acids and lower scores to substitutions involving hydrophobic amino acids than PAM. • Substitutions of rare amino acids are more tolerated by BLOSUM. • General rules: -Use higher PAM or lower BLOSUM matrices for more divergent sequences -Use lower PAM or higher BLOSUM matrices for more closely related sequences Lec-6

Concept of Gaps in Alignment • Sequences may have diverged from a common ancestor through various types of mutations: • Substitutions • Insertions • Deletions The latter two will result in gaps in alignments Lec-6

Gap Penalty • Gap penalties are used during sequence alignments to penalize the gaps. • The gap extension penalty is usually much smaller, for instance, 10 insertions of one nucleotide each should be harder than one insertion of 10 nucleotides. • That is, gap opening is less probable than a single gap extending over more than one nucleotide. Hence a single mutation event (causing incorporation or deletion of more than one nucleotide) is more probable than multiple mutation events. Lec-6

Gap Penalty • Linear gap penalties • Simplest type of gap penalty • The overall penalty for one large gap is the same as for many small gaps • wk=c L • Affine gap penalties • Have a gap opening penalty c, and a gap extension penalty, e • wk=c +(L-1)e Lec-6

BLAST & FASTA: heuristic methods • BLAST & FASTA use heuristic methods that attempt to approximate the optimal local similarity shared by two sequences. • Use word or k-tuple methods • They align two sequences very quickly , by first searching for identical short stretches of sequences (called words, or k-tuples) and then joining these words into an alignment by the dynamic programming method. Lec-6

BLAST… • The BLAST programs are used to find high-scoring local alignments between a query sequence and a target database. • The BLAST algorithm is based on the fact that true match alignments are very likely to contain short stretch of identities, or very high scoring matches somewhere within them. • So BLAST initially looks for such short stretches and uses them as ‘seeds’ from which it extends out in search of a good longer alignment. Lec-6

Main stages of BLAST • Remove (filter) low-complexity regions from Q • Harvest k-tuples (triples) from Q • Expand each triple into ~50 high-scoring words • Seed a set of possible alignments • Generate high-scoring pairs (HSP)s from the seeds • Test the significance of matches from the HSPs • Report the alignments found from the HSPs Lec-6

Main stages of BLAST (contd.) Lec-6

Multiple Sequence Alignment Why do we need to carry out multiple sequence alignments? • To make connections between more than two family members • To reveal conserved family characteristics MSA is a 2D table  rows represent individual sequences and columns the residue positions. Absolute position: Property of the sequence Relative position: Property of the alignment Lec-6

Example: Lec-6

MSA: computational complexity (O (m1 m2) O: order of the time taken by the algorithm, and m1 and m2 are the sequence lengths. When considering more sequences, the time complexity becomes O(m1,m2,m3,….ml) where ml is the length of the last sequence in the comparison set Lec-6

Simultaneous methods vs progressive methods • Simultaneous methods: Align all the sequences in a given set at once • Extension of a 2D matrix to three or more dimensions • No. of dimensions reflect the no. of sequences to be aligned • Work best on small sets of short sequences • Progressive methods: Align pairs of sequences or building sequence clusters • Use heuristics to reach an alignment in a timely and cost-efficient manner Lec-6

MSA models • There are several models for assessing the score of a given multiple sequence alignment. The most popular ones are sum-of-pairs (SP), tree alignment, and consensus alignment. Note: which of the above models are progressive alignments and which are based on dynamic programming?(should be able to answer after a few slides) Lec-6

Sum-of-pairs (SP) • Recall that: The standard computational formulation of the pairwiseproblem is to identify the alignment that maximizes protein sequence similarity, which is typically defined as the sum of substitution matrix scores for each aligned pair of residues, minus some penalties for gaps. • The mathematically — though not necessarily biologically — exact solution can be found in a fraction of a second for a pair of proteins. This approach is generalized to the multiple sequence case by seeking an alignment that maximizes the sum of similarities for all pairs of sequences, i.e. the sum-of- pairs, or SP, score. Lec-6

Sum-of-pairs (SP)... • The SP score for the complete alignment M is the sum of the scores for each column (mi) in the alignment: We wish to use the SP method to score the following alignments of these three sequences: Alignment #1 Alignment #2 T-GC-G TGC-G -AGCTG AGCTG -AGC-G AGC-G Example: We wish to align the following three DNA sequences: S1 = TGCG S2 = AGCTG S3 = AGCG Lec-6

Sum-of-pairs (SP)... We will use the following simplified DNA substitution matrix: • s(x,y) = 1: when x = y [match] • s(x,y) = -1: when x ! y [mismatch] • s(x,-) = -2: [gap] • s(-,y) = -2: [gap] • s(-,-) = 0: to prevent double counting of gaps We will construct the following matrices M for each alignment: Lec-6

Sum-of-pairs (SP)... The SP score for each alignment is calculated by summing the individual scores for each column in the matrix. Using the simplified substitution matrix, the Sum of Pairs method ranks the second alignment as the higher scoring alignment. Lec-6

Consensus alignment Lec-6

Tree alignment Lec-6

Progressive alignmentIt is a heuristic method! • Up until about 1987, multiple alignments would typically be • constructed manually, although a few computer methods did exist. Around that time, algorithms based on the idea of progressive alignment appeared. In this approach, a pairwise alignment algorithm is used iteratively, first to align the most closely related pair of sequences, then the next most similar one to that pair, and so on. • The rule “once a gap, always a gap” was implemented, on • the grounds that the positions and lengths of gaps introduced between more similar pairs of sequences should not be affected by more distantly related ones. Lec-6

Progressive alignment: CLUSTALW The three basic steps in the CLUSTAL W approach are shared by all progressive alignment algorithms*: A. Calculate a matrix of pairwise distances based on pairwise alignments between the sequences B. Use the result of A to build a guide tree, which is an inferred phylogeny for the sequences C. Use the tree from B to guide the progressive alignment of the sequences Lec-6

Progressive alignment: CLUSTALWhttp://www.ebi.ac.uk/clustalw/ • The basic idea is to use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order of the guide tree. We proceed from the tips of the rooted tree towards the root. • At each stage a full dynamic programming algorithm is used, with a residue scoring matrix (e.g., a PAM or a BLOSUM matrix) and gap opening and extension penalties. Each step consists of aligning two existing alignments. • Scores at a position are averages of all pairwisescores for residues in the two sets of sequences using matrices with only positive values. Lec-6

Pairwise progressive dynamic programming-liabilities (1) dependence on initial pairwise sequence alignments and the order of alignment -ordering them from most similar to least similar usually makes biological sense and works very well. (2) dependence on substitution matrices and gap penalties Lec-6

Common usage of MSA • Detecting similarities between sequences (closely/distantly related) • Detecting conserved regions/motifs in the sequences • Detection of structural homologies; Patterns of hydrophobicity/hydrophilicity , gaps etc. • Thus assisting the improved prediction of secondary and tertiary structures and loops and variable regions. • Predict features of aligned sequences like conserved positions which may have structural or functional importance • Making patterns or profiles that can be further used to predict new sequences falling in a given family • Computing consensus sequence • Inferring evolutionary trees or linkage-phylogenetic analysis etc • Deriving profiiles of hidden markov models (HMMs) that can be used to remove distant sequences (outliers) from the protein families Lec-6

Applicability of MSA • Very useful in the development of PCR primers and hybridization probes; • Great for producing annotated, publication quality, graphics and illustrations; • Invaluable in structure/function studies through homology inference; • Recognizable structural conservation between true homologues extends way beyond statistically significant sequence similarity. Lec-6

Applicability of MSA- contd. • Essential for building “profiles” for remote homology similarity searching; and • Required for molecular evolutionary phylogenetic inference programs. Lec-6

For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations. • Determining what alignment is best for a given set of sequences is really up to the judgement of the investigator. “To raise new questions, new possibilities, to regard old problems from a new angle, require creative imagination and marks real advance in science” Albert Einstein Lec-6

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics