multiple sequence alignment l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Multiple Sequence Alignment PowerPoint Presentation
Download Presentation
Multiple Sequence Alignment

Loading in 2 Seconds...

play fullscreen
1 / 47

Multiple Sequence Alignment - PowerPoint PPT Presentation


  • 409 Views
  • Uploaded on

Multiple Sequence Alignment. Highly conserved region in MSA (multiple sequence alignment) may imply important functional information. Families. gene family: a set of homologous genes protein family: a set of homologous proteins examples: globin gene family

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Multiple Sequence Alignment' - Leo


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide3
Highly conserved region in MSA (multiple sequence alignment) may imply important functional information.
families
Families
  • gene family: a set of homologous genes
  • protein family: a set of homologous proteins
  • examples:
    • globin gene family
    • HOX gene family
    • serine/threonine kinase family
protein families
Protein families
  • “The large majority of proteins come from no more than one thousand families” (Chothia 1994)
protein structure
Protein structure
  • amino acid sequence (primary)
  • three-dimensional structure
    • small scale (secondary)
      • alpha-helix, beta sheet, fold
    • large scale (tertiary)
      • domain
    • fully functional protein (quarternary)
domains
Domains
  • Protein composed from several domains
  • domain carries specific function
  • Structure is more likely to be conserved than sequence
  • one exon might represent one domain
related motivation
Related Motivation
  • Gain insight into evolutionary history
    • By looking at the number of mutations necessary to go from one sequence to another, one can assess the time of divergence
alternatives to sp score
Alternatives to SP score
  • What we have now (loglikelihood ratio)
  • A natural extension for aligning 3 sequences (– can be unrealistically over-parameterized)
example
Example
  • VSNS
  • SNA
  • AS
carrollo lipman algorithm

Carrollo & Lipman Algorithm

-- an attempt to reduce the volume of the dynamic programming matrix

3 or more sequences
3 or more sequences

The optimal alignment path is contained in a "polyhedron" close to the main diagonal. Here, a polyhedron is a solid formed by plane faces, or more complicated 2-dimensional surfaces. For better visualization, the polyhedron's shadows are displayed. While visiting a node and looking for the minimum along all the incoming edges, we can ignore those edges that are "coming from outside the polyhedron", as in the top part the inset. On its top-left side, the cube is "covered" by the polyhedron. The edges 1, 2, 3, 6 and 7 are coming from the inside, and edges 4 and 5 can be ignored.

progressive alignment methods

Progressive Alignment Methods

Most commonly used approach to multiple alignment

progressive methods
Progressive Methods
  • Start with the most related sequence then progressively add less related sequence(s) to the initial alignment
guide tree for progressive methods
Guide Tree for Progressive Methods

Do NOT confuse with phylogenetic tree

slide30

Ad hoc Guide Tree Building

First construct a distance matrix of all pairwise distances

problems of progressive alignment
Problems of Progressive Alignment
  • No guarantee of the global optimal multiple alignment
  • Initial choice of sequences affects the final alignment
  • When sequences are highly divergent, the progressive approach becomes less reliable
the clustalw program
The CLUSTALW program
  • Fine tuned version of the above algorithm
    • Sequences are weighted to account for biased representation in large sub-families.
    • Substitution matrix is chosen flexibly
    • Manipulation of gap penalties
motif representations
Motif Representations

CGGCGCACTCTCGCCCG

CGGGGCAGACTATTCCG

CGGCGGCTTCTAATCCG

...

CGGGGCAGACTATTCCG

  • Consensus
  • Frequency Matrix
  • Logo

CGGNGCACANTCNTCCG

logo explanation
Logo explanation
  • The characters representing the sequence are stacked on top of each other for each position in the aligned sequences.
  • The height of each letter is made proportional to its frequency, the most common one is on top.
  • The height of the entire stack is then adjusted to signify the information content of the sequences at that position.
information content
Information Content
  • Uncertainty =
  • Information=

Thomas D. Schneider and R. Michael Stephens, Nucleic Acids Research, 18: 6097-6100 (1990)

other msa methods
Other MSA methods
  • Phylogenetic tree building
  • Alignment using the Sum-of-Pairs scoring scheme can be accomplished in a more probabilistic framework: using profile HMM
  • EM algorithm
motif sampler em
Motif Sampler (EM)
  • Lawrence et al. 1993, Liu et al. 1995
  • Model the distribution of residues with multinomial distributions
    • One multinom. dist’n per position within motif
    • One background dist’n for outside motif
  • The motif location is missing!
problem description
Problem Description
  • Given a set of N sequences S1,…,SN

of lengthnk (k=1,…,N)

  • Identify a single pattern of fixed width(W) within each (N)input sequence
  • A= {ak}(k=1,…,N) : a set of starting positions for the common pattern within each sequence ; ak=1…nk-W+1
  • Objective: to find the “best,” defined as the most probable, common pattern
algorithm initialization 1
Algorithm- Initialization (1)

Choose random starting positions {ak} within the various sequences

A= {ak}(k=1,…,N) : a set of starting positions for the common pattern within each sequence ; ak=1…nk-W+1

slide44

N=6, W=10

q1A= 3/5, q2G = 2/5, …

q1G= 0

algorithm predictive update 2
Algorithm- Predictive Update (2)
  • One of the N sequences, Z, is chosen either at random or in specified order.
  • The pattern description qij and background frequency q0j are then calculated excluding z.
slide46
Calculate the new multinomial frequencies if the motif start at a given location in Z
  • calculated analogously with counts taken over all non-motif positions
  • Find the most “reasonable” location in Z
  • Iterate!
slide47

AX= Qx/Bx =

Select a set of ak’s that maximizes the product of these ratios, or F

F = Σ1≤i≤W Σj∈ {A,T,G,C} ci,jlog(qij/q0j)