Slide1 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Multiple Sequence Alignment PowerPoint PPT Presentation


  • 291 Views
  • Uploaded on
  • Presentation posted in: General

An Introduction to Bioinformatics. Multiple Sequence Alignment. AIMS. To introduce the different approaches to multiple sequence alignment. To identify criteria for selecting a multiple sequence alignment program. OBJECTIVES. To select an appropriate multiple sequence alignment program.

Download Presentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide1 l.jpg

An Introduction to Bioinformatics

Multiple Sequence Alignment


Slide2 l.jpg

AIMS

To introduce the different approaches to multiple sequence

alignment

To identify criteria for selecting a multiple sequence alignment

program

OBJECTIVES

To select an appropriate multiple sequence alignment program

To carry out a multiple sequence alignment using CLUSTALX


Slide3 l.jpg

The result of searching databases is the establishment of a list

of sequences, either protein or nucleotide, which exhibit

significant similarity and are inferred to behomologous

These sequences can then be subjected to multiple sequence

alignment

The process that involves an attempt to place residues in

columns that derive from a common ancestral residue by

substitutions

The most successful alignment is the one that most closely

represents the evolutionary history of the sequences


Slide4 l.jpg

Why create multiple sequence alignments?

to attempt a phylogenetic analysis of the sequences so as to

construct evolutionary trees

the identification of functional sites

the identication of modules in multimodular protein

the identification of motifs

the detection of weak similarities in databases using profiles

the design of PCR primers for the identification of related genes


Slide5 l.jpg

Global versus local alignments

Things would be much simpler if we only considered sequences

that are homologous over their entire length and could be

globally aligned

Homology is often restricted to certain regions of sequence

Many proteins are multi-modular and the shuffling of modules

is part of the evolutionary process

An attempt to align, over their entire length, sequences that

share some, but not all of their modules, would be bound to lead

to errors

In such a case a series of multiple local sequence alignments

of each of the modules would be appropriate


Slide6 l.jpg

Substitutions and Gaps

In trying to establish the evolutionary trajectories of a group

of related sequences the same problem is encountered as

met in pairwise alignment

How do you deal with substitutions and gaps?

The solution is the same

Use of gap penalties, gap extension penalties and substitution

matrices such as PAM and BLOSUM


Slide7 l.jpg

There are essentially four major approaches to multiple

sequence alignment:

Optimal global sequence alignment

Progressive global alignment

Block-based global alignment

Motif-based local alignment


Slide8 l.jpg

Optimal global sequence alignment

Attempts to align sequencesalong their entire length.

‘Optimal’ means that it will give the best alignment amongst

all the possible solutions for a given scoring scheme

Whether the optimal alignment corresponds with the biologically

correct alignment will depend on a variety of factors e.g.

substitution matrix, the gap penalty and the scoring scheme

Optimal global sequence alignment programs are very computer

intensive and the complexity of the task increases exponentially

with the number of sequences

There are few programs which employ this approach - there is

one available on the Web


Slide10 l.jpg

Progressive global alignment

employs multiple pairwise alignments in a series of three steps:

1. Estimate alignment scores between all possible pairwise

combinations of sequences in the set

2. Build a ‘guide tree’ determined by the alignment scores

3. Align the sequences on the basis of the guide tree

Each step can be carried out in a number of ways designed to

increase speed or accuracy

Progressive global alignment is the most commonly used

method and the best known programs employing this approach

are CLUSTAL family


Slide16 l.jpg

Block-based global alignment

Divides the sequences into blocks which, depending on the

program, are exact (identical regions of sequence) or not exact

and uniform (found in every sequence) or not uniform

Once the blocks have been defined other approaches are

employed to align regions between the blocks

Once blocks have been identified other programs (e.g.

CLUSTAL X) can be used to multiply align individual modules

Examples of block-based global alignment programs available on

the Web are DCA and DIALIGN2


Slide18 l.jpg

Motif-based local alignment

Most recent local alignment programs employ computationally

efficient heuristics to solve optimization calculations for local

alignments

The Gibbs iterative sampling approach is used to find blocks

in programs such as the excellent MACAW

MACAW although available as freeware is not available as a

Web-based application

MEME is Web-based


Slide23 l.jpg

Which method to use

Optimal global alignment programs are rarely employed

computationally intensive requirements

can only handle a very small number of sequences

When the sequences to be aligned are homologous over their

entire length a progressive global alignment program should

be used.

Where the sequences share conserved modules in a consistent

orderblocks-based global alignment or motif-based local

Alignment Is appropriate

Where the sequences share conserved modules, but the order

of modules is not consistent, a motif-based local alignment is

the approach of choice


Slide24 l.jpg

Multiple sequence alignment file types

The various multiple sequence alignment programs will require

different input file types and there are also a variety of output file

types

The sequences to be aligned are usually placed in a single file

commonly in the Fasta format

The common output file formats are: NBRF/PIR, EMBL/SWISS-

PROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup),

GCG9/RSF and GDE flat file

Multiple sequence files can be interconverted using


Slide25 l.jpg

Sequence formats that allow one or more sequences:

  • IG/Stanford, used by Intelligenetics and others

  • * GenBank/GB, genbank flatfile format

  • * NBRF format

  • * EMBL, EMBL flatfile format

  • * DNAStrider, for common Mac program

  • * Fitch format, limited use

  • * Pearson/Fasta, a common format used by Fasta programs and others

  • * Zuker format, limited use. Input only.

  • * Olsen, format printed by Olsen VMS sequence editor. Input only.

  • * Phylip3.2, sequential format for Phylip programs

  • * Phylip, interleaved format for Phylip programs (v3.3, v3.4)

  • + MSF multi sequence format used by GCG software

  • + PAUP's multiple sequence (NEXUS) format

  • + PIR/CODATA format used by PIR

  • +ASN.1 format used by NCBI


Slide26 l.jpg

Phylip

The first line of the input file contains the number of species and the number of characters separated by blanks. The information for eachspecies follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. Phylip format files can be interleaved, as in the example below, or sequential.

7 123seq1 ---------- ---------- ---KSKERYK DENGGNYFQL REDWWDANRE seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL REDWWTANRE seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL REDWWTANRH seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL REDWWTANRA seq5 ----KRIYKK IFKEIHSGLS TKNGVKDRYQ N-DGDNYFQL REDWWTANRS seq6 ------FSKN IX--QIEELQ DEWLLEARYK D--TDNYYEL REHWWTENRH seq7 ---------- ---------- ---------- ---------- ---------K  TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCIG---------  TVWKAITCGA P-GDASYFHA TCDSGDGRGG AQAPHKCRCD G---------  TVWEAITCSA DKGNA-YFRR TCNSADGKSQ SQARNQCRC- --KDENGKN-  TIWEAITCSA DKGNA-YFRA TCNSADGKSQ SQARNQCRC- --KDENGXN-  TVWKALTCSD KLSNASYFRA TC--SDGQSG AQANNYCRCN GDKPDDDKP-  TVWEALTCEA P-GNAQYFRN ACS----EGK TATKGKCRCI SGDP------ ELWEALTCSR P-KGANYFVY KLD-----RP KFSSDRCGHN YNGDP-----


Slide27 l.jpg

clustal

Clustal format files contain the word clustal at the beginning. Sequences can be interleaved, (as in the example below) or sequential.

(note: the multiple sequence alignment program Clustalw (and clustalx) produce clustal format files by default, but you can specify in "output format options" if you want your results in a different format.).

CLUSTAL W (1.74) multiple sequence alignmentseq1 -----------------------KSKERYKDENGGNYFQLREDWWDANRETVWKAITCNAseq2 ---------------YEGLTTANGXKEYYQDKNGGNFFKLREDWWTANRETVWKAITCGAseq3 ----KRIYKKIFKEIHSGLSTKNGVKDRYQN-DGDNYFQLREDWWTANRSTVWKALTCSDseq4 ------------------------SQRHYKD-DGGNYFQLREDWWTANRHTVWEAITCSAseq5 --------------------NVAALKTRYEK-DGQNFYQLREDWWTANRATIWEAITCSAseq6 ------FSKNIX--QIEELQDEWLLEARYKD--TDNYYELREHWWTENRHTVWEALTCEAseq7 -------------------------------------------------KELWEALTCSR

seq1 --GGGKYFRNTCDG--GQNPTETQNNCRCIG----------ATVPTYFDYVPQYLRWSDEseq2 P-GDASYFHATCDSGDGRGGAQAPHKCRCDG---------ANVVPTYFDYVPQFLRWPEEseq3 KLSNASYFRATC--SDGQSGAQANNYCRCNGDKPDDDKP-NTDPPTYFDYVPQYLRWSEEseq4 DKGNA-YFRRTCNSADGKSQSQARNQCRC---KDENGKN-ADQVPTYFDYVPQYLRWSEEseq5 DKGNA-YFRATCNSADGKSQSQARNQCRC---KDENGXN-ADQVPTYFDYVPQYLRWSEEseq6 P-GNAQYFRNACS----EGKTATKGKCRCISGDP----------PTYFDYVPQYLRWSEEseq7 P-KGANYFVYKLD-----RPKFSSDRCGHNYNGDP---------LTNLDYVPQYLRWSDE


  • Login