computational molecular biology biochem 218 biomedical informatics 231 http biochem218 stanford edu l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 http://biochem218.stanford.edu/ PowerPoint Presentation
Download Presentation
Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 http://biochem218.stanford.edu/

Loading in 2 Seconds...

play fullscreen
1 / 62

Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 http://biochem218.stanford.edu/ - PowerPoint PPT Presentation


  • 244 Views
  • Uploaded on

Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 http://biochem218.stanford.edu/. Multiple Sequence Alignment. Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy). Homework 4: Sequence Alignment & Search.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 http://biochem218.stanford.edu/' - clayland


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
computational molecular biology biochem 218 biomedical informatics 231 http biochem218 stanford edu
Computational Molecular BiologyBiochem 218 – BioMedical Informatics 231http://biochem218.stanford.edu/

Multiple Sequence Alignment

Doug Brutlag

Professor Emeritus

Biochemistry & Medicine (by courtesy)

homework 4 sequence alignment search
Homework 4: Sequence Alignment & Search
  • Part 1) Examining effect of gap penalties on sequence alignment
  • Please choose two proteins of interest which are only 30 to 50% identical and which have gaps in their alignment. The easiest way to find two such proteins is to choose one protein of interest to you and use it as a query in a BLAST search (either with UniProt or with NCBI BLAST) and then choose the second protein with a score 30 to 50% of the maximal score. Ensure that the alignment of these two proteins contains a few gaps.
  • Align your two protein sequences using SeqWeb's Bestfit or the SIM Alignment tool. Repeat the alignment with the same two sequences using the gap penalties 4, 8, 16, 32, and 64 . Keep the ratio of the gap penalty to the gap extension penalty the same in all cases (a 4 to 1 ratio is fine).
  • Examine the alignments and describe the effect of raising the gap penalty on the number and arrangements of gaps seen in these alignments. Mention which of the alignments has the highest overall bit score (or quality) and comment if the gaps disrupt a biologically important site (i.e. do they interrupt known functional motifs or structural features revealed by InterPro or MyHits?). Which gap penalties give the most biological alignment in your opinion.
  • Please show the alignments to support your conclusions.
homework 4 sequence alignment search3
Homework 4: Sequence Alignment & Search
  • Part 2) Comparing Smith-Waterman, UNGAPPED and GAPPED BLAST searches.
  • Find a protein-family containing only 50-100 members by examining different motifs in the PROSITE database.. You should print out and keep your "gold standard family list from from UniProt.
  • Choose one sequence of the family as a query and use the Decypher supercomputer to perform UnGAPPED Tera-BLAST, Banded SW Tera-BLAST and standard Smith-Waterman algorithm searches of the UniProt/SwissProt database. Make sure that in each case you collect at least 100 sequences in your result set (or twice the expected number of family members). Also be sure to turn query filtering OFF (the default is ON).
  • Now, compare the three searches using the Receiver-Operator-Characteristic (ROC) curve. For each search, draw a line across each output list after every 10th sequence and count the number of true positives above that line and the number of false positives above that line. Remember that the gold standard determines whether a sequence is a true positive. Continue until you have collected at least 50 false positives (ROC50 curve). Finally, plot the number of True Positive sequences versus the number of False Positive sequences on a two dimensional graph for each search.
  • Do the three searches have identical shaped curves? Is one curve higher than the other? If so, which search is the best for your protein family? Please send the actual ROC curves (either a graphics file or an EXCEL spreadsheet with the graphs in place) to support your conclusions.
  • Send the results to homework218@cmgm.Stanford.EDU.
evaluation of search algorithms

Negatives

Sensitivity=

TP/(TP+FN)

Specificity=

TN/(TN+FP)

Positives

TN

TP

FN

FP

Evaluation of Search Algorithms
evaluation of search algorithms with receiver operator characteristic curve

100

Area Under Curve (AUC)

Number True Positives

0

0

Number False Positives

200

Evaluation of Search Algorithms withReceiver-Operator Characteristic Curve
general dna similarity search principles
General DNA Similarity Search Principles
  • Search both Strands
  • Translate ORFs and cDNAs
  • Use most sensitive search possible
    • UnGapped BLAST for infinite gap penalty (PCR & CHIP oligos)
    • Gapped BLAST for most searches
    • Smith Waterman or megaBLAST or discontinuous MegaBLAST for cDNA/genome comparisons
    • cDNA =>Zero gap-length penalty
    • Consider using transition matrices
    • Ensure that expected value of score is negative
  • Examine results with exp. between 0.05 and 10
  • Reevaluate results of borderline significance using limited query
general protein similarity search principles
General Protein Similarity Search Principles
  • Chose between local or global search algorithm
  • Use most sensitive search algorithm available
    • Original BLAST for no gaps
    • Smith-Waterman for most flexibility
    • Gapped BLAST for well delimited regions
    • PSI-BLAST for families
    • Initially BLOSUM62 and default gap penalties
    • If no significant results, use BLOSUM30 and lower gap penalties
  • Examine results between exp. 0.05 and 10 for biological significance
  • Beware of long hits or those with unusual amino acid composition
  • Reevaluate results of borderline significance using limited query
goals of multiple sequence alignment
Goals of Multiple Sequence Alignment
  • Determine Consensus Sequences
    • Prosite Patterns
  • Building Gene Families
    • InterPro, Prints, ProDom, pFAM, DOMO, COGs, KOGs
  • Develop Relationships & Phylogenies
    • Clusters, COGs, KOGs, ClusTR
    • Relationships
    • Evolutionary Models
    • UPGMA, Neighbor Joining, Phylip, GrowTree, PAUP
  • Model Protein Structures for Threading and Fold Prediction
    • Profiles, Templates, HSSP, FSSP, SwissModel
    • Hidden Markov Models, pFAM, SAM, SuperFamily
    • Network Models, Neural Nets, Bayesian Networks
    • Statistical Models, Generalized Linear Models
slide14
Block Maker Makes a Multiple Sequence Alignmenthttp://blocks.fhcrc.org/blocks/blockmkr/make_blocks.html

PLLLWLNGGPGCSSIGYGASEEIG

PLVLWFNGGPGCSSVGFGAFEELG

PLMIWLTGGPGCSGLSSFVYEIGP

PLMIWLTGGPGCSGLSTFLYEFGP

PLLLWLSGGPGCSSLTGLLFENGP

PLVLWLNGGPGCSSVAYGAAEEIG

PVVIWLTGGPGCSSELALFYENGP

PLVIWFNGGPGCSSLGGAFKELGP

PLVIWFNGGPACSSLGGAFLELGP

PLVLWLNGGPGCSSLYGAFQELGP

PLVLWLNGGPGCSSIAYGASEEVG

PLTLWLNGGPGCSSVGGGAFTELG

NLQGYMLGNP

NFMGYMVGNG

NLKGFLVGNA

NLKGILIGNA

NLKGFAIGNG

NFKGYLVGNG

NLKGFIVGNP

NIKGYIQGNA

NLKGFMIGNA

NLQGYILGNP

NFKGFMVGNA

NLQGYVLGNP

TVKQWSGYMDYKDS

GVNQYSGYLSVGSN

SFAHYAGYVTVSED

DFAQYAGYVTVDAA

DLGHHAGYYKLPKS

SVESYSGFMTVDAK

GVKSYTGYLLANAT

NFKQYSGYYNVGTK

NFKSYSGYVDANAN

NFKHYSGFFQVSDN

DFFHYSGYLRAWTD

TVKQYTGYLDVEDD

10-45

25-55

40

sequence profiles http seqweb stanford edu 81 gcg bin analysis cgi program hmmerpfam
Sequence Profileshttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=hmmerpfam
seqweb sequence profile search http seqweb stanford edu 81 gcg bin analysis cgi program hmmerpfam
SeqWeb Sequence Profile Searchhttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=hmmerpfam
hidden markov models http www cse ucsc edu research compbio sam html

D 2

D 3

D 4

D 5

I 1

I 2

I 3

I 4

I 5

AA1

AA2

AA3

AA4

AA5

AA6

Hidden Markov Modelshttp://www.cse.ucsc.edu/research/compbio/sam.html
evolutionary trees

3

1

X

Y

4

2

R2

Z

3

1

5

Y

Z

R1

4

2

X

5

1

2

X

R1

R2

Z

3

Y

5

4

Evolutionary Trees
challenges aligning multiple sequences
Challenges Aligning Multiple Sequences
  • Computational complexity O(nk) for k sequences n long
  • Space requirements O(nk) for k sequences n long
  • Sequence clusters require weighting function
  • Weighted alignments tend to overweight erroneous sequences
  • Approximations must be used for real world data
    • Linked lists used to find exact words shared between k sequences
    • BLAST can find inexact shared words between k sequences
    • FASTA can be used to do progressive pair-wise alignments
    • HMM Pair models find best overall alignment probabilistically
  • Pairwise comparisons followed by Progressive Alignments
  • Final alignment is often dependent on order data presented
  • Gaps make alignment unnaturally long
carrillo lipman limits for msa http searchlauncher bcm tmc edu multi align multi align html
Carrillo-Lipman Limits for MSAhttp://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html
gaps are propagated to make alignment
Gaps Are Propagated To Make Alignment

SEQ2: MQQL

SEQ1: MGQF

SEQ2: MQQL

SEQ3: MKQL

DNPYIVRMIGICEAE-SWM

DHPNIIRLEGVVTKSRPVM

DNPYIVRMIGICEAESWM

QHPRLVRLYAVVTQEPIY

DNPYIVRMIGICEAE-SWM

DHPNIIRLEGVVTKSRPVM

DNPYIVRMIGICEAESWM

QHPRLVRLYAVVTQEPIY

SWM

PIY

SEQ2: MQQL-DNPYIVRMIGICEAE-SWM

SEQ4: MKMIGKHKNIINLLGACTQDGPLY

clustalw step 1 blosum distance matrix
ClustalW Step 1: BLOSUM Distance Matrix

ClustalW Step 2: Dendrogram

probcons do et al genome research 15 330 34044
ProbConsDo, et al. Genome Research 15, 330-340
  • Step 1: Compute posterior probability matrices of each pair of aligned sequences from the pair-HMM model
  • Step 2: Compute expected accuracies of pairwise alignments.
  • Step 3: Probabilistic Consistency Transformation
  • Step 4: Calculate Guide Tree using UPGMA from measure of similarities of sequence pairs
  • Step 5: Progressive alignment
  • Step 6: refinment by dividing sequences into two groups and re-align. Repeat multiple times.
seqweb clustalw http seqweb stanford edu 81 gcg bin analysis cgi program clustalw prot
SeqWeb ClustalWhttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot
slide49
SeqWeb ClustalW MSA Parametershttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot
seqweb clustalw alignment http seqweb stanford edu 81 gcg bin analysis cgi program clustalw prot
SeqWeb ClustalW Alignmenthttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot
seqweb clustalw text output http seqweb stanford edu 81 gcg bin analysis cgi program clustalw prot
SeqWeb ClustalW Text Outputhttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot
seqweb pileup input http seqweb stanford edu 81 gcg bin analysis cgi program pileup prot
SeqWeb Pileup Inputhttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pileup-prot
seqweb pileup input http seqweb stanford edu 81 gcg bin analysis cgi program pretty prot
SeqWeb Pileup Inputhttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pretty-prot
seqweb pileup dendrogram http seqweb stanford edu 81 gcg bin analysis cgi program pileup prot
SeqWeb Pileup Dendrogramhttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pileup-prot
seqweb pretty input http seqweb stanford edu 81 gcg bin analysis cgi program pretty prot
SeqWeb Pretty Inputhttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pretty-prot
slide56
SeqWeb Pretty Alignment and Consensushttp://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pileup-prot