Design and creation of multiple sequence alignments Unit 15

Design and creation of multiple sequence alignmentsUnit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD

IPA 6.0 license • Need a list of e-mails to create accounts • Will have a 6 weeks license (instead of 2 weeks) • Problem Set 3 is Pathway Analysis, Lab of March 19 will be on using IPA too

Problem Set 2 Review • Sensitivity and Specificity • Parameters for Multiple Alignment (Databases, Search Terms, Scores) • Transfac • Dotplots

Gene prediction flowchart Fig 5.15 Baxevanis & Ouellette 2005

p Evaluation of Splice Site Prediction What do measures really mean? Fig 5.11 Baxevanis & Ouellette 2005 Note typo in B&O

ROC curves (plots of (1-Sn) vs Sp) • A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot of the sensitivity vs. (1 - specificity) for a binary classifier system as its discrimination threshold is varied. • The sensitivity and specificity of a diagnostic test depends on more than just the "quality" of the test--they also depend on the definition of what constitutes an abnormal test.

Evaluation of Splice Site Prediction Actual True False TP FP PP=TP+FP True Predicted FN TN False PN=FN+TN AP=TP+FN AN=FP+TN = Coverage • Sensitivity: • Specificity: • Misclassification rates: • Normalized specificity:

Actual True False TP FP PP=TP+FP True Predicted FN TN PN=FN+TN False AP=TP+FN AN=FP+TN • Sensitivity: Careful: different definitions for "Specificity" Brendel definitions • Specificity: cf. Guig�ó definitions Sn: Sensitivity = TP/(TP+FN) Sp: Specificity = TN/(TN+FP) = Sp- AC: Approximate Coefficient = 0.5 x ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) - 1 Other measures? Predictive Values, Correlation Coefficient

Best measures for comparing different methods? • ROC curves(Receiver Operating Characteristic?!!) • http://www.anaesthetist.com/mnm/stats/roc/ • "The Magnificent ROC" - has fun applets & quotes: • "There is no statistical test, however intuitive and simple, which will not be abused by medical researchers" • Correlation Coefficient • (Matthews correlation coefficient (MCC) • MCC = 1 for a perfect prediction • 0 for a completely random assignment • -1 for a "perfectly incorrect" prediction Just FYI

PromotersWhat signals are there? Simple ones in prokaryotes

Prokaryotic promoters • RNA polymerase complex recognizes promoter sequences located very close to & on 5’ side (“upstream”) of initiation site • RNA polymerase complexbinds directly to these. with no requirement for “transcription factors” • Prokaryotic promoter sequences are highly conserved • -10 region • -35 region

Simpler view of complex promoters in eukaryotes: Fig 5.12 Baxevanis & Ouellette 2005

Eukaryotic genes are transcribed by 3 different RNA polymerases Recognize different types of promoters & enhancers:

Eukaryotic promoters & enhancers • Promoters located “relatively” close to initiation site (but can be located within gene, rather than upstream!) • Enhancers also required for regulated transcription (these control expression in specific cell types, developmental stages, in response to environment) • RNA polymerase complexes do not specifically recognize promoter sequences directly • Transcription factors bind first and serve as “landmarks” for recognition by RNA polymerase complexes

Eukaryotic transcription factors • Transcription factors (TFs) are DNA binding proteins that also interact with RNA polymerase complex to activate or repress transcription • TFs contain characteristic “DNA binding motifs” http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039 • TFs recognize specific short DNA sequence motifs “transcription factor binding sites” • Several databases for these, e.g.TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac

Zinc finger-containing transcription factors • Common in eukaryotic proteins • Estimated 1% of mammalian genes encode zinc-finger proteins • In C. elegans, there are 500! • Can be used as highly specific DNA binding modules • Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy

Promoter prediction: Eukaryotes vs prokaryotes Promoter prediction is easier in microbial genomes Why? Highly conserved Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously: mostly HMM-based Now: similarity-based. comparative methods because so many genomesavailable

Predicting promoters: Steps & Strategies • Closely related to gene prediction! • Obtain genomic sequence • Use sequence-similarity based comparison • (BLAST, MSA) to find related genes • But: "regulatory" regions are much less well-conserved than coding regions • Locate ORFs • Identify TSS (if possible!) • Use promoter prediction programs • Analyze motifs, etc. in sequence(TRANSFAC)

Predicting promoters: Steps & Strategies Identify TSS --if possible? • One of biggest problems is determining exact TSS! Not very many full-length cDNAs! • Good starting point? (human & vertebrate genes) Use FirstEF found within UCSC Genome Browser or submit to FirstEF web server Fig 5.10 Baxevanis & Ouellette 2005

Automated promoter prediction strategies • Pattern-driven algorithms • Sequence-driven algorithms • Combined "evidence-based" • BEST RESULTS? Combined, sequential

Promoter Prediction: Pattern-driven algorithms • Success depends on availability of collections of annotated binding sites (TRANSFAC & PROMO) • Tend to produce huge numbers of FPs • Why? • Binding sites (BS) for specific TFs often variable • Binding sites are short (typically 5-15 bp) • Interactions between TFs (& other proteins) influence affinity & specificity of TF binding • One binding site often recognized by multiple BFs • Biology is complex: promoters often specific to organism/cell/stage/environmental condition

Promoter Prediction: Pattern-driven algorithms Solutions to problem of too many FP predictions? • Take sequence context/biology into account • Eukaryotes: clusters of TFBSs are common • Prokaryotes: knowledge of  factors helps • Probability of "real" binding site increases if annotated transcription start site (TSS) nearby • But: What about enhancers? (no TSS nearby!) & Only a small fraction of TSSs have been experimentally mapped • Do the wet lab experiments! • But: Promoter-bashing is tedious

Promoter Prediction: Sequence-driven algorithms • Assumption: common functionality can be deduced from sequence conservation • Alignments of co-regulated genes should highlight elements involved in regulation Careful: How determine co-regulation? • Orthologous genes from difference species • Genes experimentally determined to be co-regulated (using microarrays??) • Comparative promoter prediction: "Phylogenetic footprinting" - more later….

Promoter Prediction: Sequence-driven algorithms Problems: • Need sets of co-regulated genes • For comparative (phylogenetic) methods • Must choose appropriate species • Different genomes evolve at different rates • Classical alignment methods have trouble with translocations, inversions in order of functional elements • If background conservation of entire region is highly conserved, comparison is useless • Not enough data (Prokaryotes >>> Eukaryotes) • Biology is complex: many (most?) regulatory elements are not conserved across species!

Examples of promoter prediction/characterization software Lab: used MATCH, MatInspector TRANSFAC MEME & MAST BLAST, etc. Others? FIRST EF Dragon Promoter Finder also see Dragon Genome Explorer (has specialized promoter software for GC-rich DNA, finding CpG islands, etc) JASPAR

TRANSFAC matrix entry: for TATA box • Fields: • Accession & ID • Brief description • TFs associated with this entry • Weight matrix • Number of sites used to build (How many here?) • Other info Fig 5.13 Baxevanis & Ouellette 2005

Global alignment of human & mouse obese gene promoters (200 bp upstream from TSS) Fig 5.14 Baxevanis & Ouellette 2005

GenBank IDs and Accessions http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions (Accession Formats: RefSeq) http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html (GenBank Sample Record)

Why we do multiple alignments? • Help prediction of the secondary and tertiary structures of new sequences; • Preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees.

An example of Multiple Alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--

Visualization example

Other multiple alignment programs ClustalW / ClustalX pileup multalign multal saga hmmt DIALIGN SBpima MLpima T-Coffee ...

ClustalW- for multiple alignment ClustalW can create multiple alignments, manipulate existing alignments, do profile analysis and create phylogentic trees. Alignment can be done by 2 methods: - slow/accurate - fast/approximate

Running ClustalW [~]% clustalw ************************************************************** ******** CLUSTAL W (1.7) Multiple Sequence Alignments ******** ************************************************************** 1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees S. Execute a system command H. HELP X. EXIT (leave program) Your choice:

Running ClustalW The input file for clustalW is a file containing all sequences in one of the following formats: NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF, RSF.

Using ClustalW ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice:

Output of ClustalW CLUSTAL W (1.7) multiple sequence alignment HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAG SYNTNFTRP GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAG CFTNFA -------------------------------------------TGTCCAG------ACAG CATTNFAA GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACAC RABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCC RNTNFAA AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACAC OATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACAC OATNFAR GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACAC BSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACAC CEU14683 GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** *

ClustalW options Your choice: 5 ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments: 1. Gap Open Penalty :15.00 2. Gap Extension Penalty :6.66 3. Protein weight matrix :BLOSUM30 4. DNA weight matrix :IUB Fast/Approximate alignments: 5. Gap penalty :5 6. K-tuple (word) size :2 7. No. of top diagonals :4 8. Window size :4 9. Toggle Slow/Fast pairwise alignments = SLOW H. HELP Enter number (or [RETURN] to exit):

ClustalW options Your choice: 6 ********* MULTIPLE ALIGNMENT PARAMETERS ********* 1. Gap Opening Penalty :15.00 2. Gap Extension Penalty :6.66 3. Delay divergent sequences :40 % 4. DNA Transitions Weight :0.50 5. Protein weight matrix :BLOSUM series 6. DNA weight matrix :IUB 7. Use negative matrix :OFF 8. Protein Gap Parameters H. HELP Enter number (or [RETURN] to exit):

Blocks database and tools • Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. • The Blocks web server tools are : Block Searcher, Get Blocks and Block Maker. These are aids to detection and verification of protein sequence homology. • They compare a protein or DNA sequence to a database of protein blocks, retrieve blocks, and create new blocks,respectively.

The BLOCKS web server At URL: http://blocks.fhcrc.org/ The BLOCKS WWW server can be used to create blocks of a group of sequences, or to compare a protein sequence to a database of blocks. The Blocks Searcher tool should be used for multiple alignment of distantly related protein sequences.

The Blocks Searcher tool • For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed over the width of the alignment, and then the block is aligned with the next position. • This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the group of sequences the block represents.

The Blocks Searcher tool • Typically, a group of proteins has more than one region in common and their relationship is represented as a series of blocks separated by unaligned regions. If a second block for a group also scores highly in the search, the evidence that the sequence is related to the group is strengthened, and is further strengthened if a third block also scores it highly, and so on.

The BLOCKS Database The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the chance distribution of matches. It is these calibrated blocks that make up the BLOCKS database.

The Block Maker Tool Block Maker finds conserved blocks in a group of two or more unaligned protein sequences, which are assumed to be related, using two different algorithms. Input file must contain at least 2 sequences. Input sequences must be in FastA format. Results are returned by e-mail.

Progressive Approaches • CLUSTALW • Perform pairwise alignments • Construct a tree, joining most similar sequences first (guide tree) • Align sequences sequentially, using the phylogenetic tree • PILEUP • Similar to CLUSTALW • Uses UPGMA to produce tree (chapter 6)

Clustal method • Higgins and Sharp 1988 • ref: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [Medline] • Progressive alignment method • An approximation strategy (heuristic algorithm) yields a possible alignment, but not necessarily the best one

First step: A B C D Compute the pairwise alignments for all against all (6 pairwise alignments) the similarities are stored in a table

A B C D Second step: • cluster the sequences to create a tree (guide tree): • Represents the order in which pairs of sequences are to be aligned • Highly similar sequences are neighbors in the tree • Highly distant sequences are distant from each other in the tree

Design and creation of multiple sequence alignments Unit 15