Computational searches of biological sequences

Computational searches of biological sequences

Conceptos básicos • Homología y otras relaciones evolutivas (paralógos, ortólogos, xenólogos) • Uso preferencial de codones, CAI y expresividad • Microarreglos y aproximaciones estadísticas para su análisis

Descripción de programas existentes • BLAST (Comparación apareada de secuencias) • MEME/MAST (Identificación de motivos sobre-representados)

Planteamiento de problemas para resolver • Grupo mínimo de genes para la vida • Predicción de operones bacterianos • Expresividad en unidades transcripcionales • Conservación de expresividad entre organismos • Identificación de genes transferidos horizontalmente H. pylori • Regulación por glucosa en E. coli

Matriz de substitución para aminoácidos PAM 250 AGGIDG GHGFMG 117137

A C G T A 1 0 0 0 C 0 1 0 0 G 0 0 1 0 T 0 0 0 1 Unitary matrix for DNA sequences

In any case, the values obtain in the comparison are the same along the entire alignment It is well know that some residues in a protein or in a nucleotide sequence plays important roles and therefore are constrained to vary

These conserved regions constitutes motif which are sometimes recognized in a set of aligned sequences. SCK1_CENEL/13-36 CLKPCKDLYGPHAGAKCMNGKCKC SCKM_CENMA/13-36 CLPPCKAQFGQSAGAKCMNGKCKC SCT2_ANDAU/35-57 CASVCRRVIGVAAG-KCINGRCVC SCK3_ANDMA/13-35 CASVCRKVIGVAAG-KCINGRCVC SCK4_MESMA/35-57 CASVCRREIGVAAG-KCINGKCVC SCKK_TITSE/35-57 CYSACKKLVGKATG-KCTNGRCDC SCK2_TITDI/14-36 CVKICIDRYNTRGA-KCINGRCTC SCKP3_TITSE/7-28 CNRKCCPG-GCRSG-KCINGKCQC SCBX_MESMA/8-29 CRVKCVAM-GFSSG-KCINSKCKC SCKL_LEIQH/8-28 CQLSCRSL-GL-LG-KCIGDKCEC SCK5_ANDMA/8-28 CQLSCRSL-GL-LG-KCIGVKCEC SCK1_CENNO/36-57 CDKDCKRR-GYRSG-KCINNACKC SCK2_CENNO/8-29 CDKDCTSR-KYRSG-KCINNACKC ** .**.*

What is a motif in a biological sequence? • Represents a conserved region of a sequence. • This conservation might be due to a functional constraint. • There are conserved structural domains in a family of proteins. Amino acid sequences can almost always represent such motifs. • Motif identification is useful to classify and understand protein or nucleotide function.

Example of a protein motif. Motifs can be represented by Weight Matrices:

Example of a RNA motif. Motifs can be represented by Weight Matrices:

Example of a DNA motif. Motifs can be represented by Weight Matrices:

How can we obtain a Weight Matrix for a specific motif? ……. by evaluating the relative frequency of its elements in a set of aligned sequences.

This frequency matrix contains relevant Biological Information about your protein and can be used to obtain a: Position Specific Score Matrix PSSM

Position Specific Score Matrix PSSM While PAMand Blosum matrices are used to compare two amino acids of a pair of sequences regardless of their position in the aligned sequences, a PSSM analysis uses a different matrix in which the score varies depending on the conservation of each position of the aligned sequences

Serin Protease

Position Specific Score Matrix PSSM Actually, the frequencies are not used as such to score putative sites. The score assigned assigned to a piece of sequence, S, is calculated as thelog-ratio of two probabilities: P(S|M), the probability to observe sequence S given the motif model M (the matrix). P(S|B), the probability to observe sequence S given the background model B (the genomic context). The score of a sequence segment isWS=log[P(S|M)/P(S|B)]

Different programs have been developed to find motifs 1 AKSJDFHLASUHERLAKSNBKAJNCLKJASHDKFJAHSEJ 2 DLKTJNKHBHEASHRGHBDFASJGHBCLKUSHKLCSDHGK 3 GNLKXDHKIASGCSDKJCSKHDGKJELHBHEAJFNLOIJS 4 JHSLRCKJGHXBDKSLCFALSIZDNGJDFGNLCKJSDNSD 5 LKSAJDHBFCKGLSHBHEAUABSXDJKFASODFHBHKAHS 6 JSHGHAEKHKSDFJHKSJDFHKAJSEHRKAJHBHEAPERI 7 QWHBHEACVLXMNCVKUIEHRMBDKFJAHLIDHRTRKKQP 8 LICVUWJENOMNVIDFGKJERJSGFAHGSIUOPIAKHVIU 9 OIEURTKSHOIUCVBSDFGUYWERKJHDFLIUHBHEAERT 10 OIUWERMXCVKJHBHEAWIERUOIUVMBNAWIUEYRHASS

Different programs have been developed to find motifs 1 AKSJDFHLASUHERLAKSNBKAJNCLKJASHDKFJAHSEJ 2 DLKTJNKHBHEASHRGHBDFASJGHBCLKUSHKLCSDHGK 3 GNLKXDHKIASGCSDKJCSKHDGKJELHBHEAJFNLOIJS 4 JHSLRCKJGHXBDKSLCFALSIZDNGJDFGNLCKJSDNSD 5 LKSAJDHBFCKGLSHBHEAUABSXDJKFASODFHBHKAHS 6 JSHGHAEKHKSDFJHKSJDFHKAJSEHRKAJHBHEAPERI 7 QWHBHEACVLXMNCVKUIEHRMBDKFJAHLIDHRTRKKQP 8 LICVUWJENOMNVIDFGKJERJSGFAHGSIUOPIAKHVIU 9 OIEURTKSHOIUCVBSDFGUYWERKJHDFLIUHBHEAERT 10 OIUWERMXCVKJHBHEAWIERUOIUVMBNAWIUEYRHASS …..if the alignment is not an option?

How do they work? A) Counting all the “words” of certain length and evaluating the more frequent and statistically significant. B) In a aleatory fashion, taking fragments chosen randomly and evaluating if these fragments manage to generate a conserved representative motif (Gibbs sampler algorithm)

Gibbs sampler algorithm Multiple Local Alignment (MLA)

Positional-Probabilistic Model (PPM) and background We mark a sequence into the motif site (occurrence), which is described by a probability-positional matrix q(i,r) , and the background, which is described by background symbol probabilities f(i). r is a nucleotide (a residue); r  {A,T,G,C} i is a position in the site, i=1..s , s is the motif length

What is a motif Two probabilistic models, foreground (the motif) and background, are formulated. We classify (mark) all the input sequences into these two models-obtained parts.

A Gibbs sampling step A new site location is sampled from the distribution. The probability distribution of the new site position or its absence in the current sequence is derived from the statistical models and the current sequence content. Statistical models for the background and for the motif are formed using the counters. The current sequence Motif and background bases counters are computed from all the sequence fragments except the current one.

Gibbs sampler algorithm …..if the alignment is not an option?

Motif site (occurrence), which is described by a probability-positional matrix q(i,r) background, which is described by background symbol probabilities f(i). Gibbs sampler algorithm Two probabilistic models are formulated: the foreground model (the motif) and the background model

Gibbs sampler algorithm A probability distribution (where the foreground and background models are different) can be evaluated

A complete statistical description of the method is not in the scope of this talk

One of the sequences, chosen randomly, is removed from the alignment. The main idea of the method ….. A probability distribution profile is evaluated

and replaced by new sequence searched with the previous motif profile A new probability distribution profile is evaluated again The main idea of the method …..

The main idea of the method ….. After several cycles, the method tends to identify a significant motif

http://bioinformatics.org.au/glam2/doc/ GLAM2is a software package for finding motifs in sequences, typically amino-acid or nucleotide sequences. The main innovation of GLAM2 is thatit allows insertions and deletions in motifs. The package includes these programs: * glam2- for discovering motifs shared by a set of sequences. * glam2scan- for finding matches, in a sequence database, to a motif discovered by glam2. * glam2format- for converting glam2 motifs to standard alignment formats. * glam2mask- for masking glam2 motifs out of sequences, so that weaker motifs can be found. * purge - for removing highly similar members of a set of sequences.

http://meme.sdsc.edu/meme4/cgi-bin/glam2.cgi

Basic usage • Running glam2 without any arguments gives a usage message: • Usage: glam2 [options] alphabet my_seqs.fa • Main alphabets: p = proteins, n = nucleotides • Main options (default settings): • -h: show all options and their default settings • -o: output file (stdout) -r: number of alignment runs (10) • -n: end each run after this many iterations without improvement (10000) • -2: examine both strands forward and reverse complement • -z: minimum number of sequences in the alignment (2) • -a: minimum number of aligned columns (2) • -b: maximum number of aligned columns (50) • -w: initial number of aligned columns (20) • The main input to glam2 is a file of sequences in FASTA format: • >MyFirstSequence GHYWVVCTGGGACH • >My2ndSequence LLIGGPWVWWADDDF (etc.) • You need to tell glam2 which alphabet to use: • glam2 p my_prots.fa • glam2 n my_nucs.fa • Use -o to write the output to a file rather than to the screen: • glam2 -o my_prots.glam2 p my_prots.fa

How it works To use glam2 starts from a random alignment, and makes many small, random changes to it, which are designed to find high-scoring alignments in the long run. The longer you let it run, the more likely it is to find a maximal-scoring alignment. To check that a reproducible, high-scoring motif has been found, the whole procedure is run several (e.g. 10) times from different starting alignments. If all runs produce identical alignments, we have maximum confidence that this is the optimal motif. If a few of the runs produce different, lower-scoring motifs, we still have high confidence. If all the runs produce completely different alignments, we have low confidence, and the run-length needs to be increased.

MEME: Multiple Expectation maximization for Motif Elicitation • MAIN DIFFERENCES • Try many different initial segments in order to get one that converges to an optimum. • Try different window analysis sizes. • In order to generate a motif with gaps, more than one motif can be generated.

MEME: Multiple Expectation maximization for Motif Elicitation A very useful program to discover Patterns • Motif discovery from unaligned sequences • Optimal for Genomic or Protein sequences • Especially if you do not know the size of the motif • Identifies profile motifs • Simultaneously analyze Multiple motifs for any input • Flexible model of motif presence • Motif can be absent in some sequences • Motif can appear several times in one sequence

The input to MEME contains the following fields. • Sequences Protein or DNA sequences (Do not merge) in fasta format. Notice that sequence names must not be repeated Valid examples are >seq1 GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAK >seq2 GDMFCPGYCPDVKPVGDFDLSAFAGAWHELAK >GdhA Glutamate dehydrogenase from Escherichia coliRDMFCPGGCPDVKPVGDHDLSAFKGAWHELAL

MEME input. continued…. • Motif distribution One per sequence (oops) Zero or one per sequence (zoops) Any number of repetitions (tcm) • Number of motifs [optional] The program will stop the analysis after this number of motifs is found. • Number of sites (Minimum or Maximum sites) (<= 300) Minimum sites = 5 Maximum sites = 8 • Motif width . MEME will find the optimum width of each motif within the limits you specify : Minimum or Maximum

MEME input. continued…. • Text output format By default, MEME output is in hypertext (HTML) format. • Shuffle letters in input sequences Useful for further statistical analysis • Look for palindromes only Average the letter frequencies in corresponding motif columns together

http://meme.sdsc.edu/meme/meme.html

MEME Output Expectation value Motif length “Position-Specific Probability Matrix # of motifs found Consensus Information content

MEME Output Position in sequence Statistical significance Sequence names Strand (reverse or complement) Motif within sequence

MEME Output Motif in complement strand Sequence length Overall Statistical significance

MAST • Searches for motifs (one or more) in sequence databases: • Like BLAST but motifs are used as input • Similar to the matrices obtained by iteration in PSI-BLAST • Profile defines statistical significance of a match • Multiple motif matches per sequence • Combined E value for all motifs • MEME uses MAST to summarize results: • Each MEME result is accompanied by the MAST result for searching the discovered motifs on the given sequences.

MAST input Email address Motif file (e.g. MEME output) Database (like BLAST) Consider matched sequence length E value threshold

MAST output Link to GenBank Matched accession Match E value Length of sequence

MAST output Motif diagram

MAST output Position of each instance Matched parts of sequence Motif and orientation P value of instance Motif ‘consensus’

Computational searches of biological sequences

Computational searches of biological sequences

Presentation Transcript

Scalable Visual Comparison of Biological Trees and Sequences

2. Comparing biological sequences: sequence alignment (cont’d)

Pattern Discovery in Biological Sequences: A Review

Semantic Modeling of Biological Sequences

Computational Architectures in Biological Vision, USC

Bioinformatics Workshop 1 Sequences and Similarity Searches

Online Viterbi Algorithm for Analysis of Long Biological Sequences

Database Index to Large Biological Sequences

Computational Modelling of Biological Pathways

Scalable Visual Comparison of Biological Trees and Sequences

Biological definitions for r elated sequences

CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments

Biological sequences and SO

2. Comparing biological sequences : sequence alignment

Computational Analysis of Genome Sequences

CAP5510 – Bioinformatics Database Searches for Biological Sequences

Introduction to Biological sequences

Computational Representation of Biological Molecules

Scalable Visual Comparison of Biological Trees and Sequences

Computational Architectures in Biological Vision, USC

Semantic Modeling of Biological Sequences

CAP5510 – Bioinformatics Database Searches for Biological Sequences