Create Presentation
Download Presentation

Download Presentation
## Computational searches of biological sequences

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Conceptos básicos**• Homología y otras relaciones evolutivas (paralógos, ortólogos, xenólogos) • Uso preferencial de codones, CAI y expresividad • Microarreglos y aproximaciones estadísticas para su análisis**Descripción de programas existentes**• BLAST (Comparación apareada de secuencias) • MEME/MAST (Identificación de motivos sobre-representados)**Planteamiento de problemas para resolver**• Grupo mínimo de genes para la vida • Predicción de operones bacterianos • Expresividad en unidades transcripcionales • Conservación de expresividad entre organismos • Identificación de genes transferidos horizontalmente H. pylori • Regulación por glucosa en E. coli**Matriz de substitución para aminoácidos**PAM 250 AGGIDG GHGFMG 117137**A**C G T A 1 0 0 0 C 0 1 0 0 G 0 0 1 0 T 0 0 0 1 Unitary matrix for DNA sequences**In any case, the values obtain in the comparison are the**same along the entire alignment It is well know that some residues in a protein or in a nucleotide sequence plays important roles and therefore are constrained to vary**These conserved regions constitutes motif which are**sometimes recognized in a set of aligned sequences. SCK1_CENEL/13-36 CLKPCKDLYGPHAGAKCMNGKCKC SCKM_CENMA/13-36 CLPPCKAQFGQSAGAKCMNGKCKC SCT2_ANDAU/35-57 CASVCRRVIGVAAG-KCINGRCVC SCK3_ANDMA/13-35 CASVCRKVIGVAAG-KCINGRCVC SCK4_MESMA/35-57 CASVCRREIGVAAG-KCINGKCVC SCKK_TITSE/35-57 CYSACKKLVGKATG-KCTNGRCDC SCK2_TITDI/14-36 CVKICIDRYNTRGA-KCINGRCTC SCKP3_TITSE/7-28 CNRKCCPG-GCRSG-KCINGKCQC SCBX_MESMA/8-29 CRVKCVAM-GFSSG-KCINSKCKC SCKL_LEIQH/8-28 CQLSCRSL-GL-LG-KCIGDKCEC SCK5_ANDMA/8-28 CQLSCRSL-GL-LG-KCIGVKCEC SCK1_CENNO/36-57 CDKDCKRR-GYRSG-KCINNACKC SCK2_CENNO/8-29 CDKDCTSR-KYRSG-KCINNACKC ** .**.***What is a motif in a biological sequence?**• Represents a conserved region of a sequence. • This conservation might be due to a functional constraint. • There are conserved structural domains in a family of proteins. Amino acid sequences can almost always represent such motifs. • Motif identification is useful to classify and understand protein or nucleotide function.**Example of a protein motif.**Motifs can be represented by Weight Matrices:**Example of a RNA motif.**Motifs can be represented by Weight Matrices:**Example of a DNA motif.**Motifs can be represented by Weight Matrices:**How can we obtain a Weight Matrix for a specific motif?**……. by evaluating the relative frequency of its elements in a set of aligned sequences.**This frequency matrix contains relevant**Biological Information about your protein and can be used to obtain a: Position Specific Score Matrix PSSM**Position Specific Score Matrix PSSM**While PAMand Blosum matrices are used to compare two amino acids of a pair of sequences regardless of their position in the aligned sequences, a PSSM analysis uses a different matrix in which the score varies depending on the conservation of each position of the aligned sequences**Position Specific Score Matrix PSSM**Actually, the frequencies are not used as such to score putative sites. The score assigned assigned to a piece of sequence, S, is calculated as thelog-ratio of two probabilities: P(S|M), the probability to observe sequence S given the motif model M (the matrix). P(S|B), the probability to observe sequence S given the background model B (the genomic context). The score of a sequence segment isWS=log[P(S|M)/P(S|B)]**Different programs have been developed to find motifs**1 AKSJDFHLASUHERLAKSNBKAJNCLKJASHDKFJAHSEJ 2 DLKTJNKHBHEASHRGHBDFASJGHBCLKUSHKLCSDHGK 3 GNLKXDHKIASGCSDKJCSKHDGKJELHBHEAJFNLOIJS 4 JHSLRCKJGHXBDKSLCFALSIZDNGJDFGNLCKJSDNSD 5 LKSAJDHBFCKGLSHBHEAUABSXDJKFASODFHBHKAHS 6 JSHGHAEKHKSDFJHKSJDFHKAJSEHRKAJHBHEAPERI 7 QWHBHEACVLXMNCVKUIEHRMBDKFJAHLIDHRTRKKQP 8 LICVUWJENOMNVIDFGKJERJSGFAHGSIUOPIAKHVIU 9 OIEURTKSHOIUCVBSDFGUYWERKJHDFLIUHBHEAERT 10 OIUWERMXCVKJHBHEAWIERUOIUVMBNAWIUEYRHASS**Different programs have been developed to find motifs**1 AKSJDFHLASUHERLAKSNBKAJNCLKJASHDKFJAHSEJ 2 DLKTJNKHBHEASHRGHBDFASJGHBCLKUSHKLCSDHGK 3 GNLKXDHKIASGCSDKJCSKHDGKJELHBHEAJFNLOIJS 4 JHSLRCKJGHXBDKSLCFALSIZDNGJDFGNLCKJSDNSD 5 LKSAJDHBFCKGLSHBHEAUABSXDJKFASODFHBHKAHS 6 JSHGHAEKHKSDFJHKSJDFHKAJSEHRKAJHBHEAPERI 7 QWHBHEACVLXMNCVKUIEHRMBDKFJAHLIDHRTRKKQP 8 LICVUWJENOMNVIDFGKJERJSGFAHGSIUOPIAKHVIU 9 OIEURTKSHOIUCVBSDFGUYWERKJHDFLIUHBHEAERT 10 OIUWERMXCVKJHBHEAWIERUOIUVMBNAWIUEYRHASS …..if the alignment is not an option?**How do they work?**A) Counting all the “words” of certain length and evaluating the more frequent and statistically significant. B) In a aleatory fashion, taking fragments chosen randomly and evaluating if these fragments manage to generate a conserved representative motif (Gibbs sampler algorithm)**Gibbs sampler algorithm**Multiple Local Alignment (MLA)**Positional-Probabilistic Model (PPM) and background**We mark a sequence into the motif site (occurrence), which is described by a probability-positional matrix q(i,r) , and the background, which is described by background symbol probabilities f(i). r is a nucleotide (a residue); r {A,T,G,C} i is a position in the site, i=1..s , s is the motif length**What is a motif**Two probabilistic models, foreground (the motif) and background, are formulated. We classify (mark) all the input sequences into these two models-obtained parts.**A Gibbs sampling step**A new site location is sampled from the distribution. The probability distribution of the new site position or its absence in the current sequence is derived from the statistical models and the current sequence content. Statistical models for the background and for the motif are formed using the counters. The current sequence Motif and background bases counters are computed from all the sequence fragments except the current one.**Gibbs sampler algorithm**…..if the alignment is not an option?**Motif site (occurrence), which is described by a**probability-positional matrix q(i,r) background, which is described by background symbol probabilities f(i). Gibbs sampler algorithm Two probabilistic models are formulated: the foreground model (the motif) and the background model**Gibbs sampler algorithm**A probability distribution (where the foreground and background models are different) can be evaluated**A complete statistical description of the method is not in**the scope of this talk**One of the sequences, chosen randomly,**is removed from the alignment. The main idea of the method ….. A probability distribution profile is evaluated**and replaced by new sequence searched with the previous**motif profile A new probability distribution profile is evaluated again The main idea of the method …..**The main idea of the method …..**After several cycles, the method tends to identify a significant motif**http://bioinformatics.org.au/glam2/doc/**GLAM2is a software package for finding motifs in sequences, typically amino-acid or nucleotide sequences. The main innovation of GLAM2 is thatit allows insertions and deletions in motifs. The package includes these programs: * glam2- for discovering motifs shared by a set of sequences. * glam2scan- for finding matches, in a sequence database, to a motif discovered by glam2. * glam2format- for converting glam2 motifs to standard alignment formats. * glam2mask- for masking glam2 motifs out of sequences, so that weaker motifs can be found. * purge - for removing highly similar members of a set of sequences.**Basic usage**• Running glam2 without any arguments gives a usage message: • Usage: glam2 [options] alphabet my_seqs.fa • Main alphabets: p = proteins, n = nucleotides • Main options (default settings): • -h: show all options and their default settings • -o: output file (stdout) -r: number of alignment runs (10) • -n: end each run after this many iterations without improvement (10000) • -2: examine both strands forward and reverse complement • -z: minimum number of sequences in the alignment (2) • -a: minimum number of aligned columns (2) • -b: maximum number of aligned columns (50) • -w: initial number of aligned columns (20) • The main input to glam2 is a file of sequences in FASTA format: • >MyFirstSequence GHYWVVCTGGGACH • >My2ndSequence LLIGGPWVWWADDDF (etc.) • You need to tell glam2 which alphabet to use: • glam2 p my_prots.fa • glam2 n my_nucs.fa • Use -o to write the output to a file rather than to the screen: • glam2 -o my_prots.glam2 p my_prots.fa**How it works**To use glam2 starts from a random alignment, and makes many small, random changes to it, which are designed to find high-scoring alignments in the long run. The longer you let it run, the more likely it is to find a maximal-scoring alignment. To check that a reproducible, high-scoring motif has been found, the whole procedure is run several (e.g. 10) times from different starting alignments. If all runs produce identical alignments, we have maximum confidence that this is the optimal motif. If a few of the runs produce different, lower-scoring motifs, we still have high confidence. If all the runs produce completely different alignments, we have low confidence, and the run-length needs to be increased.**MEME: Multiple Expectation maximization for Motif**Elicitation • MAIN DIFFERENCES • Try many different initial segments in order to get one that converges to an optimum. • Try different window analysis sizes. • In order to generate a motif with gaps, more than one motif can be generated.**MEME: Multiple Expectation maximization for Motif**Elicitation A very useful program to discover Patterns • Motif discovery from unaligned sequences • Optimal for Genomic or Protein sequences • Especially if you do not know the size of the motif • Identifies profile motifs • Simultaneously analyze Multiple motifs for any input • Flexible model of motif presence • Motif can be absent in some sequences • Motif can appear several times in one sequence**The input to MEME contains the following fields.**• Sequences Protein or DNA sequences (Do not merge) in fasta format. Notice that sequence names must not be repeated Valid examples are >seq1 GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAK >seq2 GDMFCPGYCPDVKPVGDFDLSAFAGAWHELAK >GdhA Glutamate dehydrogenase from Escherichia coliRDMFCPGGCPDVKPVGDHDLSAFKGAWHELAL**MEME input. continued….**• Motif distribution One per sequence (oops) Zero or one per sequence (zoops) Any number of repetitions (tcm) • Number of motifs [optional] The program will stop the analysis after this number of motifs is found. • Number of sites (Minimum or Maximum sites) (<= 300) Minimum sites = 5 Maximum sites = 8 • Motif width . MEME will find the optimum width of each motif within the limits you specify : Minimum or Maximum**MEME input. continued….**• Text output format By default, MEME output is in hypertext (HTML) format. • Shuffle letters in input sequences Useful for further statistical analysis • Look for palindromes only Average the letter frequencies in corresponding motif columns together**MEME Output**Expectation value Motif length “Position-Specific Probability Matrix # of motifs found Consensus Information content**MEME Output**Position in sequence Statistical significance Sequence names Strand (reverse or complement) Motif within sequence**MEME Output**Motif in complement strand Sequence length Overall Statistical significance**MAST**• Searches for motifs (one or more) in sequence databases: • Like BLAST but motifs are used as input • Similar to the matrices obtained by iteration in PSI-BLAST • Profile defines statistical significance of a match • Multiple motif matches per sequence • Combined E value for all motifs • MEME uses MAST to summarize results: • Each MEME result is accompanied by the MAST result for searching the discovered motifs on the given sequences.**MAST input**Email address Motif file (e.g. MEME output) Database (like BLAST) Consider matched sequence length E value threshold**MAST output**Link to GenBank Matched accession Match E value Length of sequence**MAST output**Motif diagram**MAST output**Position of each instance Matched parts of sequence Motif and orientation P value of instance Motif ‘consensus’