Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008
Artemis & ACT PSU Projects Organism Database entry Finishedgenome Annotatedgenome
Gene finders Primary DNA sequence Preannotation manual curation BlastN tRNA scan BlastX Dotter Repeats rRNA tRNA Pseudo-genes CDSs
Gene finders Primary DNA sequence Preannotation Manual curation BlastN tRNA scan BlastX Dotter Repeats rRNA tRNA Pseudo-genes CDSs Fasta BlastP Pfam Prosite Psort SignalP TMHMM Manual curation Annotated sequence
Annotation of Protein-coding genes: (from gene model to protein function) • search programs: local (BLAST) and global (FASTA) alignments, EST hits • Protein domains and motifs: InterPro (Pfam, Prosite, SMART etc.) • Transmembrane / signal peptide prediction (TMHMM, SignalP, Phobius) • - Base annotation on characterised proteins where possible (manually curated SWISSPROT entry) • Read the literature (PUBMED) Use several lines of evidence!
Annotation of non-protein-coding genes: (tRNAs, rRNAs, snRNAs, other ncRNAs) Structural conservation of ncRNAs! • Initial searches: • BlastN, GC-plots • tRNA scan • sno scan • Others • Search in specialised databases: • Rfam scan • microRNAdb etc. • Comparative ncRNA prediction tools: • RNAZ • Evofold • QRNA etc. • Structure prediction of ncRNAs: • MFOLD • Others Use several lines of evidence!
Statistical significance of database hits E-values (Expectation value) E-value = No alignments with the equivalent score that you would expect to find by random chance. An e-value of 5 would mean that you would expect 5 alignments with the equivalent or higher score to have occurred by random chance more reliable than the % ID Caution: Repeat regions / non-curated protein sequences
Sequence similarity searching: BLAST (Basic Local Alignment Search Tool) analysis: Nucleotide sequences: blastn: nucleotide sequence compared to nucleotide database blastx: nucleotide sequence translated and all 6 frame translations compared to protein database tblastn: protein query vs translated database Protein sequences blastp: protein query vs protein database tblastx: translated query vs translated database (all 6 frames) FastA: Provides sequence similarity and homology searching against nucleotide and protein databases using the Fasta programs. Fasta can be very specific when identifying long regions of low similarity especially for highly diverged sequences.
FASTA (Global) BLAST (Local)
Orthologues and paralogues Human hemoglobin Human myoglobin Human hemoglobin Mouse hemoglobin orthologues paralogues Originate from evolution Similar functions Originate from gene duplication Diverged functions Best tool to look for orthologues? Blast or FastA? FastA!
A B A B C A B C Functional assignment: alignments of modular proteins
HMMs WHAAAAT??? A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. An HMM can be considered as the simplest dynamic Bayesian network.
..HMPLKHRLHP.. ..RMPLKHRPHP.. ..GMRLKHRHHP.. ..PMGLKHAGHP.. aligned sequences ..-MPLKHR-HP.. Profile HMM for the aligned motif that can be used to search databases for proteins containing this motif
..-MPLKHR-HP.. Create HMM Search database with HMM Remote homology detection ..RMPLKHRFHP.. ..PMPLKHRIHP.. ..HMPLKHDVHP.. ..YMDLKHELHP.. ..-MPLKHR-HP.. • FastA • Blast • Psi-blast • HMM searches • HMM-HMM comparison: HHPred server http://toolkit.tuebingen.mpg.de/hhpred • HMM-HMM comparison: HHPred server http://toolkit.tuebingen.mpg.de/hhpred • Psi-blast • HMM searches Psi-blast
Input protein sequence Psi-blast Secondary structure prediction Alignment HMM building Secondary structure comparison HMM-HMM comparison Extremely sensitive remote homology detection 3D structure modelling
Module 3 Exercises: Section A: •Sequence retrieval of a P. falciparum protein (cyclophilin) using SRS • BLAST and Fasta searches by cutting & pasting the sequence. Section B: Exercise 1 Part I: • Search PROSITE server by cutting & pasting the cyclophylin sequence Exercise 1 Part II: • Pfam server Exercise 1 Part III: • SMART server Exercise 1 Part IV: • InterPro server Exercise 2: • Sequence retrieval of P. falciparum PFC0125w protein using SRS. • TMHMMv2.0 server. • SignalPv3.0 server. Section C: • Other web resources