Module 3 Sequence and Protein Analysis (Using web-based tools)

Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008

Artemis & ACT PSU Projects Organism Database entry Finishedgenome Annotatedgenome

Annotation using Artemis: mapping domains in proteins

Gene finders Primary DNA sequence Preannotation manual curation BlastN tRNA scan BlastX Dotter Repeats rRNA tRNA Pseudo-genes CDSs

Gene finders Primary DNA sequence Preannotation Manual curation BlastN tRNA scan BlastX Dotter Repeats rRNA tRNA Pseudo-genes CDSs Fasta BlastP Pfam Prosite Psort SignalP TMHMM Manual curation Annotated sequence

Gene model annotation Protein function

Annotation of Protein-coding genes: (from gene model to protein function) • search programs: local (BLAST) and global (FASTA) alignments, EST hits • Protein domains and motifs: InterPro (Pfam, Prosite, SMART etc.) • Transmembrane / signal peptide prediction (TMHMM, SignalP, Phobius) • - Base annotation on characterised proteins where possible (manually curated SWISSPROT entry) • Read the literature (PUBMED) Use several lines of evidence!

Annotation of non-protein-coding genes: (tRNAs, rRNAs, snRNAs, other ncRNAs) Structural conservation of ncRNAs! • Initial searches: • BlastN, GC-plots • tRNA scan • sno scan • Others • Search in specialised databases: • Rfam scan • microRNAdb etc. • Comparative ncRNA prediction tools: • RNAZ • Evofold • QRNA etc. • Structure prediction of ncRNAs: • MFOLD • Others Use several lines of evidence!

Statistical significance of database hits E-values (Expectation value) E-value = No alignments with the equivalent score that you would expect to find by random chance. An e-value of 5 would mean that you would expect 5 alignments with the equivalent or higher score to have occurred by random chance more reliable than the % ID Caution: Repeat regions / non-curated protein sequences

Sequence similarity searching: BLAST (Basic Local Alignment Search Tool) analysis: Nucleotide sequences: blastn: nucleotide sequence compared to nucleotide database blastx: nucleotide sequence translated and all 6 frame translations compared to protein database tblastn: protein query vs translated database Protein sequences blastp: protein query vs protein database tblastx: translated query vs translated database (all 6 frames) FastA: Provides sequence similarity and homology searching against nucleotide and protein databases using the Fasta programs. Fasta can be very specific when identifying long regions of low similarity especially for highly diverged sequences.

FASTA (Global) BLAST (Local)

Orthologues and paralogues Human hemoglobin Human myoglobin Human hemoglobin Mouse hemoglobin orthologues paralogues Originate from evolution Similar functions Originate from gene duplication Diverged functions Best tool to look for orthologues? Blast or FastA? FastA!

A B A B C A B C Functional assignment: alignments of modular proteins

HMMs WHAAAAT??? A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. An HMM can be considered as the simplest dynamic Bayesian network.

..HMPLKHRLHP.. ..RMPLKHRPHP.. ..GMRLKHRHHP.. ..PMGLKHAGHP.. aligned sequences ..-MPLKHR-HP.. Profile HMM for the aligned motif that can be used to search databases for proteins containing this motif

..-MPLKHR-HP.. Create HMM Search database with HMM Remote homology detection ..RMPLKHRFHP.. ..PMPLKHRIHP.. ..HMPLKHDVHP.. ..YMDLKHELHP.. ..-MPLKHR-HP.. • FastA • Blast • Psi-blast • HMM searches • HMM-HMM comparison: HHPred server http://toolkit.tuebingen.mpg.de/hhpred • HMM-HMM comparison: HHPred server http://toolkit.tuebingen.mpg.de/hhpred • Psi-blast • HMM searches Psi-blast

Input protein sequence Psi-blast Secondary structure prediction Alignment HMM building Secondary structure comparison HMM-HMM comparison Extremely sensitive remote homology detection 3D structure modelling

Module 3 Exercises: Section A: •Sequence retrieval of a P. falciparum protein (cyclophilin) using SRS • BLAST and Fasta searches by cutting & pasting the sequence. Section B: Exercise 1 Part I: • Search PROSITE server by cutting & pasting the cyclophylin sequence Exercise 1 Part II: • Pfam server Exercise 1 Part III: • SMART server Exercise 1 Part IV: • InterPro server Exercise 2: • Sequence retrieval of P. falciparum PFC0125w protein using SRS. • TMHMMv2.0 server. • SignalPv3.0 server. Section C: • Other web resources

Module 3 Sequence and Protein Analysis (Using web-based tools)

Module 3 Sequence and Protein Analysis (Using web-based tools)

Presentation Transcript

From Protein Sequence to Function: Functional Analysis of Protein Sequences and Protein Classification

Sequence-Function Relationships

Protein Sequence Analysis - Overview

Bioinformatics and Protein Sequence Analysis

PROTEIN SEQUENCE ANALYSIS

Phylogenetics workshop: Protein sequence phylogeny week 2

Sequence Analysis Tools

Protein Sequence Analysis - Overview -

Tools for Comparative Sequence Analysis

Protein sequence analysis

Environmental Modeling and Analysis Tools

Module – Deploying Your Site

Web-based analysis and annotation tools

Single DNA Sequence Analysis Tools

Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis

The significance of using web 2.0 Tools in classes

The significance of using web 2.0 Tools in EFL classes

Protein Evolution and Sequence Analysis

Web 2.0

Module 1

Session 3

Tools for Comparative Sequence Analysis