230 likes | 347 Views
This guide explores various methods for protein function prediction, focusing on the significance of Gene Ontology (GO) in gene annotation. It details the steps for annotating new proteins, whether through manual processes or automated tools. Key approaches discussed include sequence homology, phylogenetic trees, and protein domains, highlighting their roles in predicting protein functions. Tools such as BLAST2GO, InterProScan, and PANNZER are examined for their capabilities and benefits. The necessity for reliable and efficient prediction methods in genomics research is emphasized.
E N D
Outline • Goal • Howfunction is defined • WhyGeneOntology • Methods for proteinfunctionprediction • Endpoints
GOAL • A) Youfind a new protein • B) Yousequence the wholegenome of yourfavoriteorganism • Obtainedgene(s) shouldbeannotated • A canbesolvedmanually. B needsautomatictools
Howfunction is defined • Functionaldescription as text • Linkinggene to Key Words (Uniprot) • LinkinggeneGeneOntology • Linkinggene to SignallingPathwaysorBiochemicalPathways (KEGG)
WhyGeneOntology (GO) • GO represents a popularstandardcurrently in the geneannotation • GO representscategoriesthatrepresentgenefunction • Creates an union for genes in sameprocess • Easysummary for geneswithsimilarfunction
WhyGeneOntology (GO) • 3 sub-parts: BiologicalProcess, MolecularFunction, CellularLocalization • MolecularFunction => chemicalactivity • BiologicalProcess => Biology, cellularprocess • Cellularlocalization => Location of gene • Hierarchicalstructure • Categorieswithveryprecisefunction • Categorieswithlessprecisefunction • Categorieswithverybroadfunction
How GO helps • Enduser: Summarycategories for geneswithvariousfunctions • Computer programs: Classifieralgorithmscanbetaught to predict the categories for genes
Understanding GO • Amigoserver(http://amigo.geneontology.org/cgi-bin/amigo/go.cgi)
FunctionPrediction: Whatcanweuse to predictfunction • Sequencehomology (BLAST resultlist) • Phylogenetictree of sequences • ProteinDomains (PFAM domains) • Short sequencepatterns – motifs • Sequencefeatures (sec. struct., lowcompl. regions)
SequenceHomologyMethods • Do a BLAST searchwith a querysequence • Collect GO classes for genes in the BLAST resulthit • Give a weight to each BLAST hit • oftenlog(E-value) • Combine the scoresfrom the genesthatbelong to same GO class • Report the top best / significant GO classes
SequenceHomologyMethods • Simplemethods • Programs • BLAST2GO (http://www.blast2go.com/b2ghome) • GOTCHA (http://www.compbio.dundee.ac.uk/gotcha/gotcha.php) • ARGOT(http://www.medcomp.medicina.unipd.it/Argot2/form.php) • PFP (http://kiharalab.org/web/pfp.php)
Phylogenetictreemethods • Create the pair-wisedistancesfor the set of genes • Do a hierarchicalclustering of genes • Map the know GO functions to clustertree • Look for unknowngenes in a clusterwithmanygenesfrom the same GO class • Report the top best / significant GO classes • More => http://genome.cshlp.org/content/8/3/163.full
Phylogenetictreemethods • Theseshouldoutperformsequencehomologymethods (CAFA 2011?) • Require a set of relatedgenes • Oftenmuchheaviercalculations • Programs: • Sifter(http://genome.cshlp.org/content/early/2011/07/22/gr.104687.109)
PredictionwithProteindomains • Look whatproteindomainsthereare in queryprotein (PFAM) • Map the functionsthatarelinked to domains to yourquerysequence • PFAM2GO • Programs: InterProScan + PFAM2GO • Drawbacks: • Thismapping is same in plant, mammal, bacteria • Manydomains to specificfunction
PredictionwithProteindomains • Benefits: • Cancreateannotationfromseparatedomains • Similarseq:sdonothave to be in database • Programs (?): InterProScan(http://www.ebi.ac.uk/InterProScan/) • Drawbacks: • The mapping is same in plant, mammal, bacteria • Manydomains to specificfunction
Predictionwithpatterns and motifs • Sameprinciple as before, butwe look sequencepatterns and motifs • Map the functionsthatarelinked to patterns to yourquerysequence • Programs: • InterProScan • IBM BioDictionary(http://cbcsrv.watson.ibm.com/Tpa.html) • Drawbacks and benefits appr. same as before
Predictionwithsequencefeatures • Againsameprinciple as before • We look seq. features(seepict.) • Thesearegiven as an input to classifieralgorithm (SupportVector Machine)
Predictionwithsequencefeatures • Benefits: • No actualseq. similarityneeded • Info collectedfromvaguesimilarities • Use of classifier => feature weighting • Program: FFPred(http://bioinf.cs.ucl.ac.uk/ffpred/) • Drawbacks: • Calculationsprobablyquite heavy • No use of nearbysequencesimilarities (domains etc.)
Ourcontribution: PANNZER • Use BLAST resultlist • AddTaxonomicinformation • Score GO classesusing a scorethattakes the frequency of GO class in seq. DB into account • Method is used to predict: • GO Classes • Descriptionline
Ourcontribution: PANNZER • Benefits: • Taking the speciestaxonomy into account • Improveduse of statistics • Notpublicyet
Ourcontribution: No NameYet • Take PFAM domainpredictions, BLAST similarities andTaxonomicinformation • Feedthis to feature selection and to classifieralgorithm • …Wait… • Method is used to predictGO-classes • Notpublic + testing is ongoing
Conclusion • Thesemethodsincreasinglyneeded • Somemethodsexist • Unfortunately no clearevaluation (my opinion) • Remember: Thesearepredictions. No certain info untiltheyaretested in wetlab…