Automate Function Prediction

AutomateFunctionPrediction

Outline • Goal • Howfunction is defined • WhyGeneOntology • Methods for proteinfunctionprediction • Endpoints

GOAL • A) Youfind a new protein • B) Yousequence the wholegenome of yourfavoriteorganism • Obtainedgene(s) shouldbeannotated • A canbesolvedmanually. B needsautomatictools

Howfunction is defined • Functionaldescription as text • Linkinggene to Key Words (Uniprot) • LinkinggeneGeneOntology • Linkinggene to SignallingPathwaysorBiochemicalPathways (KEGG)

WhyGeneOntology (GO) • GO represents a popularstandardcurrently in the geneannotation • GO representscategoriesthatrepresentgenefunction • Creates an union for genes in sameprocess • Easysummary for geneswithsimilarfunction

WhyGeneOntology (GO) • 3 sub-parts: BiologicalProcess, MolecularFunction, CellularLocalization • MolecularFunction => chemicalactivity • BiologicalProcess => Biology, cellularprocess • Cellularlocalization => Location of gene • Hierarchicalstructure • Categorieswithveryprecisefunction • Categorieswithlessprecisefunction • Categorieswithverybroadfunction

How GO helps • Enduser: Summarycategories for geneswithvariousfunctions • Computer programs: Classifieralgorithmscanbetaught to predict the categories for genes

Understanding GO • Amigoserver(http://amigo.geneontology.org/cgi-bin/amigo/go.cgi)

FunctionPrediction: Whatcanweuse to predictfunction • Sequencehomology (BLAST resultlist) • Phylogenetictree of sequences • ProteinDomains (PFAM domains) • Short sequencepatterns – motifs • Sequencefeatures (sec. struct., lowcompl. regions)

SequenceHomologyMethods • Do a BLAST searchwith a querysequence • Collect GO classes for genes in the BLAST resulthit • Give a weight to each BLAST hit • oftenlog(E-value) • Combine the scoresfrom the genesthatbelong to same GO class • Report the top best / significant GO classes

SequenceHomologyMethods • Simplemethods • Programs • BLAST2GO (http://www.blast2go.com/b2ghome) • GOTCHA (http://www.compbio.dundee.ac.uk/gotcha/gotcha.php) • ARGOT(http://www.medcomp.medicina.unipd.it/Argot2/form.php) • PFP (http://kiharalab.org/web/pfp.php)

Phylogenetictreemethods • Create the pair-wisedistancesfor the set of genes • Do a hierarchicalclustering of genes • Map the know GO functions to clustertree • Look for unknowngenes in a clusterwithmanygenesfrom the same GO class • Report the top best / significant GO classes • More => http://genome.cshlp.org/content/8/3/163.full

Phylogenetictreemethods • Theseshouldoutperformsequencehomologymethods (CAFA 2011?) • Require a set of relatedgenes • Oftenmuchheaviercalculations • Programs: • Sifter(http://genome.cshlp.org/content/early/2011/07/22/gr.104687.109)

PredictionwithProteindomains • Look whatproteindomainsthereare in queryprotein (PFAM) • Map the functionsthatarelinked to domains to yourquerysequence • PFAM2GO • Programs: InterProScan + PFAM2GO • Drawbacks: • Thismapping is same in plant, mammal, bacteria • Manydomains to specificfunction

PredictionwithProteindomains • Benefits: • Cancreateannotationfromseparatedomains • Similarseq:sdonothave to be in database • Programs (?): InterProScan(http://www.ebi.ac.uk/InterProScan/) • Drawbacks: • The mapping is same in plant, mammal, bacteria • Manydomains to specificfunction

Predictionwithpatterns and motifs • Sameprinciple as before, butwe look sequencepatterns and motifs • Map the functionsthatarelinked to patterns to yourquerysequence • Programs: • InterProScan • IBM BioDictionary(http://cbcsrv.watson.ibm.com/Tpa.html) • Drawbacks and benefits appr. same as before

Predictionwithsequencefeatures • Againsameprinciple as before • We look seq. features(seepict.) • Thesearegiven as an input to classifieralgorithm (SupportVector Machine)

Predictionwithsequencefeatures

Predictionwithsequencefeatures • Benefits: • No actualseq. similarityneeded • Info collectedfromvaguesimilarities • Use of classifier => feature weighting • Program: FFPred(http://bioinf.cs.ucl.ac.uk/ffpred/) • Drawbacks: • Calculationsprobablyquite heavy • No use of nearbysequencesimilarities (domains etc.)

Ourcontribution: PANNZER • Use BLAST resultlist • AddTaxonomicinformation • Score GO classesusing a scorethattakes the frequency of GO class in seq. DB into account • Method is used to predict: • GO Classes • Descriptionline

Ourcontribution: PANNZER • Benefits: • Taking the speciestaxonomy into account • Improveduse of statistics • Notpublicyet

Ourcontribution: No NameYet • Take PFAM domainpredictions, BLAST similarities andTaxonomicinformation • Feedthis to feature selection and to classifieralgorithm • …Wait… • Method is used to predictGO-classes • Notpublic + testing is ongoing

Conclusion • Thesemethodsincreasinglyneeded • Somemethodsexist • Unfortunately no clearevaluation (my opinion) • Remember: Thesearepredictions. No certain info untiltheyaretested in wetlab…

Automate Function Prediction