Dr Tan Tin Wee Director Bioinformatics Centre

Basic Overview of Bioinformatics Tools and Biocomputing Applications III Dr Tan Tin Wee Director Bioinformatics Centre

More BioComputational Tools • Phylogenetics Analysis • Multiple Sequence Alignment • Profile Searching • Sensitivity and Specificity and Probabilities in the Prediction of Functions

Phylogenetic Analysis • Assumption: evolutionary descent • Divergence • Phylogenetic tree • Rooted and unrooted trees Species X Y A B

Rooted and Unrooted Trees • Rooted: ancestral state of the evolved organism or gene is known. • Branches at bifurcation points until terminal branches, or tips/ leaves. • Unrooted trees represent branching order, but does not indicate the root of the last common ancestor

Phylogenetic inference for genes • Infancy, inexact science • computational tools based on general mathematical and statistical principles • Phylogenetic reconstructions may conflict with common sense. • Incorrect sequence alignments, inadequate models • All sites within sequences evolve at different rates • unequal rate effects

Some algorithms • Maximum parsimony • maximum likelihood • distance methods • UPGMA • paralinear (logdet) distances • Software Packages: PAUP phylogenetic analysis using parsimonyPHYLIP phylogenetic inference packageMacClade, GAMBIT, MEGA/METREE

Limitations • Inspection of sequence alignments • Removal of deviant sequences from the phylogenetic inference • Different genes analysed produce different trees • "Bootstrapping" for estimating statistical significance may still have errors in interpretation

A B Uses C D • Molecular Taxonomy • 16S and 23S rRNA analysis for bacterial classification • 18S rRNA analysis of nematodes, drosophila • epidemiological analysis of strain variation eg. In infections pathogens

Multiple Sequence Analysis • Gather a set of sequences of putative similarity or homology • Pairwise comparison for each set of multiple sequences • Build a "tree" of similarity • realignment of all sequences based on "ancestral" sequence padding with gaps etc • Used for generating "profiles"

Use • Detection of conserved and variable regions • Infer gene functions • Variable segments - infer dispensable to function or antigenic variants • Motifs can be used to analyse unknown sequence and infer possible function or relatedness • Motifs as basis for annotation of genome project sequences

Software • CLUSTALW • Profile software based on Hidden Markov Models (HMM) statistical models, eg HMMer, HMMPro, META-MEME, PROBE, BLOCKS

Example • C. elegans genome project • several large gene families of sequence homology - function unknown. • Now classified as putative G-protein coupled receptors (GPCRs). • Have to detect significant similarity between putative Worm GPCRs and experimentally known GPCRs in other species

Process • Select a typical unknown sequenceBLAST Search against nr database • Inspect hits and E-values • Top scoring hits - mitochondrial L11 ribosomal protein E=0.002 (not low enough to be trusted for annotation) • The rest of top scorers are all nematode-specific unknown sequences • Compare with PSI-BLAST iterative searching at NCBI • Similarity with mammalian GPCRs or the high scoring mt rL11 protein ?

Further analysis • Gather all nematode specific sequences • WormPep database of non-redundant seqs • Discard seqs of abnormally long or short • Multiple sequence alignment using CLUSTALW • General Profile of multiple alignment using HMMer • Use profile to search database again

Results • Similarity at significance level detected with Mammalian GPCRs • Find that L11 protein has very significant high score E=5x10 • Pitfalls of PSI-Blast - significance of match to the training set during iteration. • Finally, L11 protein may be wrongly annotated and not based on experimental results -49

A.Sensitivity and Specificity of a Fairly Good Test • Total real +ve = 73Total real - ve = 27 • Specificity = (25)/(2+25)=.93picked up 25 of the 27 negatives, very specificLow false positives • Sensitivity = 70/(70+3)=.96able to pickup 70 of the total 73 that are known positive- quite sensitive- Low false negatives • Gold standards Known gold standard+ ve - ve + ve - ve 2 70 3 25 Exptaltest result N=100

B.Increase Sensitivity but Lower Specificity of a Test • Total real +ve = 73Total real - ve = 27 • Specificity = (14)/(13+14)=.52picked up 14 of the 27 negatives, not very specifichigh false positives • Sensitivity = 72/(72+1)=.99able to pickup 72 of the total 73 that are known positive- super sensitive Low false negatives Known gold standard+ ve - ve + ve - ve 13 72 1 14 Exptaltest result N=100

C.Increase Specificity of a Test butSensitivity may drop • Total real +ve = 73Total real - ve = 27 • Specificity = (27)/(0+27)=1.0picked up 27 of the 27 negatives,completely specificincrease threshold to zero false positives, true positives will drop • Sensitivity = 50/(50+23)=.68able to pickup 50 of the total 73 that are known positive- not quite sensitive- Low false negatives Known gold standard+ ve - ve + ve - ve 0 50 23 27 Exptaltest result N=100

Trade off involved • If threshold of test set high, so that all the noise disappears, you may also miss out on some true positives, get a lot of false negatives and thus not so sensitive - case C • If threshold of test set low, so that you get as much of the positives as you can get, ie high sensitivity, your non-specific false positive hits start appearing - Case B

Computational Predictions of Gene Function • Sensitivity and specificity has similar tradeoffs. • Cutoff threshold values have to be empirically determined or arbitrarily chosen depending on situation

Dr Tan Tin Wee Director Bioinformatics Centre