1 / 20

Dr Tan Tin Wee Director Bioinformatics Centre

Basic Overview of Bioinformatics Tools and Biocomputing Applications III. Dr Tan Tin Wee Director Bioinformatics Centre. More BioComputational Tools. Phylogenetics Analysis Multiple Sequence Alignment Profile Searching

Download Presentation

Dr Tan Tin Wee Director Bioinformatics Centre

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic Overview of Bioinformatics Tools and Biocomputing Applications III Dr Tan Tin Wee Director Bioinformatics Centre

  2. More BioComputational Tools • Phylogenetics Analysis • Multiple Sequence Alignment • Profile Searching • Sensitivity and Specificity and Probabilities in the Prediction of Functions

  3. Phylogenetic Analysis • Assumption: evolutionary descent • Divergence • Phylogenetic tree • Rooted and unrooted trees Species X Y A B

  4. Rooted and Unrooted Trees • Rooted: ancestral state of the evolved organism or gene is known. • Branches at bifurcation points until terminal branches, or tips/ leaves. • Unrooted trees represent branching order, but does not indicate the root of the last common ancestor

  5. Phylogenetic inference for genes • Infancy, inexact science • computational tools based on general mathematical and statistical principles • Phylogenetic reconstructions may conflict with common sense. • Incorrect sequence alignments, inadequate models • All sites within sequences evolve at different rates • unequal rate effects

  6. Some algorithms • Maximum parsimony • maximum likelihood • distance methods • UPGMA • paralinear (logdet) distances • Software Packages: PAUP phylogenetic analysis using parsimonyPHYLIP phylogenetic inference packageMacClade, GAMBIT, MEGA/METREE

  7. Limitations • Inspection of sequence alignments • Removal of deviant sequences from the phylogenetic inference • Different genes analysed produce different trees • "Bootstrapping" for estimating statistical significance may still have errors in interpretation

  8. A B Uses C D • Molecular Taxonomy • 16S and 23S rRNA analysis for bacterial classification • 18S rRNA analysis of nematodes, drosophila • epidemiological analysis of strain variation eg. In infections pathogens

  9. Multiple Sequence Analysis • Gather a set of sequences of putative similarity or homology • Pairwise comparison for each set of multiple sequences • Build a "tree" of similarity • realignment of all sequences based on "ancestral" sequence padding with gaps etc • Used for generating "profiles"

  10. Use • Detection of conserved and variable regions • Infer gene functions • Variable segments - infer dispensable to function or antigenic variants • Motifs can be used to analyse unknown sequence and infer possible function or relatedness • Motifs as basis for annotation of genome project sequences

  11. Software • CLUSTALW • Profile software based on Hidden Markov Models (HMM) statistical models, eg HMMer, HMMPro, META-MEME, PROBE, BLOCKS

  12. Example • C. elegans genome project • several large gene families of sequence homology - function unknown. • Now classified as putative G-protein coupled receptors (GPCRs). • Have to detect significant similarity between putative Worm GPCRs and experimentally known GPCRs in other species

  13. Process • Select a typical unknown sequenceBLAST Search against nr database • Inspect hits and E-values • Top scoring hits - mitochondrial L11 ribosomal protein E=0.002 (not low enough to be trusted for annotation) • The rest of top scorers are all nematode-specific unknown sequences • Compare with PSI-BLAST iterative searching at NCBI • Similarity with mammalian GPCRs or the high scoring mt rL11 protein ?

  14. Further analysis • Gather all nematode specific sequences • WormPep database of non-redundant seqs • Discard seqs of abnormally long or short • Multiple sequence alignment using CLUSTALW • General Profile of multiple alignment using HMMer • Use profile to search database again

  15. Results • Similarity at significance level detected with Mammalian GPCRs • Find that L11 protein has very significant high score E=5x10 • Pitfalls of PSI-Blast - significance of match to the training set during iteration. • Finally, L11 protein may be wrongly annotated and not based on experimental results -49

  16. A.Sensitivity and Specificity of a Fairly Good Test • Total real +ve = 73Total real - ve = 27 • Specificity = (25)/(2+25)=.93picked up 25 of the 27 negatives, very specificLow false positives • Sensitivity = 70/(70+3)=.96able to pickup 70 of the total 73 that are known positive- quite sensitive- Low false negatives • Gold standards Known gold standard+ ve - ve + ve - ve 2 70 3 25 Exptaltest result N=100

  17. B.Increase Sensitivity but Lower Specificity of a Test • Total real +ve = 73Total real - ve = 27 • Specificity = (14)/(13+14)=.52picked up 14 of the 27 negatives, not very specifichigh false positives • Sensitivity = 72/(72+1)=.99able to pickup 72 of the total 73 that are known positive- super sensitive Low false negatives Known gold standard+ ve - ve + ve - ve 13 72 1 14 Exptaltest result N=100

  18. C.Increase Specificity of a Test butSensitivity may drop • Total real +ve = 73Total real - ve = 27 • Specificity = (27)/(0+27)=1.0picked up 27 of the 27 negatives,completely specificincrease threshold to zero false positives, true positives will drop • Sensitivity = 50/(50+23)=.68able to pickup 50 of the total 73 that are known positive- not quite sensitive- Low false negatives Known gold standard+ ve - ve + ve - ve 0 50 23 27 Exptaltest result N=100

  19. Trade off involved • If threshold of test set high, so that all the noise disappears, you may also miss out on some true positives, get a lot of false negatives and thus not so sensitive - case C • If threshold of test set low, so that you get as much of the positives as you can get, ie high sensitivity, your non-specific false positive hits start appearing - Case B

  20. Computational Predictions of Gene Function • Sensitivity and specificity has similar tradeoffs. • Cutoff threshold values have to be empirically determined or arbitrarily chosen depending on situation

More Related