Robots and Automatic Genome Annotation

Robots and Automatic Genome Annotation • Ross D. King • Department of Computer Science • University of Wales, Aberystwyth

Talk Plan • Data Mining based gene function prediction • The Robot Scientist • Automating annotation and experimentation

Data Mining Prediction • We have developed a method for predicting the functional class of gene products based on data mining. • The idea is to learn a reliable predictive function on the examples of genes with products of known function. • Then apply this function to genes where the functional class is unknown. • Applied to: E. coli, M. tuberculosis, S. cerevisiae, A. thaliana. • We call this approach: Data Mining Prediction (DMP).

Classification schemes (MIPS/GO) Hierarchy of classes 1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid metabolism" 1,1,1,0 "amino acid biosynthesis" 1,1,4,0 "regulation of amino acid metabolism" 1,1,7,0 "amino acid transport" 1,1,10,0 "amino acid degradation (catabolism)" 1,1,99,0 "other amino acid metabolism activities" 1,2,0,0 "nitrogen and sulfur metabolism" 1,3,0,0 "nucleotide metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0 "C-compound and carbohydrate metabolism" 1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism" 1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups" 1,20,0,0 "secondary metabolism" ... and ORFs may have multiple functions too!

Sequence Data field description type aa_rat_X % of amino acid X in the protein real seq_len length of the protein sequence int aa_rat_pair_X_Y % of the amino acids X and Y consecutively real mol_wt molecular weight of the protein int theo_pI theoretical pI (isoelectric point) real atomic_comp_X atomic composition of X (C,H,N,O,S) real aliphatic_index aliphatic index real hydro grand average of hydropathy real strand the DNA strand 'w' or 'c' position the number of exons (no. of start positions) int cai codon adaptation index real motifs number of PROSITE motifs int tmSpans number of transmembrane spans int chromosome chromosome number 1..16,mit 478 attributes in total

Sequence database NRDB PSI-BLAST gene tfc sfc3 wsv442 cg9463 f1l3 organism baker's yeast fission yeast white spot virus fruit fly Arabidopsis score 0.0 1.0e-18 2.1 2.9 3.0 Homology data YAL001C: mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdk.... sfc3: keyword(membrane) length(358) dbref(prosite) dbref(embl) We look up the associated information from SwissProt

Predicted Secondary Structure Data mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk... cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb... We record length and relative positions of the secondary structure elements. This is relational data.

a0a7a14a21 YBR166C 0.33 -0.17 0.04 -0.07 YOR357C -0.64 -0.38 -0.32 -0.29 YLR292C -0.23 0.19 -0.36 0.14 YGL112C -0.69 -0.89 -0.74 -0.56 ... Expression Data • Microrarray experiments to measure expression changes in yeast under a variety of conditions, including cell cycle, heat shock, diauxic shift. • Short time series data, numerical-valued Spellman et al (1998), Roth et al (1998) DeRisi et al (1997), Eisen et al (1998) Gasch et al (2000, 2001), Chu et al (1998)

Phenotype Data • Data from knockout gene growth experiments • Many missing data • Data taken from 3 sources (TRIPLES, MIPS, EUROFAN) deleted ORF growth medium ORF YAL001C YAL019W YAL021C YAL029C calcofluor white w n n n sorbitol n s n w benomyl n w n w H2O2 w w n r ... s = sensitive (less growth) w = wild-type (no observable effect) r = resistant (more growth) n = no data

What are the Machine Learning Issues? • Large volume of data • Missing data • Accurate results required • Intelligible results required • Class hierarchy • Multiple labels • Relational data

Data Mining Prediction (DMP) Entire database Test data 1/3 2/3 PolyFARM Data for rule creation Validation data 1/3 2/3 Training data All rules Best rules Rule generation Select best rules Measure rule accuracy C4.5 Results

Application to Bacterial Genomes • Successful for both M. tuberculosis and E. coli. • Of the ORFs with no assigned function >40% were predicted to have a function at one or more levels of the class hierarchy. • It was found that many of the predictive rules were more general than possible using sequence homology. References King et al. (2000) KDD 2000 King et al. (2000) Yeast (Comparative and Functional Genomics) King et al. (2001) Bioinformatics

Summary Results (Bacteria) • Using voting (2 or more rules agree on a prediction) • Level 2 :128 ORFs predicted - 87.5% accuracy • Level 3 : 23 ORFs predicted - 91.3% accuracy • All predictions • Level 2 :335 ORFs predicted - 64.5% accuracy • Level 3: 204 ORFs predicted - 44.6% accuracy

Example Rule (level 2 E. coli) If the ORF is not predicted to have a b-strand of length  3  a homologous protein from class Chytridiomycetes was found Then its functional class is “Cell processes, Transport/binding proteins” 12/13 (86%) correct on Test Set - probability of this result occurring by chance is estimated at 4x10-7. 24 ORFs of unknown function are predicted by the rule. 16 ORFs now with putative or confirmed function - 93.8% accurate predictions

Experimental Conformation • The original bacterial ORF predictions were made over three years ago. • In the intervening time many more ORFs have been sequenced, making traditional homologous prediction methods more accurate and sensitive, and the function of some ORFs have been determined by wet biology. • The E. coli genome has recently been re-annotated by Monica Riley’s group.

“Wet” Biology conformation • A number of predictions have been confirmed or falsified by new “wet” experimental data. • This new data is biased towards hard classes. Despite this the results are still good: • Level 2: 23 predictions - 47.8% accuracy • Level 3: 23 predictions - 43.4% accuracy This is very much better than random as there are many classes.

Confirmation of “Wet” Predictions

Results (Yeast) • Many rules from each data type • Rules at each level of hierarchy • Some classes are much easier to predict than others (for example "protein synthesis" at 71-93%, "energy" at 20-47%) • Good levels of accuracy on held out test data • Many predictions for ORFs of unknown function (some function at some level is predicted for 96% of the ORFs of unknown function) • Some rules explainable by biology -> scientific knowledge discovery Clare & King (2003) Bioinformatics suppl. 2., 42-49

Accuracy Table

Extension to Arabidopsis Genome • Collaborative project with the Institute of Grassland and Environmental Research and the University of Nottingham. • Large increase in data: 6,000 -> 25,000 ORFs. Large amount of micro-array data from the Nottingham Arabidopsis stock centre. • 250 million Prolog facts, 200,000 attributes, File sizes almost 2Gb • 7,964 gene function predictions with an expected accuracy >70%, 2,974 with an expected accuracy >90%, • We are currently growing 14 knockout varieties of Arabidopsis to test a sample of these predictions

Availability All predictions available at http://www.genepredictions.org All rules and data available at http://www.aber.ac.uk/compsci/Research/bio/dss/

The Robots Scientist

The Robot Scientist Concept The robot scientist project aims to develop a computer system that is capable of originating its own experiments, physically doing them, interpreting the results, and then repeating the cycle. Background Knowledge Analysis Machine Learning Consistent Hypothesis Experiments(s) Experiment(s) selection Final Theory Robot Results

Motivation: Technological • In many areas of science our ability to generate data is outstripping our ability to analyse the data. • One scientific area where this is true is functional genomics, where data is now being generated on an industrial scale. • The analysis of scientific data needs to become as industrialised as its generation.

The Application Domain • Functional genomics • In yeast (S. cerivasae) ~30% of the 6,000 genes still have no known function. • EUROFAN 2 has knocked out each of the 6,000 genes in mutant strains. • Task to determine the “function” of the gene by auxotrophic growth experiments comparing mutants and wild type.

Logical Cell Model • We have built a logical model of the known metabolic pathways (coded in Prolog) - taken from KEGG and other bioinformatic sources. This is essentially a directed graph: with metabolites as nodes and enzymes as arcs. • If a path can be found from cell inputs (metabolites in the growth medium) to all the cell outputs (essential compounds), then the cell can grow.

AAA Model System • We started using the aromatic amino-acid (AAA) pathway in yeast as a model system to prove the principle of the Robot Scientist. • 9 metabolities can be used of the shelf • 15 knockout mutants from Eurofan • The mutant can grow iff all three aromatic amino-acids can be synthesised (tyrosine, phenyalalanine, tryptophan). Based on a pathway from glycerate-2-phophate.

Phenyalanine, Tyrosine, and Tryptophan Pathways for S. cerivisae Glycerate -2-Phosphate C00631 YGR254W YHR174W YMR323W C04302 Phosphoenol pyruvate D-Erythrose -4-Phosphate C00108 C00074 N-5’-Phospho --d-ribosyl anthranilate YDR354W 5-o-1-carboxyvinyl -3-phosphoshikimate Anthranilate YBR249C YDR035W C00279 YDR127W C01269 YER090W (YKL211C) YGL148W 3-deoxy-D-arabino- heptulosonate-7-phosphate YDR007W C00251 Chorismate C04961 1-(2-Carboxyl phenylamino)-1’- deoxy-D-ribulose- 5’-phosphate C01302 Shikimate –3- phosphate YPR060C YDR127W C03175 C00254 Prephenate 3-Dehydroquinate C00944 YDR127W YBR166C YNL316C YKL211C YDR127W C03506 3-Dehydroshikimate C00463 p-Hydroxyphenyl pyruvate (3-Indolyl)- glycerol phosphate Phenylpyruvate C02637 YGL026C Indole C00166 YDR127W C01179 C00493 YHR137W YGL202W YHR137W YGL202W Shikimate 5-Dehydroshikimate YGL026C YGL026C C02652 YDR127W TYROSINE PHENYLALANINE TRYPTOPHAN C00079 Metabolite import C00078 C00082 Growth Medium

Experimental Methodology • Experiments consist of making particular growth media and testing if the mutants can grow (add metabolites to a basic defined medium). • A mutant is auxotrophic if cannot grow on a defined medium that the wild type can grow on. • By observing the pattern of chemicals that recover growth the function of the knocked out mutant can be inferred.

Inferring Hypotheses • In the philosophy of science. It has often been argued that only humans can make the “leaps of imagination” necessary to form hypotheses. • We use Abductive Logic Programming to infer missing arcs/labels in our metabolic graph. With these missing nodes we can explain (deductively) all the experimental results. Reiser et al., (2001) ETAI 5, 233-244;

The Form of the Hypotheses • The form of the hypotheses we can infer is currently quite simple. Each hypothesis binds a particular gene to an enzyme that catalyses the reaction. • A correct hypothesis would be that: YDR060C codes for the enzyme for the reaction chorismate  prephenate. • An incorrect hypothesis would be that: it coded for the reaction chorismate  anthranilate. • We have also demonstrated how more complex abductive hypotheses could be formed.

A Discriminating Experiment • Hypothesis 1: YDR060C codes for the enzyme the reaction: chorismate  prephenate. • Hypothesis 2: YDR060C codes for the enzyme the reaction: chorismate  anthranilate. • These can be distinguished by growing the knockout YDR060C on prephenate or anthranilate. • Note that these two experiments will have differing monetary cost.

Phenyalanine, Tyrosine, and Tryptophan Pathways for S. cerivisae Glycerate -2-Phosphate C00631 YGR254W YHR174W YMR323W C04302 Phosphoenol pyruvate D-Erythrose -4-Phosphate C00108 C00074 N-5’-Phospho --d-ribosyl anthranilate YDR354W 5-o-1-carboxyvinyl -3-phosphoshikimate Anthranilate YBR249C YDR035W C00279 YDR127W C01269 YER090W (YKL211C) YGL148W 3-deoxy-D-arabino- heptulosonate-7-phosphate YDR007W C00251 Chorismate C04961 1-(2-Carboxyl phenylamino)-1’- deoxy-D-ribulose- 5’-phosphate C01302 Shikimate –3- phosphate YPR060C YDR127W C03175 C00254 Prephenate 3-Dehydroquinate C00944 YDR127W YBR166C YNL316C YKL211C YDR127W C03506 3-Dehydroshikimate C00463 p-Hydroxyphenyl pyruvate (3-Indolyl)- glycerol phosphate Phenylpyruvate C02637 YGL026C Indole C00166 YDR127W C01179 C00493 YHR137W YGL202W YHR137W YGL202W Shikimate 5-Dehydroshikimate YGL026C YGL026C C02652 YDR127W TYROSINE PHENYLALANINE TRYPTOPHAN C00079 Metabolite import C00078 C00082 Growth Medium

Inferring Experiments Given a set of hypotheses we wish to infer an experiment that will efficiently discriminate between them Assume: • Every experiment has an associated cost. • Each hypothesis has a probability of being correct. The task: • To choose a series of experiments which minimise the expected cost of eliminating all but one hypothesis.

Comparison of different experimental strategies • ASE - Expected cost minimization. • Naïve - Choose cheapest experiment. • Random - Randomly choose experiments. The cost of a series of experiment is a function of the time taken and money spent. “Time is Money”.

The Robot Biomek 200

Closing the Loop • We have physically implemented all aspects of the Robot Scientist system. • To the best of our knowledge this is the first active learning system that both explicitly forms hypotheses and experiments, and physicals does real experiments.

Accuracy v Time At the end of the 5th iteration: ASE 80.1%, Naïve 74.0%, Random 72.2%. ASE was significantly more accurate than either Naïve (p < 0.05) or Random (p < 0.07) using a paired t-test.

Accuracy v Money Given a spend of ≤£102.26, ASE 79.5%, Naïve 73.9%, Random 57.4%. ASE was significantly more accurate than either Naïve (p < 0.05) or Random (p < 0.001).

Time and Money • “Cost” is a positive function of time & money. ASE dominates for both, therefore ASE dominates for any reasonable cost function. • For example: to achieve an accuracy of ~70%, ASE requires fewer trial iterations, and a hundredth of the price, of Random; and almost half the number of iterations, and a third of the price, of Naïve. King et al. (2004) Nature. 427, 247-252.

Human Comparisons • We were interested to compare the performance of the Robot Scientist with that of humans. • We adopted the simulator to allow humans to chooses and interpret the results of cycles of experimentation. • Compared nine graduate computer scientists and biologists. • No significant difference between the best humans and the Robot

Robotic Annotation

New Biological Knowledge • So far with the Robot Scientist we have only shown that we can automatically rediscover known biological knowledge. • We wish to extend this result to the discovery of new biological knowledge. • To do this we need to combine the robot scientist with conventional genome annotation bioinformatics, and DMP.

Robotic Annotation • One way of thinking about genome annotation is as a hypothesis formation process. • Hypothesis formation is perhaps the hardest part of automating science. • Our idea is to incorporate bioinformatic annotation methods with genome annotation. • The bioinformatic methods will generate the hypotheses which the robot scientist will experimentally test.

Genome Scale Model of Yeast Metabolism • We have extended our model of aromatic amino acid metabolism to cover most of what is known about yeast metabolism. • Includes 1,166 ORFs (940 known, 226 inferred) • Growth if path from growth medium to defined end-points. • 83% accuracy (based on 914 strain/medium predictions)

The Model is Incomplete • It is not possible to find a path from the inputs (growth medium) to all the end-point metabolites using only reactions encoded by known genes. • This suggests automated strategies for determining the identity of the missing genes - new biological knowledge. • One strategy is based on using EC enzyme class of missing reactions, identify genes that code for this EC class in other organism, then find homologous genes in yeast. • The predictions can be tested automatically by robot.

Confirmation of DMPYeast Predictions • The yeast gene YBR147W, of currently “unknown” function. • It is predicted to have a function in “metabolism” by 2 DMP rules with expected accuracies of >80%. • It is predicted to have a function in “amino-acid metabolism” with two rules with expected accuracies of 50% and 60% respectively. • Using our robot scientist auxotrophic methodology we have recovered growth of the knockout with: aspartic acid, tyrosine, leucine, valine, phenylalanine, cystine, arginine.

Conclusions • Machine learning can be used to accurately predict gene function. • Simple forms of scientific reasoning and experimentation can be fully automated. • To develop robotic systems capable of generating new biological knowledge will require a synthesis of traditional genome annotation techniques, machine learning, and a Robot Scientist like methodology.

The Three Objects of the Intellect • The True • The Beautiful • The Beneficial

Robots and Automatic Genome Annotation

Robots and Automatic Genome Annotation

Presentation Transcript

Genome analysis and annotation

Genome annotation

MICROBIAL GENOME ANNOTATION

Computational Genome Annotation

Genome Annotation

Genome Annotation

Eukaryotic Genome Annotation

Genome Assembly and Annotation

Genome Annotation

Genome Annotation

Bioinformatics and Genome Annotation

Genome sequencing and annotation

Genome Annotation

microbial genome annotation

Genome Annotation

Genome Annotation

Eukaryotic Genome Annotation

Arabidopsis Genome Annotation

Genome sequencing and annotation

Genome analysis and annotation

Bioinformatics and Genome Annotation

Genome Annotation and Databases