Kevin Humphreys, George Demetriou, & Robert Gaizauskas Department of Computer Science,

Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George Demetriou, & Robert Gaizauskas Department of Computer Science, University of Sheffield (Pacific Symposium on Biocomputing, Vol 5, Pages 502-513, 2000)

Abstract • The application of technology to the extraction of information from scientific journal papers in the area of molecular biology. • Two bioniformatics applications: EMPathIE, concerned with enzyme and metabolic pathways; and PASTA, concerned with protein structure.

1. Introduction • The prototypical IE tasks are those defined by the U.S. DARPA MUCs, requiring the filling of a complex template from newswire texts on subjects such as joint venture announcements, management succession events, or rocket launchings. • This paper described the use of the technology developed through MUC evaluations in two bioinformatics applications.

2. IE Technology • MUC-7 specified five separate component tasks: • Named Entity recognition: organizations, persons, locations, dates and monetary amounts. • Coreference resolution: the identification of expressions that refer to the same object, set or activity. • Template Element filling: the filling of small scale templates for specified classes of entity in the texts. • Template Relation filling: fill a two slot template representing a binary relation with pointers. • Scenario Template filling: the detection and construction of relations between template elements as participants in a particular type of event, or scenario.

3. Two Bioinformatics Applications of IE (1/2) • EMPathIE • Enzyme and Metabolic Pathways Information Extraction. • Aimed to extract details of enzyme reactions from articles in the journals Biochimica et Biophysica Acta and FEMS Microbiology Letters. • Typically, journal articles in this domain describe details of a single enzyme reaction, often with little indication of related reactions and which pathways the reaction may be part of. => Combine details from several articles for pathway identification.

3. Two Bioinformatics Applications of IE (2/2) • PASTA • Protein Active Site Template Acquisition • Aimed to extract information concerning the roles of amino acids in protein molecules, and to create a database of protein active sites from both scientific journal abstracts and full articles. • New protein structures are being reported at very high rates and the number of co-ordinate sets (currently about 9000) in the Protein Data Bank (PDB) can be expected to increase ten-fold in the next five years. • Computational methods would be very useful to biologists in comparison classification work and to those engaged in modeling studies.

3.1 EMPathIE (1/2) • The EMP database contains over 20,000 records of enzyme reactions, collected from journal articles published since 1964. => provide for training data. • Template definitions: • Three Template Elements: enzyme, organism and compound. • A single Template Relation:source, relating enzyme and organism elements • A scenario Template for the specific metabolic pathway task.

3.1 EMPathIE (2/2) • A manually produced sample Scenario Template, taken from an article on ‘isocitrate lyase activity’ in FEMS Microbiology Letters. 乙醛酸循環

3.2 PASTA (1/3) • The entities to be extracted: • proteins • amino acid residues • species • types of structural characteristics • secondary structure, quaternary structure • active sites • other (probably less important) regions • chains • Interactions • hydrogen bonds, disulphide bonds etc.

3.2 PASTA (2/3)

3.2 PASTA (3/3)

4. EMPathIE and PASTA (1/2) • The IE systems are both derived from the LaSIE system, a general purpose IE system, under development at Sheffield since 1994. • The processing modules:

4. EMPathIE and PASTA (2/2) • Both systems have a pipeline architecture consisting of four principal stages. • Text preprocessing • SGML/structure analysis, tokenisation • Lexical and terminological processing • Terminology lexicons, morphological analysis, terminology grammars • Parsing and semantic interpretation • Sentence boundary detection, part-of-speech tagging, phrase grammars, semantic interpretation • Discourse interpretation • Coreference resolution, domain modeling

4.1 Text Preprocessing • Both the SGML and sectioniser modules may specify that certain text regions are to be excluded from any subsequent processing, avoiding detailed processing of apparently irrelevant text. • The tokenisation of the input needs to identify tokens within compound names.

4.2 Lexical and Terminological Preprocessing (1/3) • The main information sources used for terminology identification: • Case-insensitive terminology lexicons • Listing component terms of various categories • Morphological cues: standard biochemical suffixes • Hand-constructed grammar rules for each terminology class

4.2 Lexical and Terminological Preprocessing (2/3) • The enzyme name mannitol-1-phosphate 5-dehydrogenase would be recognized firstly by the classification of mannitol as a potential compound modifier, and phosphate as a compound, both by being matched in the terminology lexicon. • Morphological analysis would suggest dehydrogenase as a potential enzyme head, due to its suffix -ase. • Grammar rules would apply to combine the enzyme head with a known compound and modifier which can play the role of enzyme modifier.

4.2 Lexical and Terminological Preprocessing (3/3) • The biochemical terminology lexicons, assembled from various publicly available resources (e.g. SWISS-PROT), have been structured to distinguish various term components which are then assembled by grammar rules. • The total number of lexicon entries is approximate 25,000 component terms at present in 52 categories.

4.3 Parsing and Semantic Interpretation • The syntactic processing modules treat any terms recognized in the previous stage as non-decomposable units, with a syntactic role of proper noun. • The POS tagger only attempts to assign tags to tokens which are not part of proposed terms. • The phrasal grammar includes compositional semantic rules, which are used to construct a semantic representation of the ‘best’, possibly partial.

4.4 Discourse Interpretation(1/2) • The discourse interpreter adds the semantic representation of each sentence to a predefined domain model, made up of ontology, or concept hierarchy, plus inheritable properties and inference rules associated with concepts. • The domain model is gradually populated with instances of concepts from the text to become a discourse model. • Coreference mechanism attempts to merge each newly introduced instance with an existing one, subject to various syntactic and semantic constraints.

4.4 Discourse Interpretation(2/2) • The template writer module reads off the required information from the final discourse model and formats it as in the template specification. • An initial domain model for the EMPathIE metabolic pathway task has been manually constructed, directly from the template definition, and subsequent refinement will involve extending the concept subhierarchies and the addition of coreference constraints on the hypothesised instances, based on available training data.

5. Results & Evaluation(1/2) • A complete prototype EMPathIE system exists which can produce filled templates. • The terminology recognition portion has been informally reviewed by molecular biologists.=> remarkably good • The PASTA system has been implemented as far as the terminology recognition stage. Preliminary template design has been carried out, and being starting to build a domain model. • A corpus of 52 abstracts of journal articles has been manually annotated with classes.=>allow an automatic evaluation of the PASTA terminology system using the MUC scoring software.

5. Results & Evaluation(2/2) Initial Named Entity results for the PASTA system

6. Conclusion • These two projects move IE systems into the molecular biology domain much of the low-level work. • Generalize the software to longer, multi-sectioned articles with embedded SGML. • Generalize tokenisation routines to cope with scientific nomenclature. • Generalize terminology recognition procedures to deal with a broad range of molecular biological terminology. • Make good progress in designing template elements, template relations, and scenario templates.

Kevin Humphreys, George Demetriou, & Robert Gaizauskas Department of Computer Science,