Text Mining for Biomedicine: Techniques & tools

Text Mining for Biomedicine:Techniques & tools Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki, Yoshimasa Tsuruoka School of Computer Science National Centre for Text Mining www.nactem.ac.uk Sophia.Ananiadou@manchester.ac.uk

Outline • Challenges / objectives of TM in biomedicine • Terminology processing • Term extraction, term variation, named entity recognition • Resources for TM in biomedicine • Document classification • Information Extraction approaches • Levels of Text Mining Processing • Biomedical text mining services and systems @ NaCTeM • TerMine, AcroMine, Smart dictionary look up, Phenetica • Medie, InfoPubMed, KLEIO

Material • Further background on TM for Biology Ananiadou, S. & McNaught, J. (eds) (2006) Text Mining for Biology and Biomedicine. Boston, MA: Artech House • Numerous papers on line from bibliography • See BLIMP http://blimp.cs.queensu.ca/ • Biomedical Literature (and text) mining publications

Text Mining in biomedicine • Why biomedicine? • Consider just MEDLINE: 16,000,000 references,40,000 added per month • Dynamic nature of the domain: new terms (genes, proteins, chemical compounds, drugs) constantly created • Impossible to manage such an information overload

From Text to Knowledge: tackling the data deluge through text mining Unstructured Text (implicit knowledge) Information Retrieval Information extraction Knowledge Discovery Semantic metadata Structured content (explicit knowledge) Advanced Information Retrieval

Information deluge • Bio-databases, controlled vocabularies and bio-ontologies encode only small fraction of information • Linking text to databases and ontologies • Curators struggling to process scientific literature • Discovery of facts and events crucial for gaining insights in biosciences: need for text mining

The solution: The UK National Centre for Text Mining www.nactem.ac.uk • Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk • First publicly funded text mining centre in the world.. • Focus: biology, medicine, social sciences…

We don’t just press a button… • TM involves • Many components (converters, analysers, miners, visualisers, ...)‏ • Many resources (grammars, ontologies, lexicons, terminologies, thesauri, CVs)‏ • Many combinations of components and resources for different applications • Many different user requirements and scenarios, training needs • The best solutions are customised

People behind NaCTeM • Text Mining Team: 14 members • Close collaboration with University of Tokyo, Tsujii Lab http://www-tsujii.is.s.u-tokyo.ac.jp/

What NaCTeM is building: • Resources: ontologies, lexicons, terminologies, thesauri, grammars, annotated corpora • BOOTStrep project http://www.nactem.ac.uk/bootstrep.php • Tools: tokenisers, taggers, chunkers, parsers, NE recognisers, semantic analysers • NaCTeM is also providing services • Our related bio-text mining projects • REFINE http://dbkgroup.org/refine/ • Representing Evidence For Interacting Network Elements • ONDEX (data integration, workflows, text mining)

Individual tools for user data • Splitters, taggers, chunkers, parsers, NER, term extractors • Modes of use • Demonstrators: for small-scale online use • Batch mode: upload data, get email with link to download site when job done • Web Services • Integration into Workflows (Taverna) • Some services are compositions of tools

Aims • Text mining: discover & extract unstructured knowledge hidden in text • Hearst (1999) • Text mining aids to construct hypotheses from associations derived from text • protein-protein interactions • associations of genes – phenotypes • functional relationships among genes

Impact of text mining • Extraction of named entities (genes, proteins, metabolites, etc) • Discovery of concepts allows semantic annotation of documents • Improves information access by going beyond index terms, enabling semantic querying • Construction of concept networks from text • Allows clustering, classification of documents • Visualisation of concept maps

Impact of TM • Extraction of relationships (events and facts) for knowledge discovery • Information extraction, more sophisticated annotation of texts (event annotation) • Beyond named entities: facts, events • Enables even more advanced semantic querying

Hypothesis generation from literature • Swanson experiments (1986) influenced conceptual biology • rapid ‘mining’ of candidate hypotheses from the literature • migraine and magnesium deficiency (Swanson, 1988) • indomethacin and Alzheimer’s disease (Swanson and Smalheiser 1994), • Curcuma longa and retinal diseases, Crohn's disease and disorders related to the spinal cord (Srinivasan and Libbus 2004). • (Weeber M, Rein et al. 2003) thalidomide for treating a series of diseases such as acute pancreatitis, chronic hepatitis C.

Text mining steps • Information Retrieval yields all relevant texts • Gathers, selects, filters documents that may prove useful • Finds what is known • Information Extraction extracts facts & events of interest to user • Finds relevant concepts, facts about concepts • Finds only what we are looking for • Data Mining discovers unsuspected associations • Combines & links facts and events • Discovers new knowledge, finds new associations

Text Annotation Tools Structured Knowledge Knowledge Extraction Tools From Text to Knowledge: NLP and Knowledge Extraction Lexicons and ontologies

Challenge: the resource bottleneck • Lack of large-scale, richly annotated corpora • Support training of ML algorithms • Development of computational grammars • Evaluation of text mining components • Lack of knowledge resources: lexica, terminologies, ontologies.

Annotation IE system Annotation & Information Extraction Biomedical Knowledge Biomedical Literature • Semantic annotation simulates an ideal performance of IE system. • IE systems can be developed by referencing annotated corpus. • The performance of IE systems can be evaluated by being compared to the annotated corpus. (Kim & Tsujii, Text Mining Workshop, Manchester, 2006)

Task-oriented Annotation Application annotated text User system development Defined by specific tasks Specific curation tasks in specific environments Mapping of Protein names to database IDs in specific text types Specific event types such as Protein-Protein Interaction Disease-Gene Association of specific diseases Task-neutral Annotation GENIA Corpus [U-Tokyo, NaCTeM] Development of generic tools Defined by theories Linguistics Tokens POS Phrase Structure Dependency Structure Deep Syntax (PAS) Biology Named Entities of various semantic types Events Linguistics + Biology Co-references Interoperable Tools Text Annotation

Part-of-speechannotation2,000 abstracts Annotation of GENIA corpus – Term&POS Term (entity)annotation2000+400abstracts

Text semantic annotation • annotation of events and involved named entities • Example: “Regulation of Transcription events” • BOOTSTrep project http://www.nactem.ac.uk/bootstrep.php • two different types of annotation levels • linguistic annotation levels • biological annotation level, in charge of marking the biological knowledge contained in the text • Linking text with biological knowledge

Events and variables • Biological events can be centred on: • verbs, e.g. activate, • nouns with verb-like meanings (nominalised verbs), e.g. transcription • Different parts of sentence correspond to different types of variables in the event e.g. • What caused event • The narL gene productactivates the nitrate reductase operon • What was affected by event • Analysisof mutants… • Where event took place • These fusions were formedon plasmid cloning vectors

activate Verb Frame Example “The narL gene productactivates the nitrate reductase operon” Theme Characteristics operon Agent Characteristics protein

the agent The narL gene product protein operon the nitrate reductase operon the theme (what is acted upon) Example 1 activates

Linguistically Annotated Corpora • GENIA • Domain • Mesh term: Human, Blood Cells, and Transcription Factors. • Annotation: POS, named entity, parse tree • Penn BioIE • Domain • the molecular genetics of oncology • the inhibition of enzymes of the CYP450 class. • Annotation: POS, named entity, parse tree • Yapex • GENETAGa corpus of 20K MEDLINE® sentences for gene/protein NER

The GENIA annotation • Linguistic annotation • Reveals linguistic structures behind the text • Part-of-speech annotation • annotates for the syntactic category of each word. • Syntactic Tree annotation • annotates for the syntactic structure of sentences. • Semantic annotation • Reveals knowledge pieces delivered by the text. • Term annotation • annotates domain-specific terms • Event annotation • annotates events on biological entities. Ontology-drivenannotation

Annotation Tool • WordFreak http://wordfreak.sourceforge.net/ • Java-based linguistic annotation tool developed at University of Pennsylvania • Extensible to new tasks and domains • Customised visualisation and annotation specification • Allows annotation process to be made as simple as possible

Resources

What about existing resources? • Ontologies important for knowledge discovery • They form the link between terms in texts and biological databases • Can be used to add meaning, semantic annotation of texts

Link between text and ontologies Adding new knowledge KEGG Ontological resources UMLS text Supporting semantics GO GENIA

Bridging the Gap– Integrating data, text and knowledge Databases Semantic Interpretation of data Adding new knowledge Ontological resources UMLS text Supporting semantics GO GENIA KEGG Semantic Interpretation of models in Systems Biology Mathematical Models

Resources for Bio-Text Mining • Lexical / terminological resources • SPECIALIST lexicon, Metathesaurus (UMLS) • Lists of terms / lexical entries (hierarchical relations) • Ontological resources • Metathesaurus, Semantic Network, GO, SNOMED CT, etc • Encode relations among entities Bodenreider, O. “Lexical, Terminological, and Ontological Resources for Biological Text Mining”, Chapter 3, Text Mining for Biology and Biomedicine, pp.43-66

SPECIALIST lexicon • UMLS specialist lexicon http://SPECIALIST.nlm.nih.gov • Each lexical entry contains morphological (e.g. cauterize, cauterizes, cauterized, cauterizing), syntactic (e.g. complementation patterns for verbs, nouns, adjectives), orthographic information (e.g. esophagus – oesophagus) • General language lexicon with many biomedical terms (over 180,000 records) • Lexical programs include variation (spelling), base form, inflection, acronyms

{base=Kaposi's sarcoma spelling_variant=Kaposi sarcoma entry=E0003576 cat=noun variants=uncount variants=reg variants=glreg} Kaposi’s sarcoma Kaposi’s sarcomas Kaposi’s sarcomata Kaposi sarcoma Kaposi sarcomas Kaposi sarcomata Lexicon record The SPECIALIST Lexicon and Lexical Tools Allen C. Browne, Guy Divita, and Chris Lu PhD 2002 NLM Associates Presentation, 12/03/2002, Bethesda, MD

Hodgkin Disease HODGKIN DISEASE Hodgkin’s Disease Hodgkin’s disease Disease, Hodgkin ... disease hodgkin Normalisation (lexical tools) normalise

Steps of Norm Remove genitive Hodgkin’s Diseases Replace punctuation with spaces Hodgkin Diseases Remove stop words Hodgkin Diseases Lowercase hodgkin diseases Uninflect each word hodgkin disease Word order sort disease hodgkin Lexical tools of the UMLS http://lexsrv3.nlm.nih.gov/SPECIALIST/index.html

The Gene Ontology (GO) • Controlled vocabulary for the annotation of gene products http://www.geneontology.org/ 19,468 terms. 95.3% with definitions 10391 biological_process 1681 cellular_component7396 molecular_function

Gene Ontology • GOA database (http://www.ebi.ac.uk/GOA/) assigns gene products to the Gene Ontology • GO terms follow certain conventions of creation, have synonyms such as: • ornithine cycle is an exact synonym of urea cycle • cell division is a broad synonym of cytokinesis • cytochrome bc1 complex is a related synonym of ubiquinol-cytochrome-c reductase activity

GO terms, definitions and ontologies in OBO id: GO:0000002 name: mitochondrial genome maintenance namespace: biological_process def: "The maintenance of the structure and integrity of the mitochondrial genome.“ [GOC:ai] is_a: GO:0007005 ! mitochondrion organization and biogenesis

Metathesaurus • organised by concept • 5M names, 1M concepts, 16M relations • built from 134 electronic versions of many different thesauri, classifications, code sets, and lists of controlled terms • "source vocabularies“ • common representation

Are the existing knowledge resources sufficient for TM? No! Why? • Limited lexical & terminological coverage of biological sub-domains • Resources focused on human specialists GO, UMLS, UniProt ontology concept names frequently confused with terms

Naming conventions • Update and curation of resources • FlyBase gene name coverage 31% (abstracts) to 84% (full texts) • Naming conventions and representation in heterogeneous resources • Term formation guidelines from formal bodies e.g. HUGO, IPI not uniformly used • Problems with integration of resources dystrophin used for 18 gene products “Dystrophin (muscular dystrophy, Duchenne and Becker types), included DXS143, DXS164, DXS206, …” HUGO

Term variation • Terminological variation and complexity of names • High correlation between degree of term variation and dynamic nature of biomedicine • Variation occurs in controlled vocabularies and texts but discrepancy between the two • Exact match methods fail to associate term occurrences in texts with databases

What’s in a name? Terms, named entities in biology

What’s in a name? • Breast cancer 1 (BRCA1) • p53 • Ribosomal protein S27 • Heat shock protein 110 • Mitogen activated protein kinase 15 • Mitogen activated protein kinase kinase kinase 5 From K. Cohen, NAACL 2007

Worst gene names • sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A K. Cohen NAACL 2007

Text Mining for Biomedicine: Techniques & tools