text mining for biomedicine techniques tools n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Text Mining for Biomedicine: Techniques & tools PowerPoint Presentation
Download Presentation
Text Mining for Biomedicine: Techniques & tools

Loading in 2 Seconds...

play fullscreen
1 / 146

Text Mining for Biomedicine: Techniques & tools - PowerPoint PPT Presentation


  • 176 Views
  • Uploaded on

Text Mining for Biomedicine: Techniques & tools. Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki, Yoshimasa Tsuruoka School of Computer Science National Centre for Text Mining www.nactem.ac.uk Sophia.Ananiadou@manchester.ac.uk. Outline. Challenges / objectives of TM in biomedicine

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Text Mining for Biomedicine: Techniques & tools' - issac


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
text mining for biomedicine techniques tools

Text Mining for Biomedicine:Techniques & tools

Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki, Yoshimasa Tsuruoka

School of Computer Science

National Centre for Text Mining

www.nactem.ac.uk

Sophia.Ananiadou@manchester.ac.uk

outline
Outline
  • Challenges / objectives of TM in biomedicine
  • Terminology processing
    • Term extraction, term variation, named entity recognition
  • Resources for TM in biomedicine
  • Document classification
  • Information Extraction approaches
  • Levels of Text Mining Processing
  • Biomedical text mining services and systems @ NaCTeM
    • TerMine, AcroMine, Smart dictionary look up, Phenetica
    • Medie, InfoPubMed, KLEIO
material
Material
  • Further background on TM for Biology

Ananiadou, S. & McNaught, J. (eds) (2006) Text Mining for Biology and Biomedicine. Boston, MA: Artech House

  • Numerous papers on line from bibliography
  • See BLIMP http://blimp.cs.queensu.ca/
    • Biomedical Literature (and text) mining publications
text mining in biomedicine
Text Mining in biomedicine
  • Why biomedicine?
    • Consider just MEDLINE: 16,000,000 references,40,000 added per month
    • Dynamic nature of the domain: new terms (genes, proteins, chemical compounds, drugs) constantly created
    • Impossible to manage such an information overload
from text to knowledge tackling the data deluge through text mining
From Text to Knowledge: tackling the data deluge through text mining

Unstructured Text

(implicit knowledge)

Information

Retrieval

Information

extraction

Knowledge

Discovery

Semantic

metadata

Structured content

(explicit knowledge)

Advanced

Information

Retrieval

information deluge
Information deluge
  • Bio-databases, controlled vocabularies and bio-ontologies encode only small fraction of information
  • Linking text to databases and ontologies
    • Curators struggling to process scientific literature
    • Discovery of facts and events crucial for gaining insights in biosciences: need for text mining
the solution the uk national centre for text mining www nactem ac uk
The solution: The UK National Centre for Text Mining www.nactem.ac.uk
  • Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk
  • First publicly funded text mining centre in the world..
  • Focus: biology, medicine, social sciences…
we don t just press a button
We don’t just press a button…
  • TM involves
    • Many components (converters, analysers, miners, visualisers, ...)‏
    • Many resources (grammars, ontologies, lexicons, terminologies, thesauri, CVs)‏
    • Many combinations of components and resources for different applications
    • Many different user requirements and scenarios, training needs
  • The best solutions are customised
people behind nactem
People behind NaCTeM
  • Text Mining Team: 14 members
  • Close collaboration with University of Tokyo, Tsujii Lab http://www-tsujii.is.s.u-tokyo.ac.jp/
what nactem is building
What NaCTeM is building:
  • Resources: ontologies, lexicons, terminologies, thesauri, grammars, annotated corpora
    • BOOTStrep project http://www.nactem.ac.uk/bootstrep.php
  • Tools: tokenisers, taggers, chunkers, parsers, NE recognisers, semantic analysers
  • NaCTeM is also providing services
  • Our related bio-text mining projects
    • REFINE http://dbkgroup.org/refine/
    • Representing Evidence For Interacting Network Elements
    • ONDEX (data integration, workflows, text mining)
individual tools for user data
Individual tools for user data
  • Splitters, taggers, chunkers, parsers, NER, term extractors
  • Modes of use
    • Demonstrators: for small-scale online use
    • Batch mode: upload data, get email with link to download site when job done
    • Web Services
    • Integration into Workflows (Taverna)
  • Some services are compositions of tools
slide13
Aims
  • Text mining: discover & extract unstructured knowledge hidden in text
    • Hearst (1999)
  • Text mining aids to construct hypotheses from associations derived from text
        • protein-protein interactions
        • associations of genes – phenotypes
        • functional relationships among genes
impact of text mining
Impact of text mining
  • Extraction of named entities (genes, proteins, metabolites, etc)
  • Discovery of concepts allows semantic annotation of documents
    • Improves information access by going beyond index terms, enabling semantic querying
  • Construction of concept networks from text
    • Allows clustering, classification of documents
    • Visualisation of concept maps
impact of tm
Impact of TM
  • Extraction of relationships (events and facts) for knowledge discovery
    • Information extraction, more sophisticated annotation of texts (event annotation)
    • Beyond named entities: facts, events
    • Enables even more advanced semantic querying
hypothesis generation from literature
Hypothesis generation from literature
  • Swanson experiments (1986) influenced conceptual biology
    • rapid ‘mining’ of candidate hypotheses from the literature
    • migraine and magnesium deficiency (Swanson, 1988)
    • indomethacin and Alzheimer’s disease (Swanson and Smalheiser 1994),
    • Curcuma longa and retinal diseases, Crohn's disease and disorders related to the spinal cord (Srinivasan and Libbus 2004).
    • (Weeber M, Rein et al. 2003) thalidomide for treating a series of diseases such as acute pancreatitis, chronic hepatitis C.
text mining steps
Text mining steps
  • Information Retrieval yields all relevant texts
    • Gathers, selects, filters documents that may prove useful
    • Finds what is known
  • Information Extraction extracts facts & events of interest to user
    • Finds relevant concepts, facts about concepts
    • Finds only what we are looking for
  • Data Mining discovers unsuspected associations
    • Combines & links facts and events
    • Discovers new knowledge, finds new associations
from text to knowledge nlp and knowledge extraction

Text

Annotation Tools

Structured Knowledge

Knowledge Extraction Tools

From Text to Knowledge: NLP and Knowledge Extraction

Lexicons and

ontologies

challenge the resource bottleneck
Challenge: the resource bottleneck
  • Lack of large-scale, richly annotated corpora
    • Support training of ML algorithms
    • Development of computational grammars
    • Evaluation of text mining components
  • Lack of knowledge resources: lexica, terminologies, ontologies.
annotation information extraction

Annotation

IE system

Annotation & Information Extraction

Biomedical Knowledge

Biomedical Literature

  • Semantic annotation simulates an ideal performance of IE system.
    • IE systems can be developed by referencing annotated corpus.
    • The performance of IE systems can be evaluated by being compared to the annotated corpus.

(Kim & Tsujii, Text Mining Workshop, Manchester, 2006)

text annotation
Task-oriented Annotation

Application annotated text

User system development

Defined by specific tasks

Specific curation tasks in specific environments

Mapping of Protein names to database IDs in specific text types

Specific event types such as Protein-Protein Interaction

Disease-Gene Association of specific diseases

Task-neutral Annotation

GENIA Corpus

[U-Tokyo, NaCTeM]

Development of generic tools

Defined by theories

Linguistics

Tokens

POS

Phrase Structure

Dependency Structure

Deep Syntax (PAS)

Biology

Named Entities of various semantic types

Events

Linguistics + Biology

Co-references

Interoperable Tools

Text Annotation
text semantic annotation
Text semantic annotation
  • annotation of events and involved named entities
    • Example: “Regulation of Transcription events”
    • BOOTSTrep project http://www.nactem.ac.uk/bootstrep.php
  • two different types of annotation levels
      • linguistic annotation levels
      • biological annotation level, in charge of marking the biological knowledge contained in the text
      • Linking text with biological knowledge
events and variables
Events and variables
  • Biological events can be centred on:
    • verbs, e.g. activate,
    • nouns with verb-like meanings (nominalised verbs), e.g. transcription
  • Different parts of sentence correspond to different types of variables in the event e.g.
    • What caused event
      • The narL gene productactivates the nitrate reductase operon
    • What was affected by event
      • Analysisof mutants…
    • Where event took place
      • These fusions were formedon plasmid cloning vectors
verb frame example

activate

Verb Frame Example

“The narL gene productactivates the nitrate reductase operon”

Theme Characteristics

operon

Agent Characteristics

protein

example 1

the agent

The narL gene product

protein

operon

the nitrate reductase operon

the theme (what is acted upon)

Example 1

activates

linguistically annotated corpora
Linguistically Annotated Corpora
  • GENIA
    • Domain
      • Mesh term: Human, Blood Cells, and Transcription Factors.
    • Annotation: POS, named entity, parse tree
  • Penn BioIE
    • Domain
      • the molecular genetics of oncology
      • the inhibition of enzymes of the CYP450 class.
    • Annotation: POS, named entity, parse tree
  • Yapex
  • GENETAGa corpus of 20K MEDLINE® sentences for gene/protein NER
the genia annotation
The GENIA annotation
  • Linguistic annotation
    • Reveals linguistic structures behind the text
      • Part-of-speech annotation
        • annotates for the syntactic category of each word.
      • Syntactic Tree annotation
        • annotates for the syntactic structure of sentences.
  • Semantic annotation
    • Reveals knowledge pieces delivered by the text.
      • Term annotation
        • annotates domain-specific terms
      • Event annotation
        • annotates events on biological entities.

Ontology-drivenannotation

annotation tool
Annotation Tool
  • WordFreak http://wordfreak.sourceforge.net/
  • Java-based linguistic annotation tool developed at University of Pennsylvania
  • Extensible to new tasks and domains
  • Customised visualisation and annotation specification
    • Allows annotation process to be made as simple as possible
what about existing resources
What about existing resources?
  • Ontologies important for knowledge discovery
    • They form the link between terms in texts and biological databases
    • Can be used to add meaning, semantic annotation of texts
link between text and ontologies
Link between text and ontologies

Adding new knowledge

KEGG

Ontological

resources

UMLS

text

Supporting semantics

GO

GENIA

slide35

Bridging the Gap– Integrating data, text and knowledge

Databases

Semantic

Interpretation of data

Adding new knowledge

Ontological

resources

UMLS

text

Supporting semantics

GO

GENIA

KEGG

Semantic

Interpretation of models

in Systems Biology

Mathematical

Models

resources for bio text mining
Resources for Bio-Text Mining
  • Lexical / terminological resources
    • SPECIALIST lexicon, Metathesaurus (UMLS)
    • Lists of terms / lexical entries (hierarchical relations)
  • Ontological resources
    • Metathesaurus, Semantic Network, GO, SNOMED CT, etc
    • Encode relations among entities

Bodenreider, O. “Lexical, Terminological, and Ontological Resources for Biological Text Mining”, Chapter 3, Text Mining for Biology and Biomedicine, pp.43-66

specialist lexicon
SPECIALIST lexicon
  • UMLS specialist lexicon http://SPECIALIST.nlm.nih.gov
    • Each lexical entry contains morphological (e.g. cauterize, cauterizes, cauterized, cauterizing), syntactic (e.g. complementation patterns for verbs, nouns, adjectives), orthographic information (e.g. esophagus – oesophagus)
    • General language lexicon with many biomedical terms (over 180,000 records)
    • Lexical programs include variation (spelling), base form, inflection, acronyms
lexicon record
{base=Kaposi's sarcoma

spelling_variant=Kaposi sarcoma

entry=E0003576

cat=noun

variants=uncount

variants=reg

variants=glreg}

Kaposi’s sarcoma

Kaposi’s sarcomas

Kaposi’s sarcomata

Kaposi sarcoma

Kaposi sarcomas

Kaposi sarcomata

Lexicon record

The SPECIALIST Lexicon and Lexical Tools Allen C. Browne, Guy Divita, and Chris Lu PhD 2002 NLM Associates Presentation, 12/03/2002, Bethesda, MD

normalisation lexical tools
Hodgkin Disease

HODGKIN DISEASE

Hodgkin’s Disease

Hodgkin’s disease

Disease, Hodgkin ...

disease hodgkin

Normalisation (lexical tools)

normalise

steps of norm
Steps of Norm

Remove genitive

Hodgkin’s Diseases

Replace punctuation with spaces

Hodgkin Diseases

Remove stop words

Hodgkin Diseases

Lowercase

hodgkin diseases

Uninflect each word

hodgkin disease

Word order sort

disease hodgkin

Lexical tools of the UMLS http://lexsrv3.nlm.nih.gov/SPECIALIST/index.html

the gene ontology go
The Gene Ontology (GO)
  • Controlled vocabulary for the annotation of gene products

http://www.geneontology.org/

19,468 terms. 95.3% with definitions

10391 biological_process

1681 cellular_component7396 molecular_function

gene ontology
Gene Ontology
  • GOA database (http://www.ebi.ac.uk/GOA/) assigns gene products to the Gene Ontology
  • GO terms follow certain conventions of creation, have synonyms such as:
    • ornithine cycle is an exact synonym of urea cycle
    • cell division is a broad synonym of cytokinesis
    • cytochrome bc1 complex is a related synonym of ubiquinol-cytochrome-c reductase activity
go terms definitions and ontologies in obo
GO terms, definitions and ontologies in OBO

id: GO:0000002

name: mitochondrial genome maintenance

namespace: biological_process

def: "The maintenance of the structure and integrity of the mitochondrial genome.“ [GOC:ai]

is_a: GO:0007005 ! mitochondrion organization and biogenesis

metathesaurus
Metathesaurus
  • organised by concept
    • 5M names, 1M concepts, 16M relations
  • built from 134 electronic versions of many different thesauri, classifications, code sets, and lists of controlled terms
  • "source vocabularies“
  • common representation
are the existing knowledge resources sufficient for tm
Are the existing knowledge resources sufficient for TM?

No!

Why?

  • Limited lexical & terminological coverage of biological sub-domains
  • Resources focused on human specialists

GO, UMLS, UniProt ontology concept names frequently confused with terms

naming conventions
Naming conventions
  • Update and curation of resources
    • FlyBase gene name coverage 31% (abstracts) to 84% (full texts)
  • Naming conventions and representation in heterogeneous resources
    • Term formation guidelines from formal bodies e.g. HUGO, IPI not uniformly used
    • Problems with integration of resources

dystrophin used for 18 gene products

“Dystrophin (muscular dystrophy, Duchenne and Becker types), included DXS143, DXS164, DXS206, …” HUGO

term variation
Term variation
  • Terminological variation and complexity of names
    • High correlation between degree of term variation and dynamic nature of biomedicine
    • Variation occurs in controlled vocabularies and texts but discrepancy between the two
    • Exact match methods fail to associate term occurrences in texts with databases
slide48
What’s in a name?

Terms, named entities in biology

what s in a name
What’s in a name?
  • Breast cancer 1 (BRCA1)
  • p53
  • Ribosomal protein S27
  • Heat shock protein 110
  • Mitogen activated protein kinase 15
  • Mitogen activated protein kinase kinase kinase 5

From K. Cohen, NAACL 2007

worst gene names
Worst gene names
  • sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A

K. Cohen NAACL 2007

worst gene names1
Worst gene names
  • sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A

K. Cohen NAACL 2007

worst gene names2
Worst gene names
  • sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A
  • SEMA5A

K. Cohen NAACL 2007

worst gene names3
Worst gene names
  • sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A
  • SEMA5A
  • Tyrosine kinase with immunoglobulin and epidermal growth factor homology domains
  • tie

K. Cohen NAACL 2007

term ambiguity
Term ambiguity

Neurofibromatosis 2 [disease]

NF2 Neurofibromin 2 [protein]

Neurofibromatosis 2 gene [gene]

O. Bodenreider, MIE 2005 tutorial

http://www.nactem.ac.uk/

term ambiguity1
Term ambiguity
  • Gene terms may be also common English words
    • BADhuman gene encoding BCL-2 family of proteins (bad news, bad prediction)
  • Gene names are often used to denote gene products (proteins)
    • suppressor of sable is used ambiguously to refer to either genes and proteins
  • Existing resources lack information that can support term disambiguation
  • Difficult to establish equivalences between termforms and concepts
homologues
Homologues
  • Cycline-dependent kinase inhibitor first introduced to represent a protein family p27
    • But it is used interchangeably with p27 or p27kip1, as the name of the individual protein and not as the name of the protein family (Morgan 2003).
  • NFKB2 denotes the name of a family of 2 individual proteins with separate IDs in Swiss-Prot.
    • These proteins are homologues belonging to different species, homo sapiens & chicken.
terms
Terms
  • Term: linguistic realisation of specialised concepts, e.g. genes, proteins, diseases
  • Terminology: collection of terms structured (hierarchy) denoting relationships among concepts, part-whole, is-a, specific, generic, etc.
  • Terms link text and ontologies
  • Mapping is not trivial (main challenge)
term variation and ambiguity
Term variation and ambiguity

Term1 Term2

Term3 TEXT

Term variation

Term ambiguity

Concept1 concept2

concept3 ONTOLOGY

term mining steps
Term mining steps

Term recognition

Tp53

Term classification

Gene

Genome Database,

IARC TP53 Mutation Database

Term mapping

term recognition techniques
Term recognition techniques
  • ATR extracts terms (variants) from a collection of document
  • Distinguishes terms vs non-terms
  • In NER the steps of recognition and classification are merged, a classified terminological instance is a named entity
  • The tasks of ATR and NER share techniques but their ultimate goals are different
    • ATR for resource building, lexica & ontologies
    • NER first step of IE, text mining
overview papers
Overview papers
  • S. Ananiadou & G. Nenadic (2006) Automatic Terminology Management in Biomedicine, Text Mining for Biology and Biomedicine, pp. 67- 97.
  • M. Krauthammer & G. Nenadic (2004) Term identification in the biomedical literature, JBI 37 (2004) 512-526
  • J.C. Park & J. Kim (2006) Named Entity Recognition, Text Mining for Biology and Biomedicine, pp. 121-142

Detailed bibliography in Bio-Text Mining

  • BLIMPhttp://blimp.cs.queensu.ca/
  • http://www.ccs.neu.edu/home/futrelle/bionlp/

Book on BioText Mining

  • S. Ananiadou & J. McNaught (eds) (2006) Text Mining for Biology and Biomedicine, Artech House.

Other Bio-Text Mining tutorials

Kevin Cohen (NAACL 2007 tutorial) U. Colorado

dictionary ner 1
Dictionary NER (1)
  • Use terminological resources to locate term occurrences in text
    • NCBI http://www.ncbi.nlm.nih.gov/
    • EBI http://www.ebi.ac.uk/
    • neologisms, variations, ambiguity problematic for simple dictionary look-up
    • Ambiguous words e.g. an, for, can …
    • spelling variants, punctuation, word order variations
      • estrogen oestrogen
      • NF kappa B / NF kB
dictionary ner 2
Dictionary NER (2)
  • Hirschman (2002) used FlyBase for gene name recognition, results disappointing due to homonymy, spelling variations
    • Precision, 7% abstracts, 2% full papers
    • Recall, 31% -- 84%
  • Tuason (2004) reports term variation as main problem of mismatch
    • bmp-4 bmp4
    • syt4 syt iv
    • integrin alpha 4 alpha4 integrin
dictionary ner 3
Dictionary NER (3)
  • Tsuruoka & Tsujii (2003) suggest a probabilistic generator of spelling variants, edit distance operations (delete, substitute, insert)
    • Terms with ED ≤ 1 considered spelling variants
    • Used a dictionary of protein terms
  • Support query expansion
  • Augment dictionaries with variation
rule based 1
Rule based (1)
  • Use orthographic, morpho-syntactic features of terms
    • Rules that make use of internal term formation patterns (tagging, morphological analysers) e.g. affixes, combining forms
    • Do not take into account contextual features
    • Dictionaries of constituents e.g. affixes, neoclassical forms included
  • Portability to different domains?
rule based 2
Rule based (2)
  • Ananiadou, S. (1994) recognised single-word terms based on morphological analysis of term formation patterns (internal term make up)
  • based on analysis of neoclassical and hybrid elements

‘alphafetoprotein’ ‘immunoosmoelectrophoresis’

‘radioimmunoassay’

  • some elements are used for creating terms

term  word + term_suffix

term  term + word_suffix

  • neoclassical combining forms (electro- adeno-),
  • prefixes (auto-, hypo-)
  • suffixes ( -osis, -itis)
rule based 3
Rule-based (3)
  • Fukuda (1998) used lexical, orthographic features for protein name recognition e.g. upper case character, numerals etc.
  • PROPER: core and feature elements
    • Core: meaning bearing elements
    • Feature: function elements

SAP kinase

feature

core

Core elements extended to feature based on concatenation rules (based on POS tags)

rule based 4
Rule-based (4)
  • Gaizauskas (2000) CFG for protein name recognition (PASTA, EMPATHIE)
  • Based on morphological and lexical characteristics of terms
      • biochemical suffixes (-ase enzyme name)
      • dictionary look-up (protein names, chemical compounds, etc)
      • deduction of term grammar rules from Protein Data Bank

Protein -> protein_modifier, protein_head, numeral

rule based 5
Rule-based (5)
  • Inspired by PROPER, Yapex uses Swiss-Prot to add core term elements

http://www.sics.se/humle/projects/prothalt/yapex.cgi

  • Hou (2003) used Yapex with context information (collocations) appearing with protein names
  • Rule based approaches construct rule and patterns manually or automatically
  • Difficult to tune to different domains
machine learning systems
Machine learning systems
  • Learn features from training data for term recognition and classification
  • Most ML systems combine recognition and classification

Challenges

    • Feature selection and optimisation
    • Availability of training data
    • detection of term boundaries
overview of ml based ner
Overview of ML-based NER
  • Training phase:
  • Testing phase:
  • Detecting features
  • Learning model

Manually tagged texts

Learned Model

Tag annotator

with model

Tagged texts

Raw texts

slide74
ML (1)
  • Nobata et al.(1999) used Decision Tree for NER
  • Decision tree: one of the methods to classify a case using training data
    • Node: specifies some condition with a subtree
    • Leaf: indicates a class
  • Features:
    • Part-of-speech information
    • Orthographic information
    • Term lists
example of a decision tree
Example of a decision tree

Each node has one condition:

Is the current word

in the Protein term list?

No

Yes

Does the previous word

have figures?

What is the

next word’s POS?

No

Noun

Yes

Verb

Each leaf has one class:

Unknown

PROTEIN

DNA

RNA

……

slide76
ML (2)
  • Collier (2000) used HMM, orthographic features for term recognition
    • HMM looks for most likely sequence of classes corresponding to a word sequence e.g. interleukin-2 protein/DNA
    • To find similarities between known words (training set) and unknown words, use character features

Feature Examples

DigitNumber [2]protein[3]DNA

GreekLetter [alpha]protein

TwoCaps [RelB]protein[TAR]RNA

slide77
ML (2)
  • Use of GENIA resources as training data
    • Results depend on training data
  • Morgan (2004) used FlyBase to construct automatically training corpus
    • Pattern matching for gene name recognition, noisy corpus annotated
    • HMM was trained on that corpus for gene name recognition
support vector machines 1
Support Vector Machines (1)
  • Kazama trained multi-class SVMs on Genia corpus
  • Corpus annotated with B-I-O tags
    • B tags denote words at beginning of term
    • I tags inside term
    • O tags outside term
    • B-protein-tag : word in the beginning of a protein name
svms for ner 2
SVMs for NER (2)
  • Yamamoto used a combination of features for protein name recognition:
    • Morphological, lexical, boundary, syntactic (head noun), domain specific (if term exists in biomedical database).
  • Lee use different features for recognition and classification.
      • orthographic, prefix, suffix
      • Contextual information
hybrid 1
Hybrid (1)
  • ABGene: protein and gene name tagger
    • Combines ML, transformation rules, dictionaries with statistics
    • Protein tagger trained on MEDLINE abstracts by adapting Brill’s tagger
    • Transformation rules for recognition of gene, protein names
    • Used GO, LocusLink list of genes, proteins for false negative tags
hybrid 2
Hybrid (2)
  • ARBITER (Access and Retrieve Binding Terms) uses
    • UMLS Metathesaurus and GenBank to map NPs (binding terms)
    • morphological features
    • lexical information (head noun)
  • EDGAR recognises gene, cell, drug names using co-occurrences of cell,clone, expression
hybrid 3
Hybrid (3)
  • C/NC value (Frantzi & Ananiadou, 1999)
  • C-value
      • Linguistic filters
      • total frequency of occurrence of string in corpus
      • frequency of string as part of longer candidate terms (nested terms)
      • number of these longer candidate terms
      • length of string
    • Output: automatically ranked terms (TerMine)
c value
C-value
  • C- value measureextracts multi-word, nested terms

[adenoid [cystic [basal [cell carcinoma]]]]

cystic basal cell carcinoma

ulcerated basal cell carcinoma

recurrent basal cell carcinoma

basal cell carcinoma

term variation1
Term variation
  • variation recognition as part of ATR (Nenadic, Ananiadou)
  • recognise term forms and link them into equivalence classes
  • important if ATR is based on statistics (e.g. frequency of occurrence)
    • corpus-based measures are distributed across different variants
    • conflation of various surface representations of a given term should improve ATR
simple variation
Simple variation
  • orthographic
    • hyphens, slashes (amino acid and amino-acid)
    • lower/upper cases (NF-KB and NF-kb)
    • spelling variations (tumour and tumor)
    • transliterations (oestrogenand estrogen)
  • morphological
    • inflectional phenomena (plural, possessives)
  • lexical
    • genuine synonyms (carcinoma and cancer)
complex variation
Complex variation
  • Structural
    • Possessive usage of nouns using prepositions (clones of human and human clones)
    • Prepositional variants (cell in blood, cell from blood)
    • Term coordinations (adrenal glands and gonads)
coordinated term variants
Coordinated term variants
  • Structure is ambiguous
    • Head coordination or term conjunction?
  • Head or argument coordination?

(N|A)+ CC (N|A)* N+

      • cell differentiation and proliferation
      • chicken and mouse receptors
marrying ir and terminology
Marrying IR and terminology
  • IR engine plus TerMine
  • Discover associated terms ranked according to relevance
  • Allow user to link term with IR for document discovery
  • NB compound terms
  • NB technical terms, not classic index terms
  • NB terms familiar to user, found in documents
biomedical ie ir systems
Biomedical IE/IR Systems
  • iHOP
    • http://www.ihop-net.org/UniPub/iHOP/
  • EBIMed
    • http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
  • GoPubMed
    • http://www.gopubmed.org/
  • PubFinder
    • http://www.glycosciences.de/tools/PubFinder
  • Textpresso
    • http://www.textpresso.org/
acronyms
Acronyms
  • Very productive type of term variation
  • Acronym variation (synonymy)
    • NF kappa B/ NF kB / nuclear factor kappa B
  • Acronym ambiguity (polysemy) even in controlled vocabularies

GR glucocorticoid receptor

glutathione reductase

acronym recognition
Acronym recognition
  • Swartz, A. & Hearst, M. (2003) A simple algorithm for identifying abbreviation definitions in biomedical text, PSB 2003,8, 451-462
  • Adar, E. (2004) SaRAD: a simple and robust abbreviation dictionary, Bioinformatics, 20(4) 527-533
  • Chang, J.T. & Schutze, H. (2006) Abbreviations in biomedical text, Text Mining for Biology and Biomedicine, pp.99-119, Artech
  • Tsuruoka, Y., Ananiadou, S. & Tsujii, J. (2005) A Machine learning approach to automatic acronym generation, ISMB, BioLink SIG, 25-31
  • Okazaki, N. & S.Ananiadou (2006) Acronym recognition based on term identification, Bioinformatics
the importance of acronym recognition
The importance ofacronym recognition
  • Acronyms are among the most productive type of term variation
    • 64, 242 new acronyms are introduced in 2004 [Chang and Schütze 06]
  • Acronyms are used more frequently than full terms
    • 5,477 documents could be retrieved by using the acronym JNK while only 3,773 documents could be retrieved by using its full term, c-jun N-terminal kinase [Wren et al. 05]
  • No rules or exact patterns for the creation of acronyms from their full form
recognition
Recognition
  • Extracting pairs of short and long forms

<acronym, long form>

    • Distinguishing acronyms from parenthetical expressions
    • Search for parentheses in text; single or more words; e.g. Ab (antibody)
    • Limit context around ( ); limit number of words according to number of letters in acronym
recognition heuristics
Recognition (heuristics)
  • Heuristics: match letters of acronym with letters of long form using rules, patterns
    • letters from beginning of words
    • combining forms

carboxifluorescein diacetate (CFDA)

    • Acronym normalisation to allow orthographic, structural and lexical variations
    • morphological information, positional info
    • Penalise words in long form that do not match acronym
    • Accidental matching

argininosuccitate synthetase (AS)

A

S

letter matching
Letter matching
  • Alignment: find all matches between letters of acronyms and their long forms and calculate likelihood (Chang & Schütze)
    • Solves problem of acronyms containing letters not occurring in LF
    • Choose best alignment based on features, e.g. position of letter etc.
    • Finding optimal weight for each feature challenge

http://abbreviation.stanford.edu/

acronym recognition1
Acronym Recognition

Okazaki, N., Ananiadou, S. (2006) Building an abbreviation dictionary using a term recognition approach. Bioinformatics.

S.Ananiadou NaCTeM

a simple algorithm schwartz and hearst 2003
A simple algorithm –Schwartz and Hearst (2003)
  • Uses parenthetical expressions as a marker of a short form

… long-form ‘(‘short-form ‘)’ …

  • All letters and digits in a short form must appear in the corresponding long form in the same order
    • We usedhidden markov model (HMM) to …
    • Early repolarization (ER) is an enigma.
problems of letter matching approach
Problems of letter-matching approach
  • Highly dependent on the expressions in the target text
    • o acquired immuno deficiency syndrome (AIDS)
    • x acquiredsyndrome (AIDS)
    • x a patient with human immunodeficiency syndrome (AIDS)
    • ? magnetic resonance imaging unit (MRI)
    • ! beta 2adrenergic receptor (ADRB2)
    • ! gammainterferon (IFN-GAMMA)

(These examples are obtained from actual MEDLINE abstracts)

  • Naive with respect to term variations
acromine s approach
AcroMine’s approach
  • Extract a word or word sequence:
    • Co-occurring frequently with an acronym (e.g., TTF-1)
      • 1, factor 1, transcription factor 1, thyroid transcription factor 1
    • Does not co-occur with other surrounding words
      • thyroid transcription factor 1
  • Not necessarily based on letter-matching
    • Note that this is a difficult case for the letter-matching algorithm
  • Prune unlikely candidates
    • Nested candidates: transcription factor 1
    • Expansions: expression of thyroid transcription factor 1
    • Insertions: thyroid specific transcription factor 1
short form mining
Short-form mining
  • Enumerate all short forms in a target text
    • Using parentheses as a clue: …‘(‘short-form ‘)’ …
    • Validation rules for identifying acronyms [Schwartz and Hearst 03]
      • It consists of at most two words
      • Its length is between two to ten characters
      • It contains at least an alphabetic letter
      • The first character is alphanumeric

The contextual sentence of HMM and ASR.

The present system consists of a hidden Markov model (HMM) based automatic speech recognizer (ASR), with a keyword spotting system to capture the machine sensitive words (registered in a dictionary) from the running utterances.

enumerating long form candidates for an acronym
Enumerating long-form candidates for an acronym
  • Tokenize a contextual sentence by non-alphanumeric characters (e.g., space, hyphen, etc.)
  • Apply Porter’s stemming algorithm [Porter 80]
  • Extract terms that match the following pattern

[:WORD:].*$

Empty string or words of any length

We studied the expression of thyroid transcription factor-1 (TTF-1).

1

factor 1

transcript factor 1

thyroid transcript factor 1

expression of thyroid transcript factor 1

studi the expression of thyroid transcript factor 1

of thyroid transcript factor 1

thyroid transcript

long form extraction
Long-form extraction
  • Long-form candidates are sorted with their scores in a descending order
  • A long-form candidate is considered valid if:
    • It has a score greater than 2.0
    • The words in the long form can be rearranged so that all alphanumeric letters appear in the same order as the short form
    • It is not nested or expansion of the previously chosen long forms
acronym disambiguation
Acronym disambiguation
  • Local acronyms
    • Accompany their expandedforms in documents
  • Global acronyms
    • Appear in documents without the expanded formsstated
    • Need to be their correct expanded forms identified
      • Immunomodulatory effects of CT were investigated in a rat model, and the effects of CT on rat renal allograft (from Lewis rat to WKAH rat) were also examined.
      • Immunomodulatory effects of cholera toxin (CT) were investigated in a rat model, and the effects of cholera toxin (CT) on rat renal allograft (from Lewis rat to Wistar-King-Aptekman-Hokudai (WKAH) rat) were also examined.
acronym disambiguation1
Acronym disambiguation

Sample text: Considerations in the identification of functional RNA structural elements in genomic alignments (Tomas Babak et al)

http://www.biomedcentral.com/1471-2105/8/33

term structuring
Term structuring
  • term clustering (linking semantically similar terms) and term classification (assigning terms to classes from a pre-defined classification scheme)
  • Hypothesis: similar terms tend to appear in similar contexts (patterns)
  • combining various sources of similarity:
    • lexical
    • syntactic
    • contextual
    • Ontological (using external resources)
term structuring1
Term structuring
  • Based on term similarities
    • choice of features:
    • domain specific  ontology
    • linguistic  text
  • ontology-based similarity
  • textual similarity
    • internal features
    • contextual features
using ontologies
Using ontologies
  • two terms should match if they are:
    • identified as variants
    • siblings in the is-a hierarchy
    • in the is-a or part-whole relation
  • the distance between the corresponding nodes in the ontology should be transformed into the matching score

► I. Spasic presentation MIE Tutorial http://www.nactem.ac.uk/

using text
Using text
  • number of neologisms: terms are not in the ontologies
  • Use of text based techniques to calculate similarities
  • edit distance (ED) – the minimal number (or cost) of changes needed to transform one string into the other
  • edit operations:

insertion deletion replacement transposition

...a-c... ...abc... ...abc... ...abc...

...abc... ...a-c... ...adc... ...acb...

  • use of dynamic programming
term similarities
Term similarities
    • lexical similarity: based on sharing term head and/or modifier(s) --hyponymy

nuclear receptor

orphan nuclear receptor

    • Sharing heads

progesterone receptor oestrogen receptor

  • Specific types of associations
    • mainly general is_a and part_of
    • some domain-specific, e.g. binding: CREP binding protein
contextual similarities
Contextual similarities
  • Features from context
    • syntactic category
    • terminological status
    • position relative to the term
    • syntactic relation between a context element and the term
    • semantic properties
    • semantic relation between a context element and the term …….
lexical syntactic patterns
Lexical & syntactic patterns
  • a lexico-syntactic pattern:

. . . Term (, Term)* [,] and otherTerm . . .

  • the leading Terms hyponyms of the head Term

... antiandrogens, hydroxyflutamide, bicalutamide,

cyproteroneacetate, RU58841, and other compounds ...

  • candidate instances of the hyponymy relation:

hyponym( antiandrogens, compound )

hyponym( hydroxyflutamide, compound )

hyponym( bicalutamide, compound )

hyponym( cyproterone acetate, compound )

hyponym( RU58841, compound )

contextual information
Contextual information
  • automatic pattern mining for most important context patterns
    • find most important contexts in which a term appears

… receptor is bound to these DNA sequences …

… proteinsbound to the DNA …

… estrogen receptor bound to DNA …

… steroid receptor coactivator-1 when bound to DNA …

… progesterone receptor complexes bound to DNA …

… RXRsbound to respective DNA elements in vitro …

… glucocorticoid receptor to bindDNA …

pattern: <TERM> V:bind<TERM:DNA>

stumbling blocks
Stumbling blocks
  • Lexical similarities affected by many neologisms and ad hoc names
    • only 5% of most frequent terms in GENIA belonging to same biomedical class have some lexical links
  • how much context to use? (sentence, phrase, abstract, …)
  • Attempts at using co-occurrence: many report up to 40% of co-occurrence based relationships biologically meaningless
term similarities1
Term similarities
  • SOLD = Syntactic, Ontology-driven & Lexical Distance (Spasic, I. & Ananiadou, S. 2005, Bioinformatics)
  • hybrid approach to comparing term contexts, which relies on:
    • linguistic information (acquired through tagging and parsing)
    • domain-specific knowledge (obtained from the ontology)
  • based on the approximate pattern matching
  • combines ontology-based similarity with corpus-based similarity using both internal and contextual features
challenges of biomedical terminology
Challenges of biomedical terminology
  • Linking termforms in text with existing resources
  • Term clustering, classification and linking to databases, ontologies
  • Selection of most representative terms (concepts) in documents (important for improved IR, database curation, annotation tasks)
  • Efficient term management important for updating terminological and ontological resources, text mining applications e.g. IE, Q/A, summarisation, linking heterogeneous resources, IR etc
information extraction in biology
Information Extraction in Biology
  • Results appear depressed compared to general language
    • Dependent of earlier stages of processing (tokenisers, taggers, results from NER, etc)
    • MUC data 80% F-score template relations, 60% events
    • Challenge for bio-text mining is to achieve similar results
      • Evaluation see Hirschman, L. (Text mining book) BioCreATive 2004
slide126
I

Information Extraction

ie in biology
IE in Biology
  • Pattern-matching
  • Context-free grammar approaches
  • Full parsing approaches
  • Sublanguage driven IE
  • Ontology-driven IE

McNaught, J. & Black, W. (2006) Information Extraction, Text Mining for Biology & Biomedicine, Artech house, pp.143-177

pattern matching ie
Pattern-matching IE
  • Usual limitations with non inclusion of semantic processing
  • Large amount of surface grammatical structures = too many patterns (Zipf’s law)
  • Cannot explore syntactic generalisations (active, passive voice)
  • Systems extract phrases or entire sentences with matched patterns; restricted usefulness for subsequent mining
pattern matching systems 1
Pattern-matching systems (1)
  • BioIE uses patterns to extract sentences, protein families, structures, functions..
    • Presents user with relevant information, improvement from classic IR
  • BioRAT uses “deeper” analysis, tagging, apply RE over POS tags, stemming, gazetter categories etc
    • Templates apply to extract matching phrases, primitive filters (verbs are not proteins, etc)
pattern matching systems 2
Pattern matching systems (2)
  • RLIMS-P (Hu) protein phosphorylation by looking for enzymes, substrates, sites assigned to agent, theme, site roles of phosphorylation relations
  • Pos tagger, trained on newswire, chunking, semantic typing of chunks, identification of relations using pattern-matching rules
  • Semantic typing of NPs: using combination of clue words, suffixes, acronyms etc
  • Semantically typed sentences matched with rules
  • Patterns target sentences containing phosphorylate
full parsing approaches
Full parsing approaches
  • Link Grammar applied for protein-protein interactions; general English grammar adapted to bio-text
  • Link Grammar finds all possible linkages according to its grammar
  • Number of analyses reduced by random sampling, heuristics, processing constraints relaxed
    • 10,000 results permitted per sentence
    • 60% of protein interactions extracted
    • Problems: missing possessive markers & determiners, coordination of compound noun modifiers
full parsing ie 2
Full parsing IE (2)
  • Not all parsing strategies suitable for bio-text mining
  • Text type, abstracts, “ungrammaticality” related with sublanguage characteristics?
  • Ambiguity and full parsing; fragmentary phrases (titles, headings, text in table cells, etc)
  • CADERIGE project used Link grammar but on shallow parsing mode
  • Kim & Park (BioIE) use combinatorial categorial grammar, annotated with GO concepts, extract general biological interactions
  • 1,300 patterns applied to find instances of patterns with keywords
full parsing 3
Full parsing (3)
  • Keywords indicate basic biological interactions
  • Patterns find potential arguments of the interaction keywords (verbs or nominalisations)
    • Validated arguments mapped into GO concepts
    • Difficult to generalise interaction keyword patterns
  • BioIE’s syntactic parsing performance improved after adding subcategorisation frames on verbal interaction keywords
full parsing 4
Full parsing (4)
  • Daraselia(2004) use full parsing and domain specific filter to extract protein interactions
  • All syntactic analyses discovered using CFG and variant of LFG
  • Each alternative parse mapped to its corresponding semantic representation
  • Output= set of semantic trees, lexemes linked by relations indicating thematic or attributive roles
  • Apply custom-built, frame based ontology to filter representations of each sentence
  • Preference mechanism controls construction of frame tree, high precision, low recall (21%)
sublanguage driven ie 1
Sublanguage-driven IE (1)
  • Language of a special community (e.g. biology)
  • Particular set of constraints re GL
  • Constraints operate at all linguistic levels
    • Special vocabulary (terms)
    • Specialised term formation rules
    • Sublanguage syntactic patterns
    • Sublanguage semantics
  • These constraints give rise to the informational structureof the domain (Z. Harris)
  • See JBI 35(4) Special Issue on Sublanguage
genies system
GENIES system
  • Employs SL approach to extract biomolecular interactions
  • Uses hybrid syntactic-semantic rules
    • Syntactic and semantic constraints referred to in one rule
  • Able to cope with complex sentences
  • Frame-based representation
    • Embedded frames
  • Domain specific ontology covers both entities and events
genies system1
GENIES system
  • Default strategy: full parsing
    • Robust due to sublanguage constraints
    • Much ambiguity excluded
  • If full parse fails, partial parsing invoked
    • Maintains good level of recall
  • Precision: 96%, Recall: 63%
ontology driven ie
Ontology-driven IE
  • Until recently most rule based IE have used neither linguistic lexica nor ontologies
    • Reliance on gazetteers
    • Small number of semantic categories
  • Gazetteer approach not well suited in bioIE
  • Ontology based vs ontology driven
    • Passive use of ontologies, map discovered entity to concept
    • Active use, ontology guides and constrains analysis, fewer rules
  • Examples: PASTA, GenIE not SL
  • GENIES, SL and ontology driven
summary simple pattern matching
Summary: simple pattern matching
  • Over text strings
    • Many patterns required, no generalisation possible
  • Over POS
    • Some generalisation but ignore sentence structure
  • POS tagging, chunking, semantic p-m, typing
    • Limited generalisation, some account taken of structure, limited consideration of SL patterns
summary full parsing
Summary: full parsing
  • Full parsing on its own, parsing done in combination with chunking, partial parsing, heuristics) to reduce ambiguity, filter out implausible readings
    • GL theories not appropriate
    • Difficult to specialise for biotext
    • Many analyses per sentence
    • Missing information due to sublanguage meaning
summary sublanguage approach
Summary: sublanguage approach
  • Exploits a rich SL lexicon
  • Describes SL verbs in detail
  • Syntactic-semantic grammar
  • Current systems would benefit from adopting ontology-driven approach
ontology driven
Ontology-driven
  • Uses event concept frames to guide processing
  • Integration of extracted information
  • Current systems would benefit from adopting also SL approach
how do we apply tm to systems biology
How do we apply TM to Systems Biology?

REFINE project

  • Adapting TM tools to evaluate the basis in the literature for the structure of biochemical and signalling models in systems biology
  • Integrating TM with visualisation for better understanding of the evidence for biochemical and signalling pathways
  • Enriching models encoded in SBML with information derived from TM

Kell, Ananiadou, Tsujii

applications
Applications
  • Semantic annotation not only based on concepts but also on facts, events extracted by IE
  • Enables semantic querying
  • Facilitates curation
  • Hypothesis generation for scientific discovery
applications1
Applications
  • Other text mining applications
    • Summarisation
    • Question answering
  • Integration of IR with TM
    • Terms / concepts as index terms
    • Topic detection
    • Document clustering and classification