natural language processing in bioinformatics uncovering semantic relations l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations PowerPoint Presentation
Download Presentation
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations

Loading in 2 Seconds...

play fullscreen
1 / 70

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations - PowerPoint PPT Presentation


  • 253 Views
  • Uploaded on

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations . Barbara Rosario SIMS UC Berkeley. Outline of Talk. Goal: Extract semantics from text Information and relation extraction Protein-protein interactions. Text Mining.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Natural Language Processing in Bioinformatics: Uncovering Semantic Relations' - tyler


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
natural language processing in bioinformatics uncovering semantic relations

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations

Barbara Rosario

SIMS

UC Berkeley

outline of talk
Outline of Talk
  • Goal: Extract semantics from text
  • Information and relation extraction
  • Protein-protein interactions
text mining
Text Mining
  • Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text
text mining4
Text Mining
  • Text:
    • Stress is associated with migraines
    • Stress can lead to loss of magnesium
    • Calcium channel blockers prevent some migraines
    • Magnesium is a natural calcium channel blocker

1: Extract semantic entities from text

text mining5
Text Mining
  • Text:
    • Stressis associated withmigraines
    • Stresscan lead to loss ofmagnesium
    • Calcium channel blockersprevent somemigraines
    • Magnesiumis a naturalcalcium channel blocker

1: Extract semantic entities from text

Stress

Migraine

Magnesium

Calcium channel blockers

text mining cont

Stress

Migraine

Associated with

Lead to loss

Prevent

Magnesium

Calcium channel blockers

Subtype-of (is a)

Text Mining (cont.)
  • Text:
    • Stressis associated withmigraines
    • Stresscan lead to loss ofmagnesium
    • Calcium channel blockersprevent somemigraines
    • Magnesiumis a naturalcalcium channel blocker

2: Classify relations between entities

text mining cont7

Associated with

Lead to loss

Text Mining (cont.)
  • Text:
    • Stressis associated withmigraines
    • Stresscan lead to loss ofmagnesium
    • Calcium channel blockersprevent somemigraines
    • Magnesiumis a naturalcalcium channel blocker

3: Do reasoning: find new correlations

Stress

Migraine

Prevent

Magnesium

Calcium channel blockers

Subtype-of (is a)

text mining cont8

Associated with

Lead to loss

Text Mining (cont.)
  • Text:
    • Stressis associated withmigraines
    • Stresscan lead to loss ofmagnesium
    • Calcium channel blockersprevent somemigraines
    • Magnesiumis a naturalcalcium channel blocker

4: Do reasoning: infer causality

Stress

Migraine

No prevention

Prevent

Subtype-of (is a)

Magnesium

Calcium channel blockers

Deficiency of magnesium  migraine

my research

Stress

Migraine

Magnesium

Calcium channel blockers

My research

Information Extraction

  • Stressis associated withmigraines
  • Stresscan lead to loss ofmagnesium
  • Calcium channel blockersprevent somemigraines
  • Magnesiumis a naturalcalcium channel blocker
my research10

Stress

Migraine

Associated with

Lead to loss

Prevent

Magnesium

Calcium channel blockers

Subtype-of (is a)

My research

Relation extraction

information and relation extraction

Cure?

Prevent?

Treatment

Disease

Side Effect?

Information and relation extraction
  • Problems:
    • Given biomedical text:
    • Find all the treatments and all the diseases
    • Find the relations that hold between them
hepatitis examples
Hepatitis Examples
  • Cure
    • These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135.
  • Prevent
    • A two-dose combined hepatitis A and Bvaccine would facilitate immunization programs
  • Vague
    • Effect of interferon on hepatitis B
two tasks
Two tasks
  • Relationship extraction:
    • Identify the several semantic relations that can occur between the entities disease and treatment in bioscience text
  • Information extraction (IE):
    • Related problem: identify such entities
outline of ie
Outline of IE
  • Data and semantic relations
  • Quick intro to graphical models
  • Models and results
  • Features
  • Conclusions
data and relations
Data and Relations
  • MEDLINE, abstracts and titles
  • 3662 sentences labeled
    • Relevant: 1724
    • Irrelevant: 1771
      • e.g., “Patients were followed up for 6 months”
  • 2 types of Entities
    • treatment and disease
  • 7 Relationships between these entities

The labeled data are available at http://biotext.berkeley.edu

semantic relationships
Semantic Relationships
  • 810: Cure
    • Intravenous immune globulin for recurrent spontaneous abortion
  • 616: Only Disease
    • Social ties and susceptibility to the common cold
  • 166: Only Treatment
    • Flucticasone propionate is safe in recommended doses
  • 63: Prevent
    • Statins for prevention of stroke
semantic relationships17
Semantic Relationships
  • 36: Vague
    • Phenylbutazone and leukemia
  • 29: Side Effect
    • Malignant mesodermal mixed tumor of the uterus following irradiation
  • 4: Does NOT cure
    • Evidence for double resistance to permethrin and malathion in head lice
outline of ie18
Outline of IE
  • Data and semantic relations
  • Quick intro to graphical models
  • Models and results
  • Features
  • Conclusions
graphical models
Graphical Models
  • Unifying framework for developing Machine Learning algorithms
  • Graph theory plus probability theory
  • Widely used
    • Error correcting codes
    • Systems diagnosis
    • Computer vision
    • Filtering (Kalman filters)
    • Bioinformatics
quick intro to graphical models

B

C

D

(Quick intro to) Graphical Models
  • Nodes are random variables
  • Edges are annotated with conditional probabilities
  • Absence of an edge between nodes implies conditional independence
  • “Probabilistic database”

A

graphical models21

B

C

D

Graphical Models
  • Define a joint probability distribution:
  • P(X1, ..XN) = iP(Xi | Par(Xi) )
  • P(A,B,C,D) =

P(A)P(D)P(B|A)P(C|A,D)

  • Learning
    • Given data, estimate P(A), P(B|A), P(D), P(C | A, D)

A

graphical models22

B

C

D

Graphical Models
  • Define a joint probability distribution:
  • P(X1, ..XN) = iP(Xi | Par(Xi) )
  • P(A,B,C,D) =

P(A)P(D)P(B|A)P(C,A,D)

  • Learning
    • Given data, estimate P(A), P(B|A), P(D), P(C | A, D)

A

  • Inference: compute conditional probabilities, e.g., P(A|B, D)
  • Inference = Probabilistic queries. General inference algorithms (Junction Tree)
na ve bayes models
Naïve Bayes models
  • Simple graphical model
  • Xi depend on Y
  • Naïve Bayes assumption: all Xi are independent given Y
  • Currently used for text classification and spam detection

Y

x1

x2

x3

dynamic graphical models
Dynamic Graphical Models
  • Graphical model composed of repeated segments
  • HMMs (Hidden Markov Models)
    • POS tagging, speech recognition, IE

tN

wN

slide25

tN

wN

HMMs
  • Joint probability distribution
    • P(t1,.., tN,w1,..,wN) = P(t1)  P(ti|ti-1)P(wi|ti)
  • Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data
slide26
HMMs
  • Joint probability distribution
    • P(t1,.., tN,w1,..,wN) = P(t1)  P(ti|ti-1)P(wi|ti)
  • Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data
  • Inference: P(ti | w1 ,w2 ,…wN)

tN

wN

graphical models for ie

D1

S1

D2

S2

D3

Graphical Models for IE
  • Different dependencies between the features and the relation nodes

Dynamic

Static

graphical model
Graphical Model
  • Relation node:
    • Semantic relation (cure, prevent, none..) expressed in the sentence
    • Relation generate the state sequence and the observations

Relation

graphical model29
Graphical Model
  • Markov sequence of states (roles)
  • Role nodes:
    • Rolet {treatment, disease, none}

Rolet-1

Rolet

Rolet+1

graphical model30
Graphical Model
  • Roles generate multiple observations
  • Feature nodes (observed):
    • word, POS, MeSH…

Features

graphical model31
Graphical Model
  • Inference: Find Relation and Roles given the features observed

?

?

?

?

features
Features
  • Word
  • Part of speech
  • Phrase constituent
  • Orthographic features
    • ‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ …
  • Semantic features (MeSH)
slide33
MeSH
  • MeSH Tree Structures

1. Anatomy [A]

2. Organisms [B]

3. Diseases [C]

4. Chemicals and Drugs [D]

5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E]

6. Psychiatry and Psychology [F]

7. Biological Sciences [G]

8. Physical Sciences [H]

9. Anthropology, Education, Sociology and Social Phenomena [I]

10. Technology and Food and Beverages [J]

11. Humanities [K]

12. Information Science [L]

13. Persons [M]

14. Health Care [N]

15. Geographic Locations [Z]

mesh cont
1. Anatomy [A]

Body Regions [A01] +

Musculoskeletal System [A02] Digestive System [A03] +

Respiratory System [A04] +

Urogenital System [A05] +

Endocrine System [A06] +

Cardiovascular System [A07] +

Nervous System [A08] +

Sense Organs [A09] +

Tissues [A10] +

Cells [A11] +

Fluids and Secretions [A12] +

Animal Structures [A13] +

Stomatognathic System [A14]

(…..)

Body Regions [A01]

Abdomen [A01.047]

Groin [A01.047.365]

Inguinal Canal [A01.047.412]

Peritoneum [A01.047.596] +

Umbilicus [A01.047.849]

Axilla [A01.133]

Back [A01.176] +

Breast [A01.236] +

Buttocks [A01.258]

Extremities [A01.378] +

Head [A01.456] +

Neck [A01.598]

(….)

MeSH (cont.)
use of lexical hierarchies in nlp
Use of lexical Hierarchies in NLP
  • Big problem in NLP: few words occur a lot, most of them occur very rarely (Zipf’s law)
  • Difficult to do statistics
  • One solution: use lexical hierarchies
  • Another example: WordNet
  • Statistics on classes of words instead of words
mapping words to mesh concepts
Mapping Words to MeSH Concepts
  • headache pain
    • C23.888.592.612.441 G11.561.796.444
      • C23.888 G11.561
        • [Neurologic Manifestations][Nervous System Physiology ]
      • C23 G11
        • [Pathological Conditions, Signs and Symptoms][Musculoskeletal, Neural, and Ocular Physiology]
  • headache recurrence
    • C23.888.592.612.441 C23.550.291.937
  • breast cancer cells
    • A01.236 C04 A11
graphical model37
Graphical Model
  • Joint probability distribution over relation, roles and features nodes
  • Parameters estimated with maximum likelihood and absolute discounting smoothing
graphical model38
Graphical Model
  • Inference: Find Relation and Roles given the features observed

?

?

?

?

relation extraction
Relation extraction
  • Results in terms of classification accuracy (with and without irrelevant sentences)
  • 2 cases:
    • Roles given
    • Roles hidden (only features)
relation classification results
Relation classification: Results
  • Good results for a difficult task
    • One of the few systems to tackle several DIFFERENT relations between the same types of entities; thus differs from the problem statement of other work on relations
role extraction results
Role Extraction: Results

Junction tree algorithm

F-measure = (2*Prec*Recall)/(Prec + Recall)

(Related work extracting “diseases” and “genes” reports F-measure of 0.50)

features impact role extraction
Features impact: Role extraction
  • Most important features:

1)Word 2)MeSH

Rel. + irrel. Only rel.

  • All features 0.71 0.73
  • No word 0.61 0.66

-14.1% -9.6%

  • No MeSH 0.65 0.69

-8.4% -5.5%

features impact relation classification
Features impact: Relation classification
  • Most important features: Roles

Accuracy

  • All feat. + roles 82.0
  • All feat. – roles 74.9

-8.7%

  • All feat. + roles – Word 79.8

-2.8%

  • All feat. + roles – MeSH 84.6

3.1%

(rel. + irrel.)

features impact relation classification44
Features impact: Relation classification
  • Most realistic case: Roles not known
  • Most important features: 1) Word 2) Mesh

Accuracy

  • All feat. – roles 74.9
  • All feat. - roles – Word 66.1

-11.8%

  • All feat. - roles – MeSH 72.5

-3.2%

(rel. + irrel.)

conclusions
Conclusions
  • Classification of subtle semantic relations in bioscience text
  • Graphical models for the simultaneous extraction of entities and relationships
  • Importance of MeSH, lexical hierarchy
outline of talk46
Outline of Talk
  • Goal: Extract semantics from text
  • Information and relation extraction
  • Protein-protein interactions; using an existing database to gather labeled data
protein protein interactions
Protein-Protein interactions
  • One of the most important challenges in modern genomics, with many applications throughout biology
  • There are several protein-protein interaction databases (BIND, MINT,..), all manually curated
protein protein interactions48
Protein-Protein interactions
  • Supervised systems require manually labeled data, while purely unsupervised are still to be proven effective for these tasks.
  • Some other approaches: semi-supervised, active learning, co-training.
  • We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins
hiv 1 protein interaction database
HIV-1, Protein Interaction Database
  • Documents interactions between HIV-1 proteins and
    • host cell proteins
    • other HIV-1 proteins
    • disease associated with HIV/AIDS
  • 2224 pairs of interacting proteins, 65 types

http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions

protein protein interactions53
Protein-Protein interactions
  • Idea: use this to “label data”

Extract from the paper all the sentences with Protein 1 and Protein 2

activates

activates

Label them with the interaction given in the database

protein protein interactions54
Protein-Protein interactions
  • Use citations
  • Find all the papers

that cite the papers

in the database

ID 9918876

ID 9971769

protein protein interactions55

ID 9918876

ID 9971769

activates

Protein-Protein interactions
  • From the papers, extract

the citation sentences;

from these extract the

sentences with Protein 1

and Protein 2

  • Label them
examples of sentences
Examples of sentences
  • Papers:
    • The interpretation of these results was slightly complicated by the fact that AIP-1/ALIX depletion by using siRNA likely had deleterious effects on cell viability , because a Western blot analysis showed slightly reduced Gag expression at later time points (fig. 5C ).
  • Citations:
    • They also demonstrate that the GAG protein from membrane - containing viruses , such as HIV , binds to Alix / AIP1 , thereby recruiting the ESCRT machinery to allow budding of the virus from the cell surface (TARGET_CITATION; CITATION ) .
protein protein interactions58
Protein-Protein interactions
  • Tasks:
  • Given sentences from Paper ID, and/or citation sentences to ID
    • Predict the interaction type given in the HIV database for Paper ID
    • Extract the proteins involved
  • 10-way classification problem
protein protein interactions59
Models

Dynamic graphical model

Naïve Bayes

Protein-Protein interactions
evaluation
Evaluation
  • Evaluation at document level
  • All (sentences from papers + citations)
  • Papers (only sentences from papers)
  • Citations (only citation sentences)
  • “Trigger word” approach
    • List of keywords (ex: for inhibits: “inhibitor”, “inhibition”, “inhibit”…etc.
    • If keyword presents: assign corresponding interaction
results
Results
  • Accuracies on interaction classification

(Roles hidden)

results confusion matrix
Results: confusion matrix

For All. Overall accuracy: 60.5%

hiding the protein names
Hiding the protein names
  • Replaced protein names with tokens PROT_NAME
    • Selective CXCR4 antagonism by Tat
    • Selective PROT_NAME antagonism by PROT_NAME
protein extraction
Protein extraction
  • (Protein name tagging, role extraction)
  • The identification of all the proteins present in the sentence that are involved in the interaction
    • These results suggest that Tat - induced phosphorylation of serine 5 by CDK9 might be important after transcription has reached the +36 position, at which time CDK7 has been released from the complex.
    • Tatmight regulate the phosphorylation of the RNA polymerase II carboxyl - terminal domain in pre - initiation complexes by activating CDK7
protein extraction results
Protein extraction: results

No dictionary used

conclusions of protein protein interaction project
Conclusions of protein-protein interaction project
  • Encouraging results for the automatic classification of protein-protein interactions
  • Use of an existing database for gathering labeled data
  • Use of citations
conclusion
Conclusion
  • Machine Learning methods for NLP tasks
  • Three lines of research in this area, state-of-the art results
    • Information and relation extraction for “treatments” and “diseases”
    • Protein-protein interactions
    • (Noun compounds)
thank you

Thank you!

Barbara Rosario

SIMS, UC Berkeley

rosario@sims.berkeley.edu