360 likes | 361 Views
The BioText Project. Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech. BioText Project Goals.
E N D
The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech
BioText Project Goals • Provide fast, flexible, intelligent access to information for use in biosciences applications. • Focus on • Textual Information • Tightly integrated with other resources • Ontologies • Record-based databases
People • Project Leaders: • PI: Marti Hearst Co-PI: Adam Arkin • Computational Linguistics • Barbara Rosario • Presley Nakov • Database Research • Ariel Schwartz • Gaurav Bhalotia (graduated) • User Interface / Information Retrieval • Kevin Li • Emilia Stoica • Bioscience • Dr. TingTing Zhang
Outline • Main Goals • System Architecture • Apoptosis problem statement • Recent results in • Abbreviation definition recognition • Semantic relation recognition (from text) • Search User Interfaces • Hierarchical grouping of journals
BioText: Main Goals Sophisticated Text Analysis Annotations in Database Improved Search Interface
Recent Result (Schwartz & Hearst 03) • Fast, simple algorithm for recognizing abbreviation definitions. • Simpler and faster than the rest • Higher precision and recall • Idea: Work backwards from the end • Examples: • In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). • Gcn5-related N-acetyltransferase (GNAT) • Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present.
Blast Medline Mesh SwissProt Word Net GO Journal Full Text BioText: A Two-Sided Approach Empirical Computational Linguistics Algorithms Sophisticated Database Design & Algorithms
Death Receptors Signaling Ca++ Signaling Effecter Caspases (3,6,7) Apoptosis Network Survival Factors Signaling Genotoxic Stress Lost of Attachment Cell Cycle stress, etc ER Stress Initiator Caspases (8, 10) P53 pathway BH3 only Bcl-2 like NFkB Bax, Bak Mitochondria Cytochrome c Smac Caspase 12 IAPs Apaf 1 AIF Caspase 9 Apoptosis Slide courtesy TingTing Zhang
The issues (courtesy TingTing Zhang): • The network nodes are deduced from reading and processing of experimental knowledge by experts. Every month >1000 apoptosis papers are published. • The supporting experimental data are gathered in different organs, tissues, cells using various techniques. • There are various levels of uncertainty associated with different techniques used to answer certain questions. • Depending on the expression patterns for the players in the network, the observation may or may not be extended to other contexts. • We need to keep track of ALL the information in order to understand the system better.
Simple cases: • Mouse Bim proteins (isoforms EL, L, S) binds tohuman Bcl-2 (bacteriophoage screeningusingcDNA expression library from T-Lymphoma cell line KO52DA20). • Human BimEL proteinis 89% identical tomouse BimEL, Human BimLis 85% identicaltomouse BimL (Hybridization of mouse bim cDNA tohuman fetal spleen and peripheral blood cDNA library). • Bim mRNAis detectedin B and T lyphoid cells (Northern blot analysisofmouse KO52DA20, WEHI 703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415, B6.23.16BW2 cell extracts). • BimL proteininteract withBcl-2 OR Bcl-XL, or Bcl-w proteins (Immuno-precipitation(anti-Bcl-2 OR Bcl-XL OR Bcl-w))followed by Western blot(anti-EEtag) using extracts human 293T cellsco-transfected with EE-tagged BimL AND (bcl-2 OR bcl-XL OR bcl-w) plasmids) • BimL deleted of the BH3 domaindoes not bind to Bcl-2 OR Bcl-XL, or Bcl-w proteins (under experimental conditions mentioned above)
Computational Language Goals • Recognizing and annotating entities within textual documents • Identifying semantic relations among entities • To (eventually) be used in tandem with semi-automated reasoning systems.
Main Ideas for NLP Approach • Assign Semantics using • Statistics • Hierarchical Lexical Ontologies to generalize • Redundancy in the data • Build up Layers of Representation • Syntactic and Semantic • Use these in a feedback loop
Computational Linguistics Goals • Mark up text with semantic relations
Recent Result:Descent of Hierarchy • Idea: • Use the top levels of a lexical hierarchy to identify semantic relations • Hypothesis: • A particular semantic relation holds between all 2-word Noun Compounds that can be categorized by a MeSH pair.
Definition • NC: Any sequence of nouns that itself functions as a noun • asthma hospitalizations • health care personnel hand wash • Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.
NCs: Three tasks • Identification • Syntactic analysis (attachments) • [Baseline [headachefrequency]] • [[Tensionheadache] patient] • Our Goal: Semantic analysis • Headache treatment treatment forheadache • Corticosteroid treatment treatment that uses corticosteroid
Main Idea: • Top-level MESH categories can be used to indicate which relations hold between noun compounds • headache recurrence • C23.888.592.612.441 C23.550.291.937 • headache pain • C23.888.592.612.441 G11.561.796.444 • breast cancer cells • A01.236 C04 A11
Linguistic Motivation Can cast NC into head-modifier relation, and assume head noun has an argument and qualia structure. • (used-in): kitchen knife • (made-of): steel knife • (instrument-for): carving knife • (used-on): putty knife • (used-by): butcher’s knife
How Far to Descend? • Anatomy: 250 CPs • 187 (75%) remain first level • 56 (22%) descend one level • 7 (3%) descend two levels • Natural Science (H01): 21 CPs • 1 (4%) remain first level • 8 (39%) descend one level • 12 (57%) descend two levels • Neoplasm (C04) 3 CPs: • 3 (100%) descend one level
Evaluation • Apply the rules to a test set • Accuracy: • Anatomy: 91% accurate • Natural Science: 79% • Diseases: 100% • Total: • 89.6% via intra-category averaging • 90.8% via extra-category averaging
Summary of NC Work • Lexical hierarchy useful for inferring semantic relations • Works because semantics are constrained and word sense ambiguity is not too much of a problem • Can it be extended to other types of relations? • Preliminary results on one set of relations are promising.
Database Research Issues • Efficiently and effectively combining • Relational databases & Text • Hierarchical Ontologies • Layers of Annotations
Interface Issues • Create intuitive, appealing interfaces that are better than what’s currently out there. • Start with existing assigned metadata • As text analysis improves, incorporate the results into the interface.
Some Recent Work • Organizing BioScience Journal Names • Currently there are > 3500
Some Recent Work • Organizing BioScience Journal Names • Currently there are > 3500 • Idea: • Group them into faceted hierarchies semi-automatically • Using clustering of title terms, synonym similarity via WordNet, and other techniques
Summary • BioText aims to improve access to bioscience information via • Sophisticated language analysis • Integration of results into • Annotated database • Flexible user interface • Eventual goal • Semi-automated mining and discovery
There’s lots to do! biotext.berkeley.edu For more information: