1 / 12

Supporting Annotation Layers for Natural Language Processing

Supporting Annotation Layers for Natural Language Processing. Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu. Supported by NSF DBI-0317510 and a gift from Genentech. Project overview.

basia-cash
Download Presentation

Supporting Annotation Layers for Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Annotation Layers for Natural Language Processing Preslav Nakov,Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMSUniversity of California, Berkeleyhttp://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

  2. Project overview • A system for flexible querying of text that has been annotated with the results of NLP processing. • Supports • self-overlapping and parallel layers, • integration of syntactic and ontological hierarchies, • and tight integration with SQL. • Designed to scale to very large corpora. • Demo of LQL (Layered Query Language) on examples taken from the NLP literature.

  3. Key Contributions • Multiple overlapping layers (cannot be expressed in a single XML file) • Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text • Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet) • Specialized query language • Flexible results format • Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations • 1.4 million MEDLINE abstracts • 10 million sentences annotated • 320 million multi-layered annotations • 70 GB database size.

  4. Layers of Annotations • Each annotation represents an interval spanning a sequence of characters • absolute start and end positions • Each layer corresponds to a conceptually different kind of annotation • Layers can be • Sequential • Overlapping (e.g., two multiple-word concepts sharing a word) • Hierarchical • spanning, when the intervals are nested as in a parse tree, or • ontologically, when the token itself is derived from a hierarchical ontology

  5. Annotation Layers Example

  6. System Architecture(Main table)

  7. System Architecture(Indexes) • (Forward) +doc_id+section+layer_id+sentence+first_word_pos+last_word_pos+tag_type • (Inverted) +layer_id+tag_type+doc_id+section+sentence+first_word_pos+last_word_pos • (Inverted) +word_id+layer_id+tag_type+doc_id+section+sentence+first_word_pos

  8. Example query I • Protein-Protein Interactions • Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene.

  9. Example query I - LQL SELECT p1_text, verb_content, p2_text, COUNT(*) AS cnt FROM ( BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p1 [layer='pos' && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p2 ] SELECT p1.text AS p1_text, verb.content AS verb_content, p2.text AS p2_text END_LQL ) lql GROUP BY p1_text, verb_content, p2_text ORDER BY count(*) DESC

  10. Example query I – Sample output

  11. Example query II • Chemical–Disease Interactions • “Adherence to statin prevents one coronary heart disease event for every 429 patients.” • Goal: extract the relation that statin (potentially) prevents coronary heart disease. • MeSH C subtree contains diseases • MeSH supplementary concepts represent chemicals.

  12. Example query II - LQL [layer='sentence' { NO ORDER, ALLOW GAPS } [layer='shallow_parse' && tag_name='NP‘ [layer='chemicals'] AS chemical $ ] [layer='shallow_parse' && tag_name='NP' [layer='mesh' && tree_number ~ 'C%'] AS disease $ ] ] AS sent SELECT sent.pmid, chemical.text, disease.text, sent.text

More Related