Multi-layered XML-based Annotation for Integrated NL Processing

Multi-layered XML-based Annotation for Integrated NL Processing Anette Frank Language Technology Lab DFKI GmbH Saarbrücken, Germany Japanese-German Workshop on NLP, Sapporo, Japan July 4-5, 2003

Background Whiteboard– Multilevel Annotation for Dynamic Free Text Processing H.Uszkoreit, B.Crysmann, A.Frank, B.Kiefer, G.Neumann, J.Piskorski, U.Schäfer, F.Xu, (M.Becker, and H.-U.Krieger) Major project goals • Integration of shallow and deep linguistic processing • Processing of unrestricted free text • Variable-depth text analysis • XML-based system architecture • Uniform way of representing and combining results of various NLP components • Flexible software infra-structure for NLP-based applications • Applications • Grammar & controlled language checking • Intelligent information extraction

DNLP Fine-grained analysis High precision if correctly disambiguated High ambiguity rates Insufficient robustness coverage ill-formeded input Insuffiencent efficiency SNLP Partial analysis Insufficient precision Tamed ambiguity High robustness coverage ill-formed input High efficiency Goal of integrated ‘hybrid’ processing • Robustness and efficiency of shallow analysis • Precision and fine-grainedness of deep syntactic analysis Motivation—Annotation-based Integration of Shallow and Deep NLP —

Integration of Shallow and Deep Analysisin WHAT: an XML-based Annotation Architecture • Whiteboard Annotation Machine & Transformer (Schäfer 2003) • Managing shallow and deep analyses in multi-layer XML architecture • XSLT queries to XML standoff annotations for flexible, efficient integration • Lexical Integration(Crysmann et al. 2002) • SPPC-HPSG interface: building HPSG lexicon entries “on the fly” • Named entities, open class categories (nouns, adjectives, adverbs, ..) • HPSG-GermanNet integration • association with HPSG lexical sorts • coverage and robustness • Phrasal integration for ‘hybrid’ syntactic processing (Frank et al. 2003) • Integration of shallow topological field parsingand deep HPSG parsing • efficiency and robustness

WHAT Integration of Shallow and Deep NLP— XML/XSLT-based system architecture — • Multi-layer XML standoff annotation for integration of NLP components • Standoff annotation allows for combination of overlapping hierarchies • Access to results of alternative NLP components, for flexible use in applications • XSLT-based system architecture WHAT: Whiteboard Annotation Transformer(Schäfer 2003) WHAM XML standoff annotation shallow NLP components programming interface NLP-based application deep NLP components multilayer chart

Integration of Shallow and Deep NLP— XML/XSLT-based system architecture — • Multi-layer XML standoff annotation for integration of NLP components • Standoff annotation allows for combination of overlapping hierarchies • Access to results of alternative NLP components, for flexible use in applications • XSLT-based system architecture WHAT: Whiteboard Annotation Transformer(Schäfer 2003) WHAT query • XSLT queries to XMLstandoff markup • Template library for 3 types of queries: • V(alue), N(ode sets), D(ocument) • Flexible, efficient access for online / • offline integration of NLP components • ACT:Accessing, Computing, Transforming • Portability component- specific XSLT template library constructed XSLT query XML standoff markup XSLT processor result

Integration of Shallow and Deep NLP— XSLT-based queries for annotation-based integration — • Through V(alue) and N(ode) queries: • Morphology and stemming of unknown words (unknown in HPSG lexicon) • PoS tagging • Compounds • Named entities (spans and semantic types) • Through D(ocument), V(alue) and N(ode) queries: • Chunks • Topological structure (spans, types) • Example: returnNamed Entity type from SPPC_XML:getValue.NE.type(I4) <query name="getValue.NE.type">  <xsl:param name="index"/> <xsl:template match="/WHITEBOARD/SPPC_XML//NE[@id=$index]"> <xsl:value-of select="@type"/> </xsl:template> </query>

Integration of Shallow and Deep NLP— Lexical integration — • Shallow processing (SPPC) • Morphological and compound analysis • PoS tagging • Named Entity recognition • Deep syntactic processing (HPSG) • Subcategorisation • Argument structure • Lexical semantic sorts • Building HPSG lexicon entries “on the fly” • XML-encoding of typed feature structures • Mapping lexical information from SPPC to HPSG typed feature structures • Lexical syntactic and semantic information • Mapping GermaNet semantic classes to HPSG sorts (Siegel et al., 2001) • Subcategorisation acquisition from parsed corpora • Increase of coverage and robustness at lexical level • Increase of fully lexically covered sentences: 43% (on NEGRA corpus) • Increase of parsed sentences due to lexical coverage: 8,9%

Integration of Shallow and Deep NLP— Syntactic integration — • Using robust, efficient shallow parsing • to pre-partition deep parser‘s search space  efficiency • to select partial analyses from deep parser’s chart  robustness • Constraining the search space of a chart-based parser • External knowledge sources deliver • compatible subtrees to be checked for compatibility with deep parsing • additional information (categorial, featural constraints) for constituents • Prioritisation scheme: constituents (chart edges) of deep parser are • rewarded if compatible • penalised if incompatible with external constraints • Best-first filter on ambiguous output • Challenge:shallow analysis needs to provide reliable, compatible structures

NP CL CL Deep syntactic structure [Die Programme [die [sie] benutzen, [um [ihre Ergebnisse] zu verbreiten] [The programs [that [they] use, [in order to [their results] distribute] NP & CL chunks NP CL CL The Shallow-Deep Mapping Problem— Problems and Solutions— The shallow-deep mapping problem • Chunk parsing not isomorphic to deep syntactic structure („attachments“)

NP CL CL Deep syntactic structure [Die Programme [die [sie] benutzen, [um [ihre Ergebnisse] zu verbreiten] [The programs [that [they] use, [in order to [their results] distribute] NP & CL chunks NP NP The Shallow-Deep Mapping Problem— Problems and Solutions— The shallow-deep mapping problem • Chunk parsing not isomorphic to deep syntactic structure („attachments“) CL

NP CL CL Deep syntactic structure [Die Programme [die [sie] benutzen, [um [ihre Ergebnisse] zu verbreiten] [The programs [that [they] use, [in order to [their results] distribute] NP & CL chunks NP NP NP The Shallow-Deep Mapping Problem— Problems and Solutions— The shallow-deep mapping problem • Chunk parsing not isomorphic to deep syntactic structure („attachments“)

CL CL CL NP The Shallow-Deep Mapping Problem— Problems and Solutions— The shallow-deep mapping problem • Chunk parsing not isomorphic to deep syntactic structure („attachments“) • „Bottom-up“ chunk parsing not constrained by sentence macro-structure Peter eats pizza and Mary drinks wine • Stochastic Topological Field Parsing (Becker and Frank 2002) • High degree of compatibility with deep syntactic structure • Flat, partial macro-structure: robustness, coverage, efficiency, precision

sentence Vorfeld Left sentence Mittelfeld (MF) Right sentence Nachfeld type (VF) bracket (LK) bracket (RK) (NF) V2 Fritz kennt die Freunde seines Sohns , die zur Party kommen. Fritz hat die Freunde seines Sohns kennengelernt , die zur Party kommen. V1  Hat Fritz die Freunde s. Sohns kennengelernt , die zur Party kamen?  Kennt Fritz die Freunde s. Sohns , die zur Party kommen? Vletzt  weil Fritz die Freunde s.Sohns kennt wer  die Freunde seines Sohns kennt , die zur Party kommen CL VF topological structure LK MF mapping topological to deep syntactic structure VF LK MF RK NF RK NF Stochastic Topological Field Parsing— Topological field model of German syntax— Theory-neutral macro-structure of complex sentences

CL-V2 RB-PTK VF-TOPIC LB-VFIN MF NF ADV Daher thus VVFIN wies ordered NE Souza S. ART die the NN Polizei police PTKVZ an particle CL-INF RB-VINF NF MF $, PTKZU zu to VVINF fassen capture CL-REL ART den the NN Häuptling chieftain $, VF-REL MF RB-VFIN PRELS der who PRF sich himself VVPP versteckt hidden VVFIN hält keeps Stochastic Topological Field Parsing— A corpus-based approach (Becker & Frank 2002)— Non-lexicalised PCFG trained from (converted) NEGRA corpus • Flat phrasal fieldsVF, MF, NF: sequences of POS-tags (and CL-nodes) • Parameterised categories:CL–V2/–V1/–SUBCL/–REL/–WH,.. RB–INF/–FIN • Explicit clausal embedding structure

Stochastic Topological Field Parsing— Performance— Best model[para+, bin+, pnct+, prun +] • High accuracy (93% / 88% ) at high coverage (up to 100% ) • High rate of perfect matches (fully correct) 80% / 72% • Efficiency: 0.12 secs/sentence (LoPar parser, Schmid 2000) Evaluation: ignoring parameters and punctuation (length  40 words)

CL-V2 RB-VPART VF-TOPIC LB-VFIN MF NF VAPP gehabt,7 ART Der,1 NN Zehnkampf,2 VAFIN hätte,3 ART eine,4 ADJA andere,5 NN Dimension,6 CL-SUBCL MF LB-COMPL RB-VFIN S VAFIN wäre,13 PPER er,10 PROAV dabei,11 KOUS wenn,9 VAPP gewesen,12 S/NP-NOM NP-NOM D Der,1 V hätte,3 EPS/NP-NOM N’ Zehnkampf,2 NP-ACC EPS N’ EPS CP-MOD ART eine,4 AP-ATT andere,5 N’ Dimension,6 C wenn,9 V gehabt,7 S V NP-NOM er,10 PP dabei,11 V V gewesen,12 V-LE wäre,13 Integrated Shallow and Deep Parsing — TopP meets HPSG —

CL-V2 RB-VPART VF-TOPIC LB-VFIN MF NF XSLT-based extraction of map constraints VAPP gehabt,7 ART Der,1 NN Zehnkampf,2 VAFIN hätte,3 ART eine,4 ADJA andere,5 NN Dimension,6 CL-SUBCL MF LB-COMPL RB-VFIN MAP_CONSTR id="T10" constr="extrapos_rk+nf" left="W7" right="W13"/ S VAFIN wäre,13 PPER er,10 PROAV dabei,11 KOUS wenn,9 VAPP gewesen,12 S/NP-NOM NP-NOM D Der,1 V hätte,3 EPS/NP-NOM N’ Zehnkampf,2 NP-ACC EPS to guide deep parsing N’ EPS CP-MOD ART eine,4 AP-ATT andere,5 N’ Dimension,6 C wenn,9 V gehabt,7 S V NP-NOM er,10 PP dabei,11 V V gewesen,12 V-LE wäre,13 Integrated Shallow and Deep Parsing — Bridging structural non-isomorphisms —

Flattening phrasal fields Integrated Shallow and Deep Parsing— XML/XSLT-based integration: TopP meets HPSG —

chunk insertion Integrated Shallow and Deep Parsing— XML/XSLT-based integration: TopP meets HPSG —

bracket extraction <TOPO2HPSG type="root" id="5608"> <MAP_CONSTR id="T1" constr="v2_cp" left="W1" right="W13"/> <MAP_CONSTR id="T2" constr="v2_vf" left="W1" right="W2"/> <MAP_CONSTR id="T3" constr="vfronted_vfin+rk" left="W3" right="W3"/> <MAP_CONSTR id="T4" constr="vfronted_vfin+vp+rk" left="W3" right="W13"/> <MAP_CONSTR id="T5" constr="vfronted_vp+rk" left="W4" right="W13"/> <MAP_CONSTR id="T6" constr="vfronted_rk-complex" left="W7" right="W7"/> <MAP_CONSTR id="T7" constr="vl_cpfin_compl" left="W9" right="W13"/> <MAP_CONSTR id="T8" constr="vl_compl_vp" left="W10" right="W13"/> <MAP_CONSTR id="T9" constr="vl_rk_fin+complex+f" left="W12" right="W13"/> <MAP_CONSTR id="T10" constr="extrapos_rk+nf" left="W7" right="W13"/> </TOPO2HPSG> HPSG Parsing (prioritisation) Shallow lexical processing SPPC Integrated Shallow and Deep Parsing— XML/XSLT-based integration: TopP meets HPSG —

Shaping the Deep Parser’s Search Space— Bracket conditions from shallow topological parsing — • Interface to shallow components: labelled brackets • Provide information about constituent start and end positions • Bracket names (types) associated with additional constraints • HPSG parser PET: Agenda-based chart parser • Flexible priority heuristics for the parsing tasks (i.e. possible combination of edges) • Matching start, connecting and end position of new tasks against brackets • Bracket information is used to modify task priorities • Reward tasks consistent with bracket information • Penalize tasks building incompatible chart edges • No pruning, but shaping the search space !

Bracketx Shaping the Deep Parser’s Search Space— Matching brackets and chart edges — Crossing Event

Shaping the Deep Parser’s Search Space— Matching brackets and chart edges — Match Event Bracketx

Shaping the Deep Parser’s Search Space— Matching brackets and chart edges — Right (Left)-match Inside Event Bracketx

Shaping the Deep Parser’s Search Space— Matching brackets and chart edges — Right (Left)-match Outside Event Bracketx

˜ p(t) = p(t)  ( 1 ± confent(brx)  confpr(x) (x) ) Shaping the Deep Parser’s Search Space— Conditions and Effects — • Additional constaints on bracket types for prioritisation • Constituent matching conditions • „Match“ and „Cross“: brackets compatible with HPSG constituents • „Right Inside“ and „Right Outside“: partially specified constituents • HPSG grammar constraints • Allowed/Disallowed HPSG grammar rules • Necessary/Forbidden HPSG Feature Structure configurations • Positive vs. negative priority effects: rewarding vs. penalising • Changing priorities • If both match conditions and grammar constraints are fulfilled • Confidence values can be used to modulate the strength of the effect

Confidence Measures— Accuracy of map-constraints — • Static confidence measure: precision of bracket type x:confpr(x) • Precision/recall of brackets extracted from best topological parse measured against brackets extracted from evaluation corpus (Becker&Frank 2002) precision: 88.3%, recall 87.8% • Threshold pr = 0.7 excludes 22.8% of bracket mass,32.35% of bracket types • includes chunk-brackets (with 71.1% precision)

Confidence Measures— Tree Entropy — • Experiment I. Effect of varying entropy thresholds on prec/recall of topological parsing precision: proportion of selected parses that are perfect matches recall: proportion of perfect matches that are selected coverage: perfect matches above/below entropy threshold: in/out of coverage II.Determining optimal entropy threshold, trading coverage for precision • Entropy of a parse distribution delivers a measure of how certain the parser is about its best analysis for a given sentence (e.g. Hwa 2000) • Confent: Tree entropy as a confidence measure for the quality of the best topological parse and extracted bracket constraints • Uniform distribution, high entropy  very uncertain • Spike distribution, low entropy  very certain

Experiments carried out on (split) evaluation • corpus of (Becker and Frank, 2002) • Varying entropy thresholds [1,0] • ent = 1 no filtering • lowering ent increases precision, • decreases recall and coverage Optimal entropy threshold:ent = 0.236 maximising f-measure ( = 0.5) on training set Confidence Measures— Tree Entropy — Effect of threshold ent = 0.236 on test set

Experiments— Data and setup — Data • 5060 NEGRA sents (24.57% of NEGRA corpus as covered by HPSG) •  length: 8.94 w/o punct ;  lex. ambiguity: 3.05 entries/word Setup • Performance measuring (absolute run-time, no. of tasks) • Baseline : HPSG parsing w/ PoS guidance, but w/o topological information • Testing various integration parameters • topological brackets • confidence weights for topological information • bracket precision (P) (± thresholded) • tree entropy (E) (± thresholded) • chunk brackets

Results Baseline: HPSG-parsing w/ PoS guidance • Heuristic weights on task priorities • ½ : increase / decrease by half • 1 : incease to double / • decrease to zero

Results Baseline: HPSG-parsing w/ PoS guidance Heuristic weights with  set high, wrong topological information can mislead the parser Confidence weights [0,1] P(T): (thresholded) bracket precision E(T): (thresholded) tree entropy

Results Baseline: HPSG-parsing w/ PoS guidance Heuristic weights with  set high, wrong topological information can mislead the parser • Confidence weights • PT and E work best • ET: threshold cuts out entire tree, • while some brackets can be correct PT with chunk constraints w/ and w/o topological brackets

Results Baseline: HPSG-parsing w/ PoS guidance Heuristic weights with  set high, wrong topological information can mislead the parser • Confidence weights • PT and E work best • ET: threshold cuts out entire tree, • while some brackets can be correct • No improvement by adding chunks • Chunks w/o topological brackets: • almost no improvement over BL

Efficiency gains/losses by sentence length Baseline vs. PT –E ½ Distribution: # sentences / sentence length Observations— Monitoring efficiency gains by sentence length — • Outliers • 963 sents (len  3,  len 11.09) • Observation • conflicting topological / HPSG parses • cross-validation effects

Observations— Guidance from PoS, chunks, and topological brackets — Impact of guidance by PoS, chunks, or topological parsing • Baseline includes PoS-prioritisation • Chunk-based constraints rather poor • Topological constraints (span and grammar constraints): highest impact Related work : Daum et al. 2003 PoS- and chunk-based prioritisation in dependency parsing

Conclusion and Outlook • Data-driven integration of shallow and deep parsing, mediated by XML multi-layer annotation architecture • XSLT-based integration: efficient, fine-grained dovetailing of shallow and deep constraints • Shallow macro-structural constraints yield substantial performance gains • Focus on annotation-based system architecture and efficiency • Further integration scenarios target • Robustness • Topological information for fragment recovery from deep parser’s chart • Pruning failed input sentences for reparsing (snipping adjunct clauses, ...) • Precision • Confidence-based filtering: tree entropy, decision tree learning • Fine-grainedness of analysis • Projecting robust semantic structures from shallow trees

Multi-layered XML-based Annotation for Integrated NL Processing

Multi-layered XML-based Annotation for Integrated NL Processing

Presentation Transcript

XML Processing in

Tools for Ontology-based Corpus Annotation

Using JDOM for XML processing

Scalable Multi-Label Annotation

Pollution: A Multi-Layered Issue

Processing XML Documents

Layered Interval Codes for TCAM-based Classification

Navigation Meshes for Realistic Multi-Layered Environments

Supporting Annotation Layers for Natural Language Processing

A knowledge-based approach to integrated genome annotation

Ontology-based Annotation

Correlative Multi-Label Multi-Instance Image Annotation

Supporting Annotation Layers for Natural Language Processing

XML Query Processing

Java API for XML Processing

Multi-layered Multi-agent Situated System

Supporting Annotation Layers for Natural Language Processing

Processing XML

XML Stream Processing

Integrated Annotation for Biomedical IE

Layered Processing for MIMO OFDM

Processing XML data using a relational database: Schema-Based XML Storage