WP2: RTV NER Architecture Progress Report

WP2: RTV NER ArchitectureProgress Report CROSSMARC Fourth Meeting Athens 28 February-1 March 2002

IT NERC: what has been done so far • Processing steps: • Character normalization “graphological” inconsistencies • Tokenization general purpose (number, words, puntuactions,…) • Flat term normalization simple constituents and terminological expressions • Lexical analysis lexical rules for recognizing complex constituents • NumEx Recognition rules for numerical expressions, unit normalization • Ontology lookup simple inferences (based on local information)

IT NERC: lexical resources • Current resources: • a set of normalization rules (13 rules) • a terminological database (33 entries) • the Italian lexicon (119 entries) • a gazetteer for measurement units (24 units) • Small lexical resources to provide a baseline for the evaluation • XML format (enhanced readibility and management)

Name Matching and Normalization • Cross language name matching (partially) provided through the monolingual lexicons and the ontology • Marked up entities hold an ID reference to ontology • Normalization • implicit if the entity could be assigned to a concept • numex could be expressed in the ontology unit • In other cases we plan to explore: • Strict tokenization • Acronym expansion • Case-insensitive matching Eg. Vaio PCG FX-101vs. VAIO PCG 101FX <T>Vaio</T><T>PCG</T><T>FX</T><T>-</T><T>101</T> <T>VAIO</T><T>PCG</T><T>101</T><T>FX</T>

IT NERC: Test set

IT NERC: Evaluation

WP2: RTV NER ArchitectureShort DEMO CROSSMARC Fourth Meeting Athens 28 February-1 March 2002

RTV NER Component: overview • Main goals • To provide a pool of evidencies for the FE • To recognize document fragments describing candidate - features - attributes - values • Overall strategy: localbottom-up analysis • Key features • Token-based • Lexical semantics recognizeassign references to ontology nodes

RTV NER Component: design choices • ISSUE1 How much of the physical (XHTML) structure has to be retained for the analysis ? • SOL1: Keep text only • SOL2: Keep text and structural tags only (tables, list items, headings), drop format tags (bold, underline, italic) • SOL3: Keep all the XHTML structure Best choice depend on the domain (how well the physical structure follows the logical structure)

RTV NER Component: design choices • ISSUE1 In LAPTOP domain • Very few linguistic evidencies • Weak syntax (free order constituents) intra and inter components Eg. IntelTM PIII800 Mobile vs Mobile Intel PIIITM800 Mhz vs Intel Mobile PentiumTMIII800 SOL1 (Text only) poor performances (lack of constraints) SOL2 and SOL3 lead often to better results SOL3 has been choosen as uninteresting tags could be later filtered out

RTV NER Component: design choices • ISSUE2 What kind of information has to be recognized ? • SOL1: values only (ontology leaves) • SOL2: surface realizations of any ontology node Identifying attributes and features appearing in the page could later help to solve ambiguities (for instance in NUMEX expressions) We choose SOL2: to tag also “service” information

RTV NER Component: design choices • ISSUE3 When an information is recognized ? • SOL1: when its entire span has been detected (exact match) • SOL2: when its core has been identified (fuzzy match) SOL2: works very well for closed sets (i.e. Op Systems, Procs, IOPorts, Resolutions…) Advantages: compact, robust rules SOL1: needed for open sets(i.e. Models, Manufacturers,…) Detect boundaries vs understand meaning

RTV NER Component: strategy • Principles • Go through the document structure • Do a local analysis on each section • Exploit local context • Structure (i.e. normalize) the information asap • Try to fit the ontology Locality with respect to document structure may differ from textual order

RTV NER Component: implementation • XSLT stylesheets provide well-founded (and flexible) control stategy • to navigate in the XML structure • to call linguistic processors • to aggregate the extracted information • to show produced results • Linguistic processors are • embedded in the XSL processor as java extensions or • other XSLT stylesheets • Lexical rules and KBs are XML • RTV NER is a Java2 application that apply the transformation sequence via TRAX api (Jaxp 1.1)

From: textualdescr. (strings) From: physical struct. (HTML tree) KNOWLEDGE: lexical realizations of domain concepts(FLAT ONTOLOGICAL VIEW) SEM. COMPOSITIONALITY:lexical NER scope of action To: mixed struct. (HTML tree + embedded XML entities) To: domain entities ontological descr. (node references) KNOWLEDGE: rels among domain concepts (HIERARCHICAL ONTOLOGICAL VIEW) SEM. COMPOSITIONALITY: structural FE scope of action To: structural descr. (product template) To: logical struct. (XML product tree)

Resolved Old Issues – (2nd Meeting) • The input • How to bridge the gap between highly structured (HTML) pages and logical structure (partially resolved) • The ontology • is now separed from the lexicons • lexical entries are foreseen for all concepts • more declarative semantics (explicit indication of data types) • A semi-automatic procedure to update XML ontology and lexicons has been created

Current Issues • The input • Multiple product descriptions in one page • Single description for a product family in one page (usually without prices) • Tidy failures (RTV NER process well-formed XML only) • The ontology • Adding new concepts • Adding a “default” concept for unforeseen values

XML source doc Tokenizer XSL Normalizer XSL Numex Rec XSL Convs Lex Lookup XSL XML output doc Numex Conv XSL … XSL Dict Lexicons RTV Named Entity Recognizer RTV Ling Modules IT Tokenizer Inference Engine … Parser Extentions XML Parser (actually Saxon)

WP2: RTV NER Architecture Progress Report