1 / 15

Towards an NLP `module’

Towards an NLP `module’. The role of an utterance-level interface. Modular architecture. Language independent application. Meaning representation. Language module. Utterance-level interface. text or speech. Desiderata for NLP module. Application- and domain- independent

Download Presentation

Towards an NLP `module’

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards an NLP `module’ The role of an utterance-level interface

  2. Modular architecture Language independent application Meaning representation Language module Utterance-level interface text or speech

  3. Desiderata for NLP module • Application- and domain- independent • Bidirectional processing • No grammar-specific information should be needed in the application • Architecture should support multiple languages • Practical • Coverage: all well-formed input should be accepted, robust to speaker errors

  4. Why? • developers could build `intelligent’ responsive applications without being NLP experts themselves • less time-consuming and expensive than doing the NLP for each application domain • multilingual applications • support further research

  5. LinGO/DELPH-IN • Software and `lingware’ for application- and domain- independent NLP • Linguistically-motivated (HPSG), deep processing • Multiple languages • Analysis and generation • Informal collaboration since c.1995 • NLP research and development, theoretical research and teaching • www.delph-in.net

  6. What’s different? • Open Source, integrated systems • Data-driven techniques combined with linguistic expertise • Testing • empirical basis, evaluation • linguistic motivation • No toy systems! • large scale grammars • maintainable software • development and runtime tools

  7. Progress • Application- and domain- independent:reasonable (lexicons, text structure) • Bidirectional processing:yes • No grammar-specifics in applications:yes • Multiple languages:English, Japanese, German, Norwegian, Korean, Greek, Italian, French: plus grammar sharing via the Matrix • Practical: efficiency OK for some applications and improving, interfaces? • Coverage and robustness:80%+ coverage on English, good parse selection, not robust

  8. Integrating deep and shallow processing • Shallow processing: speed and robustness, but lacks precision • Pairwise integration of systems is time-consuming, brittle • Common semantic representation language: shallow processing underspecified • Demonstrated effectiveness on IE (Deep Thought) • Requires that systems share tokenization (undesirable and impractical) or that output can be precisely aligned with original document • Markup complicates this

  9. Utterance-level interface • text or speech • complex cases • text structure (e.g., headings, lists) • non-text (e.g., formulae, dates, graphics) • segmentation (esp., Japanese, Chinese) • speech lattices • integration of multiple analyzers

  10. Utterance interface • Standard interface language • allow for ambiguity at all levels • XML • collaborating with ISO working group (MAF) • processors deliver standoff annotations to original text • Plan to develop finite-state preprocessors for some text types, allow for others • Plan to experiment with speech lattices

  11. Assumptions about tokenization • tokenization: input data is transformed to form suitable for morph processing or lexical lookup: • What’s in those 234 dogs’ bowls, Fred? • what ’s in those <num 234> dogs ’s bowls , Fred ? • tokenization is therefore pre-lexical and cannot depend on lexical lookup • normalization (case, numbers, dates, formulae) as well as segmentation • used to be common to strip punctuation, but large-coverage systems utilize it • in generation: go from tokens to final output

  12. Tokenization ambiguity • Unusual to find cases where humans have any difficulty: problem arises because we need a pipelined system • Some examples: • `I washed the dogs’ bowls’, I said. (first ’ could be end of quote) • The ’keeper’s reputations are on the line. (first ’ actually indicating abbreviation for goalkeeper but could be start of quote in text where ’ is not distinct from `) • I want a laptop-with a case. (common in email not to have spaces round dash)

  13. Modularity problems • lexicon developers may assume particular tokenization: e.g., hyphen removal • different systems tokenize differently: big problem for system integration • DELPH-IN - `characterization’ – record original string character positions in token and all subsequent units

  14. Speech output • Speech output from a transcribing recognizer is treated as a lattice of tokens • may actually require retokenization

  15. Non-white space languages • Segmentation in Japanese (e.g., Chasen) is (in effect) accompanied by lexical lookup / morphological analysis • definitely do not want to assume this for English – for some forms of processing we may not have a lexicon.

More Related