1 / 42

NLP Lexicon Requirements

NLP Lexicon Requirements. ... & LMF. Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it. Looking into the past. After the “Grosseto Workshop” (1985): a turning point. Xxx-Lex. GeneLex. AcquiLex. MultiLex. Xxx-Lex.

kalea
Download Presentation

NLP Lexicon Requirements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLP Lexicon Requirements ... & LMF NicolettaCalzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it Nijmegen, August 2010

  2. Looking into the past After the “Grosseto Workshop” (1985): a turning point Xxx-Lex GeneLex AcquiLex MultiLex Xxx-Lex A. Zampolli: Let’s be coherent: • EAGLES • ISLE Standards, Best Practices, ... All started with the situation we had in the late ‘80s – early ‘90s With all the Xxx-LEX projects Nijmegen, August 2010 2

  3. Key issues: Do conditions exist for standardisation effort? • Reusability as key concept  true also today • To avoid duplication of efforts, costs, etc. • To allow synergies, integration, exchange of data, ... • To provide a model for new data creation & acquisition • Decide on “feasible” areas & state priorities  this is changing over time • The feasibility of formulation of consensual standards as a strong sign of maturity in the field  we can’t propose standards if there are not enough results on which to base them • EAGLES was launched in ‘93 Nijmegen, August 2010

  4. Standard for morphosyntactic encodingof lexical entries, in a multi-layered structure, with applications for all the EU languages Standard for subcategorisation in the lexicon: a set of standardised basic notions using a frame-based structure Proposal for a basic set of notions in lexical semantics:focus on requirements of Information Systems and MT Corpus Encoding Standard (CES) from TEI Standard for morphosyntactic annotation of corpora, to ensure compatibility/ interchangeability of concrete annotation schemata  Preliminary recommendations for syntactic annotation of corpora Dialogue annotation, for integration of written and spoken annotation Main Results in Lexicon & Corpus WGsFirst Phase (www.ilc.pi.cnr.it/EAGLES96/home.html) Nijmegen, August 2010

  5. Content vs. Format/Representation Work on lexical description deals with two aspects Linguistic descriptionof lexical items (content) Formal representationof lexical descriptions (format) EAGLES concentrated onlinguistic content, not disregarding the formal representation of the proposal TEI more on format/representation issues • In LMF : on the abstract meta-model Nijmegen, August 2010

  6. Flexibility in the Recommendationse.g. Morphosyntax Level Information Type Recommendation L-0 Part-of-Speech Obligatory L-1Morphosyntacticagreement Recommended features L-2 Language-specific (or refined) Optional features Nijmegen, August 2010

  7. MERITS  Strengths (from EAGLES-ISLE) Standardisation as a necessary component of any strategic programme to create a coherent market Leading industrials & academics participated (> 150 EU groups)  Bottom-up community created standards To avoid wasting timereinventing basic/consolidated knowledge May be true also for many “humanities” users, not interested in debates on specific lexical approaches Work otherwise duplicated among many projects, done just once in a collaborative manner (overall cost-effectiveness) Allows the field to bemore competitive:  Concentrate efforts on innovative areas  Engage in new/advanced technology Nijmegen, August 2010

  8. Why Standards for Language Resources? (from EAGLES-ISLE) • To ensure: • interoperabilityof systems (& data), through compatible interfaces • reusability and integrabilityof components • training based on consensual technical specifications and models (“gold standards”) • evaluation & validation based on agreed criteria • transition from prototypes to HLT products  important for workflows  essential for a LR Infrastructure for evaluation campaigns Nijmegen, August 2010

  9. Applications: requirements for systems & enabling technologies Machine Translation Information Extraction Information Retrieval Summarisation Natural Language Generation Word Clustering Multiword Recognition + Extraction Word Sense Disambiguation Proper Noun Recognition Parsing Coreference … For HLT knowledge of applications’ requirements is essential Nijmegen, August 2010

  10. The Multilingual ISLE Lexical Entry (MILE) General methodological principles (from EAGLES) Basic requirements for the design of theMILE: Discover and list the (maximal) set of basic notions needed to describe the MILE (up to which level standardisation is feasible?) Granularity The leading principle: the edited union of existing lexicons/models (redundancy is not a problem) Modular & layered Allow for under-specification (& hierarchical structure) Nijmegen, August 2010

  11. MILE – Modularity The building-block model Lexical entry 1 Lexical entry 2 Lexical entry 3 Lexical Objects Sem feature syntactic frame slot Syn feature phrase Allow to express different dimensions of lexical entries Enable modular specification of lexical entries Create ready-to-use packages to be combined in different ways Nijmegen, August 2010

  12. The MILE Data Categories User-adaptability and extensibility MLC:SemanticFeature instance_of Core HUMAN ARTIFACT EVENT ANIMAL GROUP AGE MAMMAL UserDefined Nijmegen, August 2010

  13. MILE Lexical Data Category RegistryA library of pre-instantiated objects Define (an ontology of) lexical objects represent lexical notions such as semantic unit, syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc. specify the relevant attributes define the relations with other classes hierarchically structured Can be used “off the shelf” or as a departure point for the definition of new or modified categories Nijmegen, August 2010

  14. ISO - LMFLexical Markup Framework Designed to accommodate as many models of lexical representation as possible Its pros: Meta-model: abstract high-level specification ISO24613 Based on constants defined in Data Category Registry: low-level specifications ISO12620 Not a monolithic model, rather amodular framework LMF libraryprovides the hierarchy of lexical objects (with structural relations among them) Data Category Registryprovides a library of descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined) Nijmegen, August 2010

  15. ISO LMF Builds on EAGLES/ISLE The field is mature Structural skeleton, with the basic hierarchy of information in a lexical entry + various extensions • Modular framework • LMF specs comply with modelling UML principles • an XML DTD allows implementation New initiatives … LIRICS ICT KYOTO NEDO Asian Lang. LexInfo NICT Language-Grid Service Ontology Nijmegen, August 2010

  16. Nijmegen, August 2010 Principles of LMF: fromvery simple lexicons … Mettere entrata PAROLE in XML LMF compliant

  17. Nijmegen, August 2010 to veryrichones … DCR

  18. Mapping experiment Major best practices: OLIF PAROLE/SIMPLE LC-Star (Speech Lexicon) WordNet - EuroWordNet FrameNet BDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French Entries from major existing lexicons mapped to LMF • To prove that the model is able to represent many best practices • To test the expressive potentialities, the adequacy of architectural model & linguistic objects from Monica Monachini Nijmegen, August 2010

  19. BioLexiconSIMPLE model & ISO-LMF standard A unique large-scale computational lexicon in the biomedical domain in terms of coverage & typology of information BL Populated with info from available biomedical resources Including both domain-specific & general language words Semi-automatically populated from corpora: Population toolkit available from Monica Monachini Nijmegen, August 2010

  20. The BioLexicon: why • LMF proved to be able to provide Text Mining systems in the biomedical domain with a substantial lexicon covering • Biomedical term variants (orthographic, semantic, geographical, …) • better information retrieval • Terminological verbs and their combinatorial properties (subcategorization frames and predicate-argument structure) • better information extraction and question answering • Word derivations • to reach similar meaning expressed in different ways(e.g. activationvsactivate) Nijmegen, August 2010

  21. KYOTO: the lexical resource perspective • KYOTO objectives • “ … facilitating the exchange of information across languages, domains and cultures” • “ … allow definition of word meaning in a shared Wiki platform” • from the point of view of linguistic resources … • needs to share lexical & knowledge bases, both general& domain-related, under the form of lexical repositories and ontologies Nijmegen, August 2010

  22. Source Documents KYOTO SYSTEM Linear MAF/SYNAF Term extraction Tybot Semantic annotation Generic TMF Linear SEMAF Fact extraction Kybot Domain editing Wikyoto Fact User Concept User LMF API OWL API Linear Generic FACTAF Domain Wordnet Domain ontology Wordnet Ontology from Piek Vossen Nijmegen, August 2010

  23. WnJP Wn IT WnNL WnEN WnES WnJP WnCH WnEU Wn IT WnNL WnEN WnES WnEU WnCH A common representation format for WordNets Seven WordNets • similar but not identical  hampered interoperability • to be accessed both intra- and inter-linguistically  to support easier integration • endow WordNet with a representation format allowing easy access, integration & interoperability among resources Nijmegen, August 2010

  24. A common representation format: WordNet - LMF Data Categories LexicalResource 1..* 0..1 1..1 GlobalInformation Lexicon SenseAxes 1..* 0..* 1..* 0..1 Meta Synset SenseAxis LexicalEntry 0..1 0..1 0..* 0..1 0..1 1..1 MonolingualExternalRefs InterlingualExternalRefs Lemma Sense Definition SynsetRelations 0..1 0..* 1..* 1..* 1..* MonolingualExternalRefs MonolingualExternalRef InterlingualExternalRef Statement SynsetRelation 0..1 0..1 0..1 1..* MonolingualExternalRef Meta Meta Meta 0..1 Meta Nijmegen, August 2010 from Monica Monachini

  25. Centralized WordNet DC Registry A list of 85 sem.rels as a result of a mapping of the KYOTOWordNet grid Inter-WN Intra-WN Nijmegen, August 2010

  26. WordNet-LMF Multilingual level - Cross-lingual Relations <!ELEMENT SenseAxes (SenseAxis+)> <!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)> <!ATTLIST SenseAxis id ID #REQUIRED relType CDATA #REQUIRED> <!ELEMENT Target EMPTY> <!ATTLIST Target ID CDATA #REQUIRED> <!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)> <!ELEMENT InterlingualExternalRef (Meta?)> <!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIRED externalReference CDATA #REQUIRED relType (at|plus|equal) #IMPLIED> IWN <fuoco_1, fiamma_1> 00001251-n SWN <fuego_3, llama_1> 09686541-n groups monolingual synsets corresponding to each other and sharing the same relations to English WN3.0 <fire_1 flame_1 flaming_1> 13480848-n specifies the type of correspondence link to ontology/(ies) Nijmegen, August 2010 from Monica Monachini

  27. Kyoto Knowledge Base Domain WnJP Domain Domain WnIT WnNL Domain Ontology Ontology Domain Domain Ontology WnES WnEN Domain Domain WnEU WnCH Nijmegen, August 2010

  28. LMF and Named Entity Lexicon • LR’s enriched with NEs can be useful within QA to : • Find answers • Validate answers • Construction of a multilingual NE lexicon automatically acquired • Source: Wikipedia → Dynamic source, huge amount of NEs, some degree of structure • NEs extracted from Wikipedia and linked to entries of LRs and ontologies from Monica Monachini Nijmegen, August 2010

  29. Named Entity Lexicon Wikip <Sense id="en_s_Florence"> <SenseRelation targets="en_s_city_1"> <feat att="semanticrelation" val="instance_of"/> </SenseRelation> <MonolingualExternalRef> <feat att="external_system" val="EnWikipedia"/> <feat att="external_reference" val="11525"/> </MonolingualExternalRef> </Sense> <SenseAxisid="sa_001" senses="en_s_Florence it_s_Firenze"> <featatt="type" val="eq_syn"/> <InterlingualExternalRef> <featatt="external_system" val="SUMO"/> <featatt="external_reference" val="City"/> <featatt="external_reltype" val="at"/> </InterlingualExternalRef> <InterlingualExternalRef> <featatt="external_system" val="SIMPLE"/> <featatt="external_reference" val="Geopolitical_location"/> <featatt="external_reltype" val="at"/> </InterlingualExternalRef> </SenseAxis> <Sense id="en_s_city_1"> <MonolingualExternalRef> <feat att="external_system" val="EnWordNet"/> <feat att="external_reference" val="noun.loc:city0"/> </MonolingualExternalRef> </Sense> LR Onto from Monica Monachini Nijmegen, August 2010

  30. LexInfo & Previous Models From Paul Buitelaar Nijmegen, August 2010 • LingInfo: modeling morphosyntatic decomposition of (complex) terms [Buitelaar et al. 2006] • LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007] • Lexical Markup Framework (LMF): ISO standardised model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007] • LexInfo:building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]

  31. LMF: ILC infrastructure Nijmegen, August 2010

  32. Desiderata for Semantic Roles Martha Palmer • First step: • What are semantic roles? • Why do we need standards? • Start with Lirics • Consistently recognizable • Clarify sense distinctions • Generalizability • Learnable • Potential for inferencing Nijmegen, August 2010

  33. Some steps for a “new generation” of LRs From huge efforts in building static, large-scale, general-purpose LRs TodynamicLRs rapidly built on-demand, tailored to specific user needs From closed, locally developed and centralized resources To LRs residing over distributed places, accessible on the web, choreographed by agents acting over them • From Language Resources • To Language Services • BUT • Need of tools to make this vision operational & concrete Nijmegen, August 2010

  34. Lexical WEB & Content Interoperability As a critical step for semantic mark-up in the SemWeb Global WordNet GRID NomLex WordNets WordNets ComLex WordNets with intelligent agents SIMPLE-WEB SIMPLE LMF Lex_x BioLexicon FrameNet Lex_y Standards for Interoperability Enough?? Nijmegen, August 2010

  35. A new paradigm of R&D in LRs & LTDistributed Language Services Open & distributed infrastructures for LRs & LT Adopting the paradigm ofaccumulation of knowledgeso successful in more mature disciplines, based on sharing LRs & LTs Ability to build on previous achievements, allowingeffective cooperation of many groups on common tasks Exchange and integrate information across repositories Create new resources on the basis of existing Compose new services on demand A new scenario implying content interoperability standards development of architectures enabling accessibility supra-national cooperation Nijmegen, August 2010

  36. A few Issues for discussion:“content”, guidelines, tools, priorities, ... For Semantic Web & “content” interoperability: is the field ‘mature’ enough to converge also for the semantic/conceptual level (e.g. to automatically establish links among different languages)? For the standards to have impact, ensure their usability & gain industry support focusing on requirements of industrial applications To have Guidelines which are a “usable product” (to assist in creation or adaptation of lexicons, …) Facilitate acceptance of the standards providing an open-source reference implementation platform & tools, related web services and test suites Relation with Spoken language community Define further steps necessary to converge on common priorities Nijmegen, August 2010

  37. Limits observed& needs of further work For usability & operability of LMF: Data Categories (DC) & others: From Japanese NEDO: DC not defined in LMF & LMF non operational Asian, African DCs Need of DC organised (easy to use) IsoCat & DC Selections/Profiles Need of an ontology of DCs with structure/dependencies, and constraints Otherwise the model remains too abstract, and doesn’t say anything on how to implement concretely the different layers Link with Ontologies: relations Lexicons-Ontologies Need of easy, user-friendly guidelines Need of toolsto make it operational, also for creating standard compliant resources: more important than the model! More dissemination, also with industry Linguists may be (rightly for certain purposes) not interested Younger colleagues not aware of the past work on standards Need of operational definitions of interoperability Need of stimuli also from EC to produce standard-compliant resources (unless differently motivated) Nijmegen, August 2010

  38. Strengths Good set of methodological principles: Granularity of basic notions, … Many languages already compliant with EAGLES morpho-syntax, etc. Many projects today using LMF Unified Lexicon experiment between Speechdat & Parole, at ELRA (possible because EAGLES compliant) Web-services to access LRs based on standards Web-based platforms for LR integration An open infrastructure of LRT need standards New topics being constantly added: Time, Space, … Nijmegen, August 2010

  39. Future requirements & planning To make LMF usable and operational LMF User Guidelines with examples Mapping of commonly used lexicons into LMF  Converters Data categories for LMF lexicons Tool related to LMF, with particular reference to the Lexus tool Need to address another layer The ontological layer in a lexicon How lexicons and ontologies are linkedand information mapped from each other • An open spacein a wiki environment • to store/link to guidelines, examples • to allow broad discussion on these topics • to ease dissemination of LMF Nijmegen, August 2010

  40. FLaReNet Mission: structure the area of LR & LT of the future Worldwide Forum for LRs & LTs Consolidate methods, approaches, common practices, architectures Integrate so far partial solutions into broader infrastructures A “roadmap”: a plan of coherent actions as input to policy development For the EU, national organisations & industry As a model for the LRs/LTs of the next years Strengthening the language product market, e.g. for new products & innovative services Identifying areas where consensus is achieved/emerging vs. areas where more discussion & testing is required Indicating priorities • 333Individual Subscribers • 88Institutional Members from 31 countries Nijmegen, August 2010

  41. Standards & Interoperability: topics for cooperation A metadata catalogue should involve every party Common repositories for LRT universally & easily accessible Try to connect ongoing work done by many groups A shared repository of data formats, annotations– where to find the most frequently used and preferred schemes –major help to achieve standardisation  For a new world-wide language infrastructure Create the means to plug together different LR & LT, in a web-based resource and technology grid Access to LRT is critical: involves – and has impact on – all the community With the possibility to easily create new workflows Create conditions to easily share and re-use technologies, to have more open (source) tools available for use also to under-funded groups Some results from FLaReNet Vienna Forum: International Cooperation Nijmegen, August 2010

  42. Special Highlight: Contribute to building the LREC2010 Map! Time is ripe to launch an important initiative, the LREC2010 Map of Language Resources, Technologies and Evaluation. The Map will be a collective enterpriseof the LREC community, as a first step towards the creation of a very broad, community-built, Open Resource Infrastructure. First in a series, it will become an essential instrument to monitor the field and to identify shifts in the production, use and evaluation of LRs and LTs over the years. When submitting a paper (< 900!), from the START page fill in a very simple template to provide essential information about resources (in a broad sense, also technologies, standards, evaluation kits.) either used for the work described or a new result of your research Go to http://www.resourcebook.eu/LreMap! FLaReNet & the LRE MAP… at LREC & COLING Nijmegen, August 2010

More Related