1 / 65

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

Risorse Linguistiche (lessici, corpora, ontologie, …) Standard e tecnologie linguistiche (cont.) . … e Progetti. Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it. With many others at ILC. SIMPLE Model for a BioLexicon.

metea
Download Presentation

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Risorse Linguistiche (lessici, corpora, ontologie, …) Standard e tecnologie linguistiche (cont.) … e Progetti Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa glottolo@ilc.cnr.it With many others at ILC Dottorato, Pisa, Maggio 2009

  2. SIMPLE Model for a BioLexicon Design a representational model for a BioLexicon, a comprehensive lexical resource able to integrate terminological, lexical and ontological info compatible with HLT international standards (i.e. ISO) able to meet the domain-specific requirements • Implement a BioLexicon database, a container with lexical objects to be filled with data provided by “populators” (EBI, UoM & CNR-ILC) • able to be automatically incremented with new terms and linguistic info extracted from texts from Valeria Quochi Dottorato, Pisa, Maggio 2009

  3. BioLexicon Building cycle Term Repository Gather terms EBI Bio-Lexicon Population variants; synt info of terms UoM Bio-Lexicon Conceptual model and physical DB ILC Bio-eventsextraction of bio-events ILC Terminolgy to Ontology Jena/Rennes/EBI from Valeria Quochi Dottorato, Pisa, Maggio 2009

  4. The BioLexicon: where from Incremental population process Existing repositories chemical compounds, species names, disease, enzymes genes/proteins Subclustering of term variants BioLexicon new genes/proteins names MEDLINE Named Entity Recognition Term Mapping by Normalisation Verbs, nouns, adjs, advs (variants, inflected forms, derivative relations, ...) Manual curation Subcat extraction Linguistic pre-processing Syn-sem mapping Manual annotation of a bio-event corpus Bio-event extraction from Simonetta Montemagni Dottorato, Pisa, Maggio 2009

  5. BioLexicon Model: High-level lexical objects, Data Categories e.g. <feat att=“POS” val=“VVZ”> <feat att=“ConfScore” val=“0.9”> <feat att=“source” val=“UNIPROT” …… Syntax Semantics from Valeria Quochi Dottorato, Pisa, Maggio 2009 DC selection

  6. GeneRegOnto – BioLexConcepts to Predicates from Valeria Quochi Dottorato, Pisa, Maggio 2009

  7. regulate regulation Regulation PredRegulate Arg0Regulate Arg1Regulate PositiveProtein Regulation NegativeProtein Regulation regulator regulatee TranscriptionFactor Protein regulates NF-AT IL2 regulates isregulatedby bio semantic entry predicative argument structure bio event concept bio semantic roles bio entity concept Bio-specific qualia relations bio relations NF-AT positively regulates IL2 Dottorato, Pisa, Maggio 2009 from Valeria Quochi

  8. Activity SynBehaviourLesion1 SenseLesion1 PredicateLESION SubcatFramepp-of BioLexicon Protein SynArg Arg0pp-of SemArg Arg0Pat The pattern “lesion of PROTEIN” is not in the lexicon, but can be calculated accessing info scattered over various lexical objects (i.e the syntactic unit lesion heads a pp-of corresponding to the patient argument, restricted by the ontological node PROTEIN) All lexical items labelled as PROTEIN can be candidates to fill this argument slot. Lesion of OmpC, OmpR, etc… are all admitted instances/sentences of this “predicate”/pattern. Dottorato, Pisa, Maggio 2009

  9. Good mapping of Relations OBO Relations Agentive Formal isA is_a partOf is_a_part_of hasPart has_as_part GrainOf … hasGrain … componentOf … hasComponent … properPartOf … hasProperPart … locatedIn … locationOf … containtIn … contains contains adjacentTo ? derivesFrom derived_from precededBy ? participatesIn ? hasParticipant ? agentOf … hasAgent ? functionOfis_the_activity_of hasFunction … instanceOf … Telic Constitutive Relations from Extended Qualia Structure Dottorato, Pisa, Maggio 2009

  10. Enhancing Semantic Relations BelongsToSpecies phosphoglycolate mouse from Valeria Quochi Dottorato, Pisa, Maggio 2009

  11. How to link Bio-Ontology and Bio-LexiconPlace(s) of Semantics in BootStrep • Bio-Ontology holds domain specific as well as general semantics (in terms of classes and relations between classes) • Lexicon model comes with semantic layer based on linguistic ontology (SIMPLE-CLIPS Ontology) Questions: • What relation between bio-ontology and linguistic ontology? • Do they overlap? What is the overlap/intersection? the difference? • Mapping possible? How could a mapping look like? Aim: • Bringing lexical semantics and ontological semantics together ? Dottorato, Pisa, Maggio 2009

  12. the BioLexicon Model & Standards The Bio-Lexicon is based on the MILE metamodel and the more recent ISO proposal of a Lexical Markup Framework (LMF) Data Categories drawn as far as possible from already existing repositories and standards (i.e. morphosyntactic datacat) There is the need, however, to define a set of Data Categories specific for the biology domain (i.e. semantic roles and relations) Dottorato, Pisa, Maggio 2009

  13. ISO Meta-model & Data Categories An ISO standard for NLP lexica • Definition of the Lexical Markup Framework, a general & abstract meta-model & a set of structural nodes relevant for linguistic description Objectives • Design of the abstract lexical meta-model • Definition of the common set of related Data Categories The field is mature from Monica Monachini Dottorato, Pisa, Maggio 2009

  14. ISO - LMF • Specifically designed to accommodate as many models of lexical representation as possible • Its pros: • Meta-model: a high-level specification ISO24613 • Data Category Registry: low-level specifications ISO12620 • Not a monolithic model, rather a modular framework • LMF library provides the hierarchy of lexical objects (with structural relations among them) • Data Category Registry provides a library of descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined) Dottorato, Pisa, Maggio 2009

  15. ISO LMF – Lexical Markup Framework Builds also on EAGLES/ISLE Structural skeleton, with the basic hierarchy of information in a lexical entry + various extensions; LMF specs comply with modelling UML principles; an XML DTD allows implementation ICT KYOTO LIRICS NEDO Asian Lang. NICT Language-Grid Service Ontology Dottorato, Pisa, Maggio 2009

  16. LMF: NLP Extension for Semantics Dottorato, Pisa, Maggio 2009

  17. SyntacticBeahviour SB_protein Lexical Entry LE_protein Lemma L_protein Representation Frame RF_protein DC: writtenForm= protein Lexical Entry <LexicalEntry rdf:ID="LEprotein"> <hasSyntacticBehaviour rdf:resource=“../../#SB_protein”/> <hasLemma> <Lemma rdf:ID="L_protein“/> <hasRepresentationFrame> <RepresentationFrame rdf:ID=“RF_protein” /> </hasRepresentationFrame> </hasLemma> </LexicalEntry> Dottorato, Pisa, Maggio 2009

  18. Event Representation through SemanticPredicate SemanticPredicate SP_regulate SemanticArgument SP_TF_protein DC: role=agent SemanticArgument SP_Target Gene DC: role=patient Dottorato, Pisa, Maggio 2009

  19. Sense Representation Synset activate <Sense rdf:ID=“activate_2"> <belongsToSynset rdf:resource="#activate"/> <hasSemanticRelation rdf:resource="#is_a_1"/> <hasSemanticRelation rdf:resource="#has_as_part_1"/> <hasSemanticRelation rdf:resource="#object_of_the_activity_1"/> <hasSemanticFeature rdf:resource="# SF_chemistry"/> <hasSemanticFeature rdf:resource="# SF_process"/> </Sense> PredicativeRepresentation Sense activate_2 SemanticFeature SF_chemistry SF_process Collocation SemanticRelation is_a: [SenseID] Typical_of: [SenseID] S_protein Dottorato, Pisa, Maggio 2009

  20. Example of Semantic Relation <SemanticRelation rdf:ID=“is_in"> <hasSourceSense> <Sense rdf:ID=“S_cox15"> <id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S_cox15</id> </Sense> </hasSourceSense> <hasTargetSense> <Sense rdf:ID=“S_chromosome19"> <id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S_chromosome19</id> </Sense> </hasTargetSense> <relationName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">is_in</relationName> </SemanticRelation> Sense S_cox15 SemanticRelation Is_in Sense S_chromosome19 Dottorato, Pisa, Maggio 2009

  21. Example: How to encode Wordnet type of Info in LMF Dottorato, Pisa, Maggio 2009

  22. XML based Abstract Lexicon Interchange FormatMapping exercise Major best practices: • OLIF • PAROLE/SIMPLE • LC-Star • WordNet - EuroWordNet • FrameNet • BDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French • … • …others on the way… Entries from existing lexicons have been mapped to LMF to prove that the model is able to represent many best practices and achieve unification from Monica Monachini Dottorato, Pisa, Maggio 2009

  23. Lexical WEB & Content Interoperability  ‘Standards’ • As a critical step for semantic mark-up in the SemWeb NomLex WordNets WordNets ComLex WordNets with intelligent agents SIMPLE LMF Lex_x FrameNet Lex_y Standards for Interoperability Enough?? Dottorato, Pisa, Maggio 2009

  24. Need of tools to make this vision operational & concrete New prototype “LeXFlow”: • web-based collaborative environment for semi-automatic management/integration of lexical resources • enabling interoperability of distributedlexical resources • accessed by different types of agents • addressing semi-automatic integration of computational lexicons, with focus on linking and cross-lingual enrichment of distributed LRs • Case-study: cross-fertilization between Italian and Chinese WordNets • From Language Resources • To Language Services Dottorato, Pisa, Maggio 2009

  25. Dottorato, Pisa, Maggio 2009

  26. Our WN case study • ItalWordNet (Roventini et al., 2003) • Academia Sinica Bilingual Ontological WordNet (Sinica BOW, Huang et al., 2004) • Both connected to Princeton WordNet (although to different versions) • Same set of semantic relations (EWN ones) Dottorato, Pisa, Maggio 2009

  27. Architecture for cooperative integration of lexicons Agent Role3 Agent Role1 Agent Role4 Agent Role2 Coordination Web service Interface Simple-Wordnet Relation Calculator Application MultiWordnet Relation Calculator Web service Interface Italian Simple Italian Wordnet Chinese Wordnet ILI Mapper Relation Mapper Data Dottorato, Pisa, Maggio 2009

  28. Basic assumptions behind MWN … • Interlingual level: • Interlingua provides an indirect linkage between different WordNets: the Interlingual Index (ILI), an unstructured version of WordNet used in EuroWordNet • Each synset in a WNA is linked to at least one record of the ILI by means of a set of relations (eq_synonym, eq_near_synonym, …) • Synset correspondence: • If there is a SA and a SB that point to the same ILI, they are correspondent • Relation correspondence: • If there are two synsets in WNA and a relation between them, the same holds between corresponding synsets in WNB Dottorato, Pisa, Maggio 2009

  29. parte, tratto N#12348 iperonimia/HYP A new proposed mero relation passaggio, strada,via N#1290 meronimy/MPT curvatura, svolta,curva N#20944 iponimia/HPO carreggiata N#21225 Synonym Derived ILI1.5-3001757-n road,route ILI1.6-3243979-n ILI1.5-5691718-n stretch ILI1.6-??? ILI1.5-2857000-n passage ILI1.6-3092396-n ILI1.5-3002522-n roadway ILI1.6-3245327-n ILI1.5-8488101-n bend,crook,turn ILI1.6-9992072-n Synonym Reinforcement & validity tong_dao (通道) N#03092396 上位(泛稱)詞_為/HYP che_dao (車道) N#3245327 dao_lu,dao,lu (道路,道,路) N#03243979 下位(特指)詞_為/HPO wan (彎) N#9992072 部件_部份詞_為/MPT Dottorato, Pisa, Maggio 2009

  30. 00403772-v HYP 00001533-v 吸 00407124-v HPO eq_syn HYP CAU eq_syn eq_syn 00462055-a Respective, several, various 00364361-a 00403772-v acquire_knowledge 00335115-v 00406975-v Absorb, assimilate Ingest, take_in 00338206-v 00407124-v imbibe 00338343-v 01513366-v receive, have 01260836-v eq_near_syn eq_near_syn eq_syn has_hyponym V#32925 studiare_3, imparare_1, apprendere_2 V#39802 prendere_3 eq_syn has_hyperonym has_hyperonym V#32080 assimilare_5, assorbire_3, accettare_2, recepire_1 AG#42011 relativo_4 causes Derived

  31. For a Global WordNet Grid • This architecture for making distributed wordnets interoperable lends itself to different applications in LR processing: • Enrichment of existing lexical resources • Creation of new resources • Validation of existing resources • Can provide a platform for cooperative & collective creation & management of LRs, by providing a web-based environment for the collaboration & interaction of distributed agents and resources Can be seen as the • Prototype of a web application supporting the GlobalWordNet Grid initiative, i.e. a shared multi-lingual knowledge base for cross-lingual processing based on distributed resources over the Grid New project:KYOTO Dottorato, Pisa, Maggio 2009

  32. Distributed, diverse & dynamic data 1 Citizens 4 Governments maintain terms & concepts Companies Wikyoto Capture text: "Sudden increase of CO2 emissions in 2008 in Europe" Ontology 2  Top Abstract Physical Tybot: term yielding robot Wordnets Process Substance 3 CO2 emission Middle H20 CO2 H20 Pollution CO2 Emission Greenhouse Gas Domain Kybot: knowledge yielding robot Index facts: Process: Emission Involves: CO2 Property: increase, sudden When: 2008 Where: Europe 5 6 Text & Fact Index Semantic Search Environmental organizations from Piek Vossen Dottorato, Pisa, Maggio 2009

  33. TEXT ontology Wordnet Linear DAF Domain Wordnet domain ontology Discourse Annotation LMF API OWL API Linear MAF Morphological Annotation Language Specific Domain Terms Linear SYNAF Syntactic Annotation Generic TMF Linear SEMAF Semantic Annotation Term Extraction (Tybot) Language Neutral Linear Generic FACTAF Language Neutral & Specific Fact Extraction (Kybot) from Piek Vossen Dottorato, Pisa, Maggio 2009

  34. System components • Wikyoto = wiki environment for a social group: • to model the terms and concepts of a domain and agree on their meaning, within group, across languages and cultures • to define the types of knowledge and facts of interest • Tybots = Term extraction robots, extract term data from text corpus • Kybots = Knowledge yielding robots, extract facts from a text corpus • Linguistic processors: • tokenizers, segmentizers, taggers, grammars • named entity recognition • word sense disambiguation • generate a layered text annotation in Kyoto Annotation Format (KAF) from Piek Vossen Dottorato, Pisa, Maggio 2009

  35. KYOTO SYSTEM Linear SYNAF/SEMAF Term extraction (Tybot) Semantic annotation Generic TMF Linear SEMAF Fact extraction (Kybot) Domain editing (Wikyoto) Fact User Concept User LMF API OWL API Linear Generic FACTAF Domain Wordnet Domain ontology Wordnet Ontology from Piek Vossen Dottorato, Pisa, Maggio 2009

  36. Source Documents Morpho-syntactic analysis [[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP Fact mining by Kybots Linguistic Processors Ontology Logical Expressions Wordnets & Linguistic Expressions  Generic Abstract Physical Fact analysis Patient [[the emission]NP ] Process: e1 [of greenhouse gases]PP Patient: s2 [in agricultural areas]PP] Location: a3 Substance Process Chemical Reaction H2O CO2 Domain Patient CO2 emission water pollution from Piek Vossen Dottorato, Pisa, Maggio 2009

  37. environment facts Wordnet environment terms Wordnet environment terms Wordnet environment terms Wordnet environment terms Ontology environment concepts Contribution of KYOTO • hundreds of thousands sources in the environment domain • in many different languages • spread all over the world • changing every day • KYOTO delivers a Web 2.0 environment for community based control • Connects people across language and cultures • Establish consensus and knowledge transition • KYOTO learns terms and concepts from text documents, • Stored as structures that people and computers understand • KYOTO enables semantic search and fact extraction • Software can partially understand language and exploit web 1 data • Understanding is helped by the terms and concepts defined for each language html pdf xls KYBOT WIKYOTO TYBOT from Piek Vossen Dottorato, Pisa, Maggio 2009

  38. A common representation format:WordNet - LMF Data Categories LexicalResource 1..* 0..1 1..1 GlobalInformation Lexicon SenseAxes 1..* 0..* 1..* 0..1 Meta Synset SenseAxis LexicalEntry 0..1 0..1 0..* 0..1 0..1 1..1 MonolingualExternalRefs InterlingualExternalRefs Lemma Sense Definition SynsetRelations 0..1 0..* 1..* 1..* 1..* MonolingualExternalRefs MonolingualExternalRef InterlingualExternalRef Statement SynsetRelation 0..1 0..1 0..1 1..* MonolingualExternalRef Meta Meta Meta 0..1 Meta from Monica Monachini Dottorato, Pisa, Maggio 2009

  39. Centralized WordNet DC Registry A list of 85 sem.rels as a result of a mapping of the KYOTOWordNet grid Intra-WN Inter-WN from Monica Monachini Dottorato, Pisa, Maggio 2009

  40. WordNet-LMF multilingual level - Cross-lingual synset relations <!ELEMENT SenseAxes (SenseAxis+)> <!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)> <!ATTLIST SenseAxis id ID #REQUIRED relType CDATA #REQUIRED> <!ELEMENT Target EMPTY> <!ATTLIST Target ID CDATA #REQUIRED> <!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)> <!ELEMENT InterlingualExternalRef (Meta?)> <!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIRED externalReference CDATA #REQUIRED relType (at|plus|equal) #IMPLIED> IWN <fuoco_1, fiamma_1> 00001251-n SWN <fuego_3, llama_1> 09686541-n groups monolingual synsets corresponding to each other and sharing the same relations to English WN3.0 <fire_1 flame_1 flaming_1> 13480848-n specifies the type of correspondence link to ontology/(ies) from Monica Monachini Dottorato, Pisa, Maggio 2009

  41. Ultimate goal • Global standardization and anchoring of meaning such that: • Machines can start to approach text understanding -> semantic web connects to the current web • Communities can dynamically maintain knowledge, concepts and their terms in an easy to use system • Cross-linguistic and cross-cultural sharing and communication of knowledge is enabled • Comparable to a formalization of Wikipedia for humans AND machines across languages from Piek Vossen Dottorato, Pisa, Maggio 2009

  42. Some steps for a “new generation” of LRs • From huge efforts in building static, large-scale, general-purpose LRs • Tonon-static LRs rapidly built on-demand, tailored to spefic user needs • From closed, locally developed and centralized resources • To LRs residing over distributed places, accessible on the web, choreographed by agents acting over them • From Language Resources • To Language Services Dottorato, Pisa, Maggio 2009

  43. Distributed Language Services A long-term scenario implying • content interoperability standards, • supra-national cooperation and • development of architectures enabling accessibility • Create new resources on the basis of existing • Exchange and integrate information across repositories • Compose new services on demand • Collaborative & collective/social development and validation, cross-resource integration and exchange of information Language Grid Wiki Dottorato, Pisa, Maggio 2009

  44. In the “Semantic Web”vision ... …need to tackle the twofold challenge of • content availability& • multilinguality • Natural convergence with HLT: • multilingual semantic processing • ontologies • semantic-syntactic computational lexicons Dottorato, Pisa, Maggio 2009

  45. Language Tech … & … Knowledge, Content Ready??? Knowledge Markup LT & LRs Semantic Web How to cooperate?? Content Interoperable LRs & LT Dottorato, Pisa, Maggio 2009

  46. LR and the future of LT or Content Tech The need of ever growing and richer LRs for effective multilingual content processing requires a change in the paradigm, & the design of a new generation of LRs, based onopen content interoperability standards The Semantic Web notion may be used to shape the LRs of the future, in the vision of an open space of sharable knowledge available on the Web for processing The effort of making available millions of “richly annotated words” for dozens of languages is not affordable by any single group This objective can only be achieved creating integrated Open and Distributed Linguistic Infrastructures Not only the linguistic experts can participate in these, but may include designers, developers, users of content encoding practices, etc. in wiki mode Is the LR/LT field mature enough to broaden and open itself to the concept of cooperative effort of different set of communities?  Could a sort of “Language Genome” large initiative be effective? Storing lots of (annotated) facts Dottorato, Pisa, Maggio 2009

  47. Today, many vitality & success signs… for LRs • In Spoken, Written, Multimodal areas … … in new emerging areas • Statistical approaches… • Different dimensions & layers: Content (Ontologies), Emotion, Time, … • For Evaluation • For Training • … • LREC(> 900 submissions); many LRs at COLING and even at ACL!! • ELRA (self-sustaining) & LDC • LRE (new Journal: N. Ide & NC) • ISO-TC37-SC4/WG4 (International Standards for LRs) • AFNLP… • FLaReNet • ESFRI - CLARIN (also political & strategic role) • New calls or initiatives in EU, US, ASIA, on LRs, interoperability, cooperation, … Dottorato, Pisa, Maggio 2009

  48. BUT … an important point In the ’90s • There was a global vision of the field & its main components: • Standards • Creation of LRs • Distribution Then: • Automatic acquisition … towards the Infrastructure of LRs & LT ELRA LDC While today: • There is an ever increasing set of initiatives for new LRs, basic robust technologies, models??, algorithms, • We have a LR community culture • BUT sort of scattered, opportunistic, not much coherence Dottorato, Pisa, Maggio 2009

  49. Today … The wealth of data & of basic technologies is such that: • We should reflect again at the field as a whole & ask if • Standards • Creation of LRs • Automatic acquisition • Distribution are still “the” important components, or how they have changed/must change • Content interoperability • Collaborative creation & Manag. • Dynamic LRs • Sharing … Which new challenges towards a new & more mature infrastructure of LRs & LTs?? Dottorato, Pisa, Maggio 2009

  50. These dimensions could be at the basis of a new Paradigm for LRs & LT & of a new Infrastructure ?? • Content interoperability • Collaborative creation & Manag. Need more • Dynamic LRs Technology exists • Sharing + • Distributed architectures/infrastr Dottorato, Pisa, Maggio 2009

More Related