html5-img
1 / 54

Christian Chiarcos chiarcos@uni-potsdam.de

Ontologies of Linguistic Annotation Towards the application of Semantic Web technologies in corpus linguistics and NLP. Christian Chiarcos chiarcos@uni-potsdam.de. 1. Ontologies of Linguistic Annotation. Background How to deal with the heterogeneity of linguistic annotations ?

cullen
Download Presentation

Christian Chiarcos chiarcos@uni-potsdam.de

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ontologies of Linguistic Annotation Towards the application of Semantic Web technologies in corpus linguistics and NLP Christian Chiarcos chiarcos@uni-potsdam.de 1

  2. Ontologies of Linguistic Annotation • Background • How to deal with the heterogeneity of linguistic annotations ? • Ontologies of Linguistic Annotation (OLiA) • Linking annotations and terminology repositories • Applications • Corpus querying • NLP

  3. BackgroundThe task • Differences ... among different language resources and individual system objectives ... lead to variations in data category definitions and data category names. • The use of uniform data category names and definitions ... contributes to system coherence and enhances the re-usability of data. (Ide & Romary 2004)

  4. BackgroundThe solution I General Ontology of Linguistic Description (GOLD) • ... large amounts of linguistic data on the Web ... from different languages can be automatically searched and compared ... • ... the data and the various encoding schemes in which they are represented need an explicit semantics. • ... a data model ... which is consistent with .... the Semantic Web ... (Farrar & Langendoen 2003)

  5. BackgroundThe solution II ISO TC37/SC4 Data Category Registry (DCR) • ... a family of data category standards designed to meet the needs of terminologists and other language experts developing a variety of electronic linguistic resources. ... • ... to ensure interoperability among these domains ... • ... with an eye to facilitating ... wide-scaleinformation handling environments such as the Semantic Web ... (Wright 2004)

  6. BackgroundThe solution III-VIII Documentation standards in typology • EUROTYP (Bakker et al. 1993) • AUTOTYP (Bickel & Nichols 2002) • Typological Database System ontology (Dimitriadis et al. 2009) Standardization initiatives and multi- language tagsets • EAGLES (Leech & Wilson 1996) • MULTEXT/East (Erjavec 2010) • Common POS tagset for Indian languages (Baskaran et al. 2008)

  7. BackgroundAnother problem Imagine you plan to develop a tool that makes use of a terminology repository. Which one would you choose ? Similar goals, but different definitions Integration efforts have only just began ... (RELISH*) * RELISH workshop, Aug 2010, http://www.mpi.nl/research/research-projects/language-archiving-technology/events/relish-workshop

  8. BackgroundAnother problem Imagine you plan to develop a tool that makes use of a terminology repository. Which one would you choose ? Maybe, it‘s not even our choice ... ... our clients may have their own preferences ... and different clients may have different preferences

  9. Ontologies of Linguistic Annotation • Background • Ontologies of Linguistic Annotation (OLiA) • Linking annotations and terminology repositories • Applications • Corpus querying • NLP

  10. OLiAArchitecture Terminology Repositories Terminology Repositories • OLiA: Ontologies of Linguistic Annotations • conceptual integration • represent tagsets and their semantics in a formal and systematic way • „Reference Model“ • interface between annotations and (multiple) terminology repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models

  11. OLiAResearch Background • developed at the Collaborative Research Center (SFB) 632 „Information Structure“ (Potsdam & Berlin) • 2006-2008 in cooperation with Collaborative Research Center (SFB) 441 „Linguistic Data Structures“ (Tübingen) • since 2007 within the SFB 632 project „Linguistic Database“

  12. OLiAResearch Background • part of an infrastructure to integrate and access heterogeneous linguistic corpora • PAULA format • integrate different formats • ANNIS data base • access data created by different tools • OLiA ontologies • represent tagsets and their semantics in a formal and systematic way

  13. OLiAOntology (Information Technology) • Ontology • Conceptualization of a knowledge domain • e.g., taxonomy of linguistic categories • hierarchical and relational structure • OWL (Web Ontology Language)* • formal description language • XML • Semantic Web • * Web Ontology Language, http://www.w3.org/2004/OWL/ (10.10.08)

  14. OLiAOntologies of Linguistic Annotation Terminology Repositories Terminology Repositories modular OWL/DL ontologies • Annotation Models • annotation scheme • OLiA Reference Model • common terminology • External Reference Models • existing terminology repositories OLiA Reference Model • interface between annotations and (multiple) terminology repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models

  15. OLiAReference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories • harmonization of repositories of annotation terminology • morphosyntax & morphology • 31 schemes • 51 languages* • syntax, discourse structure, anaphora, information structure Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models * including multilingual annotation schemes: Tapainen & Järvinen (1997), and Dipper et al. (2007), Erjavec (2010)

  16. ... ... ... ... ... OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Morphosyntactic Category Terminology Repositories Terminology Repositories concepts Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories is-a Morphological Feature PronounOrDeterminer is-a OLiA Reference Model is-a Determiner Case is-a is-a Accusative Case Demonstrative Determiner Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models properties hasCase x y x : MorphosyntacticCategory x : Case

  17. OLiA Annotation Models Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories • OWL/DL formalizations of annotation schemes • structure similar to the Reference Model • individuals represent annotation values • hasTag property • string value of annotation Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models

  18. ... ... OLiAThe TIGER/STTS Annotation Models concepts Pronoun Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories is-a Terminology Repositories Terminology Repositories Demonstrative Pronoun Feature is-a is-a is-a Attributive Demonstrative Pronoun Substitutive Demonstrative Pronoun OLiA Reference Model Case instance_of instance_of instance_of individuals Terminology Repositories Terminology Repositories PDAT PDS Acc Terminology Repositories Terminology Repositories Terminology Repositories hasTag „PDAT“ hasTag „PDS“ hasTag „...Acc...“ Annotation Models STTS German parts of speech (Schiller et al. 1996) TIGER German morphology (Brants et al. 2001)

  19. ... ... annotation Diese nicht neue Erkenntnis this not new insight PDAT ADV ADJA NN Acc.Sg. Acc.Sg. Acc.Sg. Fem Fem Fem OLiAThe TIGER/STTS Annotation Models concepts Pronoun Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories is-a Terminology Repositories Terminology Repositories Demonstrative Pronoun Feature is-a is-a is-a Attributive Demonstrative Pronoun Substitutive Demonstrative Pronoun OLiA Reference Model Case instance_of instance_of instance_of individuals Terminology Repositories Terminology Repositories PDAT PDS Acc Terminology Repositories Terminology Repositories Terminology Repositories hasTag „PDAT“ hasTag „PDS“ hasTag „...Nom...“ Annotation Models STTS German parts of speech (Schiller et al. 1996) TIGER German morphology (Brants et al. 2001)

  20. ... ... annotation Diese nicht neue Erkenntnis this not new insight PDAT ADV ADJA NN Acc.Sg. Acc.Sg. Acc.Sg. Fem Fem Fem OLiAThe TIGER/STTS Annotation Models concepts Pronoun Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories is-a Terminology Repositories Terminology Repositories Demonstrative Pronoun Feature is-a is-a is-a Attributive Demonstrative Pronoun Substitutive Demonstrative Pronoun OLiA Reference Model Case instance_of instance_of instance_of individuals Terminology Repositories Terminology Repositories PDAT PDS Acc Terminology Repositories Terminology Repositories Terminology Repositories hasTag „...Nom...“ hasTag „PDAT“ hasTag „PDS“ Annotation Models STTS German parts of speech (Schiller et al. 1996) TIGER German morphology (Brants et al. 2001)

  21. OLiALinking Terminology Repositories Terminology Repositories Annotation model concepts are defined as subclasses of Reference Model concepts • properties as sub-properties • individuals as instances The linking is physically separated from the models • onepossible interpretation of Annotation Model concepts in terms of the Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models

  22. ... ... ... ... OLiALinking Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Morphosyntactic Category Terminology Repositories is-a PronounOrDeterminer OLiA Reference Model is-a Determiner Pronoun is-a is-a Demonstrative Pronoun Demonstrative Determiner is-a Terminology Repositories is-a Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Attributive Demonstrative Pronoun Annotation Models instance_of PDAT STTS Annotation Model

  23. OLiALinking (morphosyntax) • English (7 annotation models) • German (4 annotation models) • Russian (3 annotation models) • Multext-East schemes (15 languages) • Connexor (6 languages) • Tibetan (4 languages) • Old High German, Old Norse • Tagset for typological studies • more than 30 languages • many, but not exclusively African languages

  24. OLiALinking Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model further linked to terminological repositories • if they are modelled in OWL/DL • GOLD (Chiarcos 2008) • DCR (Chiarcos 2010) • OntoTag (Buyko et al. 2008) • TDS Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Annotation Models

  25. OLiAAchievements • Annotations are mapped onto concepts in an Annotation Model • The Annotation Model is linked with the OLiA Reference Model and further terminology repositories • Annotations can be described in terms of these models independently from their original string representation • novel applications

  26. Application • Background • Ontologies of Linguistic Annotation (OLiA) • Application • Annotation formalization & documentation • Ontology-based corpus querying • corpus browsing with ontology-based machine learning • concept-based corpus querying • NLP • interface specifications in NLP pipelines • preprocessing for Semantic Web applications • ensemble combination

  27. Application • Background • Ontologies of Linguistic Annotation (OLiA) • Application • Annotation formalization & documentation* • Ontology-based corpus querying • corpus browsing** • concept-based corpus querying • NLP • interface specifications in NLP pipelines*** • preprocessing for Semantic Web applications**** • ontology-based ensemble combination * Ch. Chiarcos (2008) An Ontology of Linguistic Annotations. LDV Forum (GLDV-Journal for Computational Linguistics and Language Technology) 23 (2008):1-16. ** S. Hellmann et al. (accepted). The TIGER Corpus Navigator. accepted at the 9th Int. Workshop on Treebanks and Linguistic Theories (TLT9), Dec 3-4, 2010. Tartu, Estonia. *** E. Buyko, Ch. Chiarcos, and A. Pareja Lora (2008) Ontology-Based Interface Specifications for an NLP pipeline architecture. In: Proc. LREC. Marrakech, Morocco, May 2008. **** S. Hellmann (2010), The Semantic Gap of Formalized Meaning. In: The Semantic Web: Research and Applications (LNCS 6089/2010), 462-466

  28. ApplicationCorpus querying (Chiarcos & Goetze 2007) ontological description generated corpus query ANNIS1 + OntoClient Chiarcos, Christian and Michael Götze (2007) A Linguistic Database with Ontology-sensitive Corpus Querying. GLDV-Frühjahrstagung, Tübingen, Germany.

  29. ApplicationCorpus querying OntoClient • experimental JAVA package, based on Pellet • preprocessor for corpus queries • for every concept in the Reference Model: • retrieve associated individuals • generate a set of possible tags • set operators • intersection (&, and) • join (|, or) • intersection with complement (\, without) • generates corpus query • can be adapted to different query languages

  30. ... ... ... ... ApplicationCorpus querying original query ... pos in { Determiner \ Article } & cat = ... Reference Model Morphosyntactic Category is-a • consult the ontology • retrieve tags for every • expression that refers • to a concept in • the ontology • 2. apply operators PronounOrDeterminer is-a Determiner Pronoun is-a is-a Demonstrative Pronoun Demonstrative Determiner is-a is-a Attributive Demonstrative Pronoun instance_of STTS Annotation Model PDAT return modified corpus query ... pos = PDAT | PWAT | ... & cat = ...

  31. ApplicationEnsemble combination • Brill & Wu (1998) • Classifier Combination for Improved Lexical Disambiguation • errors made by three POS taggers are strongly complementary • combination => increase of accuracy 6.9% error reduction by simple voting

  32. ApplicationEnsemble combination with ontologies • limitations • classifiers have to make use of the same annotation scheme • combining different annotation schemes may not only increase accuracy but also the level of detail • ensemble combination with ontologies • abstract from string-based annotations • operate on conceptual representations

  33. ... ... ... ... Ensemble combination with OLiAGenerating ontological descriptions OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Morphosyntactic Category Terminology Repositories is-a PronounOrDeterminer OLiA Reference Model is-a Determiner Pronoun is-a is-a Demonstrative Pronoun Demonstrative Determiner is-a Terminology Repositories is-a Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Attributive Demonstrative Pronoun Annotation Models instance_of PDAT STTS Annotation Model

  34. tag-set independent description Morphosyntactic Category • rdf:type(olia:DemonstrativeDeterminer) • rdf:type(olia:Determiner) • rdf:type(olia:PronounOrDeterminer) is-a ... ... ... ... ... annotation Diese nicht neue Erkenntnis this not new insight PDAT ADV ADJA NN Acc.Sg. Acc.Sg. Acc.Sg. Fem Fem Fem Ensemble combination with OLiA Generating ontological descriptions OLiA Reference Model Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Morphosyntactic Category Terminology Repositories is-a PronounOrDeterminer OLiA Reference Model is-a Determiner Pronoun is-a is-a Demonstrative Pronoun Demonstrative Determiner is-a Terminology Repositories is-a Terminology Repositories Terminology Repositories Terminology Repositories Terminology Repositories Attributive Demonstrative Pronoun Annotation Models instance_of PDAT STTS Annotation Model

  35. Comparing and combining heterogeneous linguistic analyses

  36. Comparing and combining heterogeneous linguistic analyses • challenges • determiner, not pronoun • although preceding an adverb • accusative, not nominative case • although sentence-initial and • ambigous morphology

  37. Comparing and combining heterogeneous linguistic analyses Connexor PRON Dem FEM SG NOM RFTagger PRO.Dem.Attr.-3.Acc.Sg.Fem (Schmid & Laws 2008) (Tapanainen & Järvinen 1997)

  38. Comparing and combining heterogeneous linguistic analyses OLiA Reference Model descriptions rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Pronoun) rdf:type(olia:DemonstrativePronoun) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Nominative) rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativeDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Accusative) Connexor PRON Dem FEM SG NOM RFTagger PRO.Dem.Attr.-3.Acc.Sg.Fem

  39. Comparing and combining heterogeneous linguistic analyses OLiA Reference Model descriptions rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Pronoun) rdf:type(olia:DemonstrativePronoun) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Nominative) rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativeDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Accusative) Connexor PRON Dem FEM SG NOM RFTagger PRO.Dem.Attr.-3.Acc.Sg.Fem

  40. Comparing and combining heterogeneous linguistic analyses OLiA Reference Model descriptions rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Pronoun) rdf:type(olia:DemonstrativePronoun) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Nominative) rdf:type(olia:PronounOrDeterminer) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativeDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) olia:hasCase(olia:Accusative) confidence ranking (simple voting) rdf:type(olia:PronounOrDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) rdf:type(olia:Pronoun) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativePronoun) rdf:type(olia:DemonstrativeDeterminer) olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) predicted by both tools predicted by one tool

  41. Comparing and combining heterogeneous linguistic analyses disambiguation: create the maximal consistent set S of descriptions • S is empty • process descriptions with decreasing confidence • if the current description is consistent with all descriptions in S, then add it to S • if not, skip it • iterate until all descriptions are processed confidence ranking (simple voting) rdf:type(olia:PronounOrDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) rdf:type(olia:Pronoun) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativePronoun) rdf:type(olia:DemonstrativeDeterminer) olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) predicted by both tools predicted by one tool

  42. Comparing and combining heterogeneous linguistic analyses disambiguation: create the maximal consistent set S of descriptions • S is empty • process descriptions with decreasing confidence • if the current description is consistent with all descriptions in S, then add it to S • if not, skip it • iterate until all descriptions are processed identify inconsistent descriptions rdf:type(olia:PronounOrDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) rdf:type(olia:Pronoun) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativePronoun) rdf:type(olia:DemonstrativeDeterminer) olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) check consistency conditions in the ontology

  43. concept A is consistent with concept B • if A B or B A • otherwise A and B are inconsistent Comparing and combining heterogeneous linguistic analyses olia_top:MorphosyntacticFeature is-a olia:Case is-a is-a is-a is-a olia:Nominative olia:Genitive olia:Dative olia:Accusative olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) siblings are inconsistent structure-based consistency heuristic:* * no formal consistency constraints specified in OLiA, GOLD or the DCR

  44. Comparing and combining heterogeneous linguistic analyses disambiguation: create the maximal consistent set S of descriptions • S is empty • process descriptions with decreasing confidence • if the current description is consistent with all descriptions in S, then add it to S • if not, skip it • iterate until all descriptions are processed consistency rdf:type(olia:PronounOrDeterminer) olia:hasNumber(olia:Singular) olia:hasGender(olia:Feminine) rdf:type(olia:Pronoun) rdf:type(olia:Determiner) rdf:type(olia:DemonstrativePronoun) rdf:type(olia:DemonstrativeDeterminer) olia:hasCase(olia:Accusative) olia:hasCase(olia:Nominative) • from every equally-ranked pair of inconsistent descriptions: first come, first serve (simple voting with random tie resolution)

  45. Comparing and combining heterogeneous linguistic analyses Diese nicht neue Erkenntnis • PronounOrDeterminer & Determiner & DemonstrativeDeterminer

  46. Ensemble combination with ontologiesExperiments (Chiarcos 2010) • tested for POS & morphology • three German newspaper corpora • TIGER/NEGRA-style gold annotation • 7 NLP tools, 4 annotation schemes (POS) • simple voting • increase of recall • with growing number of tools • 5-6 tools outperform best-performing tool • decrease of precision • more detailed analysis than gold annotation Ch. Chiarcos (2010), Towards Robust Multi-Tool Tagging. An OWL/DL-Based Approach. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, July 2010, 659-670.

  47. Summary • OLiA architecture • „Reference Model“ mediates between annotations and multiple terminology repositories • OWL/RDF as common representation format • applications ... • abstraction from string-based annotations • corpus querying • NLP tasks • ... not tied to the OLiA Reference Model • linking allows to operate with concepts of another terminology repository on a concept-based level

  48. Thank you

  49. for those who may have wondered about the „mascot“: a metaphor for linguistic ontologies plant with white fruits „a tree growing out of text“* t‘ziib „script, written text“ t‘zi ba * inspired by the Madrid Codex, Yucatán, ~ 1450

  50. OLiA-specific HTML export concepts linked via hyperlinks used for documentation purposes in SPLICR (Rehm et al. 2008) corpus metadata includes annotation model URL OLiAAnnotation documentation STTS Annotation Model concepts Reference Model concepts Comments: excerpt from the original documentation G. Rehm et al. (2008). SPLICR: A Sustainability Platform for Linguistic Corpora and Resources, In: Proceedings of the 9th Conference on Natural Language Processing (KONVENS 2008). Ergänzungsband Textressourcen und lexikalisches Wissen. Berlin, Sep 2008, 86–95.

More Related