Unified Framework for Terminology and Disambiguation Management

Merging Terminology and Disambiguation TadejŠtajner, MārcisPinnis MLW-LT WG Meeting, 24.1.2013, Prague

Summary • 1) We can combine Terminology and Disambiguation under one umbrella, but at the same time do not loose backwards compatibility. • 2) There are compromises, however (in what can and what cannot be annotated)!

Current state • term, termInfo*, • disambigIdent* • disambigClass* • disambigGranularity (entity|lexicalConcept|ontologyConcept) • disambigConfidence • Can’t support simultaneous annotation on multiple granularity levels • Can there be more granularity levels?

Common ground • Both data categories model things in the form of <fragment, relationship, URI> • The current relationships that we model in both DCs are term, entity, lexical concept, ontology concept, class type • With some exceptions: • its:term=“yes” • its:disambigSource, its:disambigIdent

Scenario A • Add ‘term’ as another disambigGranularity level, keep its:term as a flag to allow for the simple use case • Fits in same framework • Precludes use of multiple levels on the same node, so it fails on this constraint

Suggestions • Divide Disambiguation in smaller parts dividing the "granularities" (or levels) in separate parts to accomodate multiple simultaneous levels • each of the levels serves a different purpose for different target audiences and scenarios • Treat Terminology as equally important level • This is where we end up under one umbrella • Implement this in the spec as a set of attributes per level that have a similar pattern for every level • This will allow independency of the different levels • Keep the cardinality to 1 value for every attribute: • users should use one tool for one text analysis level - if they use multiple, they have to create a wrapper that is able to combine the outputs. Otherwise you will end up having conflicting annotations anyway – we shouldn‘t encourage that. • Keep the inheritance of annotations as-is - no hierarchical entities

Scenario B • Keep Terminology as-is, drop disambigGranularity, and encode the levels in the attributes themselves: • entityIdent*, lexicalConcept*, ontologyConcept* - many new attributes, following the same pattern • Allows annotating multiple levels on same node • Every “level” gets its own set of attributes: • *ref • *refPointer • * + *source • *confidence Same pattern appears everywhere – this would make adoption easier compared to declaring granularity.

Hierarchical named entities? • Difficult to interpret on the individual node level: • <b disambigClassRef=“nerd:Person”>Mayor of <b disambigClassRef=“nerd:City”>London</b></b> • “London” would have classes Person and City. Can we really say that London is a Person? (Answer: not without looking at the surrounding nodes) Painful to implement!

Other possible levels • Proposal: add a loosely-defined “keyword” level for annotations which don’t fit in others

Unified Framework for Terminology and Disambiguation Management

Unified Framework for Terminology and Disambiguation Management

Presentation Transcript

Disambiguation

Semantic Roles and Disambiguation

Branching and Merging Practices

Merging

Merging

Word Sense Disambiguation

Merging

Merging Facility and Process

Sorting and Merging

Merging Forces

Word Sense Disambiguation

Merging

Entity Disambiguation

Word Sense Disambiguation

MERGING

SORTING AND MERGING

Sorting and Merging

Sorting and Merging

Merging