CODA – CATCHPlus Open Document Annotation

CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012

Annotation context • Audiovisual • ASR, language, gesture, oral history • Text – Semantic annotation • Music – lyrics, music notation • Linguistic Annotation – named entities • Image annotation • Programs: CATCH, CATCHPlus, CLARIN

CODA main use cases • Queen’s Cabinet (Henny van Schie/National Archive, Lambert Schomaker/Univ Groningen) • Line strip and word zone annotations • ML: search in manuscript images • Add Named Entity annotations • Sailing Letters (Nicoline van de Sijs/Meertens + consortium, Lambert Schomaker) • Support manual annotation • Line strip detection service

Line annotation tools (catchplus)

<txt>godefroit</txt> <id>navis-SAL7316_0195-line-026 -y1=2094-y2=2317-zone-HUMAN -x=1145-y=105-w=315-h=116 -unshear=0.0-version=ortho </id> <user>mceunen</user> <time>Wed Jan 26 16:37:01 2011</time>

OAC representation ImageAnnotation TextAnnotations ia:1 hasBody page:0 imageScan.jpg hasTarget hasBody hasTarget ib:0 Canvas1 cnt:chars constrains constrains “Dit is een beschrijving van Den Haag. En dit is een tweede zin.” hasBody ct:1 cb:1 ia:2 linestrip.jpg constrains constrains line:1 cb:2 ct:2 hasTarget hasBody zone:2 Named Entity

OAC representation – Named Entities ImageAnnotation TextAnnotations EntityAnnotation ia:1 ta:0 ct:3 ea:1 hasTarget hasBody hasTarget hasBody hasTarget hasTarget constrains hasBody InlineTextConstraint: <rdf:Description rdf:about="urn:uuid:533624bb-d565-40ba-a14a-2e95c19c20df"> <rdf:type rdf:resource="http://www.openannotation.org/ns/ConstrainedTarget"/> <constrains xmlns="http://www.openannotation.org/ns/" rdf:resource="http://oas.dev.seecr.nl:8000/resolve/urn%3Auuid %3Ad8741024-18bf-40a8-a648-2cd5ebb9acfd"/> <constrainedBy xmlns="http://www.openannotation.org/ns/" rdf:resource="urn:uuid:4f6b7d34-2329-4ab6-be89-a0feec9e7208"/> </rdf:Description> <rdf:Description rdf:about="urn:uuid:4f6b7d34-2329-4ab6-be89-a0feec9e7208"> <rdf:type rdf:resource="http://www.openannotation.org/ns/Constraint"/> <rdf:type rdf:resource="http://www.catchplus.nl/annotation/InlineTextConstraint"/> <rdf:type rdf:resource="http://www.w3.org/2008/content#ContentAsText"/> <chars xmlns="http://www.w3.org/2008/content#"> "<textsegment offset="279" range="2"/>"</chars> <characterEncoding xmlns="http://www.w3.org/2008/content#"> UTF-8</characterEncoding> </rdf:Description> ib:0 ib:1 ct:4 Canvas1 constrains imageScan.jpg cnt:chars constrains constrains ct:1 cb:1 “location” cnt:chars constrains constrains ta:1 cb:2 ct:2 “Dit is een beschrijving van Den Haag. En dit is een tweede zin.” hasTarget hasBody ta:2 ! ! Annotation of segments of inline text? Annotation of annotations?

KdK-2-OAC conversion • Implicit line and page text • Word and line order • Text offsets and ranges • Spatial information • Identifiers and ‘annotatability’ • Redundant text for searchability ! ! Need for explicit representation of Sequence? Search on text of ConstrainedTarget/Body?

KdK2OAC conclusions • Bidirectional mapping is possible • Compatible with SharedCanvas model • OAC + Canvas links everything together • Implicit information made explicit • Supports alternative text segmentations • OAC representation is extremely verbose ! For many annotation tasks OA may be overkill

Open Annotation Service (OAS) • Upload annotation RDF using SRU/Update • Inlines external text and XML Bodies and authors • Indexes OA and DC properties • Assigns resolvable http URIs and resolves those • Implementation: RDF store icw Solr, production quality software components (Meresco) • Built-in OAI-PMH data provider and harvester for ‘annotation sets’ • Query: SRU/CQL, SPARQL, OAI-PMH • Simple management dashboard (authentication and authorization, collection management, harvesting) • Easy installation and Open Source ! Model does not support Annotation “sets”

OAS: issues • Annotation publication • Searchability: ‘harvest and index’ • Text search on external bodies • Annotation boundaries • ‘Bypassing’ oac:constrains ! In RDF, what are the boundaries of an annotation?

Entity Recognition service URL or text resolve service OAS source_text entity annotations frog URL or ID FoLiA_document converter

‘frog’ and FoLiA • ‘Frog’ tool generates FoLiA XML document with • Segmentation of text in paragraphs, sentences and words (tokens) – XML hierarchy • Part of speech, lemma, morphology, chunking, dependency structure and named entities • Mix of inline and standoff annotation • ‘Frog’ does not keep track of character offsets • Explicit ordering: numbering system in ids • Trained for Dutch • Widely used for Dutch corpora • Made available by: ILK @ Tilburg University

FoLiA-2-OAC conversion • Reconstruct character offsets after tokenization • Operates on inline text as published by OAS • Construct and add entity text from tokens + sequence (the+hague != hague+the) • Two approaches • Minimal: extract entity annotations and tokens, and convert to OAC • Maximal: full conversion to OAC

Linguistic Annotation ! ! ! ! Mix-in domain semantics as subtypes/subproperties? Maximal OA mapping or embed linguistic standards? Layers, hierarchies (syntax) and Documents Sequence (e.g. entities, morpheme breakup)

Synchronized viewing clientdemo • Demo/screenshot

Summary of OA issues ! ! ! ! ! ! ! Annotation of annotations? Annotation of segments of inline text? Need for explicit representation of Sequence? Search on ConstrainedTarget/Body? For many annotation tasks OA may be overkill Model does not support Annotation sets In RDF, what are the boundaries of an annotation?

Future work • Finalize and integrate software (with web services) • Upgrade to new OA spec (incl OAS) • Line strip detection web service • Possible applications • AV annotation in CATCHPlus • Nederlab

Questions?

CODA – CATCHPlus Open Document Annotation