1 / 29

Tags in the cloud : Crowdsourcing semantic annotation with CATMA

Tags in the cloud : Crowdsourcing semantic annotation with CATMA. Jan Christoph Meister University of Hamburg. www.catma.de. CATMA - an integrated textual markup and analysis tool. Text vs. sentence, or: What ‘ s so different about processing texts?.

abena
Download Presentation

Tags in the cloud : Crowdsourcing semantic annotation with CATMA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tags in thecloud:Crowdsourcingsemanticannotationwith CATMA Jan Christoph Meister University of Hamburg www.catma.de

  2. CATMA - an integrated textual markup and analysis tool CLARIN's Turn Towards The Literary Text

  3. Text vs. sentence, or: What‘s so different about processing texts? • structural complexity: min TEXT > 2 (SENTENCE) • structural activity: TEXT processing actualizes paradigmatic cross-reference across sentences • structural dynamic: TEXT processing represents & simulates cognitive and empirical processes TEXT yields more INTERPRETATIONS than SENTENCE +CONTINGENCY: The more complex & dynamic structure, when activated during processing, results in a higher degree of contingency in functional „outcome“ CLARIN's Turn Towards The Literary Text

  4. The what and why of MarkUp  procedural, descriptive & discursive function discursive function • discursive markup: enables human readers to interpret a text and to explore its hermeneutic potential in collaboration  „What might this text mean to us?“ • declarative markup: informs a human reader how to process a text as a communicative device  „How is this text put together and how does it function in its communicative universe?“ • procedural markup: instructs a (natural or artificial) text processor how to handle a text as a structured character string  „What is the correct operation to perfom on this input?“ performative function CLARIN's Turn Towards The Literary Text

  5. facilitate collaboration & non-deterministic annotation allow for multiple markup allow for overlap allow for concurrent tagging conceptualize markup as dynamic & recursive allow for extensibility allow for multiple (and even contradictory) markup seamlessly integrate markup and analysis & support the hermeneutic loop Hermeneutic „must haves“ of discursive markup CLARIN's Turn Towards The Literary Text

  6. MarkUp types & data models stand off, discursive <1,5, word class = “Preposition”> <1,5, segment = “SentenceStart”> <1,8, POS = “noun phrase”> <1,5, word class = “Adverb”> <1,38, speech act = “declaration”> <1,11, POS = “verb phrase”> network There is no such thing as “no-mark up”. <1,5, word class = “Adverb”> <1,5, segment = “SentenceStart”> <1,5, POS = “verb phrase element”> There is no such thing as ”no-mark up”. stand off, descriptive relational nested inline, deterministic <SentenceStart><Adverb>There</Adverb></SentenceStart> is no such thing as “no-mark up”. sequential inline, deterministic <SentenceStart>There</SentenceStart> is no such thing as “no-mark up.” linear implicit There is no such thing as “no-mark up”. (Coombs, Renear, DeRose 1987) opaque CLARIN's Turn Towards The Literary Text

  7. Implementation in CATMA www.catma.de CLARIN's Turn Towards The Literary Text

  8. The CATMA/CLÉA approach to markup • text range based model • a tag references a text range with a start and an end offset • external standoff markup • markup is stored in external files or data bases to facilitate tagging and exchange of markup by multiple users • markup is stored in a standoff manner to allow overlapping • markup tolerates non-deterministic tagging & supports analytical operations that exploit semantic ambiguity CLARIN's Turn Towards The Literary Text

  9. Example for overlapping markup in CATMA (NB: In CATMA tag sets can be imported/exported; tags can be created / manipulated ad hoc during mark up) CLARIN's Turn Towards The Literary Text

  10. TEI feature structure tag declaration & overlapping markup • <fs xml:id="CATMA_d7251f99-14e9-4c36-8ff7-24058ae81ce5" n="1_7985fdf0-77a5-4060-9a3d-2d977e0ab954" type="catma_tag"> • <f xml:id="CATMA_aa9b3727-187e-4fb8-9990-e7880912a409" name="catma_tagname"> • <string>Keynote_speaker&amp;affiliation</string> • </f> • <f xml:id="CATMA_564825ba-28b2-4dab-b136-b87c8a3d9e28" name="catma_displaycolor"> • <numeric value="-13421569"/> • </f> • </fs> <ptr target="Abstracts.doc#range( /.21736, /.21888)" type="inclusion"/> <seg ana="#CATMA_0a252cc2-96d2-4ed4-8fb8-52380550ec0b #CATMA_d7251f99-14e9-4c36-8ff7-24058ae81ce5 #CATMA_8513fe2d-2e35-4d0a-a3a2-07528bcfa012"> CLARIN's Turn Towards The Literary Text

  11. Question 1: How can we model a collaborative mark up practice? CLARIN's Turn Towards The Literary Text

  12. Answer 1: CATMA’S “n-meta-data set to-1object data instance”-model meta-data • procedural • declarative • hermeneutic user markup 1..n 0 A Tagsets TEXT object-data CLARIN's Turn Towards The Literary Text

  13. Question 2: But how, on top of that, can we also model the recursive routines that characterize the humanistic workflow? TEXT CLARIN's Turn Towards The Literary Text

  14. Example for recursion: a simple querie across the object data/meta data divide Step 1: object data querie ... an additional meta-data constraint Step 2: refinement by adding ... CLARIN's Turn Towards The Literary Text

  15. ... which is why(reg="\b\S*\Qez\E(?=\W)") where (tag="Keynote_speaker&affiliation") generates this: CLARIN's Turn Towards The Literary Text

  16. Answer 2: CATMA’S dynamic data model, e.g.(n meta-data set to 1 object instance)>n+1 TEXT markup 1..n markup 1..n meta-data • procedural • declarative • hermeneutic 0 A Tagsets object-data TEXT object-data 0 A CLARIN's Turn Towards The Literary Text

  17. Question 3: How can we implement this practice in a system? CLARIN's Turn Towards The Literary Text

  18. Answer 3: Call the big sister – CLÉA! CLÉA Data Base Model CLARIN's Turn Towards The Literary Text

  19. CATMA/CLÉA: User and resource administration CLARIN's Turn Towards The Literary Text

  20. Manage corpora & sourcedocuments, markupcollectionsand tag libraries CLARIN's Turn Towards The Literary Text

  21. Annotatetextsorcorporausingpre-definedorready-made tags CLARIN's Turn Towards The Literary Text

  22. Buildandexecutequeries on sourcetext & tags, oranycombinationthereof CLARIN's Turn Towards The Literary Text

  23. Visualizeresults CLARIN's Turn Towards The Literary Text

  24. What’s in it for CLARIN? • Import any text or corpus into CATMA/CLÉA • Run standard analytical procedures automatically or inter actively on upload (indexing, POS tagging etc.) • Annotate and analyse texts or corpora collaboratively • Share and export markup from the CATMA/CLÉA data base in multiple formats • CLÉA = Collaborative • Literature Éxploration and Annotation CLARIN's Turn Towards The Literary Text

  25. Mille grazie to my CATMA/CLÉA development team • Evelyn Gius • Malte Meister • Marco Petris • Lena Schüch • and to our funders • University of Hamburg (2009) • Google DH Awards (2010-2013) • BMBF (2013-2016) CLARIN's Turn Towards The Literary Text

  26. Tag definition each Tag has a type each Tag has a color each Tag can have additional user defined properties CLARIN's Turn Towards The Literary Text

  27. Tag instance each Tag instance is of a type a Tag instance can have individual values for the user defined properties CLARIN's Turn Towards The Literary Text

  28. Tag referencing • The content of a range is referenced by a pointer to an external entity. • The URI is based on the RFC 5147 for pointing to plain text. CLARIN's Turn Towards The Literary Text

  29. Potential problems and possible solutions • referencing ranges based on character offsets are vulnerable to modifications of the content • possible solution: automated adjustments with checksums and context information, and • track versioning and revision history in the source document header • the encoding of the tags is machine readable but not interoperable out of the box • possible solution: defining the feature structure encoding of tags in terms of the open annotation framework CLARIN's Turn Towards The Literary Text

More Related