Interview
This presentation is the property of its rightful owner.
Sponsored Links
1 / 9

Transcripts are stored in a relational database PowerPoint PPT Presentation


  • 34 Views
  • Uploaded on
  • Presentation posted in: General

interview. speaker. sentence. word. interview_id. speaker_id. sentence_id. word_id. interview_id. speaker_id. sentence_id. start time end time. locality. DynaSAND: technology. Transcripts are stored in a relational database

Download Presentation

Transcripts are stored in a relational database

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Transcripts are stored in a relational database

interview

speaker

sentence

word

interview_id

speaker_id

sentence_id

word_id

interview_id

speaker_id

sentence_id

start time

end time

locality

DynaSAND: technology

  • Transcripts are stored in a relational database

  • Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a structure basically like this:


Transcripts are stored in a relational database

category

category_id

word_id

word

attributes

word_id

attribute_id

value_id

word_id

DynaSAND: technology

  • This means that individual words can be addressed, e.g. for POS tagging

  • The POS tags are themselves stored as separate categories, attributes and values, not as opaque strings:


Transcripts are stored in a relational database

Generating other formats

  • The fact that the data is stored in its smallest constituent parts makes it relatively easy to generate other formats

  • Example: we realize that a binary format like a relational database is not appropriate for long-term archival, so we made the SAND transcriptions available as TEI XML by creating a template and filling that with data from the database with a script

  • Another example: the IMDI metadata for another corpus (The Goeman-Taeldeman-Van Reenen Project, or GTRP corpus) were created in the same way


Transcripts are stored in a relational database

Generating metadata for CLARIN

  • Previous experience with SAND and GTRP indicates that generating XML metadata for CLARIN from our databases should be doable

  • The TEI and IMDI for SAND and GTRP were created once and are static; we plan to make the process more dynamic for CLARIN metadata by creating the XML on the fly (and implementing a caching mechanism for performance reasons) so that the metadata is always up to date


Transcripts are stored in a relational database

Edisyn (European Dialect Syntax)

  • One of the goals of Edisyn is the development of a search engine which uses one tag set to search different corpora, including the SAND, concurrently

  • Central tag set is being developed by Franca Wesseling; we plan to make it compatible with ISOcat

  • Search engine translates these tags to the native tag sets of the corpora

  • Ideal case: corpora are hosted by their own organizations and accessible via a web service

  • In practice: the Meertens has local copies of the corpora

  • Participating corpora: SAND, CORDIAL-SIN (Portuguese), ASIS (Italian), EMK (Estonian); more to come


Transcripts are stored in a relational database

Other Meertens language resources

  • PLAND (Plant Names in Dutch Dialects)

  • NVD (Dutch Database of First Names)

  • NFD (Dutch Database of Family Names)

  • Corpus of free dialect speech (sound recordings)

  • Dutch Database of Toponyms (in development)

  • Dutch Song Database

  • Dutch Folktale Database


Transcripts are stored in a relational database

Other Meertens language resources

  • Apart from part of the sound recordings, all these are web-based and based on the same database technology

  • We plan to make CLARIN metadata available for these resources in a stepwise manner: first metadata on the corpus level, later also metadata on the record level

  • The technologies involved (OAI-PMH) are new to us, so we want to do this in close cooperation with a “harvesting” institution to make sure that our stuff is correct


Transcripts are stored in a relational database

Further in the future

  • The Meertens Institute wants to be part of CLARIN and in the future we also hope to contribute to the development of tools to work with language resources


Transcripts are stored in a relational database

Thank you for your attention!


  • Login