1 / 9

Transcripts are stored in a relational database - PowerPoint PPT Presentation

  • Uploaded on

interview. speaker. sentence. word. interview_id. speaker_id. sentence_id. word_id. interview_id. speaker_id. sentence_id. start time end time. locality. DynaSAND: technology. Transcripts are stored in a relational database

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Transcripts are stored in a relational database' - neorah

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript












start time

end time


DynaSAND: technology

  • Transcripts are stored in a relational database

  • Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a structure basically like this:










DynaSAND: technology

  • This means that individual words can be addressed, e.g. for POS tagging

  • The POS tags are themselves stored as separate categories, attributes and values, not as opaque strings:

Generating other formats

  • The fact that the data is stored in its smallest constituent parts makes it relatively easy to generate other formats

  • Example: we realize that a binary format like a relational database is not appropriate for long-term archival, so we made the SAND transcriptions available as TEI XML by creating a template and filling that with data from the database with a script

  • Another example: the IMDI metadata for another corpus (The Goeman-Taeldeman-Van Reenen Project, or GTRP corpus) were created in the same way

Generating metadata for CLARIN

  • Previous experience with SAND and GTRP indicates that generating XML metadata for CLARIN from our databases should be doable

  • The TEI and IMDI for SAND and GTRP were created once and are static; we plan to make the process more dynamic for CLARIN metadata by creating the XML on the fly (and implementing a caching mechanism for performance reasons) so that the metadata is always up to date

Edisyn (European Dialect Syntax)

  • One of the goals of Edisyn is the development of a search engine which uses one tag set to search different corpora, including the SAND, concurrently

  • Central tag set is being developed by Franca Wesseling; we plan to make it compatible with ISOcat

  • Search engine translates these tags to the native tag sets of the corpora

  • Ideal case: corpora are hosted by their own organizations and accessible via a web service

  • In practice: the Meertens has local copies of the corpora

  • Participating corpora: SAND, CORDIAL-SIN (Portuguese), ASIS (Italian), EMK (Estonian); more to come

Other Meertens language resources

  • PLAND (Plant Names in Dutch Dialects)

  • NVD (Dutch Database of First Names)

  • NFD (Dutch Database of Family Names)

  • Corpus of free dialect speech (sound recordings)

  • Dutch Database of Toponyms (in development)

  • Dutch Song Database

  • Dutch Folktale Database

Other Meertens language resources

  • Apart from part of the sound recordings, all these are web-based and based on the same database technology

  • We plan to make CLARIN metadata available for these resources in a stepwise manner: first metadata on the corpus level, later also metadata on the record level

  • The technologies involved (OAI-PMH) are new to us, so we want to do this in close cooperation with a “harvesting” institution to make sure that our stuff is correct

Further in the future

  • The Meertens Institute wants to be part of CLARIN and in the future we also hope to contribute to the development of tools to work with language resources