1 / 24

EXMARaLDA

EXMARaLDA. Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg. Data Formats and Tools at the SFB. 2200 transcriptions of spoken language (30 min recording each)

magnar
Download Presentation

EXMARaLDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EXMARaLDA Thomas Schmidt SFB 538 „Mehrsprachigkeit“ University of Hamburg IRCS Workshop on Linguistic Databases, 11-13 December 2001

  2. Data Formats and Tools at the SFB • 2200 transcriptions of spoken language (30 min recording each) • Language acquisition data, interviews, expert discourse, classroom discourse, presentation discourse, interpreted discourse,... • 15 languages (German, English, Swedish, Norwegian, Danish, French, Spanish, Portuguese, Turkish, Italian, Basque, Japanese, Chinese, Russian, Luganda) • 9 different data formats (dBase, syncWriter, HIAT-DOS, Verbmobil, ...) • 3 different operating systems (MAC OS 9.x, Windows, Linux) + MAC OS X • research interests: phonetics, syntax, discourse, ... IRCS Workshop on Linguistic Databases, 11-13 December 2001

  3. Data Formats and Tools at the SFB • syncWriter: • editor for interlinear text • MAC OS 9.x and earlier • outputs binary data IRCS Workshop on Linguistic Databases, 11-13 December 2001

  4. Data Formats and Tools at the SFB • HIAT-DOS: • editor for HIAT-transcription • MS-DOS/Windows • outputs text files IRCS Workshop on Linguistic Databases, 11-13 December 2001

  5. Data Formats and Tools at the SFB • dBase/Access/4th Dimension • utterance databases IRCS Workshop on Linguistic Databases, 11-13 December 2001

  6. Data Formats and Tools at the SFB • Verbmobil: • 7-bit ASCII files IRCS Workshop on Linguistic Databases, 11-13 December 2001

  7. Database „Multilingualism“ • Goals: • 1. To have one common tool for accessing (querying) the data • Data must come in one format (AG) • Multilingual issues must be taken care of (UNICODE) • Data format should be software independent (XML) • Software should work across different OS (JAVA) 2. To have different tools reflecting the habits and needs of the different projects  different input methods (Score, column, vertical notation)  different output methods (dito) IRCS Workshop on Linguistic Databases, 11-13 December 2001

  8. ? SyncWriter HIAT-DOS Verbmobil ACCESS / dBase Database „Multilingualism“ SQL- Database IRCS Workshop on Linguistic Databases, 11-13 December 2001

  9. SyncWriter HIAT-DOS Verbmobil ACCESS / dBase Database „Multilingualism“ Segmented Transcription List Transcription SQL- Database Basic Transcription EXMARaLDA Input / Editing Tools Output / Visualization Tools IRCS Workshop on Linguistic Databases, 11-13 December 2001

  10. „Traditional“ layout principles 1. Score notation („Partitur“) MAX [v] You keep interrupting me, Tom. MAX [nv] ------ pointing at Tom ------------- TOM [v] Oh, I‘m sorry for that. TOM [nv] ----- smiling --------------- IRCS Workshop on Linguistic Databases, 11-13 December 2001

  11. „Traditional“ layout principles 1. Score notation („Partitur“) MAX [v] You keep interrupting me, Tom. MAX [nv] ------ pointing at Tom ------------- TOM [v] Oh, I‘m sorry for that. TOM [nv] ----- smiling --------------- Tiers IRCS Workshop on Linguistic Databases, 11-13 December 2001

  12. „Traditional“ layout principles 1. Score notation („Partitur“) MAX [v] You keep interrupting me, Tom. MAX [nv] ------ pointing at Tom ------------- TOM [v] Oh, I‘m sorry for that. TOM [nv] ----- smiling --------------- Categories Speakers Tiers IRCS Workshop on Linguistic Databases, 11-13 December 2001

  13. „Traditional“ layout principles 1. Score notation („Partitur“) 0 1 2 3 MAX [v] You keep interrupting me, Tom. MAX [nv] ------ pointing at Tom ------------- TOM [v] Oh, I‘m sorry for that. TOM [nv] ----- smiling --------------- Categories Speakers Timeline Tiers IRCS Workshop on Linguistic Databases, 11-13 December 2001

  14. „Traditional“ layout principles 1. Score notation („Partitur“) 0 1 2 3 MAX [v] You keep interrupting me, Tom. MAX [nv] ------ pointing at Tom ------------- TOM [v] Oh, I‘m sorry for that. TOM [nv] ----- smiling --------------- Events Categories Speakers Timeline Tiers IRCS Workshop on Linguistic Databases, 11-13 December 2001

  15. „Traditional“ layout principles 1. Score notation („Partitur“)  Basic Transcription <transcription> <speakertable> <speaker id=„SPK1“ name=„MAX“/> <speaker id=„SPK2“ name=„TOM“/> </speakertable> <timeline> <timepoint id=„T0“/> <timepoint id=„T1“/> <timepoint id=„T2“/> <timepoint id=„T3“/> </timeline> <tier speaker=„SPK1“ category=„v“> <event start=„T0“ end=„T1“>You keep interrupting </event> <event start=„T1“ end=„T2“>me, Tom. </event> </tier> <tier speaker=„SPK1“ category=„nv“> <event start=„T0“ end=„T2“>pointing at Tom</event> </tier> </transcription> Categories Speakers Events Timeline Tiers IRCS Workshop on Linguistic Databases, 11-13 December 2001

  16. „Traditional“ layout principles 2. Column notation MAX [v] MAX [nv] TOM [v] TOM [nv] You keep interrupting pointing at Tom me, Tom. Oh, I‘m smiling sorry for that. IRCS Workshop on Linguistic Databases, 11-13 December 2001

  17. „Traditional“ layout principles 2. Column notation  Basic Transcription MAX [v] MAX [nv] TOM [v] TOM [nv] 0 You keep interrupting pointing at Tom 1 me, Tom. Oh, I‘m smiling sorry for that. 2 3 Categories Speakers Events Timeline Tiers IRCS Workshop on Linguistic Databases, 11-13 December 2001

  18. „Traditional“ layout principles 3. Vertical notation MAX (pointing at Tom) You keep interrupting [me, Tom.] TOM (smiling) [Oh, I‘m] sorry for that. IRCS Workshop on Linguistic Databases, 11-13 December 2001

  19. „Traditional“ layout principles 3. Vertical notation MAX (pointing at Tom) You keep interrupting [me, Tom.] TOM (smiling) [Oh, I‘m] sorry for that. Categories Speakers Events Timeline Tiers IRCS Workshop on Linguistic Databases, 11-13 December 2001

  20. „Traditional“ layout principles 3. Vertical notation MAX (pointing at Tom) You keep interrupting [me, Tom.] Speaker-Turns TOM (smiling) [Oh, I‘m] sorry for that. Categories Speakers Events Timeline Tiers IRCS Workshop on Linguistic Databases, 11-13 December 2001

  21. Structure Of Annotated Data You keep interrupting me, Tom. Oh, I `m sorry for that Events (temporal structure) IRCS Workshop on Linguistic Databases, 11-13 December 2001

  22. Structure Of Annotated Data You keep interrupting me, Tom. Immer unterbrichst Du mich, Tom Oh, I `m sorry for that Oh, das tut mir Leid. Events (temporal structure) Utterances (linguistic structure) IRCS Workshop on Linguistic Databases, 11-13 December 2001

  23. Structure Of Annotated Data You keep interrupting me, Tom. Immer unterbrichst Du mich, Tom Pro V Vpart Pro PN. Oh, I `m sorry for that Oh, das tut mir Leid. Events (temporal structure) Int Pro V Adj Prep Pro Utterances (linguistic structure) Words (linguistic structure) ........ IRCS Workshop on Linguistic Databases, 11-13 December 2001

  24. GER: Immer unterbrichst Du mich, Tom. POS: pro POS: v POS: vpart POS: pro POS: pn 0 a b 1 c 2 W: You W: keep W: interrupting W: me W: Tom U: You keep interrupting me, Tom. GER: Oh, das tut mir Leid. POS: int POS: pn POS: v 1 d e 2 3 W: Oh W: I W: 'm U: Oh, I'm sorry for that. Structure Of Annotated Data IRCS Workshop on Linguistic Databases, 11-13 December 2001

More Related