1 / 25

Language data and XML: archiving and interoperability

Language data and XML: archiving and interoperability. Simon Musgrave Linguistics Program Monash University (Simon.Musgrave@arts.monash.edu.au). Language documentation. Language documentation produces large quantities of text Transcribed language events associated annotations

Download Presentation

Language data and XML: archiving and interoperability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language data and XML:archiving and interoperability Simon Musgrave Linguistics Program Monash University (Simon.Musgrave@arts.monash.edu.au)

  2. Language documentation • Language documentation produces large quantities of text • Transcribed language events • associated annotations • lexica / dictionaries • analyses • ethnographic notes • ……. • There is no standard software tool used by linguists • Use of proprietary software results in file formats with limited portability DRH 2003 - Cheltenham 2/9/03

  3. Advantages of XML: Archiving • UNICODE compatibility assured • Besides script possibilities, access to the full International Phonetic Alphabet character set is important for linguists • Explicit coding of data model • Generic file format assures better portability and lifespan DRH 2003 - Cheltenham 2/9/03

  4. Building an archive • Addition of data to an XML archive should be automated • This implies the existence of transformation scripts to move data between formats • Creating these scripts is work which has to be done • It can have a second benefit DRH 2003 - Cheltenham 2/9/03

  5. Advantages of XML: Interoperability • Members of a research team may use different software running on different platforms • Problems can arise in sharing data • An important use of XML is as an interchange format • Transformation scripts created for archiving can also be used for sharing data DRH 2003 - Cheltenham 2/9/03

  6. Data structures - 1 • Researchers may not agree on common data structures • They are used to working with one tool in one particular way • Their interests are different • Even if they agree on a data structure for current work, heritage data may have to be imported to the archive DRH 2003 - Cheltenham 2/9/03

  7. Data structures - 2 • Archive files must be able to hold all the information coded in all the possible input formats - there should be no loss of data • We can think of this in terms of the logic of attribute-value matrices: all inputs must be able to unify with the general data structure • Where possible, correspondences will be made between the information in different input files DRH 2003 - Cheltenham 2/9/03

  8. Example: Dictionary files • The prototype implementation of the process uses a simple type of information: dictionary files • Source 1 is a FilemakerPro database of lexical material from the language Nusalaut • Source 2 is a table in an Access database containing data from several languages DRH 2003 - Cheltenham 2/9/03

  9. Source 1 DRH 2003 - Cheltenham 2/9/03

  10. Source 2 DRH 2003 - Cheltenham 2/9/03

  11. Process overview DRH 2003 - Cheltenham 2/9/03

  12. Stage 1 – txt to xml • Data exported from database as delimited text file • A document type description (DTD) is created for each source file • This replicates the existing data structure, possibly with additions • A Perl script reads data from the txt file and adds tags based on the DTD DRH 2003 - Cheltenham 2/9/03

  13. Sample: specific XML DRH 2003 - Cheltenham 2/9/03

  14. Stage 1 – Why? • Newer versions of commercial software offer an export to XML facility • Importing data from a normalized database often means having access to data from more than one table • XSLT takes a single input file • Perl (or an equivalent) does not have this limitation • Type conversion can be done using Perl DRH 2003 - Cheltenham 2/9/03

  15. Stage 2 – XML1 to XML2 • DTD for archive file has a place for all information in all input files • More structure imposed at this level • Stage 1 used only elements • Stage 2 uses attributes, mainly for metadata • “Pseudo-normalization”: recurring data substructures treated as optionally recurring elements – the archive data structure is actually more general than ANY of the inputs • Date stamping done at this stage DRH 2003 - Cheltenham 2/9/03

  16. Sample: General XML 1 DRH 2003 - Cheltenham 2/9/03

  17. Sample: General XML 2 DRH 2003 - Cheltenham 2/9/03

  18. Exporting Data • XSLT with <xsl:output method=“text”/> • The only complication is undoing “pseudo-normalization” DRH 2003 - Cheltenham 2/9/03

  19. A more complex problem: aligned interlinear text • Important way of presenting data for linguists • Various lines of annotation, different levels have different alignment patterns DRH 2003 - Cheltenham 2/9/03

  20. The Bird, Bow & Hughes Model • Bird, Steven, Cathy Bow and Baden Hughes (2003) A generalised model of interlinear text Proceedings of the EMELD Workshop • A general data model for representing this type of information • Four levels: • Text • Phrase • Word • Morpheme DRH 2003 - Cheltenham 2/9/03

  21. XML model for aligned text DRH 2003 - Cheltenham 2/9/03

  22. Aligned text: Problems • Various types of input: • Text strings with space and/or tabs (Shoebox) • Formatted text (e.g. Word tables) • Structured data (e.g. Spinoza database) • Type of processing varies • Text strings need a lot of parsing • Structured data needs access to multiple tables • Ideally, time alignment to AV source should be included also DRH 2003 - Cheltenham 2/9/03

  23. What is gained • Interoperability within the project • Data can be imported to the archive file from one format and exported to another format • Interoperability outside the project • People who wish to share data with a group will define transformations from their data formats • A bottom-up approach to developing standards • Improved data modeling • Encourages members of the project to revise their data formats • Gives us help in developing high-level models for linguistic data DRH 2003 - Cheltenham 2/9/03

  24. Future work • Processing aligned text formats • Using schemas rather than DTDs: data validation • Improved version control, especially checking for duplicate or conflicting records DRH 2003 - Cheltenham 2/9/03

  25. Some details • This work is part of the project Endangered Maluku Languages: Eastern Indonesia and the Dutch Diaspora • Funding: • Hans Rausing Endangered Languages Project • Australian Research Council • Faculty of Arts, Monash University • Contacts: • maluku@arts.monash.edu.au • http://www.arts.monash.edu.au/ling/maluku DRH 2003 - Cheltenham 2/9/03

More Related