1 / 25

SloFon 21 April 2006

Download Presentation

SloFon 21 April 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phonetic characters in digital editionsTomaž Erjavec1 & Matija Ogrin2tomaz.erjavec@ijs.si, matija.ogrin@zrc-sazu.si1 Department of Knowledge TechnologiesJožef Stefan InstituteLjubljana2 Institute of Slovenian Literature and Literary SciencesScientific Research Centre of the Slovenian Academy of Sciences and Arts, Ljubljana SloFon 21 April 2006

  2. Overview of the talk • IPA • PUA • TEI

  3. The problem • providestandardised encoding (XML) and Web viewing (HTML) of complex digital editions • in particular, the Freising manuscripts (e-BS) • work in progress in the project “Scholarly Digital Editions of Slovenian Literature” http://nl.ijs.si/e-zrc/

  4. Focus of the talk • e-BS, a very complex document:facsimile, commentary, diplomatic and critical trascriptions, translations, dictionary, bibliography, name index, … • but also: • phonetic transcription in IPA • (recording)

  5. HTML representation of eBS phonetic transcription

  6. IPA • International Phonetic Alphabet(International Phonetic Association) • contains not-well supported characters, e.g.ɐ, ɕ, ɚ, ɷ • heavy use of diacritics: • unusual diacritical marks: ˀ ˒ ˤ • more than one diacritic: ǡ • diacritics spanning digraphs:

  7. Computer representation of IPA SAMPA (for HLT) • transliteration to ASCII • SAMPA for contemporary Slovenian: • http://www.phon.ucl.ac.uk/home/sampa/sloven-uni.htm • ZEMLJAK, Melita, KAČIČ, Zdravko, DOBRIŠEK, Simon, ŽGANEC GROS, Jerneja, WEISS, Peter. Računalniški simbolni fonetični zapis slovenskega govora. Slav. rev., apr.-jun. 2002, 50/2, 159-169. UNICODE(for humans) • universal character set, better and better supported • contains “IPA Extensions”, “Combining diacritical marks” • various good Unicode IPA fonts available, e.g. Doulos SIL • for non-standardised characters: Private Use Area (PUA) • not to be used lightly!

  8. Unicode definitions

  9. Unicode definitions

  10. ZRCola • developed at ZRC SAZU (Peter Weiss) • Unicode input system for linguistic use in WinWord program: • decomposed and composed characters: • keyboard input • font which covers historical characters as well as IPA & (now) some specifics of e-BS  ideal for use in e-BS

  11. ZRCola and PUA

  12. Why PUA? ZRCola font uses PUA mostly for • defining new Slovene (related) historical characters • composed characters with diacritics (+ digraphs), for better diacritic placement • Unicode offers Combining diacritical marks, but complex stacks can cause problems for font rendering

  13. Some comparissons PUA EB25 ZRCola  mapping to r+0300+0329        Times NR            r̩̀   MS Tahomar̩̀ Doulos SILr̩̀ PUA EEC8 ZRCola ~mapping to t+j+032E Times NR              tj̮       MS Tahoma   tj̮          Doulos SIL      tj̮ PUA E31B      ZRCola mapping to 0105+0307 Times NR            ą̇ MS Tahoma   ą̇                    Doulos SIL    ą̇          PUA E35E ZRCola                 mapping to 00E6+0303+0300 Times NR           æ̃̀ MS Tahoma æ̃̀ Doulos SILæ̃̀

  14. Problem • PUA = Private Use Area but • e-ZRC = standardised & interchangable How to retain the benefits of ZRCola, yet make e-BS interchangable? How to enable reading e-BS for platforms without the ZRCola font?

  15. Text Encoding Initiative • e-ZRC editions encoded in XML • using the Text Encoding Initiative Guidelines, TEI P4 • TEI P5 makes provisions for encoding PUA characters and glyphs • in TEI P4 user extensions are necessary to achieve the same effect

  16. PUA in TEI P5 • TEI P5 chapter25. Representation of non-standard characters and glyphs • markup in text to identify PUA characters or glyphs • link these elements to their TEI header definition • TEI header can give, for each new character: • a name (text description a la Unicode), e.g. LATIN SMALL LETTER A • mapping to standard Unicode • character properties • rendering software (e.g. XSLT stylesheet for conversion to HTML) can then use the PUA version, or the standard version

  17. Markup in the document • text:b:ʒɛ g:spɔdi miłɔstíwi :tɛ b:ʒɛ tɛbǽ ispɔwǽdæ • in XML: <line n="2" id="bsPT.1.002"> b<g corresp="zrcolaE656"/>:ʒɛ g<g corresp="zrcolaE656"/>:spɔdi miłɔstíwi <g corresp="zrcolaE656"/>:t<g corresp="zrcolaEECC"/>ɛ b<g corresp="zrcolaE656"/>:ʒɛ tɛbǽ ispɔwǽdæ </line>

  18. Markup in the header PUA characters are defined in teiHeader/encodingDesc: <charDesc> <desc>PUA characters as defined by <xref url="http://zrcola.zrc-sazu.si/">ZRCola</xref> Character descriptions taken from and based on The Unicode Standard 4.1U41M050317.lst </desc> <char id="zrcolaE31B"> <charName>LATIN SMALL LETTER A WITH OGONEK AND DOT ABOVE</charName> <charProp><localName>font</localName><value>ZRCola</value></charProp> <charProp><localName>mapping</localName><value>exact</value></charProp> <mapping type="PUA">&#xE31B;</mapping> <mapping type="standard">&#x0105;<!--LATIN SMALL LETTER A WITH OGONEK-->&#x0307;<!--COMBINING DOT ABOVE--></mapping> </char>   <!-- more chars --> </charDesc>

  19. Standardisation of ZRCola PUA • ZRCola very well documented “visually”, i.e. for humans • but lacking machine processable meta-data:Unicode compliant name • mapping to standard Unicode (identity, similarity) • we only implemented 50+ characters that actually appear in eBS • substantial work to describe all PUA characters in ZRCola distribution • maybe better to abandon the precomposed PUA characters that can be expressed in standard Unicode?

  20. PUA display with ZRCola

  21. PUA display without ZRCola

  22. Documentation

  23. Mapping to Unicode, Doulos SIL font

  24. TEI to HTML <xsl:template match="g"> <xsl:variable name="glyph" select="id(@corresp)/mapping[@type=$ENCODING]"/> <SPAN> <xsl:if test="$ENCODING = 'standard'"> <xsl:attribute name="class"> <xsl:value-of select="id(@corresp)/charProp[localName='mapping']/value"/> </xsl:attribute> </xsl:if> <xsl:attribute name="title"> <xsl:value-of select="id(@corresp)/charProp[localName='font']/value"/> <xsl:text>: </xsl:text> <xsl:value-of select="id(@corresp)/charName"/> </xsl:attribute> <xsl:value-of select="$glyph"/> </SPAN> </xsl:template>

  25. Conclusions • introduced IPA, PUA & TEI • showed how PUA characters can be, via TEI, made • interchangable • documented • flexibly presented • this does require investment of time by the designers of PUA characters

More Related