Text Encoding for Interchange: Myths and Realities
This presentation is the property of its rightful owner.
Sponsored Links
1 / 48

Text Encoding for Interchange: Myths and Realities PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Text Encoding for Interchange: Myths and Realities. Lou Burnard Oxford University Computing Services. Yesterday's Information Tomorrow?. We live in interesting times. Traditional academic goals sharing and exchange of information creation of re-usable resources

Download Presentation

Text Encoding for Interchange: Myths and Realities

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Text encoding for interchange myths and realities

Text Encoding for Interchange: Myths and Realities

Lou Burnard

Oxford University Computing Services

Yesterday's Information Tomorrow?


We live in interesting times

We live in interesting times

  • Traditional academic goals

    • sharing and exchange of information

    • creation of re-usable resources

    • dual focus on teaching and research

  • Digital technologies can contribute to these traditional goals, not subvert them


Digital technologies offer opportunities

Digital technologies offer opportunities…

  • integration of disparate sources

    • texts, commentaries, sources, variations…

    • multimedia, manuscripts, transcriptions, metadata…

  • a new way of preservation

    • media disappear, data remain

    • "multiplication beyond the reach of accident"

  • a huge expansion of accessibility

    • quantitative

    • qualitatitive


And challenges

.…and challenges

  • integration of disparate sources

    • Different user communities have different -- and sometimes contradictory -- agendas and priorities

  • a new way of preservation

    • The business model is unclear

    • The technical problems may be insuperable

  • a huge expansion of accessibility

    • Depends on huge expansion of metadata provision

    • Both quantitative and qualitative expansion


Academia offers the technical world

Academia offers the technical world:

  • a range of interesting technical problems

  • a new raison d’ être: conservation of cultural heritage … and also of contemporary culture

  • some tried and tested techniques

    • hermeneutics/semiotics

    • linguistic insights

    • robust and modular encoding schemes


Text encoding for interchange myths and realities

Resources

digital resources

encoding

abstract

model

analysis


Making digital resources

Making digital resources

  • Texts are more than simply sequences of glyphs

    • They have structure and context

    • They also have multiple readings

  • Encoding or markup provides a means of making such readings explicit

    • only that which is explicit can be digitally processed

  • Not all resources are textual – but they all require reading.


Quick recap what s markup for

Quick recap: what’s markup for?

  • Markup is a way of making explicit the distinctions we want a computer to make when it processes a string of bytes (aka a text)

  • It’s a way of naming and identifying the parts of a document in a controlled way

  • It’s (usually) more useful to markup what things are than what they look like (or should look like)


What s the point of markup

What’s the point of markup?

  • To make explicit (for a machine) what is implicit (to a person)

  • To add value by multiple annotations

  • To facilitate re-use of digital resources

    • In different contexts

    • In different formats

    • For different audiences


Xml what it is and why you should care

XML: what it is and why you should care

  • XML is a generic markup language

  • It simplifies the representation of structured data as linear character strings

  • XML looks like HTML, except that:-

    • XML is extensible

    • XML must be well-formed

    • XML can be validated

    • XML is application-, platform-, and vendor- independent

  • XML empowers the content provider and facilitates data integration


Xml concepts a review

XML concepts: a review

  • an XML object is composed of identifiable objects or elements

  • elements have a type (name, or GI)

  • a textual grammar (a schema) may be defined which specifies

    • what elements exist

    • how they may be combined

  • elements also bear descriptive named attributes

  • an XML object contains a single hierarchy of elements

  • But elements may reference other elements in arbitrary ways


For example

For example:

  • a newspaper story consists of metadata fields, followed by a headline, and a series of paragraphs, which may contain proper names or just text

  • it also has an identifier and a language

  • themetadata fields include a date, a source, and one or more keywords


Like this

metadatafields

story

The Guardian, July 1, 1997, Empire, Hong Kong

A last hurrah and an empire closes down

With a clenched-jaw nod from the Prince of Wales, a last rendition of God Save the Queen, and a wind machine to keep the Union flag flying for a final 16 minutes of indoor pomp...

headline

paragraph

… like this

<story><metadata><source>The Guardian</source><date> July 1, 997</date><keywords><term> Empire</term><term> Hong Kong</term></keywords></metadata>

<body><div><head>A last hurrah and an empire closes down</head>

<p>With a clenched-jaw nod from the <name>Prince of Wales</name>, a last rendition of <title>God Save the Queen</title>, and a wind machine to keep the Union flag flying for a final 16 minutes of indoor pomp</p>...</body></story>


Or like this

… or like this

<documentLikeObject>

<metadata> …</metadata>

<sound URI=“…”/>

<image URI=“…”/>

<transcription URI=“…”/>

</documentLikeObject>


Encoding implies making decisions

Encoding implies making decisions

We may wish to allow for many views of what a resource “is”

but avoid “markup voodoo”

Necessarily, there must be compromise

what is needed now

what might be needed some time


The beowulf manuscript

The Beowulf Manuscript

MS Cotton Vitellius A xv


Printed version wrenn 1953

Printed version (Wrenn,1953)

Hwæt we Gar-Dena in gear-dagum

þeod-cyninga þrym gefrunon,

hu ða æþelingas ellen fremedon.

Oft Scyld Scefing sceaþena þreatum,

monegum mægþum meodo-setla ofteah;

egsode Eorle, syððan ærest wearð

feasceaft funden. He þæs frofre gebad…


One encoding

One encoding…

<lg><l>Hwæt we Gar-Dena in gear-dagum</l>

<l>þeod-cyninga þrym gefrunon,</l>

<l>hu đa æþelingas ellen fremedon.<l></lg>

<lg><l>Oft Scyld Scefing sceaþena þreatum,</l>

<l>monegum mægþum meodo-setla ofteah; </l>

<l>egsode Eorle, syđđan ærest wearþ</l>

<l>feasceaft funden. He þæs frofre gebad </l>

...


Another encoding

… another encoding

<hi rend=‘caps’>&H;&Wyn;ÆT &Wyn;E GARDE</hi><lb/>na in gear-dagum þeod cyninga<lb/> þrym gefrunon hu đa æþelinga&s; ellen<lb/> fremedon. oft Scyld Scefing sceaþe<add>na</add><lb/>þreatum, moneg<expan>um</expan> mæ;gþum meodo-setla <lb/>

of<damage desc=‘blot’/>teah egsode <sic corr=‘Eorle’>eorl</sic> syđđan ærest wearþ<lb/> feasceaft funden...


Yet another encoding

…yet another encoding

<figure>

<!-- detailed description of digital image -->

</figure>

<sourceDesc>

<!-- detailed description of original source-->

</sourceDesc>

<publicationStmt>

<!– access control metadata -->

</publicationStmt>

<classCode>

<!– descriptive metadata -->

</classCode>

<!– etc -->


Where is xml used

Where is XML used?

in well-defined application areas

b2b

news stories

chemical modelling

by well-defined user communities

EAD

electronic editors


Xml the very next thing

XML: the very next thing

XML defines a simple syntax for encoding linearized hierarchic structures which is

extensible and verifiable

XML is being taken up enthusiastically as a way of

adding semantics to the web (RDF, Topic Maps)

standardizing application interfaces (SMIL, SOAP)

.. even though XML is semantics-free


Reality check what exactly is markup

Reality check: what (exactly) is markup?

markup makes explicit a theory about some aspect of a document

some theories are more useful or generalizable than others

… so no markup language can reasonably claim to be exhaustive

… so are we doomed to a further confusion of tongues?


The risks of fragmentation

The risks of fragmentation

If we have…

historical records using a “historical markup language”

linguistic data using a “linguistic markup language”

illustrations using a “visual markup language”

How will we integrate these resources?

Why did we get into this business?


Text encoding for interchange myths and realities

Once upon a time long ago in a far away galaxy ….


1987 vassar college conference

1987: Vassar College Conference

The Text Encoding Initiative


We ve been here before

We’ve been here before…

Loomings

“CALL me Ishmael. Some years ago --- never mind how long precisely--- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world”

Good news: there is software capable of translating amongst 400 different encoding formats

Bad news: there ARE 400 different encoding formats…

|chap1

<C 1> Loomings

\chapter

\chapter[1]{Loomings}

:h1.1. Loomings

MOBY001001LOOMINGS

|C1

.chapter Loomings

.cp;.sp 6 a;.ce .bd 1. Loomings

~x


We ve been here before1

Bad news: there ARE 400 different encoding formats…

We’ve been here before…

Loomings

“CALL me Ishmael. Some years ago --- never mind how long precisely--- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world”

|chap1

<C 1> Loomings

\chapter

\chapter[1]{Loomings}

:h1.1. Loomings

MOBY001001LOOMINGS

|C1

.chapter Loomings

.cp;.sp 6 a;.ce .bd 1. Loomings

~x

Good news: you can get a program that converts among 300 file formats


Information interchange 1

Information Interchange (1)

A

B

E

C

D

20 translations required (n2-n)


Information interchange 2

Information Interchange (2)

A

CommonInterchangeStandard

B

E

C

D

10 translations required (2n)


The t e what

The T E what?

Originally, a research project within the humanities

Sponsored by ALLC, ACH, ACL

Funded 1990-1994 by US NEH, EU LE Programme et al

Major influences

digital libraries and text collections

language corpora

scholarly datasets

Now an international membership consortium incorporated Jan 2001

http://www.tei-c.org


Goals of the tei

Goals of the TEI

interchange and integration of scholarly data

support for all texts, in all languages, from all periods

guidance for the perplexed: what to encode

hence, a user-driven codification of existing best practice

assistance for the specialist:how to encode

hence, a loose framework into which unpredictable extensions can be fitted


Legacy of the tei

Legacy of the TEI

The TEI Guidelines: a comprehensive way of looking at what texts are and how to organize them

Expressed as a very large set of c. 600 element definitions, tied into a rather loose DTD

A mechanism for customization and specialization of the above

Tutorials, Guides,codification of shared practice etc.


Who uses tei

Who uses TEI?

digital libraries and text collections

HTI, UVA, OTA, BiMiCesa, CRILet ...

linguistic corpora

EAGLES, BNC, MULTEX, Silfide …

research projects

Women Writers Project, Model Editions Partnership, Lorelei Projekt, …

publishers – both web and otherwise

NLR, OUCS, …

http://www.tei-c.org/Applications/


Current tei activity 1

Current TEI activity (1)

Annual Members Meetings (since Nov 2001)

Annually elected TEI Technical Council (since January 2002)

XML revision (P4X) published in print, June 2002

Project on SGML-XML conversion (completed 2003)

Next major revision (TEI P5) due mid 2004

Special Interest Groups set up end 2003

http://www.tei-c.org/Services/order/


Tei p5

TEI P5

New work groups on

character set issues: convergence with Unicode

manuscript description

hyperlinking/W3C standards

Work in progress

SGML/XML conversion

Software usability and tools

Training

Funding problems and opportunities


The scope of intelligent markup

The scope of “intelligent” markup

orthographic transcription

links to digital recordings, images…

proper nouns, dates, times, etc.

part-of-speech and morphological tagging

syntactic analysis

discourse analysis

cross references to other material on the topic

meta-textual status (correction etc)

editorial commentary and annotation

etc., etc., etc.

How can all these things co-exist?


Frequently answered questions

Frequently Answered Questions

re-use of common text for multiple purposes

scholarly edition, school edition, speaking edition

alignment of transcription with

sound

image

multiple annotations of a common text

additive

alternative

authoring!


Fortunately the tei was designed for scholarly use

Fortunately, the TEI was designed for scholarly use

all texts are alike -- but every text is different

multiple perspectives are the norm

not one size fits all but who would you like to be today?

one construct, many views

each view a selection from the whole


The tei solution modularization

The TEI solution: modularization

a (very) large number of element and attribute definitions

organized as tagsets aka modules (core, base, additional, or auxiliary)

grouped into classes

combined according to a defined procedure (the pizza model)

which permits controlled extension and modification

http://www.tei-c.org/pizza.html


What use is a dtd

What use is a DTD?

A DTD is very useful at data preparation time (e.g. to enforce consistency), but redundant at other times

If a document is well-formed, its DTD can be (almost) entirely recreated from it.

DTDs don't allow you to specify much by the way of content validation

Unlike other parts of the XML family, DTDs are not expressed in XML

The XML Schema Language addresses these issues, and may eventually replace the DTD entirely... maybe.


Dtd what does it really mean

DTD : what does it really mean?

To get the best out of XML, you need two kinds of DTD:

document type declaration: elements, attributes, entities, notations (syntactic constraints)

document type definition: usage and meaning constraints on the foregoing

Published specifications (if you can find them) for XML DTDs usually combine the two, hence they lack modularity

The TEI model is to provide definitions which can be fitted to multiple declarations


Tei as an interlingua

TEI as an interlingua

TEI defines generic classes of textual object

<div>, <ab>, <seg> rather than chapter, paragraph, metaphor

Modification allows these to be more tightly constrained without loss of generality

<metaphor TEIform=“seg”>fresh ideas</metaphor>

Cf architectural forms


Sgml xml and

SGML, XML, and …

The TEI originally used SGML

for pragmatic reasons

existing standard, widely used

for theoretical reasons

declarative, verifiable

expressive power adequate to needs of research

It is now re-expressed in XML…


After xml

… after XML?

In fact, the TEI expresses an abstract model, which can be represented in SGML or XML

A TEI DTD can be constructed in either.

Work on generating Relax or W3C Schemas from the same source is ongoing

This will enable us to implement better TEI validation


Why bother

Why bother?

The TEI is a well-known reference point

Using the TEI enables

sharing of data and resources

shared modular software development

lower learning curve and reduced training costs

The TEI is stable, rigorous, and well-documented

The TEI is also flexible, customizable, and extensible in documented ways

Its architectural approach offers a good practical compromise between generality and implementability


Transmitting the hermeneutic

Transmitting the hermeneutic

scholarship depends on continuity

it is not enough to preserve the bytes of an encoding

there must also be a continuity of comprehension: the encoding must be self-descriptive


The wider picture

The wider picture

TEI is not just about exchanging data between machines

It's also about communication between humans

TEI/XML is not just about the web

It's about information in general

TEI is not just about technology

It's about the relationship between content creators and software developers

It’s also about scholarship


  • Login