Encoding Information for Interchange An introduction to the TEI

Encoding Information for InterchangeAn introduction to the TEI Lou Burnard Humanities Computing Unit Oxford University

The problem • SGML/XML markup is powerful, flexible, and can be customised to meet most (all?) needs • But to use it, you need a formal specification (aka document type definition orDTD) • Where do you get one from? • How do you choose?

Some answers • Roll your own • from scratch • within an existing framework • Take what’s on offer • Use the TEI architecture

The Text Encoding Initiative Origins and Goals Modular Architecture Customization

Where did the TEI come from? • From the humanities research community • librarians and cybernauts • linguists, historians, lexicographers... • Sponsors • ACH Association for Computers and the Humanities • ACL Association for Computational Linguistics • ALLC Association for Literary and Linguistic Computing • Funders • U.S. National Endowment for the Humanities • Mellon Foundation • Commission of European Communities DG XIII • Social Science and Humanities Research Council of Canada

… and where is it going? • Continued work in new application areas • manuscript description • physical description • non-SGML data • XML conformance • Continued take-up • Need for new infrastructure • Corrected reprint of P3 due summer 1998

Goals of the TEI • better interchange and integration of data • support for all texts, in all languages, from all periods • guidance for the perplexed: what to encode • assistance for the specialist: how to encode any information of interest a user-driven codification of existing best practice

TEI Deliverables • coherent set of recommendations for text encoding • comprising several distinct SGML tagsets • based on existing practice • documented in a reference manual • tutorials for general and specialised audiences ... but no software

The TEI modus operandi... • identify significant particularities independent of notation or realisation • avoid controversy, over-delicacy, inadequacy • seek generalizable solutions, acceptable to a consensus

... and some consequences • focus on content, not presentation • descriptive, not prescriptive • Occam's razor • modular, extensible dtd • highly general in application, needs customization for particular areas

Who uses TEI? • see http://www-tei.uic/orgs/tei/app/ • digital librarians and archivists • LC, HTI, UVA, CETH, OTA... • Language Engineering projects • EAGLES, BNC, MULTEX, Parole, Silfide • academic researchers • Women Writers Project, Project Orlando, Model Editions Partnership, Canterbury Tales Project, Bodleian Library, and many more...

Designing your DTD • How can a single mark-up scheme handle a large variety of requirements ? • all texts are alike • every text is different • Learn from the database designers • one construct, many views • each view a selection from the whole

How many dtds might you need? • one (the Corporate or WKWBFY approach) • none (the Anarchic or NWEUMP approach) • as many as it takes (the Mixed Economy or WNSA approach) or is there a better way?

The TEI solution: modularization • a (very) large number of element and attribute definitions • organised as tagsets (core, base, additional, or auxiliary) • grouped into classes a single main DTD with many faces (a British DTD)

Combining Tag Sets • And how does one combine tagsets? The how-many-dtds problem is back. • all tag sets, all the time (the table d'hôte model) • a few pre-selected combinations (the combination plate model) • in completely unconstrained abandon (the smorgasbord model) • one from column A, two from column B (the Chinese menu model)

To build a view of the TEI dtd, take... • the core tagsets • the base of your choice • the toppings of your choice <!DOCTYPE TEI.2 system 'tei2.dtd' [ <!ENTITY % tei.prose 'INCLUDE' > <!ENTITY % tei.analysis 'INCLUDE' > ]> <tei.2>.....</tei.2>

… trim to fit ... • user extension files • rename elements • undefine elements to be redefined* or removed <!ENTITY % tei.extensions.ent SYSTEM ‘myMods.ent’ > <!ENTITY % n.p ‘para’ > <!ENTITY % seg ‘IGNORE’> * see later

… and cook thoroughly • ‘compile’ the dtd to remove all parameterization • easier to use for some software • better project management • see http://firth.natcorp.ox.ac.uk/~tei/pizza.html • don’t forget the documentation!

TEI base tagsets • one only must be selected • defines basic structural components • currently defined: • prose, verse, drama • transcribed speech • dictionaries • terminological databases • mixtures of bases require special treatment

TEI additional tagsets • sets of elements for specialised application areas • can be mixed and matched ad lib • currently provided: • linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....

How does this work ? • Main dtd consists of marked sections, each (potentially) containing one tagset • By default, all tagsets are IGNOREd <![ %TEI.tagset [  ]]> <!ENTITY % TEI.tagset “INCLUDE”>

How does this work? (contd) • Tagsets contain element and attlist declarations, each also enclosed by a marked section • By default all elements are INCLUDEd <![ %element [ <!ELEMENT %n.element - - (#PCDATA)> <!ATTLIST %n.element %a.global > ]]> <!ENTITY % element “IGNORE”>

How does this work? (contd) • Element names (GIs) are always referred to indirectly, so that they may be renamed <!ELEMENT %n.elem1 - (%n.elem2;+)> <!ENTITY % n.elem1 “elem1”> <!ENTITY % n.elem2 “foo”>

Element Classes • Model classes • elements which share syntactic properties (i.e. occur in same position) • Attribute classes • elements which share attributes • Class membership can be inherited • Another way of doing architectural forms

Some TEI model classes • divn: structural elements like divisions <div>, <div1>, <div2>, <lg>, <lg1>... • divtop: elements which can appear at the start of a divn element <head>, <epigraph>, <byLine>... • chunk: paragraph-like elements <sp>, <p>, <lg>, <l>… • phrase: elements which appear within chunks <hi>, <foreign>, <date>, <q> ...

Some TEI semantic classes • data: phrases likely to be normalised or processed non textually <date>, <time>, <name>... • biblpart: specialised components of bibliographic descriptions <author>, <title>, <editor>... • demographic: descriptive features of participants in a language interaction <birth>, <socEcstat>, <occupation>...

Some TEI attribute classes • global: attributes which are available to every element n, lang, id, TEIform • linking: attributes for elements which have linking semantics targType, targOrder, evaluate

The class system in action • Simplifying documentation and understanding of the DTD • Parameterizing content models • different for different bases • Simplifies customization • class membership is unaffected • adding new elements to an existing class

Parameterized content models • “Components”, for example: • a dictionary is composed of entries • a play is composed of speeches • a novel is composed of paragraphs • in each case, the basic “text soup” (and the structural divisions) remain the same, but they are organized differently

How does this work? (contd) • the component class has different members in different bases <![ %TEI.prose [ <!ENTITY % m.component “p|list|note”> ]]> <![ %TEI.dictionaries [ <!ENTITY % m.component “entry”> ]]> <!ENTITY %component.seq “(%m.component)+”> <!ELEMENT div -- (head?, (%component.seq), div*) >

Customization... • Removing an element involves • undeclaring it • (NB: ISO 8879 permits references to undefined elements -- though not all vendors know this) • Adding a new element involves • determining its class • defining it • adding it to that class

Customization (contd) • Modification of an element implies removal followed by addition • Class membership should be unaffected  <!ENTITY % p “IGNORE”>  <!ELEMENT %n.p - - (#PCDATA)>

How does this work? (contd) • Each model class is defined as a parameter entity • Reference to class members is always indirect • Membership extensible (by a kludge) <!ENTITY % x.class ““> <!ENTITY % m.class “%x.classname1 | name2 | name3 ...” > <!ELEMENT % n.element - - (%m.class;+)>

An example: the Lampeter corpus • Requirements • light presentational tagging • structural markup for access • demographic information about text production • small number of tags to ease data capture and validation • Implementation • tagsets: prose base, and tags from four additional sets • some extensions, many exclusions

The Lampeter corpus DTD subset <!DOCTYPE TEICORPUS.2 SYSTEM "tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.corpus "INCLUDE"> <!ENTITY % TEI.figures "INCLUDE"> <!ENTITY % TEI.transcr "INCLUDE"> <!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent"> <!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd">  ]>

The Lampeter corpus extensions.ent <!ENTITY % analytic 'IGNORE' > <!ENTITY % biblStruct 'IGNORE' >  <!ENTITY % supplied 'IGNORE' > <!ENTITY % x.phrase "it|ro|sc|su|bo|go|"> <!ENTITY % x.biblPart "printer|pubFormat|bookSeller|"> <!ENTITY % x.demographic "socecstatusPat|biogNote|"> <!ENTITY % x.globincl "gap|">

Summary • Designing a successful DTD involves careful, conscious, controlled , theft • Modularize the task • A class system helps identify • what is true of all documents • what is true of some documents • Modifiability can be compatible with standardization

Encoding Information for Interchange An introduction to the TEI

Encoding Information for Interchange An introduction to the TEI

Presentation Transcript

An Introduction to Information Literacy

Information Interchange

An Introduction to The Information Standard

An Introduction to Information Governance

An Introduction to Information Systems

The Great American Interchange of Species (An Introduction to Biogeography )

Introduction to XML and TEI for Digital Archives

TEXT ENCODING INITIATIVE (TEI)

Introduction to the TEI Process

Text Encoding for Interchange: Myths and Realities

More Text Encoding Initiative (TEI)

An Introduction to Information Assurance

The TEI : an overview

Encoding information

An Introduction to Information Literacy

Encoding Information for DNA computing

Introduction to the Diverging Diamond Interchange

The Guidelines (P5) of the Text Encoding Initiative (TEI)

Text Encoding for Interchange: Myths and Realities