Encoding Information for InterchangeAn introduction to the TEI Lou Burnard Humanities Computing Unit Oxford University
The problem • SGML/XML markup is powerful, flexible, and can be customised to meet most (all?) needs • But to use it, you need a formal specification (aka document type definition orDTD) • Where do you get one from? • How do you choose?
Some answers • Roll your own • from scratch • within an existing framework • Take what’s on offer • Use the TEI architecture
The Text Encoding Initiative Origins and Goals Modular Architecture Customization
Where did the TEI come from? • From the humanities research community • librarians and cybernauts • linguists, historians, lexicographers... • Sponsors • ACH Association for Computers and the Humanities • ACL Association for Computational Linguistics • ALLC Association for Literary and Linguistic Computing • Funders • U.S. National Endowment for the Humanities • Mellon Foundation • Commission of European Communities DG XIII • Social Science and Humanities Research Council of Canada
… and where is it going? • Continued work in new application areas • manuscript description • physical description • non-SGML data • XML conformance • Continued take-up • Need for new infrastructure • Corrected reprint of P3 due summer 1998
Goals of the TEI • better interchange and integration of data • support for all texts, in all languages, from all periods • guidance for the perplexed: what to encode • assistance for the specialist: how to encode any information of interest a user-driven codification of existing best practice
TEI Deliverables • coherent set of recommendations for text encoding • comprising several distinct SGML tagsets • based on existing practice • documented in a reference manual • tutorials for general and specialised audiences ... but no software
The TEI modus operandi... • identify significant particularities independent of notation or realisation • avoid controversy, over-delicacy, inadequacy • seek generalizable solutions, acceptable to a consensus
... and some consequences • focus on content, not presentation • descriptive, not prescriptive • Occam's razor • modular, extensible dtd • highly general in application, needs customization for particular areas
Who uses TEI? • see http://www-tei.uic/orgs/tei/app/ • digital librarians and archivists • LC, HTI, UVA, CETH, OTA... • Language Engineering projects • EAGLES, BNC, MULTEX, Parole, Silfide • academic researchers • Women Writers Project, Project Orlando, Model Editions Partnership, Canterbury Tales Project, Bodleian Library, and many more...
Designing your DTD • How can a single mark-up scheme handle a large variety of requirements ? • all texts are alike • every text is different • Learn from the database designers • one construct, many views • each view a selection from the whole
How many dtds might you need? • one (the Corporate or WKWBFY approach) • none (the Anarchic or NWEUMP approach) • as many as it takes (the Mixed Economy or WNSA approach) or is there a better way?
The TEI solution: modularization • a (very) large number of element and attribute definitions • organised as tagsets (core, base, additional, or auxiliary) • grouped into classes a single main DTD with many faces (a British DTD)
Combining Tag Sets • And how does one combine tagsets? The how-many-dtds problem is back. • all tag sets, all the time (the table d'hôte model) • a few pre-selected combinations (the combination plate model) • in completely unconstrained abandon (the smorgasbord model) • one from column A, two from column B (the Chinese menu model)
The Chicago Pizza Model <!ENTITY % base “(deepDish|thinCrust|stuffed)” > <!ENTITY % topping “(pepperoni|mushrooms|sausage| pepper | anchovies | ...)” > <!ELEMENT pizza - - (%base;, tomatoSauce & cheese, %(topping)*) >
To build a view of the TEI dtd, take... • the core tagsets • the base of your choice • the toppings of your choice <!DOCTYPE TEI.2 system 'tei2.dtd' [ <!ENTITY % tei.prose 'INCLUDE' > <!ENTITY % tei.analysis 'INCLUDE' > ]> <tei.2>.....</tei.2>
… trim to fit ... • user extension files • rename elements • undefine elements to be redefined* or removed <!ENTITY % tei.extensions.ent SYSTEM ‘myMods.ent’ > <!ENTITY % n.p ‘para’ > <!ENTITY % seg ‘IGNORE’> * see later
… and cook thoroughly • ‘compile’ the dtd to remove all parameterization • easier to use for some software • better project management • see http://firth.natcorp.ox.ac.uk/~tei/pizza.html • don’t forget the documentation!
TEI base tagsets • one only must be selected • defines basic structural components • currently defined: • prose, verse, drama • transcribed speech • dictionaries • terminological databases • mixtures of bases require special treatment
TEI additional tagsets • sets of elements for specialised application areas • can be mixed and matched ad lib • currently provided: • linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....
How does this work ? • Main dtd consists of marked sections, each (potentially) containing one tagset • By default, all tagsets are IGNOREd <![ %TEI.tagset [ <!-- declarations for tagset here --> ]]> <!ENTITY % TEI.tagset “INCLUDE”>
How does this work? (contd) • Tagsets contain element and attlist declarations, each also enclosed by a marked section • By default all elements are INCLUDEd <![ %element [ <!ELEMENT %n.element - - (#PCDATA)> <!ATTLIST %n.element %a.global > ]]> <!ENTITY % element “IGNORE”>
How does this work? (contd) • Element names (GIs) are always referred to indirectly, so that they may be renamed <!ELEMENT %n.elem1 - (%n.elem2;+)> <!ENTITY % n.elem1 “elem1”> <!ENTITY % n.elem2 “foo”>
Element Classes • Model classes • elements which share syntactic properties (i.e. occur in same position) • Attribute classes • elements which share attributes • Class membership can be inherited • Another way of doing architectural forms
Some TEI model classes • divn: structural elements like divisions <div>, <div1>, <div2>, <lg>, <lg1>... • divtop: elements which can appear at the start of a divn element <head>, <epigraph>, <byLine>... • chunk: paragraph-like elements <sp>, <p>, <lg>, <l>… • phrase: elements which appear within chunks <hi>, <foreign>, <date>, <q> ...
Some TEI semantic classes • data: phrases likely to be normalised or processed non textually <date>, <time>, <name>... • biblpart: specialised components of bibliographic descriptions <author>, <title>, <editor>... • demographic: descriptive features of participants in a language interaction <birth>, <socEcstat>, <occupation>...
Some TEI attribute classes • global: attributes which are available to every element n, lang, id, TEIform • linking: attributes for elements which have linking semantics targType, targOrder, evaluate
The class system in action • Simplifying documentation and understanding of the DTD • Parameterizing content models • different for different bases • Simplifies customization • class membership is unaffected • adding new elements to an existing class
Parameterized content models • “Components”, for example: • a dictionary is composed of entries • a play is composed of speeches • a novel is composed of paragraphs • in each case, the basic “text soup” (and the structural divisions) remain the same, but they are organized differently
How does this work? (contd) • the component class has different members in different bases <![ %TEI.prose [ <!ENTITY % m.component “p|list|note”> ]]> <![ %TEI.dictionaries [ <!ENTITY % m.component “entry”> ]]> <!ENTITY %component.seq “(%m.component)+”> <!ELEMENT div -- (head?, (%component.seq), div*) >
Customization... • Removing an element involves • undeclaring it • (NB: ISO 8879 permits references to undefined elements -- though not all vendors know this) • Adding a new element involves • determining its class • defining it • adding it to that class
Customization (contd) • Modification of an element implies removal followed by addition • Class membership should be unaffected <!-- in TEI.extensions.ent --> <!ENTITY % p “IGNORE”> <!-- in TEI.extensions.dtd --> <!ELEMENT %n.p - - (#PCDATA)>
How does this work? (contd) • Each model class is defined as a parameter entity • Reference to class members is always indirect • Membership extensible (by a kludge) <!ENTITY % x.class ““> <!ENTITY % m.class “%x.classname1 | name2 | name3 ...” > <!ELEMENT % n.element - - (%m.class;+)>
An example: the Lampeter corpus • Requirements • light presentational tagging • structural markup for access • demographic information about text production • small number of tags to ease data capture and validation • Implementation • tagsets: prose base, and tags from four additional sets • some extensions, many exclusions
The Lampeter corpus DTD subset <!DOCTYPE TEICORPUS.2 SYSTEM "tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.corpus "INCLUDE"> <!ENTITY % TEI.figures "INCLUDE"> <!ENTITY % TEI.transcr "INCLUDE"> <!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent"> <!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd"> <!-- more declarations here --> ]>
The Lampeter corpus extensions.ent <!ENTITY % analytic 'IGNORE' > <!ENTITY % biblStruct 'IGNORE' > <!-- hic desunt multa --> <!ENTITY % supplied 'IGNORE' > <!ENTITY % x.phrase "it|ro|sc|su|bo|go|"> <!ENTITY % x.biblPart "printer|pubFormat|bookSeller|"> <!ENTITY % x.demographic "socecstatusPat|biogNote|"> <!ENTITY % x.globincl "gap|">
The Lampeter corpus extensions.dtd <!ELEMENT (it|ro|sc|su|bo|go) - - (%phrase.seq)> <!ELEMENT (persName|printer|pubFormat |bookSeller|biogNote|socecstatusPat) - - (%phrase.seq) > NB: This is a provisional version only! (no attlists, no documentation…)
Summary • Designing a successful DTD involves careful, conscious, controlled , theft • Modularize the task • A class system helps identify • what is true of all documents • what is true of some documents • Modifiability can be compatible with standardization