320 likes | 457 Views
The MLCD (Markup Languages and Complex Documents) project aims to innovate document markup for complex structures such as overlapping, fragmented, and disordered elements. At its core, MLCD seeks to create a comprehensive markup language that combines simplicity and powerful data structure, improving upon existing systems like SGML and XML. Key features of the project include addressing the current limitations in markup languages, developing prototype software, and establishing robust grammars. This initiative is essential for advancing how texts are processed, interpreted, and represented.
E N D
Markup Languages and Complex Documents (MLCD) Universität zu Köln, 10.12.2004 Claus Huitfeldt (University of Bergen) and C.M. Sperberg-McQueen (World Wide Web Consortium) http://teksttek.aksis.uib.no/projects/mlcd
What MLCD is about Creating a markup • notation • data structure • grammar • semantics • for ”complex” documents, i.e. documents with overlapping, fragmented or disordered elements, multiple co-existing alternative structures etc. http://teksttek.aksis.uib.no/projects/mlcd
The structure of this talk • About document markup • The success of SGML/XML • The problems of SGML/XML • MLCD’s aims and organization etc. • Data structure • Notation • Prototype software • Grammars • Related work http://teksttek.aksis.uib.no/projects/mlcd
What Markup is • (or, what we mean by "markup") • Markup is information added to the character stream of a text document, normally meta-information about the contents, intended interpretation, or processing of part of the character stream. • Markup is • - embedded • - separable http://teksttek.aksis.uib.no/projects/mlcd
Why Markup is Important • Form of representation affects: • what computers can do with texts • the way we think about texts • designers of computer text systems • authors and readers • Formal representation may serve as a model for theories of text, or as a tool in theorising about texts. • Therefore: Shortcomings and problems of current markup systems are also important http://teksttek.aksis.uib.no/projects/mlcd
The basic elements of markup (The “Tripod”): • Notation (linearisation) • Data structure (graph representation) • Constraint language (grammar) http://teksttek.aksis.uib.no/projects/mlcd
Why SGML/XML (or is it HTML/PDF?)is such a success Tight integration of: • A simple notation • (the angle brackets) • A straightforward data structure with a natural interpretation • (document tree and attribution of properties to elements) • A powerful constraint language • (the DTD, a context-free grammar) http://teksttek.aksis.uib.no/projects/mlcd
Why SGML/XML is still not perfect Problems representing: • overlapping elements • discontiguous elements • disordered elements • structural variation (alternate ordering) • micro-level variation • macro-level alternate ordering • fragmentation, transposition, disorder • coextensive elements? • context-sensitive constraints • attribute co-occurence constraints i.e.: complex structures. http://teksttek.aksis.uib.no/projects/mlcd
Are complex structures important? Yes, -- an example: Overlap • Overlap exists in real texts: • pages and paragraphs • physical and formal structure • details of inscriptions and other structures • verse lines and speeches in verse drama • direct discourse and verse lines or sentences • Overlap also exists in electronic texts: • Overlap is explicitly allowed, and used for the encoding of overlapping features, in certain non-SGML systems such as: • MECS, FFF (Folio Flat File), TACT/COCOA • Such systems may also create "spurious" or "dumb" overlap. And some systems contain overlap although they are not supposed to: HTML ! http://teksttek.aksis.uib.no/projects/mlcd
Doesn’t SGML/XML have solutions? • Yes, but No: SGML/XML ”solutions”, such as • milestones • fragmentation • virtual elements • stand-off markup • CONCUR • are artificial and cumbersome, and not supported, neither by SGML/XML as such, nor by existing software. • (From now on: SGML/XML -> XML.) http://teksttek.aksis.uib.no/projects/mlcd
Why doesn’t XML support complex structures? • Non-hierarchical links are possible, but • XML element structure is supported by a context-free grammar • which requires hierarchical nesting of elements • What we have called ”complex structures” are not supported by context-free grammars http://teksttek.aksis.uib.no/projects/mlcd
Departure point for MLCD: XML and MECS • MECS: • Simple notation for overlapping structures • Generic, like XML • Used in markup of 20,000 pages • Host of software • But: • No data structure • (Just left-to-right scan) • No grammar • (Though well-formedness and GI vocabulary constraints defined) http://teksttek.aksis.uib.no/projects/mlcd
The idea behind MLCD • Combine the best of both worlds (XML and MECS), i.e. • create a markup language which • can handle complex structures (overlap and beyond), • is based on a markup tripod tightly integrated like that of XML (notation, data structure and grammar). http://teksttek.aksis.uib.no/projects/mlcd
Today XML MECS SGML http://teksttek.aksis.uib.no/projects/mlcd
Tomorrow MLCD XML MECS SGML http://teksttek.aksis.uib.no/projects/mlcd
MLCD – project organization • Project period: 2001-07 • Project partners • Host: Aksis • Programmer, researcher, administration • UoB • Philosophy, Linguistics, Humanities informatics… • GSLIS (semantics) • Renear, Dubin • Sperberg-McQueen (W3C) • Others…? • Achievements • Notation (TexMECS) • Data structure (GODDAG) • Experimental software (wff checker, loader/linearizer, visualizer, MOTS15, BECHAMEL) • Plans • Grammar • Prototype software • Further partners…? http://teksttek.aksis.uib.no/projects/mlcd
MLCD notation: TexMECS(”Trivially extended MECS”) • Design goals: • Isomorphic to XML and MECS for relevant documents • Every TexMECS document corresponds to a GODDAG structure, and vice versa • Correct GODDAG construction without application-specific knowledge • Simplicity of parsing, minimal number of magic characters http://teksttek.aksis.uib.no/projects/mlcd
TexMECS elements Only two reserved characters: < and | empty: <e att="val"> with ID: <e@foo att="val"> contiguous: <e|...|e> interrupted: <e|...|-e> ... <+e|...|e> unordered: <|e||...||e|> virtual: <^e^foo att="val"> self-overlap: <e~1|...<e~2|...|e~1>...|e~2> http://teksttek.aksis.uib.no/projects/mlcd
Other TexMECS mechanisms Internal entities: <é> Structured internal entities: <&dot.fullstop> vs. <&dot.decimal> External entities: <<url>>, e.g. <<vw117-a>> or <<http://www.w3.org/XML>> Comments: <* ... *>. Note that comments can nest. CDATA sections: <#CDATA< ... >>. http://teksttek.aksis.uib.no/projects/mlcd
TexMECS examples <s|<a| John <b| loves |a> Mary |b>|s> <sp who="HUGHIE"| <p|How did that translation go?|p> <lg type="haiku"| <l|da de dum de dum,|l> <l@frog|gets a new frog,|l> <l|...|l>|lg> |sp> <sp who="LOUIS"| <p|Er ...|p> <lg| <l@new|it's a new pond.|l>|lg> |sp> <sp who="DEWEY"| <p|Ah ...|p> <lg| <l@pond|When the old pond|l>|lg> <p|Right. That's it.|p> |sp> <lg|<^l^pond><^l^frog><^l^new>|lg> http://teksttek.aksis.uib.no/projects/mlcd
MLCD data structure: GODDAG(”generalized ordered-descendant directed acyclic graph”) Not: But: Overlap is simply multiple parentage. http://teksttek.aksis.uib.no/projects/mlcd
GODDAG – general description A Goddag is a directed acyclic graph (DAG): • Every node is either a leaf or a non-terminal. • Each leaf is labeled with a string. • Each non-terminal is labeled with an identifier. • Directed arcs identify parent/child relation; paths identify ancestor/descendant relation. • Node n is a leaf node iff n is not a parent. • Node n is a non-terminal node iff n is not a leaf node. http://teksttek.aksis.uib.no/projects/mlcd
Restricted GODDAGs • Leaf nodes are ordered. • Each non-terminal dominates a contiguous subsequence of leaves. • No two nodes dominate the same subsequence of the frontier. Unrestricted GODDAGs • For each node n, arcs (n → x) are ordered. • Leaves need not have any ordering; no contiguity rule for non-terminals. • Two non-terminals may dominate same set of leaves. http://teksttek.aksis.uib.no/projects/mlcd
Features of GODDAGs Just like a tree: • simple inheritance • overriding • additive meaning • positional meaning http://teksttek.aksis.uib.no/projects/mlcd
GODDAG and spurious overlap <a|…<b|…|a>…|b> <a|<b|…|a>…|b> <a|…<b||a>…|b> <a|…<b|…|a>|b> However, GODDAGs can be ”cleaned” by removing spurious overlap. http://teksttek.aksis.uib.no/projects/mlcd
Software • Well-formedness checker • Loader/linearizer for XML, MECS, TexMECS • Visualizer • Retrieval / concordance program Demo: • http://teksttek.aksis.uib.no/projects/mlcd • BECHAMEL http://teksttek.aksis.uib.no/projects/mlcd
MLCD’s open slot: A Grammar • Without a formal grammar, no constraint language. • Without a constraint language, no proper markup system. • Validation Requirements • allow for overlap • allow validation of virtual elements • partial validation? • modular specification, operations on grammar fragments • union • intersection • difference http://teksttek.aksis.uib.no/projects/mlcd
The Chomsky hierarchy • regular grammars (regular expressions) • context-free grammars (BNF, ...) • context-sensitive (monotonic) grammars • unrestricted phrase-structure grammars • What we want is either a little more powerful than context-free grammars. • Or else a little weaker. http://teksttek.aksis.uib.no/projects/mlcd
Formalisms to consider • attribute/affix grammars • attribute grammars (Knuth 1968) • affix grammars (descended from Van Wijngaarden two-level grammars used in Algol 68 Report) • extend context-free grammars in limited (tractable) ways by passing parameters on non-terminals • tree automata • parallel parsing (intersection of multiple grammars), cf. CONCUR • exotica (GPSG slash formalism? graph grammars? constraint grammars?) • standard context-free grammars plus ad hoc rules? http://teksttek.aksis.uib.no/projects/mlcd
Related work • Text Encoding Initiative SIG on overlapping markup • The ARCHway Project (Kentucky) • OSIS (Steve DeRose) • JITTs (Patrick Durusau and Brook O’Donnell) • LMNL project (Wendell Piez and Jeni Tennison) • Bielefeld Text Technology group? • Others ??? http://teksttek.aksis.uib.no/projects/mlcd
Thank you http://teksttek.aksis.uib.no/projects/mlcd http://teksttek.aksis.uib.no/projects/mlcd