Native XML Databases for Information Systems

Native XML Databasesfor Information Systems Chris Wallace XQuery workshop April 2006

Exploring the design space • Native XML database (NXD) • Storing, querying and updating XML documents without mapping into relations • Schema-free • Trees are to NXD what tables are to RDBMS • Tables are trees • Information Systems • Focus on semi-structured data (mixture of simple data items, text and complex nested structures) • Searching, derived data, visualisation • Process support • Large problem space variously supported by spreadsheets, word documents, ad-hoc databases, increasingly web-integrated data • “design as a conversation with the materials in the situation” (Schon) Chris Wallace, UWE, Bristol

Solution:eXist Native XML Database • eXist • Open source Java • European team of developers led by Wolfgang Meier • Under development for several years, mature except for documentation • Supports • XQuery • XUpdate • XSLT • Free-text searching • XQuery Extensions to allow complete applications to be developed • Documents (files) are organised in collections (folders) in a file store • XML Documents stored in an efficient, B+ tree structure with indexes • Non-XML resources (XQuery, CSS, JPEG ..), etc can be stored as binary • Deployable in different ways • Embedded in a Java application • Part of a Cocoon pipeline • As web application in Apache/Tomcat • With embedded Jetty HTTPServer • Multiple Interfaces • REST – to Java servlet • SOAP • XML-RPC Chris Wallace, UWE, Bristol

Sample Implementations • Family photos and history • Integration of meta-data on family photos with family history (births, deaths and marriages) and Google Earth • FOLD • modules, programmes, scheme operations, staff, organisational structures, events • Other demos on the eXist demo site Chris Wallace, UWE, Bristol

FOLD – Faculty OnLine Data • Operations at student level (2000 in CEMS) supported by central systems (student records, finance) • FOLD Scope – teaching and assessment management and organisational knowledge • Modules [450] and their specification • Programmes (Courses) [100] and their structures • Operations – Runs, Coursework, exams • Staff (300+) • Organisational structure (100) • Events • Information currently distributed over word documents, spreadsheets, access databases, SQL database, flat text files, LDAP • Aims • To support distributed data ownership • To provide a web of data within and between systems • To support organisational processes • To improve data veracity Chris Wallace, UWE, Bristol

FOLD Entity Types Chris Wallace, UWE, Bristol

FOLD current stats • Code • XQuery -3000 • XSLT -3000 • XSD - 300 (one schema) • CSS - 200 • PHP - 10 ( vcal) • Pages • about 25 user • Only 1 admin as yet • Information System development • CW (4 months) • Placement Student (8 months) • Phase allocation: • Project (20%) • Code (20%) • Data – gathering, conversion, cleaning (60%) Chris Wallace, UWE, Bristol

The FOLD Chris Wallace, UWE, Bristol

Areas for attention • Conceptual Modelling • Identifiers • Relationships and links • Versioning • Logical Modelling (in XML) • Element/attribute • Views • Validation • Physical layer (in NXD) • Structuring documents and collections • Mapping to editors • Responsibilities • Programming • Functional allocation between tiers • Views and constructed elements • Integrity • XQuery programming • User interface • Editing • Long transactions • Development Process • Case Tool requirements • Scope of application of NXD Chris Wallace, UWE, Bristol

Conceptual Modelling • Conventional normalised data model • EAR ++ • Entity (not XML entities like &) • Attribute (multi-valued) • Relationships • Association • Composition • Object Orientation? • methods are mainly getters (of derived values) • Inheritance only useful in the schema domain • Instance inheritance more useful in IS • Expressivity Problems • Identifiers • Order of parts • Verbosity • ? Conceptual Scope • Edit trails, versioning, activity tracking • Generality problem • Roles as Attributes • <ModuleLeader>Stewart Green</ModuleLeader> • Roles as Entities • <role><title>Module Leader</title><person>Stewart Green</person></role> Chris Wallace, UWE, Bristol

Identifiers • Principle adopted – use naturally occurring identifiers wherever possible • Persons : “Chris Wallace” • Rooms : “3P14” • Yes • Reduces gap between Real World domain and system • Names in minutes of meetings, on spreadsheets are readable • No • Duplicates • Duplicates not tolerable in the RW either, resolved through RW negotiation within a RW namespace e.g. the Faculty • Mergers generate duplicates • Aliases • Not all entities have unique domain identifiers • Gives rise to confusion in the problem domain and should be resolved there • Po • All names need namespace – “Chris Wallace” at CEMS at UWE • Need to replace multiple naming conventions with a single naming scheme (e.g. initials) • URN’s and semantic web Chris Wallace, UWE, Bristol

Conceptual to Logical • Attributes v elements • Relationships • Integrity • Views Chris Wallace, UWE, Bristol

Attributes v elements • E.g. • <Module code=“UFIEKG-20-3” level=“3”>… • <Module><ModuleCode>UFIEKG-20-3</ModuleCode> • What criteria to use? • Attributes as ‘meta’ is vague • FOLD uses only elements Chris Wallace, UWE, Bristol

Relationships • Implementing Relationships • One – Many • RDBMS – primary key on the One side becomes foreign key on the Many side • NXD – choose which side on the basis of complexity and responsibility • Sequence (modules in a stage) • Complex (pre-requisite expression) • Many-Many • RDBMS – intersection table • NXD– as for one-many • or either side as appropriate – e.g. Groups and subgroups Chris Wallace, UWE, Bristol

Integrity • Structural integrity • Schema validation too weak and too restructive • NXD stores any well-formed XML • Referential Integrity • RDBMS – ‘eager’ • data not allowed in unless valid, updates maintain integrity • integrity failures transient, repair outside database • NXD – ‘lazy’ • store the data and provide on-demand or on-trigger validation • Integrity failures can be persisted (XLinkit) and repair is inside database • Identifier Uniqueness • XML ids only checked within a document • NXD stores all XML nodes with internal identifiers • For Information Systems, veracity of the model is what’s important Chris Wallace, UWE, Bristol

Logical to Physical layers • What criteria to use in allocation of logical units to the physical layer: • Documents – a physical aggregation of entity instances • Collections – a physical aggregation of documents • Examples • Module Specification [moduleCode] • Module Spec is an Entity • Each Module Spec is a Document • Module Run [moduleCode/year/runNo] • Module Run is an Entity • Set of Module Runs for a Field is a Document • Issues • Schemas needed per entity, not per document • Principle: No concepts modelled in the physical layer • Use Physical layer for responsibility, access rights ? Chris Wallace, UWE, Bristol

Programming issues • Tier design • Views and constructed elements • XQuery programming Chris Wallace, UWE, Bristol

Tier design • Allocation of functionality to tiers • Initially nearly all XQuery generating HTML • As work matured, code moved into function libraries and XSLT • XQuery for request input, sessions, selection of nodes, computation of views for • XSLT to generate interface for • CSS to style Chris Wallace, UWE, Bristol

Views • Views arise from the need for de-normalisation for presentation • Coursework Element • As a simple element • Key : moduleCode/Year/runNo/elementNo • Data: due date • As an extended de-normalised element • SuggestedHours (computed from Hours table) • Late date (computed from UWE calendar) • Weighings (extracted from relevant specification) • Module Leader (extracted from Module Run) • Views as intermediate structures • From low level functions • For output to XSL • Constructed elements in XQuery use copy (losing reference so cant update through a constructed element) • View caching for efficiency • Triggers can invoke cache renewal Chris Wallace, UWE, Bristol

declare function fold:courseworkElement($moduleCode, $year, $runNo, $elementNo) { let $mod := fold:moduleSpecification($moduleCode,$year), $run := fold:moduleRun($moduleCode,$year,$runNo), $elementRun := fold:elementRun($moduleCode,$year,$runNo,'B', $elementNo) , $elementSpec := $mod/Assessment/FirstAttempt/Components/ComponentB/Element[position() = $elementNo], $dueDate := $elementRun/DueDate, $returnDate := fold:workingDays($dueDate,20), $componentWeight := $mod/Assessment/Weighting/ComponentWeightB, $weightInComponent := data($elementSpec/Weight), $weightInModule := round($weightInComponent * $componentWeight div 100), $load := fold:load($mod/Level), $hrs := round(data($mod/UWERating) div data($load/Credits) * $weightInModule div 100 * data($load/Hours)) return <CourseworkElement> <ModuleCode>{$moduleCode}</ModuleCode> {$mod/Title} <RunNo>{$runNo}</RunNo> {$run/ModuleLeader} {$run/InternalModerator} {$run/ExternalExaminer} <Component>CW</Component> <ElementNo>{$elementNo}</ElementNo> {$elementSpec/Description} <SuggestedHours>{$hrs}</SuggestedHours> <WeightInComponent>{$weightInComponent}</WeightInComponent> <WeightInModule>{$weightInModule}</WeightInModule> <DueDate>{data($dueDate)}</DueDate> <ReturnDate>{data($returnDate)}</ReturnDate> </CourseworkElement> }; Chris Wallace, UWE, Bristol

Integrity • Unlike RDBMS, integrity checks not inherent in Database • Structural ( schema validation) • Referential integrity • Business rules • Policies • Restrictive - allow in only data which has satisfied integrity constraints • Unitary view of data – model must be consistent at all times • Permissive – allow in un-validated data with on-demand validation reconciliation • Pluralist view – model will probably never be consistent but have to work with this • On-demand validation • Structure via eXist validation • Referential (via explicit coding) • Extensive Business rules Chris Wallace, UWE, Bristol

XQuery programming • Functional style yields good clean code • But its not OO! • Need to rethink some algorithms • Strict data typing needs explicit conversion • Schema not missed • XPath 2.0 in XQuery, Xpath 1.0 in XSLT (xalan) causes confusion • Fast and responsive Chris Wallace, UWE, Bristol

User Interface • Table structured Document editing • Allows maintenance using familiar Spreadsheet tools (Excel 2003 + Add-in) • Schema is induced by Excel • Accommodations • Multi-valued fields as concatenated values • XPath Join and tokenise functions • Embedded separator problem (a name with ‘,’ as a legitimate character) • Defeats conventional indexing but eXist supports full text indexing • Optional elements increase table width • Formatting choices not maintained (e.g. column widths, freeze-window location) • WebDav to provide Web Folder access (still not functioning) • Structured Document editing • Allows maintenance with Word without a schema • With difficulty –not schema awareness • Use InfoPath to create desktop form based on schema • Need to redo if schema changes • Document editors (Arbotext, XMetal..) - expensive • In-situ updates • With Xquery-generated forms and update • With XForms using Orbeon (open-source XForms server) Chris Wallace, UWE, Bristol

Development Tools • eXist Java Client provides basic tools • Syntax-aware editor • Query execution • User and database management • XML spy • Any text editor • Model-driven development • Conceptual Model -> logical Model -> physical Model • Rose, QSEE ? Chris Wallace, UWE, Bristol

Development Process • Co-development of Information system structure (code and schemas) and content (documents) • Support schema migration and refactoring (using XQuery/XSLT) • Slide from prototype to production • Pluses and Minuses of user enthusiasm • Go for ‘low-hanging fruit’ • Pay attention to the learning process • XQuery, XSLT are non-trivial languages because deeply unlike Java/PHP • Project management via steering group, discussion boards but needs forceful lead developer • Reflection forced by presentations and workshops • Is Agile IS development different to Agile Software development? Chris Wallace, UWE, Bristol

Characteristics of good fit ? • FOLD • Low update rate / medium access rate • High document complexity • Document-centric ownership • Navigational interface • Integration with central systems – (via XML interfaces?) Chris Wallace, UWE, Bristol

Native XML Databases for Information Systems