1 / 26

Native XML Databases for Information Systems

Native XML Databases for Information Systems. Chris Wallace XQuery workshop April 2006. Exploring the design space. Native XML database (NXD) Storing, querying and updating XML documents without mapping into relations Schema-free Trees are to NXD what tables are to RDBMS Tables are trees

judson
Download Presentation

Native XML Databases for Information Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Native XML Databasesfor Information Systems Chris Wallace XQuery workshop April 2006

  2. Exploring the design space • Native XML database (NXD) • Storing, querying and updating XML documents without mapping into relations • Schema-free • Trees are to NXD what tables are to RDBMS • Tables are trees • Information Systems • Focus on semi-structured data (mixture of simple data items, text and complex nested structures) • Searching, derived data, visualisation • Process support • Large problem space variously supported by spreadsheets, word documents, ad-hoc databases, increasingly web-integrated data • “design as a conversation with the materials in the situation” (Schon) Chris Wallace, UWE, Bristol

  3. Solution:eXist Native XML Database • eXist • Open source Java • European team of developers led by Wolfgang Meier • Under development for several years, mature except for documentation • Supports • XQuery • XUpdate • XSLT • Free-text searching • XQuery Extensions to allow complete applications to be developed • Documents (files) are organised in collections (folders) in a file store • XML Documents stored in an efficient, B+ tree structure with indexes • Non-XML resources (XQuery, CSS, JPEG ..), etc can be stored as binary • Deployable in different ways • Embedded in a Java application • Part of a Cocoon pipeline • As web application in Apache/Tomcat • With embedded Jetty HTTPServer • Multiple Interfaces • REST – to Java servlet • SOAP • XML-RPC Chris Wallace, UWE, Bristol

  4. Sample Implementations • Family photos and history • Integration of meta-data on family photos with family history (births, deaths and marriages) and Google Earth • FOLD • modules, programmes, scheme operations, staff, organisational structures, events • Other demos on the eXist demo site Chris Wallace, UWE, Bristol

  5. FOLD – Faculty OnLine Data • Operations at student level (2000 in CEMS) supported by central systems (student records, finance) • FOLD Scope – teaching and assessment management and organisational knowledge • Modules [450] and their specification • Programmes (Courses) [100] and their structures • Operations – Runs, Coursework, exams • Staff (300+) • Organisational structure (100) • Events • Information currently distributed over word documents, spreadsheets, access databases, SQL database, flat text files, LDAP • Aims • To support distributed data ownership • To provide a web of data within and between systems • To support organisational processes • To improve data veracity Chris Wallace, UWE, Bristol

  6. FOLD Entity Types Chris Wallace, UWE, Bristol

  7. FOLD current stats • Code • XQuery -3000 • XSLT -3000 • XSD - 300 (one schema) • CSS - 200 • PHP - 10 ( vcal) • Pages • about 25 user • Only 1 admin as yet • Information System development • CW (4 months) • Placement Student (8 months) • Phase allocation: • Project (20%) • Code (20%) • Data – gathering, conversion, cleaning (60%) Chris Wallace, UWE, Bristol

  8. The FOLD Chris Wallace, UWE, Bristol

  9. Areas for attention • Conceptual Modelling • Identifiers • Relationships and links • Versioning • Logical Modelling (in XML) • Element/attribute • Views • Validation • Physical layer (in NXD) • Structuring documents and collections • Mapping to editors • Responsibilities • Programming • Functional allocation between tiers • Views and constructed elements • Integrity • XQuery programming • User interface • Editing • Long transactions • Development Process • Case Tool requirements • Scope of application of NXD Chris Wallace, UWE, Bristol

  10. Conceptual Modelling • Conventional normalised data model • EAR ++ • Entity (not XML entities like &amp;) • Attribute (multi-valued) • Relationships • Association • Composition • Object Orientation? • methods are mainly getters (of derived values) • Inheritance only useful in the schema domain • Instance inheritance more useful in IS • Expressivity Problems • Identifiers • Order of parts • Verbosity • ? Conceptual Scope • Edit trails, versioning, activity tracking • Generality problem • Roles as Attributes • <ModuleLeader>Stewart Green</ModuleLeader> • Roles as Entities • <role><title>Module Leader</title><person>Stewart Green</person></role> Chris Wallace, UWE, Bristol

  11. Identifiers • Principle adopted – use naturally occurring identifiers wherever possible • Persons : “Chris Wallace” • Rooms : “3P14” • Yes • Reduces gap between Real World domain and system • Names in minutes of meetings, on spreadsheets are readable • No • Duplicates • Duplicates not tolerable in the RW either, resolved through RW negotiation within a RW namespace e.g. the Faculty • Mergers generate duplicates • Aliases • Not all entities have unique domain identifiers • Gives rise to confusion in the problem domain and should be resolved there • Po • All names need namespace – “Chris Wallace” at CEMS at UWE • Need to replace multiple naming conventions with a single naming scheme (e.g. initials) • URN’s and semantic web Chris Wallace, UWE, Bristol

  12. Conceptual to Logical • Attributes v elements • Relationships • Integrity • Views Chris Wallace, UWE, Bristol

  13. Attributes v elements • E.g. • <Module code=“UFIEKG-20-3” level=“3”>… • <Module><ModuleCode>UFIEKG-20-3</ModuleCode> • What criteria to use? • Attributes as ‘meta’ is vague • FOLD uses only elements Chris Wallace, UWE, Bristol

  14. Relationships • Implementing Relationships • One – Many • RDBMS – primary key on the One side becomes foreign key on the Many side • NXD – choose which side on the basis of complexity and responsibility • Sequence (modules in a stage) • Complex (pre-requisite expression) • Many-Many • RDBMS – intersection table • NXD– as for one-many • or either side as appropriate – e.g. Groups and subgroups Chris Wallace, UWE, Bristol

  15. Integrity • Structural integrity • Schema validation too weak and too restructive • NXD stores any well-formed XML • Referential Integrity • RDBMS – ‘eager’ • data not allowed in unless valid, updates maintain integrity • integrity failures transient, repair outside database • NXD – ‘lazy’ • store the data and provide on-demand or on-trigger validation • Integrity failures can be persisted (XLinkit) and repair is inside database • Identifier Uniqueness • XML ids only checked within a document • NXD stores all XML nodes with internal identifiers • For Information Systems, veracity of the model is what’s important Chris Wallace, UWE, Bristol

  16. Logical to Physical layers • What criteria to use in allocation of logical units to the physical layer: • Documents – a physical aggregation of entity instances • Collections – a physical aggregation of documents • Examples • Module Specification [moduleCode] • Module Spec is an Entity • Each Module Spec is a Document • Module Run [moduleCode/year/runNo] • Module Run is an Entity • Set of Module Runs for a Field is a Document • Issues • Schemas needed per entity, not per document • Principle: No concepts modelled in the physical layer • Use Physical layer for responsibility, access rights ? Chris Wallace, UWE, Bristol

  17. Programming issues • Tier design • Views and constructed elements • XQuery programming Chris Wallace, UWE, Bristol

  18. Tier design • Allocation of functionality to tiers • Initially nearly all XQuery generating HTML • As work matured, code moved into function libraries and XSLT • XQuery for request input, sessions, selection of nodes, computation of views for • XSLT to generate interface for • CSS to style Chris Wallace, UWE, Bristol

  19. Views • Views arise from the need for de-normalisation for presentation • Coursework Element • As a simple element • Key : moduleCode/Year/runNo/elementNo • Data: due date • As an extended de-normalised element • SuggestedHours (computed from Hours table) • Late date (computed from UWE calendar) • Weighings (extracted from relevant specification) • Module Leader (extracted from Module Run) • Views as intermediate structures • From low level functions • For output to XSL • Constructed elements in XQuery use copy (losing reference so cant update through a constructed element) • View caching for efficiency • Triggers can invoke cache renewal Chris Wallace, UWE, Bristol

  20. declare function fold:courseworkElement($moduleCode, $year, $runNo, $elementNo) { let $mod := fold:moduleSpecification($moduleCode,$year), $run := fold:moduleRun($moduleCode,$year,$runNo), $elementRun := fold:elementRun($moduleCode,$year,$runNo,'B', $elementNo) , $elementSpec := $mod/Assessment/FirstAttempt/Components/ComponentB/Element[position() = $elementNo], $dueDate := $elementRun/DueDate, $returnDate := fold:workingDays($dueDate,20), $componentWeight := $mod/Assessment/Weighting/ComponentWeightB, $weightInComponent := data($elementSpec/Weight), $weightInModule := round($weightInComponent * $componentWeight div 100), $load := fold:load($mod/Level), $hrs := round(data($mod/UWERating) div data($load/Credits) * $weightInModule div 100 * data($load/Hours)) return <CourseworkElement> <ModuleCode>{$moduleCode}</ModuleCode> {$mod/Title} <RunNo>{$runNo}</RunNo> {$run/ModuleLeader} {$run/InternalModerator} {$run/ExternalExaminer} <Component>CW</Component> <ElementNo>{$elementNo}</ElementNo> {$elementSpec/Description} <SuggestedHours>{$hrs}</SuggestedHours> <WeightInComponent>{$weightInComponent}</WeightInComponent> <WeightInModule>{$weightInModule}</WeightInModule> <DueDate>{data($dueDate)}</DueDate> <ReturnDate>{data($returnDate)}</ReturnDate> </CourseworkElement> }; Chris Wallace, UWE, Bristol

  21. Integrity • Unlike RDBMS, integrity checks not inherent in Database • Structural ( schema validation) • Referential integrity • Business rules • Policies • Restrictive - allow in only data which has satisfied integrity constraints • Unitary view of data – model must be consistent at all times • Permissive – allow in un-validated data with on-demand validation reconciliation • Pluralist view – model will probably never be consistent but have to work with this • On-demand validation • Structure via eXist validation • Referential (via explicit coding) • Extensive Business rules Chris Wallace, UWE, Bristol

  22. XQuery programming • Functional style yields good clean code • But its not OO! • Need to rethink some algorithms • Strict data typing needs explicit conversion • Schema not missed • XPath 2.0 in XQuery, Xpath 1.0 in XSLT (xalan) causes confusion • Fast and responsive Chris Wallace, UWE, Bristol

  23. User Interface • Table structured Document editing • Allows maintenance using familiar Spreadsheet tools (Excel 2003 + Add-in) • Schema is induced by Excel • Accommodations • Multi-valued fields as concatenated values • XPath Join and tokenise functions • Embedded separator problem (a name with ‘,’ as a legitimate character) • Defeats conventional indexing but eXist supports full text indexing • Optional elements increase table width • Formatting choices not maintained (e.g. column widths, freeze-window location) • WebDav to provide Web Folder access (still not functioning) • Structured Document editing • Allows maintenance with Word without a schema • With difficulty –not schema awareness • Use InfoPath to create desktop form based on schema • Need to redo if schema changes • Document editors (Arbotext, XMetal..) - expensive • In-situ updates • With Xquery-generated forms and update • With XForms using Orbeon (open-source XForms server) Chris Wallace, UWE, Bristol

  24. Development Tools • eXist Java Client provides basic tools • Syntax-aware editor • Query execution • User and database management • XML spy • Any text editor • Model-driven development • Conceptual Model -> logical Model -> physical Model • Rose, QSEE ? Chris Wallace, UWE, Bristol

  25. Development Process • Co-development of Information system structure (code and schemas) and content (documents) • Support schema migration and refactoring (using XQuery/XSLT) • Slide from prototype to production • Pluses and Minuses of user enthusiasm • Go for ‘low-hanging fruit’ • Pay attention to the learning process • XQuery, XSLT are non-trivial languages because deeply unlike Java/PHP • Project management via steering group, discussion boards but needs forceful lead developer • Reflection forced by presentations and workshops • Is Agile IS development different to Agile Software development? Chris Wallace, UWE, Bristol

  26. Characteristics of good fit ? • FOLD • Low update rate / medium access rate • High document complexity • Document-centric ownership • Navigational interface • Integration with central systems – (via XML interfaces?) Chris Wallace, UWE, Bristol

More Related