190 likes | 319 Views
XML (eXtensible Markup Language) serves as a framework for creating self-describing data structures fundamental to the Semantic Web. Initially designed for automatic web searching, XML allows users to define custom elements and attributes, enabling data from various domains to be structured and exchanged efficiently. Although it presents limitations for direct database applications, XML has inspired technologies such as XPath for querying and DTDs/XML Schema for data validation. This document explores XML's syntax, philosophy, and its role within distributed databases and OLAP systems.
E N D
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more
What is XML? • A framework for declarative languages • A syntax and two major constructs: elements & attributes • Elements: • Have begin and end tags • Can be embedded • Can be put in lists (homogeneous or heterogeneous) • Attributes: • Are assigned to elements • Are strings • Are put in quotes
What is XML for? • Initially, as a cornerstone of the semantic web • Automatic searching of the web (versus interactive) • Self-describing data • Has been adapted to a wide variety of application domains • As a means for specifying the structure of data • As a catch-all for nontraditional data
XML documents • An instance of XML is a language • An instance of an XML language is a document • Documents are hierarchical & list-oriented • XML documents can be parsed in a single, linear pass • There is do notion of a fixed schema • Does not leverage meta data for set-oriented queries • Order matters in a set of documents • Order matters in a series of elements in a document
Is it a generalized HTML? • Sort of, but perhaps more of a meta alternative to HTML • The real point is to allow HTML pages to be located and searched automatically • This is done by allowing language developers to create their own names for documents, elements, & attributes
What else is part of the XML philosophy? • Namespaces • Associated with URLs • Can be referenced in a nested fashion in an XML document • Widely distributed sharing of data, XML languages, and namespaces
What’s missing, from the database uer’s and a programmer’s perspective? • No innate notion of a query language • No Objects • Very limited data structuring capabilities • Yet another impedance mismatch problem • No way to store XML documents in a relational database, at least not natively • No way to make a database out of a set of documents
So, in response to the database community’s desires… • A hierarchical query language – Xpath • A specification format for schemas – DTDs • But uses a different syntax • Does not accommodate namespaces
So, in response to the database community’s desires, phase 2… • XML schema • More atomic or “basic” types • Like DTD’s, but with an XML syntax • Supports namespaces • Adds primary keys and foreign keys • Adds more constructs for structuring data • Simple types: primitive types, list and union, & restriction • Attributes can be of simple types • Complex types: compositors • all (unordered) and sequence (ordered), and choice • Extension and restriction • Integrity constraints
Query language 1: XPath • Follows hierarchy of XML documents • Uses syntax borrowed from Unix file system • \ for root • . for current node • @ for value of an attribute • [1], [2], etc., for siblings • // for self or descendent of • .//x for all descendants to find an element of a specific type x • Augmented with URLs to create Xpointer • Relational database systems generally have an XML data type now
Distributed Databases & Distributed TXS – homogenous and heterogeneous • See page 689: multiple DBs vs. a distributed DB • Homogeneous distributed DBs • Single unified schema • Designed top down • Distribution by row, column, table, by table selection • Issues of distribution • Redundancy: availability vs. keeping copies up to date • Hidden joins with column distribution • Hidden unions with table selection distribution
Executing distributed transactions • Each node has a master and a client module • Masters are all identical and contain distributed data info • Clients are like single site databases with a prepare to commit • 3 basic strategies for query fragment execution • Bring data to procedure • Send procedure to data • Meet in a 3rd place • Estimating costs • Data shipping • Result shipping • Wait times on nodes • Integrity constraint enforcement
Heterogeneous distributed databases • Forms of heterogeneity • Model • Schema • Database product • Namespace • Table structure (implications for object identities) • Keys and Foreign keys • Units • SQL dialect • Semantic issues relating to varying interpretations of data
Integrating heterogeneous databases • After the fact • Stability is never achieved • Mappings are complex • Data may have conflicts, redundancy, and gaps • Closed world vs. open world
Engineering for nonstop change • Mediators around databases • Gateways connecting old apps and new databases • Gateways connecting new apps and old databases • A stability of instability
OLAP • Standard model • N dimension tables • 1 fact table (PK is union of keys of dimension tables) • Hypercube visualization • Multidimensional table result visualizations • Star and constellation schemas • Terminology • Drilling down – stepping down nested attributes • Rolling up – moving up nested attributes • Pivot – group by
Specialized operators • Cube operator and 4 equivalent queries • Viewing results • See page 722 • Equivalent – see 723
Populating the warehouse • Transformation • Integration • cleaning
Data mining • Effectively an open world application • Association, classification, clustering – page 730 • Association – confidence and support – page 731