XML, distributed databases, and OLAP/warehousing

XML, distributed databases, and OLAP/warehousing The semantic web and a lot more

What is XML? • A framework for declarative languages • A syntax and two major constructs: elements & attributes • Elements: • Have begin and end tags • Can be embedded • Can be put in lists (homogeneous or heterogeneous) • Attributes: • Are assigned to elements • Are strings • Are put in quotes

What is XML for? • Initially, as a cornerstone of the semantic web • Automatic searching of the web (versus interactive) • Self-describing data • Has been adapted to a wide variety of application domains • As a means for specifying the structure of data • As a catch-all for nontraditional data

XML documents • An instance of XML is a language • An instance of an XML language is a document • Documents are hierarchical & list-oriented • XML documents can be parsed in a single, linear pass • There is do notion of a fixed schema • Does not leverage meta data for set-oriented queries • Order matters in a set of documents • Order matters in a series of elements in a document

Is it a generalized HTML? • Sort of, but perhaps more of a meta alternative to HTML • The real point is to allow HTML pages to be located and searched automatically • This is done by allowing language developers to create their own names for documents, elements, & attributes

What else is part of the XML philosophy? • Namespaces • Associated with URLs • Can be referenced in a nested fashion in an XML document • Widely distributed sharing of data, XML languages, and namespaces

What’s missing, from the database uer’s and a programmer’s perspective? • No innate notion of a query language • No Objects • Very limited data structuring capabilities • Yet another impedance mismatch problem • No way to store XML documents in a relational database, at least not natively • No way to make a database out of a set of documents

So, in response to the database community’s desires… • A hierarchical query language – Xpath • A specification format for schemas – DTDs • But uses a different syntax • Does not accommodate namespaces

So, in response to the database community’s desires, phase 2… • XML schema • More atomic or “basic” types • Like DTD’s, but with an XML syntax • Supports namespaces • Adds primary keys and foreign keys • Adds more constructs for structuring data • Simple types: primitive types, list and union, & restriction • Attributes can be of simple types • Complex types: compositors • all (unordered) and sequence (ordered), and choice • Extension and restriction • Integrity constraints

Query language 1: XPath • Follows hierarchy of XML documents • Uses syntax borrowed from Unix file system • \ for root • . for current node • @ for value of an attribute • [1], [2], etc., for siblings • // for self or descendent of • .//x for all descendants to find an element of a specific type x • Augmented with URLs to create Xpointer • Relational database systems generally have an XML data type now

Distributed Databases & Distributed TXS – homogenous and heterogeneous • See page 689: multiple DBs vs. a distributed DB • Homogeneous distributed DBs • Single unified schema • Designed top down • Distribution by row, column, table, by table selection • Issues of distribution • Redundancy: availability vs. keeping copies up to date • Hidden joins with column distribution • Hidden unions with table selection distribution

Executing distributed transactions • Each node has a master and a client module • Masters are all identical and contain distributed data info • Clients are like single site databases with a prepare to commit • 3 basic strategies for query fragment execution • Bring data to procedure • Send procedure to data • Meet in a 3rd place • Estimating costs • Data shipping • Result shipping • Wait times on nodes • Integrity constraint enforcement

Heterogeneous distributed databases • Forms of heterogeneity • Model • Schema • Database product • Namespace • Table structure (implications for object identities) • Keys and Foreign keys • Units • SQL dialect • Semantic issues relating to varying interpretations of data

Integrating heterogeneous databases • After the fact • Stability is never achieved • Mappings are complex • Data may have conflicts, redundancy, and gaps • Closed world vs. open world

Engineering for nonstop change • Mediators around databases • Gateways connecting old apps and new databases • Gateways connecting new apps and old databases • A stability of instability

OLAP • Standard model • N dimension tables • 1 fact table (PK is union of keys of dimension tables) • Hypercube visualization • Multidimensional table result visualizations • Star and constellation schemas • Terminology • Drilling down – stepping down nested attributes • Rolling up – moving up nested attributes • Pivot – group by

Specialized operators • Cube operator and 4 equivalent queries • Viewing results • See page 722 • Equivalent – see 723

Populating the warehouse • Transformation • Integration • cleaning

Data mining • Effectively an open world application • Association, classification, clustering – page 730 • Association – confidence and support – page 731

XML, distributed databases, and OLAP/warehousing

XML, distributed databases, and OLAP/warehousing

Presentation Transcript

Distributed File Systems

Scientific Databases Lecture: Hubble Space Telescope Science Databases

Spatial Databases: Lecture 2

Distributed Systems

Chapter 2 Data Warehousing

Distributed Object-Based Systems

Distributed Systems

An OLAP Solution using Mondrian and JPivot

Distributed Object-Based Systems

DATA WAREHOUSING AND DATA MINING

Protein sequence databases http://education.expasy.org/cours/Murcia2011/

Chapter 32

Distributed Object-Based Systems

Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store

Distributed Systems

Distributed Systems

DISTRIBUTED COMPUTING

Distributed Systems

Chapter 22: Distributed Databases

Chapter 23

DISTRIBUTED SYSTEMS