CS511 Design of Database Management Systems

CS511Design of Database Management Systems Lecture 17: New Model-- Lore/Semistructured Data Kevin C. Chang

Announcements • Midterm (almost) graded • Will release on Wed. • On-campus: at Bethany’s office 2120 SC. • Off-campus: will scan and email • Regarding possible; follow the procedure. • Midterm feedback starting Wed for one week • Anonymous, online at course Web site. • Where we are now: • Focus: Projects, tutorials • One more HW, incrementally released, to keep on track with lectures.

Background & History: Vision Matters • Pre-Web: about 1992, TSIMMIS at Stanford • info. integration over autonomous, heterogeneous sources • OEM: self-describing data model • 1993: Web, HTML, and Mosaic took off • more and more data sources networked and presented • Pre-XML: 1995, Lore started as a semistructured DBMS • hot topic in DB research circles • 1999: Lore became an XML DBMS# • hot topic everywhere • XML DBMS still ongoing development and research • on top of RDBMS vs. native construction (Lore approach)

Well-Structured Data • Today's DBMS operates on well-structured data • fixed schema (structure) defined in advance • all data conforms to schema • ?? why is schema essential in DBMS? • what do DBMSs need schema for? • what do users need schema for?

Well-Structured Data • DBMS needs schema to: • validate, store, and index data • process queries and updates • Users need schema to: • formulate queries • program applications

Semistructured Data • Much of today's information is semistructured • lack of regular structure, or structure can evolve • data may be incomplete (null val. not good solution) • DBMS and users don't know complete structure • Sources of semistructured data • data integration and exchange in heterogeneous env. • personal, pervasive data • XML (eXtensible Markup Language) • ? XML is semistructured?

Lore Contributions • Data model for semistructured data • Querying semistructured data effectively • query language, formulating queries, browsing results • Indexing semistructured data • Query processing and optimization • DataGuide - dynamic structural summary • External data manager • First serious implementation of S/X DB

OEM: Lore’s Original Data Model • Object Exchange Model: Simple nested objects • Objects are self-describing via labels • why need labels?# where are “labels” in RDBMS? • No fixed, prescribed schema • Conceptually: directed labeled graph • nodes are objects • labeled edges denote object-subobject relationship • atomic values at leaves: integer, real, string, ...

OEM: Example DB

?? OEM vs. XML • Different?

OEM vs. XML • OEM is schema-less: XML has DTD • OEM subobjects • mixes XML subelements, attributes, and IDREF’s • attribute = (sub-) object with atomic value e.g., <student dept = cs> = <student> …… <dept> cs </dept> …… …… • OEM subobjects unordered

Lore is Just Network DBMS? # What is the difference between Lore and the old navigational systems (aside from the addition of a few bells and whistles -- the HTML GUI interface, it looks the same to me)? • What do you think?

Lore vs. Network/Hierarchical DB • Similar: Data are hierarchical • thus certainly can borrow ideas from network DB • Differences: • how hierarchy is structured • Lore: nested object; values at leaves • Network: connected records; values in records • requirement of schema • ? is Lore also navigational?

LoreL- Lore Language • Basic principles • no errors (hints and warnings OK) • gracefully handles irregular and incomplete data • user need not know full object structure • Extension of OQL • no classes, no strict type-checking • extensive automatic type coercion • atomic values: room (str) > 351 (int) • sets and singletons: everything as set existential comparison • heterogeneous sets • general path expressions: regular expressions

Lorel: Query Examples Q1: select DBGroup.Member where DBGroup.Member.Age > 30 Q2: select DBGroup.Member.Project where DBGroup.Member.#.(Office%|Room%) like "%252"

System Architecture ?? Looks familiar? What are new? What are missing?

Indexing in Traditional DB Systems • Relational system provide attribute indexes • quickly find all student with GPA > 3.0 • enable finding tuples by values on attributes • Objected-oriented systems add path indexes • quickly find all student with Dept.Name = “CS” • enable finding objects by values on paths • These are value indexes

Value Indexes: Vindex • Value indexes: • all atomic objects matching some predicates • restricted by incoming label (the attribute equivalent) • e.g.: select … where …(score|grade) > 95 • Multi-index scheme to accommodate coercion • string, real, string-to-real indexes for each label • comp. of (string, real). (real, string) coerced to (real, real) • use string-to-real index

Structural Indexes: Lindex and… • Link indexes: • all L-labeled parents of a given object • serves as back pointers • ? why not forward-label index? • Path indexes: (multi-links forward) • all objects via given path • e.g.: all DBGroup.Project.Publication.Author • provided by DataGuide

Query Processing • Basic goal: • match path exprs in query to paths in data • Query: select DBGroup.Member where DBGroup.Member.Age > 30 • Essential: how to traverse the graph? • Naïve approach: exhaustive top-down traversal

Query Plan: Top-Down Scan-Based • select DBGroup.Member.Office where DBGroup.Member.Age > 30 Join implemented as “nested-loop” join: for each object return from left, find all matches from right

?? When Can It Go Wrong? • Give example query that top-down may break

When Can It Go Wrong? • Query: find any one (staff, faculty, students) in office 123 select M from DBGroup.# M where M.office = “123” • Similar problem in RDBMS? • ? How does RDBMS avoid similar problem? • How to remedy?

Query Plan: • Key: where to start traverse, what directions? • root is a well-know starting point • but you’d hope to stay focused from there • Bottom-up traversal: • start with values at bottom of graph • traverse backward to match paths • ? why will this be more efficient?

Query Plan: Bottom-Up Index-Based • select DBGroup.Member.Office where DBGroup.Member.Age > 30 • OA Dependencies: • 30, Age --> OA2; OA2, Age --> OA1; OA1, Member -> OA0 (DBGroup)

Other Traversal Approaches? • Path indexes • quickly find all objects reachable by a path • e.g.: DBGroup.Member.Age • Hybrid of top-down, bottom-up, path-directed

Query Optimization: Selecting Plans • find all objects via A.B.C = 5 • ?? Which preferred: hybrid, top-down, bottom-up?

Data Guide: Example

DataGuides • Data and schema: chicken and egg • traditional schema: prescribed • DataGuide: derived • DataGuide: • dynamic structural summary of current database • no extraneous paths in DataGuide • maintained incrementally as database evolves • serve role of schema

DataGiuide Uses • Store statistics for query optimizer • relational statistics is per attribute • Lore: per path statistics for each path in DataGuide • Path index • all objects reachable by DBGroup.Member.Age • Central to user interface • to browse structure • to see sample atomic values and statistics • to formulate queries by example

Quote and Thought There has always been a need for saving unstructured data. … But there seems to be only a little scope for the Lore system in other databases except the web, where data is unstructured and constantly changing. It would be unwise to think of a technology such as Lore as a corporate database. Computers are used to organize information, and in fact, most of the information that is stored is structured, due to the need for data integrity, computation and processing, query handling, etc. Analyzing data in an irregular format is a challenge. Good observation! What do you think? • traditionally: correctness, functionalities, performance • trends and challenges: flexibility, ease of use, “pervasive” • And, challenges are opportunities!

More Comments • Delete by removing links • garbage collection to remove unreachable objs • Redundancy, consistency on the model? • Concurrency control: how to lock • what is the extent of an object? • How to handle different labels of same concepts? • OEM: irregular “structure” but not semantics • Tolerance can be dangerous-- • typeless, strucrureless: no validation • unintended irregularity = error?

What’s Next? • NM2: XML overview and query languages

End Of Talk

CS511 Design of Database Management Systems