Introduction - PowerPoint PPT Presentation

Introduction l.jpg
Download
1 / 24

Introduction. Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 4, 2008. Welcome to CIS 650, Database and Information Systems!. Instructor: Zachary Ives, zives@cis 576 Levine Hall North Office hours: Wednesdays, 2PM

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction l.jpg

Introduction

Zachary G. Ives

University of Pennsylvania

CIS 650 – Implementing Data Management Systems

September 4, 2008


Welcome to cis 650 database and information systems l.jpg

Welcome to CIS 650, Database and Information Systems!

Instructor: Zachary Ives, zives@cis

  • 576 Levine Hall North

  • Office hours: Wednesdays, 2PM

    Home page: www.cis.upenn.edu/~zives/cis650/

    Discussion group: cis650-fall08@googlegroups.com

    Texts and readings:

  • Hellerstein & Stonebraker: Readings in Database Systems, 4th ed.

    • Most papers will be linked via the Web, but it’s often nice to have the book

  • Supplementary papers (will be linked via schedule)


Course format and prerequisites l.jpg

Course Format and Prerequisites

  • Read “classic” and important research papers

    • Lectures will be very discussion-oriented; about one topic area per week or two

  • Gain experience building some sort of data management “engine” and experimentally validating it – this is a systems course

  • At end: you should be equipped to do research in this field, or apply ideas from data management to your field

  • Prerequisites:

    • Strong undergraduate DB course, or CIS 550

      • SQL, data modeling, basics of query optimization and execution, ACID, …

    • Strong Java coding abilities


Grading l.jpg

Grading

Summaries/commentary on papers (20%)

“Midterm report” (25%)

  • Take one of the topics we’ve discussed and write a summary and synthesis paper

  • Graded for organization, clarity, grammar, etc. as well as content

    Project (50%) – team or individual:

  • One focus: a SIGMOD demo to build a “smart research environment” – instrumented machines, labs, building (more shortly)

  • Implementation

  • Experimentation / validation

  • Project report (should be in the style of a research paper)

  • Brief (~15-minute) presentation for each group / project

    Participation, discussion, intangibles (5%)


Potential projects l.jpg

Potential Projects

  • “Smart CIS”: integrate data from sensors, machines, power monitoring, calendars, etc. to build a queriable building, labs, machines

    • Goal: demo at SIGMOD, possibly some research papers!

  • Cloud computing: adapt a query processor to run on Hadoop

  • Sensors: build a real app with Crossbow motes

  • Data visualizer: help understand and manipulate data

  • Transformation reverse engineering: create data instances to determine what a Perl or other tool is doing when converting from one format to another


So what is this course about l.jpg

So What Is This Course About?

Not how to build an Oracle-driven Web site…

… nor even how to build Oracle…


What is unique about data management l.jpg

What Is Unique about Data Management?

  • It’s been said that databases and data management focus on scalability to huge volumes of data

  • What is it that makes this possible – and what makes the work interesting if NOT at huge scale?


The key principle data independence l.jpg

The Key Principle: Data Independence

  • Most methods of programming don’t separate the logical and physical representations of data

    • The data structures, access methods, etc. are all given via interfaces!

  • The relational data model was the first model for data that is independent of its data structures and implementation


What is data independence l.jpg

What Is Data Independence?

  • Codd points out that previous methods had:

    • Order dependence

    • Index dependence

    • Access path dependence

  • Still true in today’s Java/C#: what is the drawback?

  • What might you be able to do in removing those?


The relational data model l.jpg

The Relational Data Model

More than just tables!

  • True relations: sets of tuples

  • The only data representation a user/programmer “sees”

  • Explicit encoding of everything in values

    General and universal means of encoding everything!

  • Connections are explicitly represented as values

  • All semantics are pushed to queries

    Additional integrity constraints

  • Key constraints, functional dependencies, …

    A secondary concept: views

  • Define derived relations that are always “live”

  • A way of encapsulating, abstracting, protecting, integrating data


Constraints and normalization l.jpg

Constraints and Normalization

  • Fundamental idea: we don’t want to build semantics into the data model, but we want to be able to encode certain constraints

    • Functional dependencies, key constraints, foreign-key constraints, multivalued dependencies, join dependencies, etc.

    • Allows limited data validation, plus opportunities for optimization

  • The theory of normalization (see CSE 330, CIS 550) makes use of known constraints

    • Idea: eliminate redundancy, in order to maintain consistency in the presence of updates

    • (Note that there’s no reason for normalization of data in views!)

      • Ergo, XML???


Relational completeness plus extensions declarativity l.jpg

Relational Completeness(Plus Extensions): Declarativity

What is special about relational query languages that makes them amenable to scalability?

  • Limited expressiveness – particularly when we consider conjunctive queries (even with recursion)

    • Guaranteed polytime execution in size of data

    • Can reason about containment, invert them, etc.

  • Equivalence between relational calculus and algebra

    • Calculus  fully declarative, basis of query languages

    • Algebra  imperative but polytime, basis of runtime systems

  • Predictability of operations (in bulk)  cost models

  • Ability to supplement data with auxiliary structures for performance

  • Interfaces to other “external” languages


Concurrency and consistency l.jpg

Concurrency and Consistency

  • Traditionally, DB efforts provide “ACID” properties:

    • Atomicity, Consistency, Isolation, Durability

    • Transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry)

    • Recoverability via a log: keeping track of all actions carried out by the database

    • But there’s a cost to all of this!

  • How do distributed systems, Web services, service-oriented architectures, and the like affect these properties?

  • We’ll consider one “relaxation” of these properties – the MapReduce / BigTable style of computing


Other data models l.jpg

Other Data Models

  • Concepts from the relational data model have been adapted to form object-oriented data models (with classes and subclasses), XML models, etc.

    • Doesn’t this result in some loss of logical-physical independence?


What is a data management system l.jpg

What Is a Data Management System?

  • Of course, there are traditional databases

    • The focus of most work in the past 25 years

    • “Tight loops” due to locally controlled data

    • Indexing, transactions, concurrency, recovery, optimization

  • Also, today there are DB-like components in:

    • Your email client and server (transactional storage)

    • Enterprise Java Beans (distributed transactions)

    • Google Base, BigTable, … (distributed indexing, storage)

  • But…


80 of the world s data is not in databases l.jpg

80% of the World’s Data is Not in Databases!

Examples:

  • Scientific data (large images, complex programs that analyze the data)

  • Personal data

  • WWW and email

  • Network traffic logs

  • Sensor data, network router data, stream data, …

  • Are there benefits to declarative techniques and data independence in tackling these issues?

    • Need to deal with data we don’t control and can’t guarantee consistency over

    • In recent years: increasing connection between databases, data integration, information retrieval, information extraction, sensors…


  • Some questions we ll consider l.jpg

    Some Questions We’ll Consider

    • What are the “right” architectures for data sharing? How do they change as consistency needs (or other requirements) change?

    • How much can we abstract away heterogeneity, physical properties, etc.?

    • How do we get good performance from declarative queries?


    Some classes of systems we ll consider l.jpg

    Some Classes of Systems We’ll Consider

    Databases

    How do we optimize and execute queries or ensure ACID?

    Data integration

    How do we handle heterogeneity in data and meaning?

    Data streams and sensor data

    How do we process infinite amounts of data?

    Cloud computing, Web search

    How do we partition computation along 1000s of machines and achieve reliable execution?

    Peer-to-peer architectures

    What’s the best way of finding data?


    Our agenda this semester l.jpg

    Our Agenda this Semester

    • Reading the canonical papers in the data management literature, starting with databases and later going to other data management systems

      • Some are very systems-y

      • Some are very experimental

      • Some are highly algorithmic, complexity-oriented

    • Gaining an understanding of the principles of building systems to handle declarative queries over large volumes of data


    Recap query answering in a data management system l.jpg

    Recap: Query Answering in a Data Management System

    • Based on declarative query languages

      • Based on restricted first-order logic expressions over relations

      • Not procedural – defines constraints on the output

    • Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine

      • Data may be local or remote

      • Data may be heterogeneous or homogeneous

      • Data sources may have different interfaces, access methods, etc.

    • Most common query languages:

      • SQL (based on tuple relational calculus)

      • Datalog (based on domain relational calculus, plus fixpoint)

      • XQuery (functional language; has an XML calculus core)


    Recap layers of a typical data management system l.jpg

    Recap: Layers of a Typical Data Management System

    API/GUI

    (Simplification!)

    Query

    Optimizer

    Stats

    Physical plan

    Exec. Engine

    Logging, recovery

    Schemas

    Catalog

    Data/etc

    Requests

    Access Methods

    Data/etc

    Requests

    Buffer Mgr

    Red = logical

    Blue = physical

    Pages

    Pages

    Physical retrieval

    Data

    Requests

    Source


    Processing the query l.jpg

    Hash

    STUDENT

    Merge

    COURSE

    Takes

    by cid

    by cid

    Processing the Query

    Web Server /

    UI / etc

    Execution

    Engine

    Optimizer

    Storage

    Subsystem

    SELECT * FROM STUDENT, Takes, COURSE

    WHERE STUDENT.sid = Takes.sID

    AND Takes.cID = cid


    Dbmss in the real world l.jpg

    DBMSs in the Real World

    • Big, mature relational databases

      • IBM, Oracle, Microsoft

    • “Middleware” above these

      • SAP, PeopleSoft, dozens of special-purpose apps

    • “Application servers”

    • Integration and warehousing systems

      • DB2 Integrator, Oracle Fusion, …

    • Current trends:

      • Web services; XML everywhere

      • Smarter, self-tuning systems (AutoAdmin, …)

      • Stream systems (Vertica, Microsoft, IBM)


    For next time l.jpg

    For Next Time

    • Skim Codd if you haven’t already

    • Read the overview papers of the two first database systems:

      • Astrahan et al., pp. 117-

      • Wong et al. (skip Section 2; focus on pp. 200-)

    • Write a summary of your assigned paper and post to the Google Group: cis650-fall08@googlegroups.com

      • Key question: how well did this system mesh with Codd’s relational model? (You may need to skim through other aspects of your assigned paper to help answer that question)


  • Login