new bases for new data omar benjelloun stanford university
Download
Skip this Video
Download Presentation
New Bases for New Data Omar Benjelloun Stanford University

Loading in 2 Seconds...

play fullscreen
1 / 51

New Bases for New Data Omar Benjelloun Stanford University - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

New Bases for New Data Omar Benjelloun Stanford University. January 27th, 2006. Relational databases are great. A simple, understandable model for data High-level, declarative language for queries and updates: SQL Efficient optimization techniques

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' New Bases for New Data Omar Benjelloun Stanford University' - chad


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
new bases for new data omar benjelloun stanford university
New Bases for New DataOmar BenjellounStanford University

January 27th, 2006

Omar Benjelloun - New Bases for New Data

relational databases are great
Relational databases are great
  • A simple, understandable model for data
  • High-level, declarative language for queries and updates: SQL
  • Efficient optimization techniques
  • Relational databases are the cornerstone of the management of homogeneous, regular, exact, centralized information

Omar Benjelloun - New Bases for New Data

but data has changed
… but data has changed
    • Data is distributed, behind applications, dynamically changing
    • Data is heterogeneous
    • Data may be uncertain
  • Today
    • Data is stored in relational databases (or XML)
    • Techniques for data integration, data exchange
    • … Lots of code
  • Traditional Database Management Systems (DBMS’s) are too rigid
  • New characteristics should be represented in the data
  • New bases are needed
    • foundations (models and languages)
    • Processing and optimization techniques

Omar Benjelloun - New Bases for New Data

applications
Applications
  • Information integration
    • Data is distributed on multiple heterogenous, independent sources
    • Conflicting information from the sources: inconsistency, uncertainty
    • Varying and evolving reliability of sources
    • Wheredata came from can be critical information
  • Scientific data management
  • Receptor (e.g., sensor) data management
  • Data cleaning (entity resolution)
  • And many others…

Omar Benjelloun - New Bases for New Data

agenda
Agenda
  • Distributed and dynamic data: Active XML
    • A “glue” language to connect data and programs
    • XML documents with embedded calls to Web services
    • Distributed interactions through the exchange of AXML data
    • Techniques to query and control the exchange of AXML data
  • Uncertain data: ULDB’s
    • An extension of the relational model with uncertainty and lineage
    • Efficient query evaluation
    • Computing probabilities
  • Conclusion

Omar Benjelloun - New Bases for New Data

distributed data management information is everywhere
Distributed data managementInformation is everywhere

Web

service

XML

XML

XML

services

XML

services

Internet

services

XML

XML

XML

XML

Data warehouses

Databases

Web sites

PC, PDA, cell phones,

home appliances, cars…

Web

service

services

Omar Benjelloun - New Bases for New Data

the golden triangle of distributed data management
The golden triangle of distributed data management
  • XML
  • a standard for data representation & exchange
    • Extensible Markup Language
    • Labeled ordered trees
    • Rich types: XML Schema
  • Query languages
    • XPath, XQuery
  • Web services
    • Standards for distributed computing

XML

  • XQuery XPath

SOAP

WSDL

Omar Benjelloun - New Bases for New Data

what is active xml axml
What is Active XML (AXML)?
  • AXML is a declarative language
  • for distributed information management
  • and
  • an infrastructure to support this language,
  • in a peer-to-peer framework.

Omar Benjelloun - New Bases for New Data

active xml documents
Active XML documents
  • XML documents with embedded calls to Web services
  • Intensional
    • Some of the data is given explicitly
    • Some is given intensionally (i.e. the means to acquire data when needed are given)
  • Dynamic
    • If the external sources change, the same document will provide different information
    • Reaction to world changes

Omar Benjelloun - New Bases for New Data

not a new idea in databases nor on the web
Not a new idea in databases, nor on the Web
  • Mixing calls to data is an old idea
    • Procedural attributes in relational systems
    • Basis of Object-oriented Databases
  • In Web programming
    • Sun’s JSP, PHP+MySQL
  • Calls to Web services inside documents
    • Macromedia FLEX, Apache Jelly, Microsoft XAML
  • What is new is the exploitation of the idea…

Omar Benjelloun - New Bases for New Data

web services in brief
Web services in brief
  • A number of standards
    • XML
    • SOAP: Exchange of messages between applications
    • WSDL: Description of service interfaces (e.g. input/output types)
    • UDDI: Advertisement and discovery of services
    • … other proposed standards (choreography, security, etc.)
  • For us: means to provide, invoke and describe remote functions with XML input/output.
  • They make AXML documents universally understandable.

Omar Benjelloun - New Bases for New Data

a sample axml document
A sample AXML document

city

newspaper

title

date

GetTemp

GetEvents

“Exhibits”

“06/10/2003”

“Paris”

“Le Monde”

<?xml version=“1.0” ?>

<newspaper>

<title>Le Monde</title>

<date>06/10/2003</date>

<call svc=“Yahoo.GetTemp”>

<city>Paris</city>

</call>

<call svc=“TimeOut.GetEvents”>

exhibits

</call>

</newspaper>

  • AXML documents may contain calls:
    • to any existing Web services (e-bay.net, google.com…)
    • to any AXML Web services (to be defined)

Omar Benjelloun - New Bases for New Data

materialization
Materialization

date

city

newspaper

title

temp

GetTemp

GetEvents

“Exhibits”

“Paris”

“16°C”

Y!

<?xml version=“1.0” ?>

<newspaper>

<title>Le Monde</title>

<date>06/10/2003</date>

<call svc=“Yahoo.GetTemp”>

<city>Paris</city>

</call>

<call svc=“TimeOut.GetEvents”>

exhibits

</call>

</newspaper>

  • Replacing the call by its result is not the only option
  • Calls are not necessarily RPC-style synchronous invocations

“06/10/2003”

<temp>16°C</temp>

“Le Monde”

SOAP call

Omar Benjelloun - New Bases for New Data

axml web services
AXML Web services
  • Parameters: AXML data
  • Result: AXML data
  • Distribute computations:by sending as parameters data containing service calls, one can delegate some work to other peers.
  • Partial computations:by returning data containing service calls, one can give to the receiver the control of these calls.

Great

flexibility

Omar Benjelloun - New Bases for New Data

distributed interactions
Distributed interactions

Omar Benjelloun - New Bases for New Data

slide18

title

date

newspaper

title

date

city

newspaper

city

temp

temp

GetEvents

GetEvents

GetTemp

GetTemp

“Exhibits”

“Exhibits”

“06/10/2003”

“06/10/2003”

“Le Monde”

“Le Monde”

“Paris”

“Paris”

“16°C”

“16°C”

Y!

To call or not to call ?

  • Materialization can be performed
    • by the sender, before sending a document…
    • or by the receiver, afterreceiving it.

Omar Benjelloun - New Bases for New Data

why control the materialization of calls
Why control the materialization of calls?
  • For added functionality, e.g.
    • Intensional data allows to get up-to-date information.
  • For security reasons or capabilities, e.g.
    • I don’t trust this Web service/domain,
    • I don’t have the right credentials to invoke it,
    • It costs money,
    • Maybe the receiver doesn’t know Active XML!
  • For performance reasons, e.g.
    • A proxy can invoke all the services on behalf of a PDA.
  • … and many more reasons you can think of!

Omar Benjelloun - New Bases for New Data

how to control it using types
How to control it? Using types

Receiver

Sender

Capabilities

ACL

Cost

...

Capabilities

ACL

Cost

...

  • We extend XML Schema, withintensional types: XMLSchemaint

g

data

exchange

Schema

q

f

g

f

q

...

...

g

g

g

q

f

r

r

g

f

...

q

g

g

q

...

r

...

...

...

...

  • Static analysis algorithms use signatures of services:WSDLint

Omar Benjelloun - New Bases for New Data

the extended schema language
The extended schema language

city

newspaper

title

date

GetTemp

GetEvents

“Exhibits”

“06/10/2003”

“Paris”

“Le Monde”

To simplify, we use here a DTD-like syntax

  • Data:
  • newspaper = title.date.(GetTemp|temp).(GetEvents|exhibit*)
  • title = data
  • date = data
  • temp = data
  • city = data
  • exhibit = title.(GetDate|date)
  • Functions:
  • GetTemp(city) -> temp
  • GetEvents(data) -> (exhibit|performance)*
  • GetDate(title) -> date
  • Rewriting: replace call(s) by anarbitraryoutput of the service.

Omar Benjelloun - New Bases for New Data

rewritings
Rewritings
  • The Goal:

Given

    • an AXML document d
    • a schema s,

Can we rewrited so that it matches s?

  • Safe rewriting: one that for sure leads to s
  • (we know without making any call)
  • Possible rewriting: one that may lead to s(depending on the answers of services)

Omar Benjelloun - New Bases for New Data

difficulties
Difficulties
  • Infinite search space
    • Vertical
    • Horizontal
  • Main problem
    • The result of a Web service call is unknown
    • We just know a signature (input/output types)
  • We want a very efficient solution
  • Foundations of the problem
    • String & tree automata,
    • with existential and universal transitions.

Omar Benjelloun - New Bases for New Data

results
Results
  • The general problem is undecidable [MSS03]
  • Restrictions on the considered rewritings
    • Left-to-right: No “going back and forth”
    • K-depth: bound on the nesting of function calls

(Search space still infinite but finitely representable)

  • Under these restrictions
    • We have algorithms to find safe/possible rewritings.
    • They are PTIME(for deterministic schemas).
    • We can also do it between schemas.
  • Implementation
    • demo at VLDB 2003 (customizable news syndication)

Omar Benjelloun - New Bases for New Data

safe rewriting algorithm flavor
Safe rewriting algorithm (flavor)

title

date

GetTemp

GetEvents

  • Build an FSA that accepts all k-depth rewritings of the initial word.
  • Build an FSA that recognizes the complement of the target type.

q3

q1

q4

q2

q0

temp

q5

q6

q7

exhibit

performance

*

*

*

*

GetEvents

*

title

date

temp

p0

p4

p6

p1

p2

p3

*

*

exhibit

p5

exhibit

Omar Benjelloun - New Bases for New Data

safe rewriting algorithm
Safe rewriting algorithm

exhibit

q4,p6

q7,p5

q4,p5

exhibit

performance

performance

exhibit

GetEvents

exhibit

performance

q7,p6

q3,p6

q7,p3

q4,p3

q7,p6

GetTemp

title

date

GetEvents

q1,p1

q2,p2

q3,p3

q4,p4

q0,p0

temp

q5,p2

q6,p3

  • Compute the intersection of these languages:
  • A smart marking determines whether a safe rewriting exists.
  • Then run the word on the marked automaton to find an actual rewriting.
  • Optimizations: lazy construction of the automata
  • parallel evaluation of calls

Omar Benjelloun - New Bases for New Data

querying axml data
Querying AXML Data

getDate

City

exhibits

city

temp

newspaper

title

“19°C”

  • Given a (tree pattern) query:
  • /newspaper[temp > 18°C]/exhibits//exhibit[location=“Le Louvre”]
  • Materialize the document?
  • Call only the services that may contribute
  • data to the query answer.
  • The problem: Lazy evaluation of service calls
  • To call or not to call, this time when evaluating a query

GetEvents

GetTemp

“Exhibits”

GetExhibits

“Paris”

“Le Monde”

“Paris”

Omar Benjelloun - New Bases for New Data

lazy evaluation
Lazy evaluation
  • Difficulties:
    • Calls can be found everywhere in the document
    • May appear dynamically (as a result of previous calls)
    • May become (ir)relevant due to previous invocations
    • Need to take signatures of calls into consideration
  • A possible approach: modify the query processor
    • Top-down evaluation
    • Trigger the calls found on the way
    • Not so great:
      • Computation is blocked
      • Optimization opportunities are lost

Omar Benjelloun - New Bases for New Data

nfq s
NFQ’s

temp

newspaper

exhibit

exhibits

temp

newspaper

location

exhibits

*

*

*

> 18°C

  • Given a query to evaluate:
  • Derive a set of
  • “node-focused” queries (NFQ),
  • that find the relevant calls
  • when evaluated on the document.
  • Need to be reevaluated, as the document evolves!

> 18°C

“Le Louvre”

Etc.

Omar Benjelloun - New Bases for New Data

optimizations
Optimizations
  • Service calls sequencing
    • Analysis of the relationship between calls (through the NFQ’s)
    • Layering, and parallelization inside each layer.
  • Filtering by type analysis
    • Match output types of services to the data expected by queries
  • “Pushing” queries to capable services
  • Acceleration:
    • Via relaxation:
      • NFQ approximation
      • Superset of the relevant calls
    • Via a special access structure, similar to a DataGuide:
      • Restricted to paths that lead to service calls
      • Indexes the calls
  • Experimental assessment
    • 10x speed-up when combining optimizations

Omar Benjelloun - New Bases for New Data

there is more
There is more…
  • The AXML peer system
    • Manages persistent AXML documents
    • Provides AXML services
    • Open source
  • Language extensions to control the activation of calls
  • Continuous services
  • Theoretical foundations
  • …check out http://www.activexml.net

Omar Benjelloun - New Bases for New Data

basic premise
Basic Premise
  • Traditional relational DB
    • Every data item’s value must be exact
    • Every data item is in the database or not
    • Where data came from and how it evolves is not important
  • ULDB’s relax these constraints by making
    • Data
    • Uncertainty
    • Lineage

all first-class interrelated concepts

Omar Benjelloun - New Bases for New Data

previous work
Previous work
  • Models for uncertainty
    • Labeled nulls, c-tables, probabilistic models,...
  • Trade-off between
    • expressiveness
    • Simplicity of representation, complexity of operations
    • We investigated this space in [DBHM06]
  • Models for lineage
    • In relational databases, data warehouses
    • Definition of lineage can be tricky for complex queries
  • First to consider lineage together with uncertainty

Omar Benjelloun - New Bases for New Data

uncertainty
Uncertainty

alternate

  • Possible worlds:

?

maybe

x-tuple

  • Simple formalism
    • not complete
    • not closed under joins

Omar Benjelloun - New Bases for New Data

lineage
Lineage

 witness, suspect

Omar Benjelloun - New Bases for New Data

uldb s
ULDB’s

?

?

?

?

Omar Benjelloun - New Bases for New Data

slide39

ULDB’s

?

?

?

?

Omar Benjelloun - New Bases for New Data

properties
Properties
  • ULDB’s are simple
    • x-tuples: set of alternate tuples, with or without ‘?’
    • lineage: associates with each alternate a set of alternates / external symbols
  • ULDB’s are expressive
    • Complete: can represent any finite set of possible worlds (with lineage)
    • Simple implementation of monotonic queries, with correct lineages
    • Natural probabilistic extension
  • ULDB’s are efficient
    • Query processing can use existing query optimizers
    • Tuple certainty/membership can be tested in polynomial time

Omar Benjelloun - New Bases for New Data

querying uldb s
Querying ULDB’s

Algorithm

Possible worlds

Query semantics

D1, D2, …, Dn

Q(D1), Q(D2), …, Q(Dn)

Q(Di): add query result

as new relation and lineage to Di

D

Q(D)

ULDB’s

Relational databases(with lineage)

Omar Benjelloun - New Bases for New Data

algorithm
Algorithm

?

  • Granny
  • BMW

?

  • Kid
  • Ford
  • Granny
  • BMW
  • Kid
  • Ford

 witness, suspect

?

?

?

  • Kid
  • Mike

?

Omar Benjelloun - New Bases for New Data

properties1
Properties
  • Efficient algorithm
    • Query processing phase can use standard query optimizer
    • Lineages are easy to propagate
    • “Grouping” phase requires a single pass on the result
  • Initial prototype
    • represents a ULDB as a relational DB
    • uses simple query rewriting techniques
  • Algorithm works for any monotonic query (including SPJU queries)

Omar Benjelloun - New Bases for New Data

probabilistic uldb s
Probabilistic ULDB’s

0.3

0.2

0.5

0.3

0.7

  • Semantics: As before, with a probability for each possible world
  • Without lineages
    • Alternates of the same x-tuple correspond to disjoint events
    • Alternates of different x-tuples correspond to independent events
  • Lineages
    • Capture correlations
    • Help propagate probabilities for query results

?

Omar Benjelloun - New Bases for New Data

probabilistic query answering
Probabilistic query answering
  • Compute queries as before
  • Compute probabilities on demand
    • Traverse lineages transitively to the leaves
    • Combine probabilities of reached alternates
  • Optimizations: memoize probabilities, efficiently detect ‘closest independent ancestors’

?

?

?

?

?

0.2

0.3

0.4

0.1

0.3

0.5

1

Omar Benjelloun - New Bases for New Data

future work
Future work
  • Richer queries
    • Duplicate elimination, difference, aggregation
    • Supported through new kinds of lineages (e.g., disjunctive, negative)
    • Querying the uncertainty and the lineage
  • More operations
    • Updates (and their lineage), close to versioning
    • “Uncertain operations”, e.g., entity resolution, inconsistency repairs
  • More optimization techniques
  • More theory

Omar Benjelloun - New Bases for New Data

new bases for new data
New “Bases” for new data
  • The database way
    • Simple models
    • Declarative languages
    • Optimization techniques
  • … for new features of data
    • Distribution and decentralization: Active XML
    • Uncertainty and lineage: ULDB’s
  • There are more challenges
    • Real-world side effects, semantic reasoning
  • and strong requirements
    • security, privacy, personalization
  • Big challenge: Doing it all in a coherent way
    • One “big” model?
    • Integration of models?

Omar Benjelloun - New Bases for New Data

ad