Representing and querying xml with incomplete information
Download
1 / 43

Representing and Querying XML with Incomplete Information - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

Representing and Querying XML with Incomplete Information. Serge Abiteboul INRIA. Luc Segoufin INRIA. Victor Vianu UCSD. Organization. Motivations Simplifying assumptions Model of incompleteness Answering queries Results Discussion Conclusion. Motivations.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Representing and Querying XML with Incomplete Information' - tuan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Representing and querying xml with incomplete information

Representing and QueryingXMLwith Incomplete Information

Serge Abiteboul

INRIA

Luc Segoufin

INRIA

Victor Vianu

UCSD


Organization
Organization

  • Motivations

  • Simplifying assumptions

  • Model of incompleteness

  • Answering queries

  • Results

  • Discussion

  • Conclusion

Abiteboul-Segoufin-Vianu



The web is a world of incompleteness
The Web is a world of incompleteness

  • Information you get from the web is seldom complete:

    • Queries return you some - not all - data

    • Limited storage capability

    • Documents change on the Web: expiration

    • Sites are unavailable…

  • Context: A warehouse of XML documents from the Web, Xyleme

Abiteboul-Segoufin-Vianu


This work
This work

  • This work: simple, practically appealing approach to managing incomplete information

  • Sequence of queries to the web

    • (q1,A1)+(q2,A2)+…

    • Answers are cached

  • Process a new query without access to the web

    • Give an incomplete answer

    • Explain incompleteness to user

    • Seek additional information, i.e., find minimal set of queries to fully answer

Abiteboul-Segoufin-Vianu


Related works
Related works

  • Semantic caching

  • Answering queries using views

    • keep (Qi,Ai)

    • try to rewrite query Q into Q’(A1,...,An)

    • reject if you cannot

  • Incomplete database

    • (Qi,Ai) is some incomplete knowledge of DB

    • Related to querying incomplete information – e.g. Lipski-Imielinski

Abiteboul-Segoufin-Vianu


Challenge balance expressiveness and tractability
Challenge: balance expressiveness and tractability

  • Choice of data model

  • Choice of the query language

  • Choice of a representation of incompleteness

  • Results

    • Simple, practical solution

    • Extra features lead to serious problems

Abiteboul-Segoufin-Vianu



Data is xml trees
Data is XML: trees

<dealer>

<UsedCars>

<ad>

<model>Honda</model>

<year>96</year>

</ad>

</UsedCars>

<NewCars>

<ad>

<model>Acura</model>

</ad>

</NewCars>

</dealer>

dealer

UsedCars

NewCars

ad

ad

model

year

model

Honda

96

Acura

Abiteboul-Segoufin-Vianu


Simplified xml

unordered trees

catalog

labelling function

value function

product

product

=c.jpg

name price category

name price cat picture

=nik =234 =electronic

=can =444 =electronique

subcategory

subcategory

=camera

=camera

Simplified XML

Abiteboul-Segoufin-Vianu


Simple xml types
Simple XML types

catalog

1 : 1 child (default)

* : 0 or more

+ : 1 or more

? : 0 or 1

*

product

*

name price cat picture

subcategory

Abiteboul-Segoufin-Vianu


Prefix selection queries ps queries
Prefix Selection Queries (ps-queries)

catalog

catalog

Query1

Query2

product

product

name price cat=elec

name

picture

<200

subcategory

Abiteboul-Segoufin-Vianu


Simplifications

Data

No order

No distinction attribute/element

No recursion

No links

Query

No complex path expressions

No join

No repeated child

Simplifications

product

name cat=elec cat=toy

Abiteboul-Segoufin-Vianu

NO


Crucial assumption xid

prod

&245

prod

&245

&245

prod

+

=

c.jpg

canon 120 elec

canon 120 elec

c.jpg

camera

camera

Crucial assumption: XID

  • URLs

  • ID/IDrefs

Abiteboul-Segoufin-Vianu



Document type definition dtd are used to represent incompleteness

Set of rules: e  r

e element name

r regular expression

Set of trees satisfying a DTD d: tree(d)

Shortcoming of DTDs

An element has a single definition independently of the context

Type of ad depends on the context

Document Type Definition (DTD) are used to represent incompleteness

dealer

usedcar

newxar

ad

ad

model

year

model

Abiteboul-Segoufin-Vianu


Solution specialization decoupled tags

adused and adnew

h(adused)=h(adnew )=ad

Solution: specialization (decoupled tags)

dealer

dealer

usedcar

newxar

usedcar

newxar

h

adused

adnew

ad

ad

model

year

model

model

year

model

Abiteboul-Segoufin-Vianu


Dtds specialization
DTDs + Specialization

The sets of trees that can be specified: the regular unranked tree languages [Bruggeman—Klein+Murata+Wood]

  • Same closure properties: intersection, union, complement

  • Same complexity

Abiteboul-Segoufin-Vianu


Example
Example

Q1: name, subcat, price of electronic products with price less than $200

Q2: name, pictures of cameras at least pictured once

----------------------------

Q3: name, price, pictures of cameras costing less than $100 and at least pictured once

can be completely answered using A1, A2

Q4: list all cameras

can be partially answered using A1, A2

Abiteboul-Segoufin-Vianu


Representing and querying xml with incomplete information

*

product

product

product

*

product1

product2

canon 120 elec

nikon 199 elec

sony 175 elec

camera

camera

cdplayer

catalog

missing

Q1: name, subcat, price of electronic products with price less than 200

Abiteboul-Segoufin-Vianu


Missing data after q1
Missing data after Q1

product1

product2

*

*

name price cat picture

name price cat picture

=elec

>200

!=elec

subcategory

subcategory

Abiteboul-Segoufin-Vianu


Representing and querying xml with incomplete information

product1

*

3

3

c.jpg

akai a.jpg elec

camera

catalog

product2

*

product2b

*

product2c

missing

product

product

product

product2a

canon 120 elec

nikon 199 elec

sony 175 elec

camera

camera

cdplayer

Q2: name, pictures of cameras at least pictured once

Abiteboul-Segoufin-Vianu


Incomplete information
Incomplete information

  • Known information

    • Prefix of the real data tree

  • Missing information

    • Extended tree type

    • Conditions on data values

    • Specializations, disjunctions

Abiteboul-Segoufin-Vianu


Representing and querying xml with incomplete information

product +

product2a

Missing data

name pricecat picture

=elec

product1

>200

*

subcategory

no picture

name price cat picture

product3

!=elec

no picture

subcategory

name price cat

product2c

elec

product2b

subcategory

*

namepricecat

!=camera

=elec

>200

namepricecatpicture

=elec

>200

Known data

subcategory

subcategory

Abiteboul-Segoufin-Vianu

!=camera



Complete answer to q3
Complete answer to Q3

  • Q3: name, price, pictures of cameras costing less than $150 and having at least one picture

  • Can be fully answered using available information

  • Need to check whether answer is complete

catalog

prod

canon 120

c.jpg

Abiteboul-Segoufin-Vianu


Incomplete answer to q4

price>200

and

no picture

more products

name

Incomplete answer to Q4

  • Provide known cameras

  • Explain incompleteness

akai

canon

nikon

sony

Abiteboul-Segoufin-Vianu


Completing answer to q4
Completing answer to Q4

  • It suffices to ask:

product

0

name price cat

picture

=elec

>200

sub=camera

Abiteboul-Segoufin-Vianu


Revisit the types
Revisit the types

  • DTD

  • Conditions

  • Specialization: same

    element name may have

    several types

  • Not sufficient

  • Need to extend again the types: disjunctions

product2b

*

namepricecatpicture

=elec

>200

subcategory

!=camera

Abiteboul-Segoufin-Vianu


Disjunction

Query1’

Query2’

Disjunction

vehicle

vehicle

engine

data

data

vehicle

?

sail

engine

data

description

?

&322

sail

vehicle

Empty!

description

data=“….”

description=“….”

Abiteboul-Segoufin-Vianu


Disjunction continued
Disjunction continued

  • Type of &322

    vehicle1 + vehicle2

vehicle1

vehicle2

engine

data

data

sail

description

description

The type of &322 can not be described

independently of that of data below

Abiteboul-Segoufin-Vianu



Representation system lipski s imielinski s

Representation

of information

Set of possible

worlds

T

rep(T)

rep

q

q

Set of possible

answers

q(rep(T))

=

rep(q(T))

Representation

of result

q(T)

rep

Representation System:Lipski’s+Imielinski’s

Abiteboul-Segoufin-Vianu


Representation system for ps queries
Representation System for PS-queries

  • Incomplete tree T to represent

    q1-1(A1)  …  qk-1(Ak)

  • PS-query q

  • q(T) can be computed in ptime

    (representation of the answer can be computed in ptime)

Abiteboul-Segoufin-Vianu


Querying incomplete trees
Querying Incomplete Trees

  • Given T and a query q, one can

    • Give in ptime the sure answers up to our current knowledge

    • Check in ptime whether query q can be fully anwered

    • Generate in ptime queries to complete answer

Abiteboul-Segoufin-Vianu


Comparison with il

Relational model

Relational calculus/algebra

Conditional table

Closed or open world

Representation system

XML tree model

Weaker language (no join)

Weaker system (no variable)

+ Closedandopen World

Representation system

Comparison with IL

Abiteboul-Segoufin-Vianu


Drawback exponential blowup
Drawback: exponential blowup

  • Incomplete information may become exponential w.r.t the sequence of query/answer q1/A1;q2/A2…

database

database

qi:

Type:

1

1

b

b=i

a

a=i

Answers are empty

Abiteboul-Segoufin-Vianu


Dealing with exponential blowup
Dealing with exponential blowup

  • Make the representation more complex using disjunctions of types

    • Size of representation stays polynomial

    • Manipulations much more complex

  • Restrict tree types and PS-queries

    • Already very/too? simple

  • Accept to loose some information

  • Ask extra queries to simplify representation

Abiteboul-Segoufin-Vianu



Discussion extend language
Discussion: extend language

  • Some results in paper

  • Extensions often lead to intractability

  • E.G. : K-pebble transducers [Milo,Suciu,Vianu] that somehow subsume XML-QL and XSL

    • No (known) representation system

    • Testing rep(T) is empty is non-elementary

Abiteboul-Segoufin-Vianu


Discussion node ids
Discussion : node Ids

Without node Ids

  • much less information to integrate results

  • more complex

  • tedious case analysis

Abiteboul-Segoufin-Vianu


Discussion ordering
Discussion: ordering

  • Ordering in XML, DTD, queries

  • Problem is totally different and very complex

  • Example:

    • Q1/A1: list of males; Q2/A2: list of females; Q3: list all

  • Depending on the type of input

    • (Male)*(Female)* A3= A1 || A2

    • (Male Female)* A3= shuffle(A1,A2)

    • (Male + Female)* we cannot answer A3

  • Regular expression processing

Abiteboul-Segoufin-Vianu


Conclusion
Conclusion

  • Framework for acquiring, maintaining, querying incomplete XML data

  • Limitations:

    • simple queries

    • no order and Id assumption

    • small extensions lead to problems

  • Possible to represent the incompleteness

  • Possible to answer with incompleteness

  • Possible to obtain queries to provide full answer

Abiteboul-Segoufin-Vianu