Beyond sql structured data retrieval by ranking
Download
1 / 52

Beyond SQL: Structured Data Retrieval by Ranking - PowerPoint PPT Presentation


Beyond SQL: Structured Data Retrieval by Ranking. Chengkai Li October 25, 2006 CS511 guest lecture. About Myself. Advisor: Kevin Chang (UIUC) Research Interests Structured Data Retrieval and Ranking Query Processing Web Information Integration XML Query Processing Misc.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Beyond SQL: Structured Data Retrieval by Ranking

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Beyond SQL: Structured Data Retrieval by Ranking

Chengkai Li

October 25, 2006

CS511 guest lecture


About Myself

  • Advisor: Kevin Chang (UIUC)

  • Research Interests

    • Structured Data Retrieval and Ranking Query Processing

    • Web Information Integration

    • XML Query Processing

  • Misc.

    • http://www.ews.uiuc.edu/~cli

    • cli [at] uiuc [dot] edu


Boolean Semantics of SQL: Success and Barrier

  • Example:

    SELECT *

    FROM Homes H

    WHERE 400K<price<600K AND #bedroom = 4

  • Problems:

    • Query Semantics:

      • True or False, nothing in the middle

      • What do we want?: similarity, relevance, preference, etc.

    • Query Results Presentation:

      • Too many answers (“information overload” problem): hard to digest and analyze; or too few answers: no enough options to consider

      • What do we want?: a ranked list


“Google” on Databases:From Boolean Query to Fuzzy Retrieval


Fuzzy Retrieval by Ranking

  • Ranking (Top-k) Queries

    • Top k results by a ranking function with multiple criteria

    • SELECT*

      FROMHomes H

      Order By0.3*big(H.size)+0.7*cheap(H.price)

      Limit5


Beyond Boolean Semantics: Ranking

Useful and important in many real-world database applications:

  • E-Commerce

    Find the best hotel deals by price, distance, etc.

  • Multimedia Databases

    Find the most similar images by color, shape, texture, etc.

  • Text Retrieval and Document Management

    Find the most relevant records/documents/pages.

  • OLAP, Decision Support

    Find the top profitable customers.


Application 1: E-commerce

  • The home search example

  • More details later


Application 2: Decision Support

  • What are the top 5 areas to advertise a new insurance product?

    SELECT zipcode,

    AVG(income*w1+age*w2+credit*w3) as score

    FROM customer

    WHERE occupation=‘student’

    GROUP BY zipcode

    ORDER BY score

    LIMIT 5


Application 3: Multimedia Databases

  • Find the most similar images by color, shape, texture, etc.

  • Online demo: http://amazon.ece.utexas.edu/~qasim/research.htm


Application 4: Keyword Queries

  • Text data coexist with structured data

  • IR-style database queries

[slides courtesy of Vagelis Hristidis]


Example – Complaints Database Schema


Example - Complaints Database Data

Complaints

Customers

Products


Example – Keyword Query [Maxtor Netvista]

Complaints

Customers

Products


Keyword Query Semantics (definition of “document” in databases)

  • Keywords are:

  • in same tuple

  • in same relation

  • in tuples connected through primary-foreign key relationships

  • Score of result:

  • distance of keywords within a tuple

  • distance between keywords in terms of primary-foreign key connections

  • IR-style score of result tree


Example – Keyword Query [Maxtor Netvista]

Complaints

Customers

Products

Results: (1) c3, (2) p2c3, (3) p1c1


Great Interest in Database Community

  • Top-k Queries

    • Top-k to range query transformation: [ChaudhuriG99]

    • Indexing: [ChangBC+00], [TsaparasPK+03]

    • View: [HristidisKP01], [YiYY+03], [DasGK+06]

    • Query operators and algorithms:

      [CareyK97], [NatsevCS+01], [IlyasAE03], [IlyasSA+04], [ZhangHC+06]

    • Ranking algebra: [ChaudhuriRW05] [LiCI+05]

    • Ranking aggregate queries: [LiCI06]

  • Other Types of Queries

    • Keyword queries: [BhalotiaHN+02], [AgrawalCD02], [HristidisP02]

    • IR-style TFIDF ranking: [AgrawalCD+03]

    • Ranking of attributes: [DasHK+06]

    • Preference queries: [AgrawalW00], [Kießling02]

    • Skyline queries: [BorzsonyiSK01]

  • DBRank07


Theme Life Cycles: Bursting Topics(Current Bursting)

[slide courtesy of Chengixang Zhai]


RankSQL: a RDBMS with Efficient Support of Ranking Queries

  • Algebraic Foundation and Optimization Framework

    SPJ queries (SELECT … FROM … WHERE … ORDER BY …)

  • Ad-Hoc Ranking Aggregate Queries

    top k groups instead of tuples

    (SELECT … FROM … WHERE … GROUP BY… ORDER BY …)


Example: Trip Planning

Suggest a hotel to stay and a museum to visit:

Select *

From

Hotel h, Museum m

Where

h.star=3 AND

h.area=m.area

Order By

cheap(h.price) +

close(h.addr, “BWI airport”) +

related(m.collection,“dinosaur”)

Limit 5

membership dimension:

Boolean predicates,

Boolean function

B

order dimension:

ranking predicates,

scoring function

R


5 results

R

sort cheap+close+related

B

h.area=m.area

σh.star=3

scan(m)

scan(h)

Processing Ranking Queries in Traditional RDBMS

Select *

From

Hotel h, Museum m

Where

B

Order By

R

Limit 5


5 results

R

B

Problems of Traditional Approach

  • NaïveMaterialize-then-Sort scheme

  • Overkill

    total order of all results;

    only 5 top results are requested.

  • Very inefficient

    • Scan large base tables

    • Join large intermediate results

    • Evaluate every ranking on every tuple

    • Full sorting


Therefore the problem is:

Unlike Boolean constructs, ranking is second class.

Ranking is processed as a Monolithic component (R), always after the Boolean component (B).


σh.star=3AND h.area=m.area

How did we make Boolean “first class”?

Select *

From

Hotel h, Museum m

Where

h.star=3 AND

h.area=m.area

HotelXMuseum

(1) materialize-then-filter


h.area=m.area

σh.star=3

scan(m)

scan(h)

Splitting and Interleaving

σh.star=3AND h.area=m.area

Select *

From

Hotel h, Museum m

Where

h.star=3 AND

h.area=m.area

HotelXMuseum

(1) materialize-then-filter

(2) B is split into joins and selections, which interleave with each other.


Relational Algebra Enables Such Splitting and Interleaving

  • Algebra: Arabic word, “bring together broken parts”.

    • Operators:

      to break apart the Boolean condition(splitting)

      • selection, projection, join, etc.

    • Algebraic Laws:

      to bring them together (interleaving)

      • pushdown selection: σc(R ΧS) = σc(R) Χ S

      • split selection: Σc1Λc2 (R) = σc1(σc2 (R))

  • Relational algebra is the foundation of query optimization


Rank-Relational Algebra

By:

  • Splitting ranking function

  • Interleaving ranking with Boolean operations

    We achieve:

  • Support ranking as a first-class query type

  • Integrate rankingwith traditional Boolean constructs

  • Enable efficient query plans that were unavailable


5 results

5 results

R

μclose

sort cheap+close+related

rank-joinh.area=m.area

B

μrelated

σh.star=3

h.area=m.area

σh.star=3

rank-scan(h,cheap)

scan(m)

scan(m)

scan(h)

Ranking Query Plan

split and interleave:

reduction of intermediate results, thus processing cost

materialize-then-sort:

naïve, overkill


Possibly orders of magnitude improvement

Implementation in PostgreSQL

plan1: traditional materialize-then-sort plan

plan2-4: new ranking query plans

Observations:

an extended plan space with plans of various costs.


μp3

μp2

Scanp1(H)

Example

  • Ranking function split, predicates evaluated incrementally.

  • Iterator structure.

  • Output order: how “promising” according to the evaluated predicates.

Hp1, p2, p3

Hp1, p2

Select *

From Hotel H

Order By p1+p2+p3

Limit 1

GetNext()

Hp1


hotel

upper-bound

μp3

hotel

upper-bound

μp2

hotel

upper-bound

Scanp1(H)

Example

Hp1, p2, p3

Hp1, p2

Select *

From Hotel H

Order By p1+p2+p3

Limit 1

Hp1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

h2 2.9

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


In contrast: materialize-then-sort

Sortp1+p2+p3

Select *

From Hotel H

Order By p1+p2+p3

Limit 1

Scan(H)


Data Model: of Rank-Relation

  • Two Logical Properties

  • Membershipof the tuples: evaluated Boolean predicates

  • Orderamong the tuples: evaluated ranking predciates

rank-relation

Membership(B): h.area=m.area, h.star=3

Order(R): close(h.addr, “BWI airport”)

related(m.collection,“dinosaur”)

cheap(h.price)

B

Membership(B): h.star=3

Order(R): cheap(h.price)

A

rank-relation


Ranking Principle:upper-bound determines the order

F=cheap + close + related

  • Processing in the “promising” order, avoiding unnecessary processing.

B

tuple h1

tuple h2

  • Without further processing h1, we cannot output any result;

A


Ranking Operators

To achieve splitting and interleaving:

  • New operator:

    • μ: evaluate ranking predicates piece by piece.

  • Extended operators:

    • rank-selection

    • rank-join

    • rank-scan

    • rank-union, rank-intersection.


parser,

rewriter

operatorsandalgebraic laws

Logical Query Plan

optimizer

Plan enumerator

cost model

implementation of operators

Physical Query Plan

Results

executor

Impact of Rank-Relational Algebra

Ranking Query


Skyline Queries

  • A real estate example

dist

P4

P3

Skyline on

price & dist

P1

P5

P2

price

  • skyline contains points that are not dominated

    by any other point.

[slide courtesy of Yidong Yuan]


Preference Queries

  • Semantically more flexible than top-k queries

    • Instead of enforcing total order, it allows partial order.

  • Efficient processing is more challenging


Motivation - Preferences

SELECT * FROM used_cars

PREFERRING

(make = ‘Honda’

ELSE

make = ‘Ford’)

AND price AROUND 40000

SELECT * FROM used_cars

WHERE

(make = ‘Honda’

OR

make = ‘Ford’)

AND price = 40000


Preferences Semantics: Strict Partial Order

  • Preference: strict partial order

    A > B : “I like A better than B”

  • Example

    Car Make: {Honda, Ford, Toyota}

    Honda > Ford > {Toyota, Pontiac}


Accumulating Preference

  • Example

    P1=POS(make, {Honda, Toyota})

    P2=NEG(color, {yellow, green})

    P=P1P2

    Car1(Honda, black)

    Car2(Toyota, green)

    Car3(Pontiac, red)

    Car4(Ford, yellow)

    Car1 > {Car2, Car3} > Car4


Summary: Take-Home Messages

  • Strict Boolean semantics of SQL is awkward when facing the challenges in supporting new applications.

  • There is a crying need for fuzzy retrieval in databases.

  • RankSQL system (http://aim.cs.uiuc.edu)


ad
  • Login