Beyond sql structured data retrieval by ranking
Download
1 / 52

Beyond SQL: Structured Data Retrieval by Ranking - PowerPoint PPT Presentation


  • 283 Views
  • Uploaded on

Beyond SQL: Structured Data Retrieval by Ranking. Chengkai Li October 25, 2006 CS511 guest lecture. About Myself. Advisor: Kevin Chang (UIUC) Research Interests Structured Data Retrieval and Ranking Query Processing Web Information Integration XML Query Processing Misc.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Beyond SQL: Structured Data Retrieval by Ranking' - adamdaniel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Beyond sql structured data retrieval by ranking l.jpg

Beyond SQL: Structured Data Retrieval by Ranking

Chengkai Li

October 25, 2006

CS511 guest lecture


About myself l.jpg
About Myself

  • Advisor: Kevin Chang (UIUC)

  • Research Interests

    • Structured Data Retrieval and Ranking Query Processing

    • Web Information Integration

    • XML Query Processing

  • Misc.

    • http://www.ews.uiuc.edu/~cli

    • cli [at] uiuc [dot] edu


Boolean semantics of sql success and barrier l.jpg
Boolean Semantics of SQL: Success and Barrier

  • Example:

    SELECT *

    FROM Homes H

    WHERE 400K<price<600K AND #bedroom = 4

  • Problems:

    • Query Semantics:

      • True or False, nothing in the middle

      • What do we want?: similarity, relevance, preference, etc.

    • Query Results Presentation:

      • Too many answers (“information overload” problem): hard to digest and analyze; or too few answers: no enough options to consider

      • What do we want?: a ranked list


Google on databases from boolean query to fuzzy retrieval l.jpg
“Google” on Databases:From Boolean Query to Fuzzy Retrieval


Fuzzy retrieval by ranking l.jpg
Fuzzy Retrieval by Ranking

  • Ranking (Top-k) Queries

    • Top k results by a ranking function with multiple criteria

    • SELECT *

      FROM Homes H

      Order By 0.3*big(H.size)+0.7*cheap(H.price)

      Limit 5


Beyond boolean semantics ranking l.jpg
Beyond Boolean Semantics: Ranking

Useful and important in many real-world database applications:

  • E-Commerce

    Find the best hotel deals by price, distance, etc.

  • Multimedia Databases

    Find the most similar images by color, shape, texture, etc.

  • Text Retrieval and Document Management

    Find the most relevant records/documents/pages.

  • OLAP, Decision Support

    Find the top profitable customers.


Application 1 e commerce l.jpg
Application 1: E-commerce

  • The home search example

  • More details later


Application 2 decision support l.jpg
Application 2: Decision Support

  • What are the top 5 areas to advertise a new insurance product?

    SELECT zipcode,

    AVG(income*w1+age*w2+credit*w3) as score

    FROM customer

    WHERE occupation=‘student’

    GROUP BY zipcode

    ORDER BY score

    LIMIT 5


Application 3 multimedia databases l.jpg
Application 3: Multimedia Databases

  • Find the most similar images by color, shape, texture, etc.

  • Online demo: http://amazon.ece.utexas.edu/~qasim/research.htm


Application 4 keyword queries l.jpg
Application 4: Keyword Queries

  • Text data coexist with structured data

  • IR-style database queries

[slides courtesy of Vagelis Hristidis]


Example complaints database schema l.jpg
Example – Complaints Database Schema


Example complaints database data l.jpg
Example - Complaints Database Data

Complaints

Customers

Products


Example keyword query maxtor netvista l.jpg
Example – Keyword Query [Maxtor Netvista]

Complaints

Customers

Products


Keyword query semantics definition of document in databases l.jpg
Keyword Query Semantics (definition of “document” in databases)

  • Keywords are:

  • in same tuple

  • in same relation

  • in tuples connected through primary-foreign key relationships

  • Score of result:

  • distance of keywords within a tuple

  • distance between keywords in terms of primary-foreign key connections

  • IR-style score of result tree


Example keyword query maxtor netvista15 l.jpg
Example – Keyword Query [Maxtor Netvista]

Complaints

Customers

Products

Results: (1) c3, (2) p2c3, (3) p1c1


Great interest in database community l.jpg
Great Interest in Database Community

  • Top-k Queries

    • Top-k to range query transformation: [ChaudhuriG99]

    • Indexing: [ChangBC+00], [TsaparasPK+03]

    • View: [HristidisKP01], [YiYY+03], [DasGK+06]

    • Query operators and algorithms:

      [CareyK97], [NatsevCS+01], [IlyasAE03], [IlyasSA+04], [ZhangHC+06]

    • Ranking algebra: [ChaudhuriRW05] [LiCI+05]

    • Ranking aggregate queries: [LiCI06]

  • Other Types of Queries

    • Keyword queries: [BhalotiaHN+02], [AgrawalCD02], [HristidisP02]

    • IR-style TFIDF ranking: [AgrawalCD+03]

    • Ranking of attributes: [DasHK+06]

    • Preference queries: [AgrawalW00], [Kießling02]

    • Skyline queries: [BorzsonyiSK01]

  • DBRank07


Theme life cycles bursting topics current bursting l.jpg
Theme Life Cycles: Bursting Topics(Current Bursting)

[slide courtesy of Chengixang Zhai]


Ranksql a rdbms with efficient support of ranking queries l.jpg
RankSQL: a RDBMS with Efficient Support of Ranking Queries

  • Algebraic Foundation and Optimization Framework

    SPJ queries (SELECT … FROM … WHERE … ORDER BY …)

  • Ad-Hoc Ranking Aggregate Queries

    top k groups instead of tuples

    (SELECT … FROM … WHERE … GROUP BY… ORDER BY …)


Example trip planning l.jpg
Example: Trip Planning

Suggest a hotel to stay and a museum to visit:

Select *

From

Hotel h, Museum m

Where

h.star=3 AND

h.area=m.area

Order By

cheap(h.price) +

close(h.addr, “BWI airport”) +

related(m.collection,“dinosaur”)

Limit 5

membership dimension:

Boolean predicates,

Boolean function

B

order dimension:

ranking predicates,

scoring function

R


Processing ranking queries in traditional rdbms l.jpg

5 results

R

sort cheap+close+related

B

h.area=m.area

σh.star=3

scan(m)

scan(h)

Processing Ranking Queries in Traditional RDBMS

Select *

From

Hotel h, Museum m

Where

B

Order By

R

Limit 5


Problems of traditional approach l.jpg

5 results

R

B

Problems of Traditional Approach

  • NaïveMaterialize-then-Sort scheme

  • Overkill

    total order of all results;

    only 5 top results are requested.

  • Very inefficient

    • Scan large base tables

    • Join large intermediate results

    • Evaluate every ranking on every tuple

    • Full sorting


Therefore the problem is l.jpg
Therefore the problem is:

Unlike Boolean constructs, ranking is second class.

Ranking is processed as a Monolithic component (R), always after the Boolean component (B).


How did we make boolean first class l.jpg

σh.star=3AND h.area=m.area

How did we make Boolean “first class”?

Select *

From

Hotel h, Museum m

Where

h.star=3 AND

h.area=m.area

HotelXMuseum

(1) materialize-then-filter


Splitting and interleaving l.jpg

h.area=m.area

σh.star=3

scan(m)

scan(h)

Splitting and Interleaving

σh.star=3AND h.area=m.area

Select *

From

Hotel h, Museum m

Where

h.star=3 AND

h.area=m.area

HotelXMuseum

(1) materialize-then-filter

(2) B is split into joins and selections, which interleave with each other.


Relational algebra enables such splitting and interleaving l.jpg
Relational Algebra Enables Such Splitting and Interleaving

  • Algebra: Arabic word, “bring together broken parts”.

    • Operators:

      to break apart the Boolean condition (splitting)

      • selection, projection, join, etc.

    • Algebraic Laws:

      to bring them together (interleaving)

      • pushdown selection: σc(R ΧS) = σc(R) Χ S

      • split selection: Σc1Λc2 (R) = σc1(σc2 (R))

  • Relational algebra is the foundation of query optimization


Rank relational algebra l.jpg
Rank-Relational Algebra

By:

  • Splitting ranking function

  • Interleaving ranking with Boolean operations

    We achieve:

  • Support ranking as a first-class query type

  • Integrate rankingwith traditional Boolean constructs

  • Enable efficient query plans that were unavailable


Ranking query plan l.jpg

5 results

5 results

R

μclose

sort cheap+close+related

rank-joinh.area=m.area

B

μrelated

σh.star=3

h.area=m.area

σh.star=3

rank-scan(h,cheap)

scan(m)

scan(m)

scan(h)

Ranking Query Plan

split and interleave:

reduction of intermediate results, thus processing cost

materialize-then-sort:

naïve, overkill


Possibly orders of magnitude improvement l.jpg
Possibly orders of magnitude improvement

Implementation in PostgreSQL

plan1: traditional materialize-then-sort plan

plan2-4: new ranking query plans

Observations:

an extended plan space with plans of various costs.


Example l.jpg

μp3

μp2

Scanp1(H)

Example

  • Ranking function split, predicates evaluated incrementally.

  • Iterator structure.

  • Output order: how “promising” according to the evaluated predicates.

Hp1, p2, p3

Hp1, p2

Select *

From Hotel H

Order By p1+p2+p3

Limit 1

GetNext()

Hp1


Example30 l.jpg

hotel

upper-bound

μp3

hotel

upper-bound

μp2

hotel

upper-bound

Scanp1(H)

Example

Hp1, p2, p3

Hp1, p2

Select *

From Hotel H

Order By p1+p2+p3

Limit 1

Hp1


Example31 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example32 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example33 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

h2 2.9

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example34 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example35 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example36 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example37 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example38 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example39 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example40 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


Example41 l.jpg

Hp1, p2, p3

μp3

Hp1, p2

μp2

Hp1

Scanp1(H)

Example

Select *

From Hotel H

Order By p1+p2+p3

Limit 1


In contrast materialize then sort l.jpg
In contrast: materialize-then-sort

Sortp1+p2+p3

Select *

From Hotel H

Order By p1+p2+p3

Limit 1

Scan(H)


Data model of rank relation l.jpg
Data Model: of Rank-Relation

  • Two Logical Properties

  • Membershipof the tuples: evaluated Boolean predicates

  • Orderamong the tuples: evaluated ranking predciates

rank-relation

Membership(B): h.area=m.area, h.star=3

Order(R): close(h.addr, “BWI airport”)

related(m.collection,“dinosaur”)

cheap(h.price)

B

Membership(B): h.star=3

Order(R): cheap(h.price)

A

rank-relation


Ranking principle upper bound determines the order l.jpg
Ranking Principle:upper-bound determines the order

F=cheap + close + related

  • Processing in the “promising” order, avoiding unnecessary processing.

B

tuple h1

tuple h2

  • Without further processing h1, we cannot output any result;

A


Ranking operators l.jpg
Ranking Operators

To achieve splitting and interleaving:

  • New operator:

    • μ: evaluate ranking predicates piece by piece.

  • Extended operators:

    • rank-selection

    • rank-join

    • rank-scan

    • rank-union, rank-intersection.


Impact of rank relational algebra l.jpg

parser,

rewriter

operatorsandalgebraic laws

Logical Query Plan

optimizer

Plan enumerator

cost model

implementation of operators

Physical Query Plan

Results

executor

Impact of Rank-Relational Algebra

Ranking Query


Skyline queries l.jpg
Skyline Queries

  • A real estate example

dist

P4

P3

Skyline on

price & dist

P1

P5

P2

price

  • skyline contains points that are not dominated

    by any other point.

[slide courtesy of Yidong Yuan]


Preference queries l.jpg
Preference Queries

  • Semantically more flexible than top-k queries

    • Instead of enforcing total order, it allows partial order.

  • Efficient processing is more challenging


Motivation preferences l.jpg
Motivation - Preferences

SELECT * FROM used_cars

PREFERRING

(make = ‘Honda’

ELSE

make = ‘Ford’)

AND price AROUND 40000

SELECT * FROM used_cars

WHERE

(make = ‘Honda’

OR

make = ‘Ford’)

AND price = 40000


Preferences semantics strict partial order l.jpg
Preferences Semantics: Strict Partial Order

  • Preference: strict partial order

    A > B : “I like A better than B”

  • Example

    Car Make: {Honda, Ford, Toyota}

    Honda > Ford > {Toyota, Pontiac}


Accumulating preference l.jpg
Accumulating Preference

  • Example

    P1=POS(make, {Honda, Toyota})

    P2=NEG(color, {yellow, green})

    P=P1P2

    Car1(Honda, black)

    Car2(Toyota, green)

    Car3(Pontiac, red)

    Car4(Ford, yellow)

    Car1 > {Car2, Car3} > Car4


Summary take home messages l.jpg
Summary: Take-Home Messages

  • Strict Boolean semantics of SQL is awkward when facing the challenges in supporting new applications.

  • There is a crying need for fuzzy retrieval in databases.

  • RankSQL system (http://aim.cs.uiuc.edu)


ad