Querying web sources within a data federation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Querying Web-Sources within a Data Federation PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on
  • Presentation posted in: General

Querying Web-Sources within a Data Federation. Lynn Wu 1 , Aykut Firat 2 , Tarik Alatovic 3 , Stuart Madnick 1 1 MIT Sloan School of Management 2 Northeastern University 3 INSEAD International Conference on Information Systems (ICIS) December 11, 2006. Motivating Scenario. You want:

Download Presentation

Querying Web-Sources within a Data Federation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Querying web sources within a data federation

Querying Web-Sources within a Data Federation

Lynn Wu1, Aykut Firat2,

Tarik Alatovic3, Stuart Madnick1

1MIT Sloan School of Management

2Northeastern University

3INSEAD

International Conference on Information Systems (ICIS)

December 11, 2006


Motivating scenario

Motivating Scenario

You want:

  • The current stock quotes of all companies listed on the Stock Exchange

    • that are in the biotechnology industry.

  • And you want to see each of the stock quotes in all the major currencies.


Good news

Good News

All of the necessary information is available (and for free) on the Web …

Stock price for any company

Listing of companies in an industry

Conversion between any two currencies

So what’s the problem?


Process part 1

Process – Part 1

Web sites are not like Relational (SQL) databases.

Must go step-by-step: first find all the biotech companies.


Querying web sources within a data federation

Process – Part 2

Then must find the stock price of each, one-by-one.

237 Biotech firms

ADH

ACOR

ACHN

ABPI

ADLS

ANX

ACAD


Process part 3

Process – Part 3

Then must convert stock price of each, one-by-one.


General scenario

General Scenario

  • Users often have to browse through many websites and collect and process a lot of information manually.

  • Wouldn’t it be great if you could get all the stock quotes in the biotech industry using one query?

select ticker, price from yahooF where ticker IN (select companyticker from companytable where industry='Biotechnology')


Why is this so difficult

Why is this so difficult?

  • Websites have various capability restrictions.

    • Web sites do not accept general queries (e,g., SQL).

    • Assuming they somehow accepted general queries, there are still problems. For example:

      • select price from yahooF

        This is not answerable as Yahoo! Finance requires at least one ticker at a time to get the stock quote.

      • select exchanged, expressed, rate, date from olsen where expressed='USD' and date= '12/10/06'

        Must specify both currencies.


Existing solutions

Existing Solutions

  • Commercial databases can incorporate heterogeneous data sources through the use of wrappers:

    • However, there is no general-purpose wrapper that can query the entire Web.

    • Need to construct one wrapper per website.

    • This is our focus – how can these be improved ?

  • Other options:

    • Using highly expressive context-free grammars to express the capability restrictions

      • Has not been used widely in commercial systems due to their complexity.


How does a federated database system handle the problem

Query: Select ..from s1,s2,s3

Federation Engine

IBM DB2

Wrapper: Request

-

Reply Protocol

Wrapper

Wrapper for S1

Wrapper for S2

Wrapper for

S3

Capability

Capability

Capability

Handler

Handler

Handler

Data

Data

Data

Extraction

Extraction

Extraction

Web

Sources

S1-website

S2-website

S3-website

How does a Federated database system handle the problem?

Example: IBM DB2

For web sites (S1, S2, S3), each wrapper must be custom crafted.


Research contribution

Research Contribution

  • Offer a complete, practical, and scalable solution to easily incorporate websites into a data federation.

  • Abstract wrapper components into separate reasoning engines.

    • Capability reasoning engine for query planning and execution

    • Data extraction engine


Our solution

Query: Select ..from s1,s2,s3

Federation Engine

IBM DB2

Wrapper: Request

-

Reply Protocol

Wrapper

Wrapper for S1

Wrapper for S2

Wrapper for S3

Capability

Capability

Capability

Handler

Handler

Handler

Data

Data

Data

Extraction

Extraction

Extraction

Web

Sources

S1-website

S2-website

S3-website

Our Solution

Three-Layered Architecture— with capability declaration

Two-Layered Architecture—current IBM solution

Query: Select ..from s1,s2,s3

Query: Select ..from s1,s2,s3

Federation Engine

Federation Engine

IBM DB2

IBM DB2

Wrapper: Request

Wrapper: Request

-

-

Reply Protocol

Reply Protocol

C

C

apability

apability

Wrapper,

Wrapper,

Query

Query

R

R

ecord

ecord

Capability

Capability

planning

planning

CR

CR

Declaration

Declaration

for S1

for S1

Engine

Engine

with

with

CR

CR

for S2

for S2

capability

capability

CR

CR

declaration

declaration

for S3

for S3

D

D

ata

ata

Data

Data

E

E

xtraction

xtraction

Data

Data

Extraction

Extraction

DE

DE

Spec Files

Spec Files

for S1

for S1

Extraction

Extraction

Engine

Engine

Engine

Engine

DE

DE

for S2

for S2

DE

DE

for S3

for S3

Web

Web

Sources

Sources

S1-website

S2-website

S3-website


Adding a web source is simple

Adding a web source is simple.

  • Define the data extraction rules.

  • Define the capability record.

     No procedural coding involved at all !


Querying web sources within a data federation

Data Extraction: Cameleon Engine

  • Extract data from web pages using declarative specifications that extract specific fields within a website.

  • Can answer rudimentary queries involving only a single website.

Input param

Regular expression identifying the region and extracts the price

Example data extraction rules for Yahoo! Finance


Querying web sources within a data federation

Cameleon Studio tool enables quick creation and testing of the data extraction rules


Capability record

Capability Record

  • For Yahoo Finance!, we have two attributes of interest.

    • Cameleon extracts data and form a table format

  • Capability Record

Must provide one (and only one) Ticker at a time

(some sites allow up to 50 Tickers at a time).

relation(‘YahooF’,

[[‘Ticker’, string, bound(1)],

[‘Price’, number, free]],

['='])

Price is value returned.

Can only use equality (=) operator.

relation(olsen,

[['Exchanged',string, bound(1)],

['Expressed',string, bound(1)],

['Rate',number, free],

['Date',string, bound(1)]],

['=']).

relation(‘companytable’,

[[‘Industry’, string, bound(1)],

[‘CompanyTicker’, string, free]],

['='])


Ibm db2

DB2 XML Wrapper (Adapted from IBM).

IBM DB2

  • Uses wrapper to access non-relational data sources.

  • DB2 first decomposes the original query into query fragments and then sends them to wrappers.

  • Wrapper sends the result back to DB2 which then assembles the final results.


Request reply compensate protocol

select price * 1.3

from YahooF

where ticker in (‘GE’, ‘IBM’, ‘MSFT’);

Query Fragment

HXP: Price

Table: YahooF

Predicates: ticker in (‘GE’, ‘IBM’, ‘MSFT’)

Request

HXP: Price

Table: YahooF

Predicate: ticker = ‘IBM’

HXP: Price

Table: YahooF

Predicate: ticker = ‘MSFT’

HXP: Price

Table: YahooF

Predicate: ticker = ‘GE’

Wrapper plan 1

Wrapper plan 2

Wrapper plan 3

Request-Reply-Compensate protocol example

Request-Reply-Compensate Protocol


Query planning

Query Planning

  • Now we have a capability record defined.

  • Add a secondary mini query planner that is designed specifically to work with capability records.

    • Can answer queries involving multiple web sources.

    • Specify a query execution order of query fragments.

    • Independent query fragments are executed first.

    • Followed by dependent query fragments that can uses the prior results.


Our solution1

Independent query fragment

SELECT COMPANYTICKER, INDUSTRY FROM COMPANYTABLE WHERE INDUSTRY = BIOTECHNOLOGY’ AND COMPANYTICKER < 'AD')

Depends on the previous query fragment

SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER = [<unbound kind>]

Our Solution

  • Example 1

    • Find all the stock quotes of biotech companies.

SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER IN (SELECT COMPANYTICKER FROM COMPANYTABLE WHERE INDUSTRY='BIOTECHNOLOGY' AND COMPANYTICKER <'AD'))


Example query

Independent query fragment

SELECT COMPANYTICKER, INDUSTRY FROM COMPANYTABLE WHERE INDUSTRY = BIOTECHNOLOGY AND COMPANYTICKER < AD

Depends on the previous query fragment

SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER = [<unbound kind>]

Example Query

COMPANYTICKER INDUSTRY

---------------------------------------------

ACAD Biotechnology

ACAM Biotechnology

ACOR Biotechnology

ACEL Biotechnology

SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACAD

SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACAM

SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACOR

SELECT PRICE, TICKER FROM YAHOOF WHERE TICKER = ACEL

TICKER PRICE

-------------------------------------------

ACAD 14.90

ACAM 6.51

ACOR 5.10

ACEL 3.18


Example 2

Example 2

  • Now you want the stock price in all major currencies.

select yahooF.ticker, yahooF.price * exchange.rate, exchange.curency from

  • (select ticker, price from yahooF

  • where ticker IN (select companyticker from companytable

  • where industry=‘biotechnology’)

  • (select currency, olsen.rate from

  • (select currency from currency_map

  • where currency <> ‘USD') currency_map,

  • (select exchanged, 'USD', rate, ‘12/10/06'

  • from olsen where expressed= 'USD' and date=‘12/10/06') olsen

  • where currency_map.currency = olsen.exchanged and

  • currency_map.currency <> 'USD ') as exchange


  • Example 21

    Example 2

    Get all the exchange rates against the USD on Dec 10 2006

    select olsen.rate,

    from

    (select currency,

    from currency_map

    where currency <> ‘USD')currency_map,

    (select exchanged, ‘USD', rate, ‘12/10/06'

    from olsen

    where expressed=‘USD'

    and date=‘12/10/06') olsen,

    where currency_map.currency = olsen.exchanged

    and currency_map.currency <> ‘USD'

    Query fragment 1

    Query fragment 2


    Querying web sources within a data federation

    Query fragment 2

    Capability record

    (select exchanged, ‘USD', rate, ’12/10/06'

    from olsen

    where expressed=‘USD'

    and date=’12/10/06’) olsen

    relation(olsen,

    [['Exchanged',string, bound(1)],

    ['Expressed',string, bound(1)],

    ['Rate',number, free],

    ['Date',string, bound(1)]],

    ['=']).

    select olsen.rate

    from

    (select currency

    from currency_map

    where currency <> 'USD') currency_map,

    (select exchanged, 'USD', rate, '12/10/06'

    from olsen

    where expressed= 'USD'

    and date='12/10/06') olsen,

    where currency_map.currency = olsen.exchanged

    and currency_map.currency <> 'USD'

    (select exchanged, 'USD', rate, ’12/10/2006'

    from olsen

    where expressed= 'USD'

    and date='12/10/06'

    and exchanged in (select currency from currency_map where currency<>’USD’))

    Modified Query fragment 2


    Querying web sources within a data federation

    TICKER PRICE

    -------------------------------------------

    ACAD 14.90

    ACAM 6.51

    ACOR 5.10

    Currency rate

    ----------------------------------------

    AUD 1.46

    CAD 1.32

    HKD 7.72

    YPY 113.00

    select ticker, price * exchange.rate, exchanged.currency

    TICKER PRICE($)PRICExRATECURRENCY

    -------------------------------------------------------------------------------------------------------------

    ACAD 14.9021.754AUD

    ACAD 14.9019.668CAD

    ACAD 14.90115.028HKD

    ACAD 14.901683.7YPY

    ACAM 6.519.505AUD

    ACAM 6.518.593CAD

    ACAM 6.5150.257HKD

    ACAM6.51735.63YPY

    ACOR 5.107.446AUD

    ACOR 5.106.732CAD

    ACOR 5.1039.372HKD

    ACOR 5.10576.3YPY


    Conclusion

    Conclusion

    • Three-layered architecture for querying web sources.

    • Instead of burying capability handling in each wrapper, we created a generic capability handler.

    • Using this capability handler, adding a web source to a federated database is as simple as declaring the extraction rules and capability record for the source.

    • This was implemented and successfully tested.

    • This makes millions of semi-structured web sites into useful “databases.”


  • Login