Extracting schema from semistructured data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Extracting Schema from Semistructured Data PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Extracting Schema from Semistructured Data. Nestorov, Abiteboul, and Motwani at Stanford. Perspective. This paper is new work. More than the details look at the issues: What are their goals? What does this contribute? Do they attain their goals? Why do we need this?. Sample Database. 7.

Download Presentation

Extracting Schema from Semistructured Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Extracting schema from semistructured data

Extracting Schema from Semistructured Data

Nestorov, Abiteboul, and Motwani at Stanford


Perspective

Perspective

  • This paper is new work.

  • More than the details look at the issues:

    • What are their goals?

    • What does this contribute?

    • Do they attain their goals?

    • Why do we need this?


Sample database

Sample Database

7

1

Hours

Manager

Name

Manager

Entree

Entree

Name

8

9

10

11

2

3

4

“The Keg”

“Steak”

“Jim”

24

“Burger

King”

“Fries”

Company

Name

Phone

5

6

“AA+

Management”

543-7798

Schema = Types


Where does semistructured data come from

Where does semistructured data come from?

  • Document collections

  • Biological data

  • HTML

  • Bibtex, etc.


Who needs structure

Who needs structure?

  • For the user

    • To know what queries are possible

    • Browsing the database

    • Type checking

  • Storage

    • Data layout to facilitate querying

      • E.g. place similar objects on same page

    • Indexes


Who needs structure 2

Who Needs Structure?(2)

  • Query optimization

    • All the relational query optimization tricks

      • Maintaining statistics per data type

        • Cardinality, # of pages, Index cardinality, etc.

      • Estimating the cost/size of result of query plans

    • Efficient processing of path expressions

  • Other?


Their goals

Their Goals

Approximate typing (schema extraction) of

semistructured data.

Example (little lie) Typing Program:

Restaurant(X):-Link(X,A,B,C) & Name-atom(A) &

Entrée-atom(B) & Manager-atom(C)


Outline of the algorithm

Outline of the Algorithm

Given a database:

1. Find the perfect typing program.

  • This typing might be too large so we:

    2. Coalesce similar types into k types.

    3. Assign a type to objects in database.

    4. Deduce meaningful names for the types.


Typing

Typing

7

The two base relations:

- link(FromObj, ToObj, Label)

- atomic(Obj, Value)

Manager

Name

Entree

8

9

10

“The Keg”

“Steak”

“Jim”

These are the only two EDB’s of the typing program.

Restaurant(X) :-link(X,A,Name) & atomic(A, Ap) &

link(X,B,Entrée) & atomic(B, Bp) &

link(X,C,Manager) & atomic(C,Cp)


Typing 2

Typing 2

Restaurant(X) :-link(X,A,Name) & atomic(A, Ap) &

link(X,B,Entrée) & atomic(B, Bp) &

link(X,C,Manager) & atomic(C,Cp)

EDB:

link(7, 8, Name)atomic(8, “The Keg”)

IDB: (intensional relations)

defined by the typing program

Extension of an IDB:

Restaurant(1)


Restriction on types

Restriction on Types

Arbitrary type programs are not allowed.

Rules typei(X) can only be built from the following:

1. link(Y, X, c) & typej(Y)

2. link(X, Y, c) & typej(Y)

3. link(X, Y, c) & atomic(Y, Z)

Types can only express local characteristics.

The collection of typed links is a set.

(2 entrées = 1 entrée)

cj

cj

c0

X


Semantics of type program

Semantics of Type Program

The greatest fixpoint of a datalog program on a database defines the semantics of the typing.

Fixpoint = Extensions of IDB’s + EDB’s

  • Least fixpoint

    • start with model of only EDB’s

    • at each step union into the model anything new.


Greatest fixpoint

Greatest Fixpoint

1. Start with a model of EDB’s and all possible extensions.

2. At each step, remove any extensions not derived by applying

the rules.

Least fixpoint doesn’t work:

person(X) :- link(X, Y, is-manager-of) & firm(Y) &

link(X, Yp, name) & atomic(Yp, Z)

firm(X) :- link(X, Y, is-managed-by) & person(Y) &

link(X, Yp, name) & atomic(Yp, Z)


Imperfect types

Imperfect Types

Defect: a measure of how well an

object fits a given type.

= Excess + deficit

type1 = +

+

Defect is 2 for assigning 11

to type1.

7

Manager

Name

Entree

4

5

6

“The Keg”

“Steak”

“Jim”

manager0

name0

entree0

11

# seats

Name

Entree

8

9

10

“McD”

“biscuit”

53


Imperfect types 2

Imperfect Types(2)

  • Excess: # of EDB’s not used to validate any object’s type.

  • Deficit: Minimum # of ground facts that need to be added to make all type derivations possible.

7

Manager

Name

Entree

4

5

6

“The Keg”

“Steak”

“Jim”

11

# seats

Name

Entree

8

9

10

“McD”

“biscuit”

53


Perfect typing program stage 1

Perfect Typing Program (Stage 1)

Gore.


Multiple roles

Multiple Roles

O3

O1

O2

Country

Country

Movie

Name

Team

Team

Movie

Name

Movie

France

Rocky

Horror

Name

Scholes

Man Utd

Bleu

Star Trek

Country

Binoche

Cantona

England

How hard is it to choose to types for the cover?

How do you quantify atomization?


Clustering stage 2

Clustering (Stage 2)

Define a distance function between two types:

First approximation is difference between the bodies of

their rule definitions.

t1 :- a0, b2t2 :- a0, b1

t3 :- b2, b1, b3

d(t1, t2) = 2


A better function

A Better Function

Include some measure of the weight of a type(# of

objects of that type):

t2 ~> t1

Some desirable properties:

  • increasing in d= coalesce similar types

  • decreasing in w1= compensate for ‘expected noise’

  • increasing in w2= maintain types with large extents

    Choosing what to coalesce is hard!


Recasting stage 3

Recasting (Stage 3)

Assign each object to types within the k types formed

from stage 2.

(optional) choose a better value of k an rerun step 2.


Results

Results

  • Heavy use of synthetic data.

    • Create a type definition and generate instances that are peturbed randomly in some way.

  • What do the graphs show?

    • Are the data sets realistic?


Conclusions

Conclusions

  • Paper problems:

    • The algorithm isn’t completely explained.

    • Many comments are not elaborated.

  • But, it’s an important problem and good first approach.


  • Login