- 43 Views
- Uploaded on
- Presentation posted in: General

Extracting Schema from Semistructured Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Extracting Schema from Semistructured Data

Nestorov, Abiteboul, and Motwani at Stanford

- This paper is new work.
- More than the details look at the issues:
- What are their goals?
- What does this contribute?
- Do they attain their goals?
- Why do we need this?

7

1

Hours

Manager

Name

Manager

Entree

Entree

Name

8

9

10

11

2

3

4

“The Keg”

“Steak”

“Jim”

24

“Burger

King”

“Fries”

Company

Name

Phone

5

6

“AA+

Management”

543-7798

Schema = Types

- Document collections
- Biological data
- HTML
- Bibtex, etc.

- For the user
- To know what queries are possible
- Browsing the database
- Type checking

- Storage
- Data layout to facilitate querying
- E.g. place similar objects on same page

- Indexes

- Data layout to facilitate querying

- Query optimization
- All the relational query optimization tricks
- Maintaining statistics per data type
- Cardinality, # of pages, Index cardinality, etc.

- Estimating the cost/size of result of query plans

- Maintaining statistics per data type
- Efficient processing of path expressions

- All the relational query optimization tricks
- Other?

Approximate typing (schema extraction) of

semistructured data.

Example (little lie) Typing Program:

Restaurant(X):-Link(X,A,B,C) & Name-atom(A) &

Entrée-atom(B) & Manager-atom(C)

Given a database:

1. Find the perfect typing program.

- This typing might be too large so we:
2. Coalesce similar types into k types.

3. Assign a type to objects in database.

4. Deduce meaningful names for the types.

7

The two base relations:

- link(FromObj, ToObj, Label)

- atomic(Obj, Value)

Manager

Name

Entree

8

9

10

“The Keg”

“Steak”

“Jim”

These are the only two EDB’s of the typing program.

Restaurant(X) :-link(X,A,Name) & atomic(A, Ap) &

link(X,B,Entrée) & atomic(B, Bp) &

link(X,C,Manager) & atomic(C,Cp)

Restaurant(X) :-link(X,A,Name) & atomic(A, Ap) &

link(X,B,Entrée) & atomic(B, Bp) &

link(X,C,Manager) & atomic(C,Cp)

EDB:

link(7, 8, Name)atomic(8, “The Keg”)

IDB: (intensional relations)

defined by the typing program

Extension of an IDB:

Restaurant(1)

Arbitrary type programs are not allowed.

Rules typei(X) can only be built from the following:

1. link(Y, X, c) & typej(Y)

2. link(X, Y, c) & typej(Y)

3. link(X, Y, c) & atomic(Y, Z)

Types can only express local characteristics.

The collection of typed links is a set.

(2 entrées = 1 entrée)

cj

cj

c0

X

The greatest fixpoint of a datalog program on a database defines the semantics of the typing.

Fixpoint = Extensions of IDB’s + EDB’s

- Least fixpoint
- start with model of only EDB’s
- at each step union into the model anything new.

1. Start with a model of EDB’s and all possible extensions.

2. At each step, remove any extensions not derived by applying

the rules.

Least fixpoint doesn’t work:

person(X) :- link(X, Y, is-manager-of) & firm(Y) &

link(X, Yp, name) & atomic(Yp, Z)

firm(X) :- link(X, Y, is-managed-by) & person(Y) &

link(X, Yp, name) & atomic(Yp, Z)

Defect: a measure of how well an

object fits a given type.

= Excess + deficit

type1 = +

+

Defect is 2 for assigning 11

to type1.

7

Manager

Name

Entree

4

5

6

“The Keg”

“Steak”

“Jim”

manager0

name0

entree0

11

# seats

Name

Entree

8

9

10

“McD”

“biscuit”

53

- Excess: # of EDB’s not used to validate any object’s type.
- Deficit: Minimum # of ground facts that need to be added to make all type derivations possible.

7

Manager

Name

Entree

4

5

6

“The Keg”

“Steak”

“Jim”

11

# seats

Name

Entree

8

9

10

“McD”

“biscuit”

53

Gore.

O3

O1

O2

Country

Country

Movie

Name

Team

Team

Movie

Name

Movie

France

Rocky

Horror

Name

Scholes

Man Utd

Bleu

Star Trek

Country

Binoche

Cantona

England

How hard is it to choose to types for the cover?

How do you quantify atomization?

Define a distance function between two types:

First approximation is difference between the bodies of

their rule definitions.

t1 :- a0, b2t2 :- a0, b1

t3 :- b2, b1, b3

d(t1, t2) = 2

Include some measure of the weight of a type(# of

objects of that type):

t2 ~> t1

Some desirable properties:

- increasing in d= coalesce similar types
- decreasing in w1= compensate for ‘expected noise’
- increasing in w2= maintain types with large extents
Choosing what to coalesce is hard!

Assign each object to types within the k types formed

from stage 2.

(optional) choose a better value of k an rerun step 2.

- Heavy use of synthetic data.
- Create a type definition and generate instances that are peturbed randomly in some way.

- What do the graphs show?
- Are the data sets realistic?

- Paper problems:
- The algorithm isn’t completely explained.
- Many comments are not elaborated.

- But, it’s an important problem and good first approach.