optimal probabilistic generators for xml collections
Download
Skip this Video
Download Presentation
Optimal Probabilistic Generators for XML Collections

Loading in 2 Seconds...

play fullscreen
1 / 23

Optimal Probabilistic Generators for XML Collections - PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on

Optimal Probabilistic Generators for XML Collections. Serge Abiteboul, Yael Amsterdamer , Daniel Deutch, Tova Milo, Pierre Senellart [ ICDT 2012 ]. Adding probabilities to an XML Schema. XML schemas are useful for describing the structures of XML documents. E.g., DTD or XSD

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Optimal Probabilistic Generators for XML Collections' - awen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
optimal probabilistic generators for xml collections

Optimal Probabilistic Generators for XML Collections

Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart

[ICDT 2012]

adding probabilities to an xml schema
Adding probabilities to an XML Schema

Optimal Probabilistic Generators for XML Collections

XML schemas are useful for describing the structures of XML documents.

  • E.g., DTD or XSD

Schemas may be very general (e.g., xhtml, RSS)

We want to add probabilities that reflect the likelihood of different parts of the schema

  • We will use the probabilities to turn the schema into a probabilistic generative model for XML documents
  • In particular, we want them to maximize the likelihood of a given XML document or document collection

- 2 -

Motivation

one application xml auto completion sigmod 2012
One Application: XML Auto-Completion [SIGMOD 2012]

XML for Beginners

M. Jones

H. Q. David

L. Martin

S. Smith

Advanced XML

M. Jones

J. E. Peterson

G. L. Williams

Optimal Probabilistic Generators for XML Collections

Based on previous document versions / corpus of example documents

Suggest nodes / sub-trees / node values to the user

For example:

Challenges:

  • Allow editing every part of the document
  • What kind of completion to suggest?
  • Finding the top-k best completions

- 3 -

Motivation

many other usages for a probabilistic schema
Many Other Usages for a Probabilistic Schema
  • Testing – e.g., generating many XML messages to simulate network load and test system performance.
  • Explaining – e.g., a probabilistic schema for DBLP may show which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc.
  • Schema Evaluation – how well a given schema describes a given corpus.

Optimal Probabilistic Generators for XML Collections

...

- 4 -

Motivation

our solution an outline
Our solution - An Outline

Preliminaries – Tree Automata

Generators for Schemas without Constraints

Adding Constraints

Restart Generators

Continuation-Test Generators

Leaf Values

Optimal Probabilistic Generators for XML Collections

- 5 -

schema as a deterministic tree automaton
Schema as a Deterministic Tree Automaton

An XML document is modeled as an ordered tree.

r

Document d0:

a

b

c

$

Schema validation: the children of an a-labeled node are accepted by DFA Aa

abcd

532

abcd

a

c

b

$

q0

q1

q2

Automaton Ar: (L(Ar) = a*bc*$)

Validation is performed for the children of every inner node.

Optimal Probabilistic Generators for XML Collections

- 6 -

Preliminaries

using the schema as a generator
Using the Schema as a Generator

Optimal Probabilistic Generators for XML Collections

Recall that we want to turn the schema from an acceptor into a probabilistic generative model.

Straightforward nondeterministic generator: repeatedly choose an accepting run for a node's automaton, and generate children accordingly.

Adding probabilities: we consider two problem settings

  • Generating documents that are accepted by the schema, while maximizing the likelihood of a corpus.
  • Additionally, imposing integrity constraints on the documents (e.g., key constraints)

- 7 -

Preliminaries

probabilistic generator
Probabilistic Generator

r

a

pa

pc

c

b

$

q0

q1

q2

a

a

b

$

pb

p$

Optimal Probabilistic Generators for XML Collections

Without Constraints

- 8 -

Each transition is assigned a probability

We assume independent choices, (a Markovian process) thus the document probability is the product.

In this case, Pr(d)=pa∙pa∙pb∙p$

The schema and generator ignore leaf values (for now!)

formal problem definition
Formal Problem Definition

Optimal Probabilistic Generators for XML Collections

Given a corpus D of documents ,

and a deterministic schema S that accepts every document in D

We want to find an optimal generator based on S:

  • Find probabilities for the transitions of S that maximize the probability of generating D,
  • i.e., the maximum likelihood estimator (MLE).

- 9 -

Without Constraints

a learning algorithm
A Learning Algorithm

r

The frequency of using each transition during the corpus verification process is recorded.

a

b

c

$

1

a

c

1

b

$

q0

q1

q2

1

1

Optimal Probabilistic Generators for XML Collections

Without Constraints

- 10 -

an algorithm for probabilities learning cont
An Algorithm for Probabilities Learning (Cont.)

/2

/2

/2

  • Theorem: This efficient algorithm learns the MLE probabilities – finds an optimal probabilistic generator

/2

Optimal Probabilistic Generators for XML Collections

This is repeated for every node in every corpus document.

We set the probability of each transition to be its relative frequency.

- 11 -

Without Constraints

termination
Termination

Optimal Probabilistic Generators for XML Collections

Theorem: generation terminates with probability 1.

  • Guaranteed only because of the choice of probabilities according to the corpus.

- 12 -

Without Constraints

integrity constraints
Integrity Constraints

Optimal Probabilistic Generators for XML Collections

We want to support integrity constraints, which are used in XML schema languages.

Key Constraint: the leaves of a-labeled leaves have unique values (unary key)

Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves

Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain

- 13 -

Adding Constraints

new problem
New Problem

r

r

b

c

a

a

b

$

a

c

b

b

Optimal Probabilistic Generators for XML Collections

- 14 -

  • We want to find optimal generators for XML schemas with constraints.
  • Valid generator output: an XML document, which
    • is a accepted by the schema, and
    • there exists a validleaf value assignment – which does not violate the constraints
  • Example: a, b, c are unique and contain each other

Adding Constraints

restart generators
Restart Generators

Optimal Probabilistic Generators for XML Collections

A simple idea:

  • Use a probabilistic generator to generate a document
  • Check if it has a value assignment valid w.r.t. the constraints
  • If not, 'restart' and try again until a valid document is generated

Proposition: Given a document with no values, checking for the existence of a valid value assignment is in PTIME

  • Proof: By translating the constraints to bounds on the number of unique values for each leaf label

Bad news: number of restarts can be unboundedly large in an optimal generator

- 15 -

Adding Constraints

continuation test generators
Continuation-test Generators

Perform a continuation-test before taking the transition

Implies |c|≤|a|

Pr(d) = pa∙pb∙pc∙1

r

a

pa

pc

c

b

$

q0

q1

q2

a

b

c

$

pb

p$

Optimal Probabilistic Generators for XML Collections

Never make choices that lead to a 'dead end', thus always generate a valid document.

We use a binary test to check if a choice has a continuation.

Example: add to the schema of d0the constraints:

  • c is included in a
  • c is unique

The generation process:

- 16 -

Adding Constraints

learning algorithm for continuation test generators
Learning Algorithm for Continuation-test Generators

/2

/2

/1

  • (q1, $) was chosen only when (q1, c) was not available.

/1

Optimal Probabilistic Generators for XML Collections

The probabilities are again relative frequencies, but –only in cases where there was an alternative choice.

The learned generator will generate as many c-s as a-s

Adding Constraints

- 17 -

results for continuation test generators
Results for Continuation-test Generators

Optimal Probabilistic Generators for XML Collections

Theorem: The algorithm learns an optimal continuation-test generator, for automata with binary choices.

  • Extensions to non-binary are discussed in the paper

Theorem: Continuation-test is NP-Complete

  • But only in the size of the schema; it is polynomial in the document size
  • Both generation and finding the optimal generator are polynomial when using a continuation-test oracle.
  • Based on schema satisfiability test [David et al. 2011]

Theorem: probability of termination for a continuation-test generator may be arbitrarily small!

  • Proof – by construction of a simple, non-recursive schema
  • Can be handled by adding a constraint on the document size.
  • Sub-classes of schemas that guarantee termination?

- 18 -

Adding Constraints

adding values to the structure
Adding Values to the Structure

Optimal Probabilistic Generators for XML Collections

So far our generators were used only for the document structure

Leaf values may also have a distribution according to which they can be generated

  • The distribution may be learned from the same document collection

We will focus on the interesting case – generating leaf values for a schema with constraints

- 19 -

Leaf Values

suggested algorithm
Suggested Algorithm

r

a

b

c

$

abcd

efg

abcd

Optimal Probabilistic Generators for XML Collections

We start with a valid document skeleton

Order labels by inclusion constraints (e.g., c, b, a)

Choose a leaf from the 'smallest' (most included) label, and including leaves

Draw a value (from the domain) according to a given distribution.

Use PTIME test to verify validity, if not revert the step

Improvements presented in the paper

- 20 -

Leaf Values

related work
Related Work

Optimal Probabilistic Generators for XML Collections

Schema Satisfiability tests [Fan & Libkin 2001; David, Libkin & Tan 2011]

Probabilistic XML and Probabilistic Schemas [e.g., Benedikt, Kharlamov, Olteanu & Senellart 2010]

Probabilistic XML generation [e.g., Antonopoulos, Geerts, Martens & Neven 2011]

Schema Inference [e.g., Bex, Gelade, Neven & Vansummeren 2008]

AXML [Abiteboul, Benjelloun & Milo 2008]

PCFGs[e.g., Chi & Geman 1998]

- 21 -

Summary

conclusion
Conclusion

Optimal Probabilistic Generators for XML Collections

A model for a probabilistic XML generators

Unconstrained case

  • Generation and learning optimal generators can be done efficiently
  • Termination is guaranteed

Constrained case

  • Restart generator
    • # of restarts is unbounded
  • Continuation-test generators
    • Generation and learning optimal generators are expensive
    • Termination is not guaranteed

Leaf Value generation

In the talk labels and states are coupled (as in a DTD), but all the results hold when they are uncoupled.

Future work

  • More Efficient combinations of restart and continuation-test generators

- 22 -

Summary

slide23
Thank You!

Thank You!

Q&A

ad