- By
**awen** - Follow User

- 53 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Optimal Probabilistic Generators for XML Collections' - awen

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Optimal Probabilistic Generators for XML Collections

Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart

[ICDT 2012]

Adding probabilities to an XML Schema

Optimal Probabilistic Generators for XML Collections

XML schemas are useful for describing the structures of XML documents.

- E.g., DTD or XSD

Schemas may be very general (e.g., xhtml, RSS)

We want to add probabilities that reflect the likelihood of different parts of the schema

- We will use the probabilities to turn the schema into a probabilistic generative model for XML documents
- In particular, we want them to maximize the likelihood of a given XML document or document collection

- 2 -

Motivation

One Application: XML Auto-Completion [SIGMOD 2012]XML for Beginners Advanced XML

Optimal Probabilistic Generators for XML Collections

Based on previous document versions / corpus of example documents

Suggest nodes / sub-trees / node values to the user

For example:

Challenges:

- Allow editing every part of the document
- What kind of completion to suggest?
- Finding the top-k best completions

- 3 -

Motivation

Many Other Usages for a Probabilistic Schema

- Testing – e.g., generating many XML messages to simulate network load and test system performance.
- Explaining – e.g., a probabilistic schema for DBLP may show which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc.
- Schema Evaluation – how well a given schema describes a given corpus.

✗

✓

Optimal Probabilistic Generators for XML Collections

...

- 4 -

Motivation

Our solution - An Outline

Preliminaries – Tree Automata

Generators for Schemas without Constraints

Adding Constraints

Restart Generators

Continuation-Test Generators

Leaf Values

Optimal Probabilistic Generators for XML Collections

- 5 -

Schema as a Deterministic Tree Automaton

An XML document is modeled as an ordered tree.

r

Document d0:

a

b

c

$

Schema validation: the children of an a-labeled node are accepted by DFA Aa

abcd

532

abcd

a

c

b

$

q0

q1

q2

Automaton Ar: (L(Ar) = a*bc*$)

Validation is performed for the children of every inner node.

Optimal Probabilistic Generators for XML Collections

- 6 -

Preliminaries

Using the Schema as a Generator

Optimal Probabilistic Generators for XML Collections

Recall that we want to turn the schema from an acceptor into a probabilistic generative model.

Straightforward nondeterministic generator: repeatedly choose an accepting run for a node's automaton, and generate children accordingly.

Adding probabilities: we consider two problem settings

- Generating documents that are accepted by the schema, while maximizing the likelihood of a corpus.
- Additionally, imposing integrity constraints on the documents (e.g., key constraints)

- 7 -

Preliminaries

Probabilistic Generator

r

a

pa

pc

c

b

$

q0

q1

q2

a

a

b

$

pb

p$

Optimal Probabilistic Generators for XML Collections

Without Constraints

- 8 -

Each transition is assigned a probability

We assume independent choices, (a Markovian process) thus the document probability is the product.

In this case, Pr(d)=pa∙pa∙pb∙p$

The schema and generator ignore leaf values (for now!)

Formal Problem Definition

Optimal Probabilistic Generators for XML Collections

Given a corpus D of documents ,

and a deterministic schema S that accepts every document in D

We want to find an optimal generator based on S:

- Find probabilities for the transitions of S that maximize the probability of generating D,
- i.e., the maximum likelihood estimator (MLE).

- 9 -

Without Constraints

A Learning Algorithm

r

The frequency of using each transition during the corpus verification process is recorded.

a

b

c

$

1

a

c

1

b

$

q0

q1

q2

1

1

Optimal Probabilistic Generators for XML Collections

Without Constraints

- 10 -

An Algorithm for Probabilities Learning (Cont.)

/2

/2

/2

- Theorem: This efficient algorithm learns the MLE probabilities – finds an optimal probabilistic generator

/2

Optimal Probabilistic Generators for XML Collections

This is repeated for every node in every corpus document.

We set the probability of each transition to be its relative frequency.

- 11 -

Without Constraints

Termination

Optimal Probabilistic Generators for XML Collections

Theorem: generation terminates with probability 1.

- Guaranteed only because of the choice of probabilities according to the corpus.

- 12 -

Without Constraints

Integrity Constraints

Optimal Probabilistic Generators for XML Collections

We want to support integrity constraints, which are used in XML schema languages.

Key Constraint: the leaves of a-labeled leaves have unique values (unary key)

Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves

Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain

- 13 -

Adding Constraints

New Problem

r

r

b

c

a

a

b

$

a

c

b

b

…

Optimal Probabilistic Generators for XML Collections

- 14 -

- We want to find optimal generators for XML schemas with constraints.
- Valid generator output: an XML document, which
- is a accepted by the schema, and
- there exists a validleaf value assignment – which does not violate the constraints
- Example: a, b, c are unique and contain each other

Adding Constraints

Restart Generators

Optimal Probabilistic Generators for XML Collections

A simple idea:

- Use a probabilistic generator to generate a document
- Check if it has a value assignment valid w.r.t. the constraints
- If not, 'restart' and try again until a valid document is generated

Proposition: Given a document with no values, checking for the existence of a valid value assignment is in PTIME

- Proof: By translating the constraints to bounds on the number of unique values for each leaf label

Bad news: number of restarts can be unboundedly large in an optimal generator

- 15 -

Adding Constraints

Continuation-test Generators

Perform a continuation-test before taking the transition

Implies |c|≤|a|

Pr(d) = pa∙pb∙pc∙1

r

a

pa

pc

c

b

$

q0

q1

q2

a

b

c

$

pb

p$

Optimal Probabilistic Generators for XML Collections

Never make choices that lead to a 'dead end', thus always generate a valid document.

We use a binary test to check if a choice has a continuation.

Example: add to the schema of d0the constraints:

- c is included in a
- c is unique

The generation process:

- 16 -

Adding Constraints

Learning Algorithm for Continuation-test Generators

/2

/2

/1

- (q1, $) was chosen only when (q1, c) was not available.

/1

Optimal Probabilistic Generators for XML Collections

The probabilities are again relative frequencies, but –only in cases where there was an alternative choice.

The learned generator will generate as many c-s as a-s

Adding Constraints

- 17 -

Results for Continuation-test Generators

Optimal Probabilistic Generators for XML Collections

Theorem: The algorithm learns an optimal continuation-test generator, for automata with binary choices.

- Extensions to non-binary are discussed in the paper

Theorem: Continuation-test is NP-Complete

- But only in the size of the schema; it is polynomial in the document size
- Both generation and finding the optimal generator are polynomial when using a continuation-test oracle.
- Based on schema satisfiability test [David et al. 2011]

Theorem: probability of termination for a continuation-test generator may be arbitrarily small!

- Proof – by construction of a simple, non-recursive schema
- Can be handled by adding a constraint on the document size.
- Sub-classes of schemas that guarantee termination?

- 18 -

Adding Constraints

Adding Values to the Structure

Optimal Probabilistic Generators for XML Collections

So far our generators were used only for the document structure

Leaf values may also have a distribution according to which they can be generated

- The distribution may be learned from the same document collection

We will focus on the interesting case – generating leaf values for a schema with constraints

- 19 -

Leaf Values

Suggested Algorithm

r

a

b

c

$

abcd

efg

abcd

Optimal Probabilistic Generators for XML Collections

We start with a valid document skeleton

Order labels by inclusion constraints (e.g., c, b, a)

Choose a leaf from the 'smallest' (most included) label, and including leaves

Draw a value (from the domain) according to a given distribution.

Use PTIME test to verify validity, if not revert the step

Improvements presented in the paper

- 20 -

Leaf Values

Related Work

Optimal Probabilistic Generators for XML Collections

Schema Satisfiability tests [Fan & Libkin 2001; David, Libkin & Tan 2011]

Probabilistic XML and Probabilistic Schemas [e.g., Benedikt, Kharlamov, Olteanu & Senellart 2010]

Probabilistic XML generation [e.g., Antonopoulos, Geerts, Martens & Neven 2011]

Schema Inference [e.g., Bex, Gelade, Neven & Vansummeren 2008]

AXML [Abiteboul, Benjelloun & Milo 2008]

PCFGs[e.g., Chi & Geman 1998]

- 21 -

Summary

Conclusion

Optimal Probabilistic Generators for XML Collections

A model for a probabilistic XML generators

Unconstrained case

- Generation and learning optimal generators can be done efficiently
- Termination is guaranteed

Constrained case

- Restart generator
- # of restarts is unbounded
- Continuation-test generators
- Generation and learning optimal generators are expensive
- Termination is not guaranteed

Leaf Value generation

In the talk labels and states are coupled (as in a DTD), but all the results hold when they are uncoupled.

Future work

- More Efficient combinations of restart and continuation-test generators

- 22 -

Summary

Download Presentation

Connecting to Server..