1 / 19

Finding Optimal Probabilistic Generators for XML Collections

Finding Optimal Probabilistic Generators for XML Collections. Serge Abiteboul , Yael Amsterdamer, Daniel Deutch , Tova Milo, Pierre Senellart BDA 2011. Adding probabilities to an XML Schema. The Web is full of semi-structured documents

carney
Download Presentation

Finding Optimal Probabilistic Generators for XML Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart BDA 2011

  2. Adding probabilities to an XML Schema Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer The Web is full of semi-structured documents Given a collection of XML documents, we would like to have a schema the documents conform to. • E.g., DTD or XSD • Restricts the structure, mostly parent-child node relations (using regular expressions) The schema may be very general (e.g., xhtml, RSS) We want to add probabilities to 'guide' the schema • Optimal probabilities – maximize the likelihood of a corpus - 2 - Motivation

  3. Implementation Idea: XML Editor Auto-Completion <MyPapers> <Paper> <title>Provenance Minimization</title> <author>Yael A.</author> <author>Daniel D.</author> <author>Tova M.</author> <author>Val T.</author> </Paper> <Paper> <title>Provenance for Aggregates</title> <author>Yael A.</author> <author>Daniel D./author> <author>Val T.</author> </Paper> <Paper> <title> </title> <author> </author> <author> </author> <author> </author> </Paper> </MyPapers> Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Based on previous document versions / corpus of user documents / corpus of example documents – Suggest nodes / sub-trees / node values to the user For example: Challenges: • Allow editing in every part of the document • What kind of completion to suggest? • Finding the top-k best completions - 3 - Motivation

  4. Many Other Usages for a Probabilistic Schema Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Testing – e.g., generating many XML messages to simulate network load and test system performance. Explaining – e.g., the probabilistic schema for DBLP shows which types of publications are rarely used, which kinds of attributes are not filled for BibTex, etc. Querying – e.g., finding the probability that a paper has more than 3 authors.– e.g., finding the top-k best completions to a partial document. Schema Evaluation – how well a given schema describes a given corpus. - 4 - Motivation

  5. Our solution - An Outline Preliminaries – Tree Automata Generators for Schemas without Constraints Adding Constraints Restart Generators Continuation-Test Generators Leaf Values Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer - 5 -

  6. Using a Tree Automaton for Schema Verification An XML document is modeled as an ordered tree. r Document d0: a b c $ The children of a-labeled node are accepted by DFA Aa abcd 532 abcd a c b $ q0 q1 q2 Automaton Ar: (L(Ar) = a*bc*$) This is done for every inner node in a fixed order (BF-LTR) Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer - 6 - Preliminaries

  7. Using the Schema as a (Probabilistic) Generator r a pa pc c b $ q0 q1 q2 a a b $ pb p$ Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Without Constraints Each transition is assigned a probability We assume independent choices, thus the document probability is the product of all t-probs. Thus, PR(d)=pa∙pa∙pb∙p$ The schema and generator ignore leaf values (for now!) - 7 -

  8. An Algorithm for Probabilities Learning r The frequency each transition is chosen during the corpus verification process is recorded. a b c $ 1 a c 1 b $ q0 q1 q2 1 1 Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Without Constraints - 8 -

  9. An Algorithm for Probabilities Learning (Cont.) /2 /2 • These probabilities maximize the likelihood of generating the corpus – optimal generator(similar result in PCFGs) /2 /2 Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer This is repeated for every node in every corpus document. We set the probability of each transition to be its relative frequency. - 9 - Without Constraints

  10. Our Results Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Relative frequencies make optimal probabilities. Optimal probability learning and document generation can be done efficiently. Additional non-trivial result:Generation terminates with probability 1. • Guaranteed because of the choice of probabilities according to the corpus. - 10 - Without Constraints

  11. Integrity Constraints Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer We want to allow the use of the following constraints in the schema: • Key Constraint: the leaves of a-labeled leaves have unique values (unary key) • Inclusion Constraint: the values of a-labeled leaves are contained in those of b-labeled leaves • Domain Constraint: the values of a-labeled leaves belong to some (finite or infinite) domain - 11 - Adding Constraints

  12. Restart Generators Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer A naïve idea: • Use a probabilistic generator to generate a document • Check if it has a value assignment valid w.r.t. the constraints • If not, 'restart' and try again until a valid document is generated Good news: Checking the existence of a valid assignment is in PTIME Bad news: number of restarts can be unboundedly large in an optimal generator • A different quality measure for restart generators? - 12 - Adding Constraints

  13. Continuation-test Generators Perform a continuation-test before taking the transition Implies |c|≤|a| Pr(d) = pa∙pb∙pc∙1 r a pa pc c b $ q0 q1 q2 a b c $ pb p$ Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Never make choices that lead to a 'dead end', thus always generate a valid document. We use a binary test to check if a choice has a continuation. Add to the schema of d0the constraints: • c is included in a • c is unique The generation process: - 13 - Adding Constraints

  14. Algorithm for Learning Probabilities with Constraints /2 /2 /1 • (q1, $) was chosen only when (q1, c) was not available. /1 Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer The probabilities are again relative frequencies, but –only in cases where there was an alternative choice. The learned generator will generate as many c-s as a-s Adding Constraints - 14 -

  15. Our Results Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer The algorithm learns an optimal continuation-test generator, for automata with binary choices. • Extensions to non-binary are discussed in the paper Bad News: Continuation-test is NP-Complete • But only in the size of the schema; it is polynomial in the document size (not so bad?) • Based on schema satisfiability test [David et al. 2011] More Bad News: probability of termination may be arbitrarily small! • Even for simple, non-recursive schemas • Can be handled by adding a constraint on the document size. • Sub-classes of schemas that guarantee termination? - 15 - Adding Constraints

  16. Suggested Algorithm r a b c $ abcd efg abcd Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer We start with a valid document skeleton Order labels by inclusion constraints (e.g., c, b, a) Choose a leaf from the 'smallest' (most included) label, and including leaves Draw a value (from the domain) according to a given distribution. Use PTIME test to verify validity, if not revert the step - 16 - Leaf Values

  17. Possible improvement to the basic algorithm r a a b $ new old new Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer Annotate the leaves with 'old' or 'new' For 'old' a-labeled leaves choose values already chosen for some a-labeled leaf For 'new' choose a value unused by a-labeled leaves yet Annotations can be learned from the corpus, and generated: • Offline – after the document generation, using a PTIME validity test • Online – during document generation, using a continuation test. • Both methods are incomparable in terms of quality - 17 - Leaf Values

  18. Ideas for Experimental Study on Probabilistic generators Finding Optimal Probabilistic Generators for XML Collections – Yael Amsterdamer How many schemas in practice may require continuation tests? How many have termination probability < 1? Continuation test is expensive, but how expensive is it in practice? 'Competition' between restart and continuation-test generators More? - 18 - Implementation

  19. Thank You! Thank You! Questions, Ideas?...

More Related