1 / 68

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison. Motivation. XML enables Internet scale applications that query data from many sources

Download Presentation

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison

  2. Motivation • XML enables Internet scale applications that query data from many sources • Niagara, Xyleme, … • Queries over XML data use path expressions • Optimizing these queries requires estimating the selectivity of the path expressions • Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions

  3. What is XML? <readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel> </readings>

  4. Querying XML FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/author WHERE $n_auth/text() = $p_auth/text() RETURN $n_auth • Optimizing this query requires estimating the selectivity of the path expressions • This requires information about the structure of the XML data

  5. Goal of this Work • Build database statistics that capture the structure of XML data • Ensure that the statistics fit in a small amount of memory • For efficient query optimization • Important for Internet scale applications • Use the statistics to estimate the selectivity of simple XML path expressions//t1/t2/…/tn

  6. Outline of Presentation • Introduction • Path Trees • Markov Tables • Performance Evaluation • Conclusions

  7. A C B D D E 1 1 2 1 1 3 Path Trees <A> <B> </B> <B> <D> </D> </B> <C> <D> </D> <E> </E> <E> </E> <E> </E> </C> </A>

  8. Summarizing Path Trees • Path trees contain all the information needed for selectivity estimation • Problem: May not fit in available memory • Small available memory • Internet scale • Remove low frequency nodes • Removed nodes replaced with *-nodes • Tag name: * meaning "any tag" • Frequency: Average frequency of replaced nodes • Sibling-*, Level-*, Global-*, No-*

  9. A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Sibling-* Summarization

  10. B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Sibling-* Summarization A 1

  11. B C 13 9 D E F G H 7 5 15 10 6 J K K 4 11 12 Sibling-* Summarization A 1 I 2

  12. B C 13 9 D E F G H 7 5 15 10 6 K K 11 12 Sibling-* Summarization A 1 I J 2 4

  13. B C 13 9 D E F G H 7 5 15 10 6 * f=6 n=2 K K 11 12 Sibling-* Summarization • *-nodes represent deleted sibling nodes • Memory saved by coalescing nodes A 1

  14. B C 13 9 D F G H 7 15 10 6 K K 11 12 Sibling-* Summarization A 1 E 5 * f=6 n=2

  15. B C 13 9 D F G 7 15 10 K K 11 12 Sibling-* Summarization A 1 E H 5 6 * f=6 n=2

  16. B C 13 9 F G 15 10 K K 11 12 Sibling-* Summarization A 1 D E H 7 5 6 * f=6 n=2

  17. B C 13 9 * f=12 n=2 F G H 15 10 6 K K 11 12 Sibling-* Summarization A 1 * f=6 n=2

  18. B 13 F G 15 10 K K 11 12 Sibling-* Summarization A 1 C 9 * f=12 n=2 H 6 * f=6 n=2

  19. B 13 F 15 K K 11 12 Sibling-* Summarization A 1 C 9 * f=12 n=2 G H 10 6 * f=6 n=2

  20. B 13 F * f=16 n=2 15 K K 11 12 Sibling-* Summarization A 1 C 9 * f=12 n=2 * f=6 n=2

  21. B 13 F 15 Sibling-* Summarization A 1 C 9 * f=12 n=2 * f=16 n=2 * f=6 n=2 K f=23 n=2

  22. B 13 F 15 Sibling-* Summarization A 1 C 9 * * 6 8 * K f=23 n=2 3

  23. A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Original Path Tree

  24. B 13 F 15 Sibling-* Summarization A 1 C 9 * * 6 8 * K f=23 n=2 3 • Try to retain as much information as possible about the deleted nodes

  25. A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Level-* Summarization

  26. B C 13 9 F G 15 10 K K 11 12 Level-* Summarization A 1 D E H 7 5 6 I J 2 4

  27. B C 13 9 * F G 6 15 10 * K K 3 11 12 Level-* Summarization • Less information about deleted nodes than sibling-* • Deletes fewer nodes than sibling-* A 1

  28. A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 Global-* Summarization

  29. B C 13 9 F G 15 10 K K 11 12 Global-* Summarization A 1 D E H 7 5 6 I J 2 4

  30. B C 13 9 F G 15 10 K K 11 12 Global-* Summarization 3 * D H 7 6

  31. A 1 B C 13 9 D E F G H 7 5 15 10 6 I J K K 2 4 11 12 No-* Summarization

  32. B C 13 9 F G 15 10 K K 11 12 No-* Summarization A 1 D E H 7 5 6 I J 2 4

  33. B C 13 9 F G 15 10 K K 11 12 No-* Summarization • Memory savings similar to global-* • Conservative assumption about deleted nodes D E H 7 5 6

  34. Outline • Introduction • Path Trees • Markov Tables • Performance Evaluation • Conclusions

  35. f(B/C/D) f(A/B/C/D) = f(A/B/C) f(B/C) Markov Tables • A table of all distinct paths of length up to m and their frequencies • For paths of length greater than m, combine paths from the Markov table • Example: • Uses "short memory" or "Markov" property

  36. A 1 B C D 11 6 4 C D 9 7 D 8 Markov Tables

  37. Summarizing Markov Tables • Exact selectivities for paths of length up to m • Approximate selectivities for paths longer than m • Problem: May not fit in available memory • Remove low frequency paths • Discard removed paths of length > 2 • Replace removed paths of length 1 or 2 with *-paths • Suffix-*, Global-*, No-*

  38. Suffix-* Summarization

  39. Suffix-* Summarization

  40. Suffix-* Summarization

  41. Suffix-* Summarization

  42. Set of deleted paths of length 2 Suffix-* Summarization SD= { }

  43. Suffix-* Summarization SD= { (AD,4) }

  44. Suffix-* Summarization SD= { (AD,4)}

  45. Suffix-* Summarization SD= { (AD,4) }

  46. Suffix-* Summarization SD= { }

  47. Suffix-* Summarization SD= { }

  48. Suffix-* Summarization SD= { (BD,7) }

  49. Suffix-* Summarization SD= { (BD,7) }

  50. Suffix-* Summarization SD= { (BD,7), (CD,8) }

More Related