1 / 44

HY-561 PRESENTATION

HY-561 PRESENTATION. Φιλιππάκη Χρυσή ΑΜ: 584. 1 st paper. Distributed Query Evaluation with Performance Guarantees. Problem. Partial evaluation  effective technique for evaluating Boolean XPath queries over a fragmented tree , that is distributed over a number of sites.

hallie
Download Presentation

HY-561 PRESENTATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HY-561 PRESENTATION Φιλιππάκη Χρυσή ΑΜ: 584

  2. 1st paper Distributed Query Evaluation with Performance Guarantees

  3. Problem • Partial evaluation effective technique for evaluating Boolean XPath queries over a fragmented tree, that is distributed over a number of sites. • Is the technique applicable to generic dataselecting XPath queries? • Yes! • evaluation algorithms (PaX3, PaX2) • optimizations

  4. Partial Evaluation • Function f(x1, x2) • Weare given part of its input e.g. x1 • Partial evaluation specializes function f with respect to theknown argument x1, without waiting for the other argumentx2. • It performs the part of f’s computation thatdepends only on x1, and generates a partial answer, i.e. aresidual function f′ that depends on the as yet unavailableargument x2.

  5. XML Tree Fragmentation (1/2)

  6. XML Tree Fragmentation (2/2)

  7. ParBoX • Algorithmbased onpartial evaluation, which evaluates Boolean xml queries overa fragmented tree that is distributed over a number of differentsites • Partially evaluates the whole queryQ, in parallel, over each fragment of the tree. • Partial answers are all collected to a single coordinatorsite and are composed resulting in the final answer to Q.

  8. Parallel XPath (PaX3) • Evaluation algorithm, based on partial evaluation for generic data-selecting XPath queries. Guarantees: • Max 3 visits per site • Parallel query processing • Total computation comparable tothe best-known centralized algorithm • Total network trafficdetermined by the size of: • the query • query answer • not the xml tree

  9. Three stages of PaX3 • Each stage  single visit of a site holding tree fragments • Partially evaluatethe qualifiers of query Q. At the end for each node we know: • the actual value of eachqualifier or • a Boolean formula whose value is yet to be determined • Partially evaluatethe selectionpart of query Q. At the end for each node we know: • whether or not the node is part of the answer of query Q • that the node is a candidate to be part of the answer • Determine which candidate answer nodes are true answer nodes  all nodes belonging to the answer of Q are transmitted to site S

  10. PaX3 Algorithm (1/2) • SVect(Q): vector to store the prefixes of the selectionpath η1/ . . . /ηn, such that SVect(Q)[i] indicates the queryη1/ . . . /ηi • QVect(Q): Boolean vector to store the listof all sub-queries of the qualifiers of Q

  11. PaX3 Algorithm (2/2) • Simple query Q over T • At each node v we computethe values of all the sub-queries in QVect(Q) and store themin a vector QVv. • Consult the (already computed) values ofthe QVect(Q) sub-queries at the children (QCVv) and descendants (QDVv) of v • Each fragment is processed in parallel,the values of QVect(Q) are unknown for the virtualnodes • Partial evaluation: introduce Boolean variables, one foreach missing value of QVect(Q) at each virtual node

  12. Example (1/2) • Q = client[country/text()= “us”]/broker[market/name/text() = “nasdaq”]/name • normalize(Q) = client/ε [country/ε [text()=“us”]]/broker/ε[market/name/ ε[text() = “nasdaq”]]/name • SVect(Q) = [q1, q2, q3] where • q1 = client, q2 = q1/broker, q3 = q2/name • QVect(Q) = [q1, q2, q3, q4, q5, q6, q7, q8, q9], where • q1 = country, q2 = [text()=“us”], q3 = q1/ε [q2], q4 = * /ε [q3],q5 = name, q6 = [text()=“nasdaq”], q7 = q5/ ε [q6],q8 = market/q7, q9 = * /ε [q8]

  13. Example (2/2) • QVname= <0, 0, 0, 0, 1, 0, 0, 0, 0> • QVcountry = <1, 0, 1, 0, 0, 0, 0, 0, 0> • QVF1= <x1, x2, x3, x4, x5, x6, x7, x8, x9> • CQVF1=<cx1,cx2,cx3,cx4,cx5,cx6,cx7,cx8,cx9> • QVclient= <0, 0, 0, 1, 0, 0, 0, 0, x8>

  14. Analysis • Communication cost : O((|Q| |FT|) + |ans|) (optimal) • cost of transmitting our queryover the various sites + • cost of retrieving theactual answers to our query • Total computation cost : O(|Q| |T|) • at each node v of Fj at most O(|Q|) operations are performed • total computation for each fragment is O(|Q| |Fj |) • Parallel computation cost: O(|Q| maxSi |FSi |) • |FSi | : total size of the fragments in site Si • Correctness: correctanswer Q(T) on any xml tree T no matter how T is fragmentedand distributed

  15. PaX2 • Two stagesand max two visits of each site • Combine thefirst two stages of PaX3 into a single stage • evaluation of qualifiers +evaluation of selection paths • Queryingsite SQ makes a remote procedure call to all the sites holdingfragments • At each such site, combines the partialevaluation of selection paths with that of qualifiers, over afragment Fj . • The procedure performs a top-down traversal of fragment Fj . • At each node v of Fj , twotypes of computation are performed: a pre-order computationand a post-order computation.

  16. Optimizing Query Evaluation (1/2) • Identifie fragmentswhich do not contain any nodes that are in the query answer • Require that each edge (Fj , Fk) of the fragmenttree FT of T is annotated with a simple XPath expressiondescribing the path in T connecting the root offragment Fj with the root of fragment Fk

  17. Optimizing Query Evaluation (2/2) • XPath-annotations are used before the beginning of Stage2 in PaX3 and before Stage 1 in PaX2 to identify fragmentsthat are relevant to a query • Ifthe input query Q has no qualifiers then we can use XPathannotationsto skip the last step of both algorithm PaX3 andPaX2

  18. Experimental Study • Q1: without qualifiers (with and without annotations) • Q4: with qualifiers

  19. Experimental Study Conclusions • Distributingtree fragments over various sites proves an effective strategywith significant reductions in evaluation times • In the presence of a ‘//’ in the selection path of a query,XPath-annotations might not help much • Using PaX2 alongwith XPath-annotations best results

  20. Conclusions • Developed algorithms and optimizations for evaluatinggeneric XPath queries on fragmented anddistributed xml trees. • Shown analyticallyand experimentally that these techniques are scalable and efficient. • Partial evaluation can also be combined with recent techniquesdeveloped for P2P systems andbe applied to P2P query processing

  21. 2nd paper XML Processing In DHT Networks

  22. Problem • Study the management of XML datain P2P networks based on distributed hash tables (DHTs). • Identify performance limitations and proposean array of techniques to lift them: • DHT improvements • DPP Algorithm to speed up query processing • Bloom filters to reduce data transfers entailed by query processing

  23. KadoP System • Peers publish XML documentsand share the tasks of indexing the data and processingqueries • Indexes the XML data in the form of postings,where each posting encodes information on an element or akeyword • “Responsibility” mechanism, by which (typically) a single peerstores all the postings for a given term • Given aquery, the system combines the postings stored in the index tolocate the peers that can contribute to the query, and forwardsthe query to these peers where the final results are computed • Potential problem: posting lists for very popular terms grow very large andlimit the system’s scalability

  24. KadoPData and Query Model • Each document in the system isidentified by a pair (p, d) • p: the numerical identifierof the peer that checked it in • d:the document identifierwithin this peer • Document (p, d): a labeled unranked tree (element and text nodes) • Element: uniquely identified by a structural identifier sid=(start, end, lev) • start: the numberassigned to the openingtag of the element • end: the numberassigned to the closingtag of the element • lev:denotesthe element’s level in the tree • Element e1 is an ancestor of element e2 ife1.start<e2.start<e1.end

  25. Indexing scheme of KadoP (1/3) • XML documents are stored at their publishingpeer • Term relation is distributed among thepeers of the system using a distributed hash table (DHT) • Interface: • locate(k) returns the id of peer in charge of key k • put(k,α) enters a new value for k • get(k) returns the value fork • delete(k) deletes the key k

  26. Indexing scheme of KadoP (2/3) • Term(p,d,sid,l) l is the label of element (p, d, sid) • Term(p,d,sid,w) w is a word under element (p,d,sid) • Posting: a tuple in Term • Posting list for a (La) : the set of its postings

  27. Indexing scheme of KadoP (3/3) • DHTassigns the keys automatically among the peers (hash function), and handles the redistribution ofkeys when peers join or leave the network • Keysof the relation are the terms and the values the correspondingposting lists • Important property of KadoP index is that it identifiesprecisely the documents that contain results for q, whichin turn can limit considerably the set of peers to which it’s forwarded

  28. Challenge • Evaluation of index queries that involve longposting lists since they represent the true challenge for a DHT-basedapproach

  29. Improving indexing time • Postings of the same term are bufferedand sent in batches • Extending the DHT API • insert: n successive entriesassociated to key k leads to a total I/O complexity of n2 • new operationappend(key, entry)to obtain an indexing of linear cost • Replacing its data store • DHT’s communicationbuffers to cope with many small messages generated by smallposting lists • Tuningof the index storage

  30. Improving query response time • Peer p in charge of a query q performs a holistic twig join on the posting lists received from other peers • get: it returns only when thecontent of the posting list has been fully retrieved • Adding a pipelined get method, which transfers posting listsasynchronously.

  31. DPP Algorithm • Distributed posting partitioning (DPP) is adistributed hierarchical data structure for managing postinglists • A DPP is used to split a posting list for a given term overseveral peers

  32. Implementation of DPP in KadoP • Originally, the entries of one posting list are all in one datablock • The system sets a bound on the number of entries in a data block • When inserting entries, a block may overflowand be split

  33. Implementation of DPP in KadoP

  34. Experiment (Indexing time) • DPP block splitting has a moderate cost • Many publishers drastically cut indexing time, asthey work in parallel

  35. Experiment (Query responsetime) • Benefits of the DPP: • query processing is cutby a factor of three • its growth is really slow as the datavolumes grow

  36. Structural Bloom Filters • Mechanism forreducing the volume of transferred data during the evaluationof index queries • A Bloom Filter provides a concise representationof a set S in a form that is suitable for membershipqueries

  37. Ancestor Bloom Filter(AB Filter) • Tags a and b and the respectiveposting lists La and Lb • ABF(a): AB Filter for La, enables the filtering of Lb and the computation ofa sub-list F(b,ABF(a)) • F(b,ABF(a)): contains a superset of b[\\a], the set of Lb postings that have an ancestor in La

  38. Descendant BloomFilter (DB Filter) • DBF(b):DB Filter for Lb, enables the filtering of La and the computation ofa sub-list F(a,DBF(b)) • F(a,DBF(b)): contains a superset of a[//b], that is, the postings in Lathat have at least one descendant in Lb

  39. Query Evaluation with Bloom Filters • Three query processing strategies based onStructural Bloom Filters: • Ancestor Bloom Reducer(AB Filter), • Descendant Bloom Reducer(AB Filter), • Bloom Reducer (a hybrid of the previous two) • Phases: • the peers exchangestructural Bloom Filters and reduce their posting lists • the reduced lists are sent to the query peer forthe final join

  40. Performance of Bloom-based strategies • DB Reducer: very effective in filteringpostings that are irrelevant to the query, leading to a reductionof more than 90% in transfer load • Bloom Reducer andAB Reducer: less effective as they transfer a large ABfilter on article, without getting any significant benefits fromfiltering the small list of Ullman

  41. Performance of Bloom-based strategies • AB- and Bloom Reducer become more efficientsince the overhead of the AB filteron article is now offset by the savings of reducing author, thedominant list in this query

  42. Performance of Bloom-based strategies • Proposed strategies do not enable any savings for thisparticular query • Due to the existence of the title branch,which has a detrimental effect on the performance of eachstrategy

  43. Conclusions • Investigating several improvementson the KadoP system • Exploring techniquesto improve index construction time, as it remains significantwhen a large collection of documents is published

  44. End Thank you for your attention! Any questions?

More Related