Node Indexes

Node Indexes Interval Labeling Schemes Prefix Labeling Schemes Konsolaki Konstantina (624)konsolak@csd.uoc.gr Fafalios Pavlos (623)fafalios@csd.uoc.gr University of Crete Department of Computer Science May 2010

Outline Introduction Interval Labeling Schemes Prefix Labeling Schemes Comparison

Node Indexing Schemes Hold values that reflect the nodes’ position within the structure of an XML tree. Can solve both simple path and twig path queries. Use two types of labeling schemes: Interval labeling Schemes Prefix labeling Schemes

Labeling Schemes The purpose of a labeling scheme is to provide unique labels for each node in the XML tree A good labeling scheme should have the following characteristics: The relationships between two nodes should be uniquely and quickly determined simply by examining their labels Updating XML files should not require the re-labeling of nodes in the XML trees The size of the label should be minimal in order to fit in the main memory The scheme should be used to support all kinds of XPath functions Should follow the order of the XML document

Node Indexes vs. Graph Indexes Graph indexes consider paths, during query evaluation, as a whole path. Node indexes deal with each node in the path separately. In graph indexes, the numbers of joins is reduced during query processing and therefore, query performance is improved. In node indexes, at each step of a query processing, a structural join is performed between two nodes starting from one end of the path and finishing at the other end.

Node Indexes vs. Sequence Indexes Sequence indexes transform XML documents and queries into an encoded sequences. Node indexes label each node of the XML document In Sequence indexes, answering a query requires a sequence matching between the encoded sequences of the data and the query Efficient evaluation of simple path and twig queries without any extra join operations In Node indexes, answering a query requires structural joins among the labeled nodes Not efficient evaluation of queries due to the multiple structural joins

XML Document for our examples Bib book paper paper author author Tim Sarah XML Document XML Tree <Bib> <book> <author>Tim</author> </book> <paper></paper> <paper> <author>Sarah</author> </paper> </Bib>

Outline Introduction Interval Labeling Schemes Prefix Labeling Schemes Comparison

Interval Labeling Schemes

Outline Interval Labeling Scheme Beg-End Labeling Scheme Order-Size Labeling Scheme Prime Number Labeling Scheme Nested Tree Structure Label Size Experimental results Conclusion

Interval Labeling Scheme Interval based labeling schemes (otherwise known as Containment based labeling schemes or Region encoded labeling schemes) exploit the properties of tree traversal to maintain document order and to determine various structural relationships between nodes Tree traversal is the process of visiting each node in a tree data structure. Such traversals are characterized by the order in which the nodes are visited.

Beg-End Labeling Scheme A pair of numbers is assigned to each node in an XML document according toits sequential traversal order. Starting from the root element, each node is given a “Beg” number. If the end of an attribute, an attribute value, or an ending tag element is reached, the “End” number is assigned. The “End” number is equal to the next sequential number. If the value of the element is a leaf the “Beg” number =“End” number

Example Bib book paper paper author author Tim Sarah (1,14) (2,6) (9,13) (7,8) (3,5) (10,12) (4,4) (11,11)

Properties [1] A “Level” is added to the (Beg,End) label to form a node-tripletidentification label (Beg,End,Level) for each node in the tree, where “Level” represents the depthof an element in the tree. Ancestor-descendant relationship: In a given data-tree, node “x” is an ancestorof node “y” iff x.Beg < y.Beg < x.End (preorder property). Bib (1,14) (2,6) (9,13) book paper (7,8) paper (3,5) author author (10,12) (4,4) (11,11) Tim Sarah

Properties [2] Parent-child relationship: In a given data-tree, node “x” is a parent of node “y” iff (x.Beg < y.Beg < x.End and y.Level = x.Level + 1. There is no way to locate the siblings of a given node, using only the knowledge of its index numbers. Bib (1,14) (2,6) (9,13) book paper (7,8) paper (3,5) author author (10,12) (4,4) (11,11) Tim Sarah

Are updates possible ? Updating the labeling (numbering) scheme of Beg-End is costly. When a new node is inserted into the tree, then all the nodes in the tree, exceptthe left sibling subtrees of the inserted node, have to be updated. On the other hand when a node is deleted no re-labeling is needed.

Update example Bib paper (9,10) (1,14) (1,16) (2,6) (11,15) (9,13) book paper (10,12) paper (12,14) (3,5) author (7,8) author (4,4) (13,13) (11,11) Tim Sarah

<Order-Size> Labeling Scheme This labeling scheme uses an extended preorder. Each node is associated with a pair of numbers <order-size> as follows: For a tree node y and its parent x: order(x)< order(y), order(y) + size(y) <= order(x) +size(x). For two sibling nodes x and y, if x is the predecessor of y in preorder traversal: order(x)+size(x) < order(y).

Example Bib book paper paper author author Tim Sarah (1,100) (60, 30) (10,30) (41,10) (11,20) (62,20) (17,10) (65,10)

Properties Ancestor-descendant relationship: For two given nodes x and y of a tree T, x is an ancestor of y if and only if: order(x) < order(y) <= order(x) + size(x). Bib book paper paper author author Tim Sarah (1,100) (60, 30) (10,30) (41,10) (11,20) (62,20) (17,10) (65,10) There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.

Are updates possible ? For a tree node x, size(x) <= Σy size(y) for all y’s that are a direct child of x. Size(x) can be an arbitrary integer larger than the total number of the current descendants of x. Thus <Order,Size> labeling scheme is more flexible and can deal with dynamic updates of XML data more efficiently, in contrast with the one presented before. Additional space is reserved for future data insertions. Disadvantage : It is hard to predict the actual space requirements, thus after several data insertions the space required to hold inserted data has exceeded the reserved space and in the worst case the relabeling of the whole data tree is needed.

Insertion without Re-labeling Bib paper (53,5) (1,100) book paper (10,30) (60, 30) paper author author (41,10) (11,20) (62,20) (65,10) No re-labeling since: order(x)+size(x) < order(y) and size(x) <= Σy size(y) Tim Sarah (17,10)

Insertions with Re-labeling Bib paper (58,30) (1,100) (1,200) book paper (10,35) (54, 35) (90, 35) paper author author (46,10) (11,20) (95,20) (62,20) (65,10) Re-labeling needed since: order(x)+size(x) < order(y) size(x) <= Σy size(y) (100,10) Tim Sarah (17,10)

Prime Number Labeling Scheme Divisibility Property: If an integer X has a prime* factor Z whichis not a prime factor of another integer Y,then Y is notdivisible by X. In XML trees, if a node A has a descendant C which is not a descendant of another node B, then A cannot be a descendant of node B. Therefore, if the leaf nodes in XML are labeled by prime numbers and the non-leaf nodes as a product of the labels of its child nodes, then we can easily determine the ancestor-descendent relationship by using the “divisible” property of prime numbers. *Prime factor: prime numbers that divide that integer exactly A B EXAMPLE X=6 Z=3 (prime number) Y=10 C

Bottom-Up Starting from the leaf nodesprime numbers are assigned to each leaf node. For each subsequent level, the parents labels are assigned as theproduct of their children’s labels. Bib (1155) (15*77) (15) (3*5) (77) (7*11) book paper author author author author (7) (11) (3) (5) (5)

Properties of Bottom-Up Ancestor-descendant relationship: For any nodes x and y in anXML tree, x is an ancestor of y if and only if:label(x)mod label(y) = 0. Bib (1155) (77) (15) book paper author author author author (7) (11) (3) (5) (5) There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.

Disadvantages of Bottom-Up Can quickly result in relatively large numbers being assigned to nodes at the top of the tree. Special handling is required for thosenodes that have only one child.

Top-Down Each non-leaf node is given a unique prime numberand the label of each node is the product of its parentnodes label and its own label. Thus each label is aproduct of two factors: first factor is the number that isinherited from the label of its parent, is called “parent-label”. The second part is the value that isassigned to the node by the labeling scheme, is called “self- label”. Bib 1 (1*1) book paper 2 (1*2) 5 (1*5) paper author author 3 (1*3) 55 (5*11) 14 (2*7) Tim Sarah 182 (14*13) 935 (55*17) parent-label self-label

Properties of Top-Down Ancestor-descendant relationship: For any nodes x and y in anXML tree, x is an ancestor of y if and only if:label(y)mod label(x) = 0. Bib 1 (1*1) book paper 2 (1*2) 5 (1*5) paper author author 3 (1*3) 55 (5*11) 14 (2*7) Tim Sarah 182 (14*13) 935 (55*17) There is no way to locate the siblings of a given node, using only the knowledge of its index numbers.

Are updates possible ? Bib paper 19 (1*19) The top-down prime number labeling schemeisgood fordynamic updates. When a new node isinserted, it is easy tosimply assign a prime numberthat has not been assignedbefore as the self-label forthe newly inserted node. No re-labeling is required. 1 (1*1) 2 (1*2) 5 (1*5) book paper paper 3 (1*3) author author 55 (5*11) 14 (2*7) Tim Sarah 182 (14*13) 935 (55*17)

Top-Down Disadvantage In the prime number labeling scheme each prime number can only be used once. Hence, the self-label of a node that is subsequently inserted is always larger than self-labels of existing nodes. This implies that the size of the labels will increase when the smaller prime numbers are used up. Thus after a few insertions the space size for the node label will be huge.

Nested Tree Structure Definition:A Nested Tree is a subtree which has aninterval-based number as a node of the containing tree andits own intervalbased numbering as a tree. Bib (1,50) paper (30,35) book (29;1,29;12) (7,20) paper (23,27) paper (29;5,29;9) author author (11,15) (29;7,29;7) Nested Tree Tim Sarah (13,13)

K-Nested Tree Bib 1-Nested Tree is a Nested Tree of XML data treewhich is not included by any other Nested Trees. 1-Nested Tree (1,50) paper book (29;1,29;12) K-Nested Tree is a Nested Tree that is includedby (k- 1)-Nested Tree andthere is not any other Nested Tree that includes Tkand is included by Tk-1. (7,20) paper (30,35) (23,27) paper (29;5,29;9) author author (11,15) 2-Nested Tree (29;7,29;7) Tim Sarah (13,13)

StartList-EndList of a Node Bib The startList of any tree node N is the list,s1, . . . , sn;sn+1, where si is the label of the i-Nested Treeofthe node N(i = 1, 2,. . . ,n) and sn+1is the start position ofN in the n-NestedTree T. The endList of node N is definedin the same way ofthe previous definition of startList of Nexcept that the start position is substitutedby the end positionof N. 1-Nested Tree (1,50) paper book (29;1,29;12) (7,20) paper (30,35) (23,27) paper (29;5,29;9) author author (11,15) StartList=([(1,50),29;1] EndList=[(1,50),29;12] 2-Nested Tree (29;7,29;7) Tim Sarah (13,13)

Nested Tree’s Label The label of each node can be representedas the 4-tuple (DocID, sList, eList, Level), where : DocID is the identifier of the document sList andeList is the startList and endList of the node, respectively Level is the depth of the node in the datatree. Bib 1-Nested Tree (1,50) For example the red’s node label is: (1, [(1,50),29;1], [(1,50),29;12],2) Assuming that DocId =1 paper book (29;1,29;12) (7,20) paper (30,35) (23,27) paper (29;5,29;9) author author (11,15) 2-Nested Tree (29;7,29;7) Tim Sarah (13,13)

Ancestor-Descendant Relationship Node X is ancestor of node Y: • Beg(X)<NestedTreeLabel(Y)< End(X) 1-Nested Tree Bib (1,50) paper book (29;1,29;12) (7,20) paper (30,35) (23,27) paper (29;5,29;9) author author (11,15) 2-Nested Tree (29;7,29;7) Tim Sarah The red’s node label is: (1, 1, 50, 1) The blue’s node label is: (1, ( (1,50);(29;5)), (1,50);(29;9)),3) Assuming that DocId =1 (13,13) The red node is the ancestor of the blue because : • They have same DocId • 1<29<50

Parent-Child Relationship 1-Nested Tree Node X is parent of node Y: • Beg(X)<NestedTreeLabel(Y)< End(X) • Level(Y) = Level(X)+1 Bib (1,50) paper book (29;1,29;12) (7,20) paper (30,35) (23,27) paper (29;5,29;9) author author (11,15) 2-Nested Tree (29;7,29;7) Tim Sarah (13,13) The red’s node label is: (1, 1, 50, 1) The blue’s node label is: (1, ( (1,50);(29;1)), (1,50);(29;12)),2) Assuming that DocId =1 The red node is the ancestor of the blue because : • They have same DocId • 1<29<50 • Levelb= Levelr+1

Insertion of a Node The space is the range of integers that are possible to be used as new labels for the inserted data and the size of the space is the number of integers in the range. The size of the space is calledSpaceSize and the size of the inserted dataInsertSize. Bib (1,50) For example the SpaceSize between the red ant blue node is 2. (30,35) book (7,20) paper (23,27) paper author (11,15) Tim (13,13)

Insertion of a Node The insertion of a node can be divided in three cases : 1st case SpaceSize > InsertSize:Use the integers in the range of the space as labels for the inserted subtree 2nd case 0 < SpaceSize <=InsertSize:Treat the insertedsubtree as a new Nested Tree and label the Nested Treewith an integer in the range of the space. 3rd case SpaceSize = 0: Combine the inserted subtree with the subtree rooted by the parent of the inserted subtree, treat the combined subtree as one Nested Tree and label the Nested Tree with an integer in the space.

Insertion of a Node : Case:1st The first case does not need a new method to process data insertion because the SpaceSize is enough to label the nodes of the new inserted tree. Bib SpaceSize=7 InsertedData=5 (1,50) paper (35,40) book (28,32) (7,20) paper (23,27) paper (29,31) author author (11,15) (30,30) Inserted Tree Tim Sarah (13,13)

Insertion of a Node : Case: 2nd In the second case the size of theinserted subtree is larger than the size of the space. But if the new inserted subtree is treated as one Nested Tree, onlyone integer is needed for the label of the new Nested Tree. Accordingly if the size of the space is one or more, the relabelingfor the nodes in the original data tree is not necessaryfor the new data insertion. Bib SpaceSize=2 InsertedData=5 (1,50) paper (30,35) book (29;1,29;5) (7,20) paper (23,27) paper (29;2,29;4) author author (11,15) (29;329;3) Inserted Tree Tim Sarah (13,13)

Insertion of a Node : Case: 3rd In the third case, thescope of the new Nested Tree is extended such that the Nested Treeincludes the subtree rooted by the parent of the insertedsubtree. In this case, it is required to relabel some nodesin the original data tree. (5,50) SpaceSize=0 InsertedData=5 infoBooks (5;1,5;16) paper (28,35) (5;14,5;15) book (5;9;5;13) (5;7,5;8) (5;2,5;6) paper (7,20) (23,27) paper (5;10,5;12) author author (5;3,5;5) (11,15) Tim (5;11,5;11) Inserted Tree Sarah (5;4,5;4) (13,13)

Deletion of a Node In the interval labeling scheme in case of deletion no processing is required. However, the more subtree insertions occur, the more Nested Trees are created. Themore Nested Trees are created, the longer the lengths of the startList and endList of nodes are. The deletion is classified by two cases: Release the last Nested Tree in which the deleted subtree is included Release following-sibling or preceding-sibling Nested Trees of the deleted subtree

Deletion of a Node: 1st Case Bib PositionSize=3 RemainSize=2 (1,50) paper (31,35) book (29;1,29;12) (28,29) (7,20) paper (23,27) paper author author (29;5,29;9) (11,15) (29;7,29;7) Nested Tree Tim Sarah (13,13) PositionSize is the size of the space in which the Nested Tree is included. RemainSize is the size of the Nested tree, after delete processing.

Deletion of a Node: 2nd Case Bib PositionSize=26 RemainSize=5 (1,50) paper (30,35) book (29;1,29;12) (14,18) (7,20) paper (23,27) paper (15,17) (29;5,29;9) author author (11,15) (29;7,29;7) (16,16) Nested Tree Tim Sarah (13,13)

Node Indexes

Node Indexes

Presentation Transcript

Indexes

Indexes

Indexes

Indexes

Variant Indexes

Indexes

Indexes

Indexes

MARKET INDEXES

Indexes

BITMAP INDEXES

Secure Indexes

BITMAP INDEXES

Indexes

Indexes

Primary Indexes

Indexes

Indexes:

Indexes