An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML

An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML Changqing Li Tok Wang Ling Department of Computer Science School of Computing National University of Singapore

Overview • Related work and motivation • Labeling with our ImprovedBinary scheme • Order-sensitive Update and Query • Experimental results • Conclustion

1,20 4,5 6,11 12,13 14,19 2,3 7,8 9,10 15,16 17,18 Related work – Containment scheme • Containment scheme [1] • Node u is an ancestor of v iff label(u.start) < label(v.start) < label(v.end) < label(u.end) • Or interval of v is contained in interval of u [1] Rakesh Agrawal, Alexander Borgida, H. V. Jagadish: Efficient Management of Transitive Relationships in Large Data and Knowledge Bases. SIGMOD Conference 1989: 253-262

1,22 6,7 8,13 14,15 16,21 4,5 2,3 1,20 1,20 9,10 11,12 17,18 19,20 4,5 4,5 6,11 6,11 12,13 12,13 14,19 14,19 2,3 2,3 7,8 7,8 9,10 9,10 15,16 15,16 17,18 17,18 Related work – Containment scheme • Update is bad • Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order Original labels Insert a node

Related work – Containment scheme • How to decrease the update cost • Increase the interval size • Small interval size is easy to lead to re-labeling • large interval size wastes a lot of values which causes the increase of storage • Use real (float-point) values [2] • float-point is represented in computer with a fixed number of bits which is similar to the representation of integer • As a result, there are a finite number of values between any two real values [2] Toshiyuki Amagasa, Masatoshi Yoshikawa, Shunsuke Uemura: QRS: A Robust Numbering Scheme for XML Documents. ICDE 2003: 705-707

10 110 1110 11110 0 110.0 110.10 11110.0 11110.10 2 3 4 5 1 3.1 3.2 5.1 5.2 Related work – Prefix scheme • Prefix scheme • Node u is an ancestor of v iff label(u) is a prefix of label(v) • DeweyID [6] • Integer based • Binary [4] • Binary string based [4] Edith Cohen, Haim Kaplan, Tova Milo: Labeling Dynamic XML Trees. PODS 2002: 271-281 [6] Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang: Storing and querying ordered XML using a relational database system. SIGMOD Conference 2002: 204-215

1 3 2 2 3 4 3 5 4 4 5 5 6 1 1 2 3.1 3.1 4.1 3.2 3.2 4.2 5.1 5.1 6.1 6.2 5.2 5.2 Related work – Prefix scheme Original labels • Update is better • DeweyID • Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings Insert a node

110 1110 11110 111110 10 1110.0 1110.10 111110.0 111110.10 10 10 110 110 1110 1110 11110 11110 0 0 110.0 110.0 110.10 110.10 11110.0 11110.0 11110.10 11110.10 0 Related work – Prefix scheme Original labels • Update is better • Binary • Same number of nodes to re-label as DeweyID Insert a node

0 1 1 2 3 6 7 3 5 7 11 2 (1*2) (1*3) (1*5) (1*7) (1*11) 4 5 8 9 65 85 209 253 (5*13) (5*17) (11*19) (11*23) Related work – Prime scheme • Prime scheme [7] • Node u is an ancestor of v iff label(v) mod label(u) = 0 • The number above each node is the document order • The right number is the label • The numbers below each label is its parent_label and self_label (prime number) [7] Xiaodong Wu, Mong-Li Lee, Wynne Hsu: A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. ICDE 2004: 66-78

0 0 1 1 1 1 1 2 2 3 3 6 6 7 7 2 3 4 7 8 1 3 3 5 5 7 7 11 11 2 2 3 5 7 11 2 (1*2) (1*2) (1*3) (1*3) (1*5) (1*5) (1*7) (1*7) (1*11) (1*11) 1 (1*2) (1*3) (1*5) (1*7) (1*11) 4 4 5 5 8 8 9 9 (1*29) 5 6 9 10 65 65 85 85 209 209 253 253 65 85 209 253 (5*13) (5*13) (5*17) (5*17) (11*19) (11*19) (11*23) (11*23) (5*13) (5*17) (11*19) (11*23) Related work – Prime scheme Original labels • Update • Based on Chinese Remainder Theorem (CRT) [3] • And Simultaneous Congruence (SC) value • The SC value is 56135603 such that • 56135603 mod2= 1 (2 is the self_label, 1 is the document order) 56135603 mod 3 = 2, 56135603 mod 5 = 3 , 56135603 mod 13 = 4, ···, and56135603 mod 23 = 9 • Need not re-label the existing nodes, need to re-calculate the SC value After insertion, the new SC value is 4071807264, such that 4071807264 mod 29 = 1, 4071807264 mod 2 = 2, 4071807264 mod 3 = 3, 4071807264 mod 5 = 4, 4071807264 mod 13 = 5, ···, 4071807264 mod 23 = 10. Insert a node Use more SC values to prevent the single SC value become a very large number. [3] James A. Anderson and James M. Bell, Number Theory with Application, Prentice-Hall, New Jersey, 1997.

Motivation • Containment scheme is very bad in updating • Prefixand prime schemes are better in updating which are called dynamic labeling schemes, and we only compare the performance of dynamiclabeling schemes in this paper. • Prefix schemes DeweyID and Binary both need to re-labelthe existing nodes • Prime need not re-label existing nodes, but need to re-calculate the SC values • SC value are very large numbers • Re-calculation is costly • Our objective • Need not re-label existing nodes in updating • Need not re-calculate any values in updating

Our ImprovedBinary scheme • How to label the XML tree based on our ImprovedBinary scheme • Root is empty • First sibling self_label is “01” • Last sibling self_label is “011” • Case (a) left self_label size is less than or equal to right self_label size • The middle self_label is that we change the last character of the rightself_label to “0” and concatenate one more “1”. • Case (b) left self_label size is larger than right self_label size • The middle self_label is that we directly concatenate one more “1” after the left self_label. 01001 0101 01011 011 01 0101.01 0101.011 011.01 011.011

01001 0101 01011 011 01 0101.01 0101.011 011.01 011.011 Properties of our ImprovedBinary labels • Self_labels of siblings are lexically ordered • 01 < 01001 < 0101 < 01011 < 011 • Labels are lexically ordered when comparing the labels component by component • 01 < 01001 < 0101 < 0101.01 < 0101.011 < 01011 < 011 < 011.01 < 011.011 The component is the labels between two consecutive delimiters.

Algorithm 1: AssignMiddleSelfLabel • Input: left self label self_label_L, and right self label self_label_R • Output: middle self label self_label_M, such that • self_label_Lself_label_Mself_label_R lexically. • begin • 1: calculate the size of self_label_L and the size of self_label_R • 2: if size(self_label_L) size(self_label_R) • thenself_label_M = change the last character of self_label_R to “0”, • and concatenate () one more “1” • 3: else if size(self_label_L) > size(self_label_R) • thenself_label_M = self_label_L “1” • end AssignMiddleSelfLabel algorithm. The formal labeling algorithm for our ImprovedBinary • Left self_label is smaller than or equal to the size of the right self_label, the self_label of the middle node is that we change the last character of the rightself_label to “0” and concatenate one more “1”. • Otherwise, the self_label of the middle node is the left self_label concatenates “1”.

Algorithm 2: Labeling Input: XML document Output: Label of each node begin 1: for all the sibling child nodes of each node of the XML document 2: for the first sibling node, self_label[1]=“01” //self_label is an array 3: if the Number of Sibling nodes SN > 1 thenself_label[SN]=“011” self_label=SubLabeling(self_label, 1, SN) 4: label = prefix_label concatenates delimiter and each element of self_label array end SubLabeling Input:self_label array, left element index of self_label array L, and right element index of self_label array R Output:self_label array begin 1: M = floor((L+R)/2) //M refers to the Mth element of self_label array 2: ifL+1<R then self_label[M]=AssignMiddleSelfLabel(self_label[L], self_label[R]) SubLabeling(self_label, L, M) SubLabeling(self_label, M, R) end Labeling algorithm. The formal labeling algorithm for our ImprovedBinary • SubLabeling is a recursive function • Self_label is an array

a b d b d 01001 0101 01011 011 01 c c 0101.01 0101.011 011.01 011.011 Order-sensitive update for our ImprovedBinary • Case (1): Insert a node before the first sibling node. The self_label of the inserted node is that the last character of the first self_label is changed to “0” and is concatenated with one more “1”. • Label of node of a will be 001 • After insertion, the order is still kept. • Label(a) < 01 lexically

a b d b d 01001 0101 01011 011 01 c c 0101.01 0101.011 011.01 011.011 Order-sensitive update for our ImprovedBinary • Case (2): Insert a node at any position between the first and last sibling node. We use the AssignMiddleSelfLabel algorithm to assign the self_label of the new inserted node. • Label of node of b will be 010011 • Label of node of c will be 0101.0101 • After insertion, the order is still kept. • 01001 < label(b) < 0101 lexically • 0101.01 < label(c) < 0101.011 lexically

a b d b d 01001 0101 01011 011 01 c c 0101.01 0101.011 011.01 011.011 Order-sensitive update for our ImprovedBinary • Case (3): Insert a node after the last sibling node. The self_label of the inserted node is that the last self_label concatenates one more “1”. • Label of node of d will be 0111 • After insertion, the order is still kept. • 011 < label(d) lexically

Order-sensitive update for our ImprovedBinary • For all the three cases • We need not re-label the exiting nodes • We need not re-calculate any values to keep the document order

Order-sensitive query 1) position = i: Selects the ith node within a context node set. 2) preceding-sibling or following-sibling: Selects all the preceding (following) sibling nodes of the context node. 3) preceding or following: Selects all the nodes before (after) (in document order) the context node excluding any ancestors (descendants). • When inserting a node • DeweyID and Binary need to re-label the existing nodes • Prime needs to re-calculate the SC values before they can process the order-sensitive queries.

Experimental result -- datasets • Datesets available at [8] • The depths of real XMLs are usually not too high [5] [5] Laurent Mignet, Denilson Barbosa, Pierangelo Veltri: The XML web: a first study. WWW 2003: 500-510 [8] The Niagara Project Experimental Data. Available at: http://www.cs.wisc.edu/niagara/data.html

Experimental results -- storage • Storage requirement • ImprovedBinary has the smallest label size in each dataset

Experimental results -- query • Query performance • ImprovedBinary works the fastest for most of the queries, both ordered and un-ordered

Experimental results -- update • Update performance • Here we study the update performance of the Hamlet XML file in D8. The update performance of other XML files in D8 is similar. • Hamlet has 5 acts, and we test the following six cases: • inserting an act before act[1] • inserting an act between act[1] and act[2] • inserting an act between act[2] and act[3] • inserting an act between act[3] and act[4] • inserting an act between act[4] and act[5] • and inserting an act after act[5] • Acts are the child nodes of the root play in the Hamlet XML file

Experimental results -- update • Update performance • Number of nodes to re-label • The number of SC values to re-calculate is counted for Prime, one SC value is for every three nodes. • One SC value for 4 or more nodes can not be calculated using 64 bits, the long type value in java • Our ImprovedBinaryneed not re-label any existing nodes and need not re-calculate any values

Experimental results -- update • Update performance • Time to re-label or re-calculate • The re-calculation time for prime is more than 337 times larger than the re-labeling time for DeweyID and Binary • Our ImprovedBinaryneeds 0 milliseconds for the update The query time and re-labeling (re-calculation) time in this paper refer to the processing time only without including the I/O time.

Conclustion • Our contribution • We proposed a node labeling scheme called ImprovedBinary to decrease the label update cost • This scheme need not re-label any existing nodes • Need not re-calculate any values

Reference [1] Rakesh Agrawal, Alexander Borgida, H. V. Jagadish: Efficient Management of Transitive Relationships in Large Data and Knowledge Bases. SIGMOD Conference 1989: 253-262 [2] Toshiyuki Amagasa, Masatoshi Yoshikawa, Shunsuke Uemura: QRS: A Robust Numbering Scheme for XML Documents. ICDE 2003: 705-707 [3] James A. Anderson and James M. Bell, Number Theory with Application, Prentice-Hall, New Jersey, 1997. [4] Edith Cohen, Haim Kaplan, Tova Milo: Labeling Dynamic XML Trees. PODS 2002: 271-281 [5] Laurent Mignet, Denilson Barbosa, Pierangelo Veltri: The XML web: a first study. WWW 2003: 500-510 [6] Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang: Storing and querying ordered XML using a relational database system. SIGMOD Conference 2002: 204-215 [7] Xiaodong Wu, Mong-Li Lee, Wynne Hsu: A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. ICDE 2004: 66-78 [8] The Niagara Project Experimental Data. Available at: http://www.cs.wisc.edu/niagara/data.html

An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML