- 34 Views
- Uploaded on
- Presentation posted in: General

QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates

Changqing Li,Tok Wang Ling

- Background and related work
- Our QED encoding
- Completely avoid re-labeling in XML updates based on our QED
- Experiments
- Conclusion

- Three main categories of labeling schemes to process XML queries
- Containment labeling scheme [Zhang et al SIGMOD01 etc.]
- Prefix labeling scheme [Tatarinov et al SIGMOD02 etc.]
- Prime number labeling scheme [Wu et al ICDE04]

1,16,1

2,3,2

4,9,2

10,13,2

14,15,2

5,6,3

7,8,3

11,12,3

- “start”, “end”, and “level”
- Determine ancestor-descendant and parent-child relationships based on the containment property

“5,6,3” is a descendant of “1,16,1” because interval [5,6] is contained in interval [1,16]

“5,6,3” is a child of “4,9,2” because interval [5,6] is contained in interval [4,9], and levels 3-2=1

- Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order

1,16,1

2,3,2

4,9,2

10,13,2

14,15,2

5,6,3

7,8,3

11,12,3

- Need to re-label all the ancestor nodes and all the nodes after the inserted node in document order

- All the red color numbers need to be changed, very expensive

1,18,1

4,9,2

10,11,2

12,15,2

2,3,2

16,17,2

5,6,3

7,8,3

13,14,3

- Increase the interval size and leave some values unused[Li et al VLDB01]
- When unused values are used up, have to re-bel

- Use float-point value [Amagasa et al ICDE03]
- Float-point value represented in a computer with a fixed number of bits
- Due to float-point precision, have to re-label

- They both can not completely avoid re-labeling

1

2

3

4

2.1

2.2

3.1

- Determine ancestor-descendant and parent-child relationships based on the prefix property

“2.1” is a descendant of the root, because the label of the root is empty which is a prefix of “2.1”

“2.1” is a child of “2” because “2” is an immediate prefix of “2.1”, i.e. when removing “2” from the left side of “2.1”, “2.1” has no other prefixes.

- To maintain the document order when updates are performed ---- order-sensitive updates
- Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings

1

2

3

4

2.1

2.2

3.1

2

3

4

1

5

2.1

2.2

4.1

- To maintain the document order when updates are performed ---- order-sensitive updates
- Need to re-label all the sibling nodes after the inserted node and all the descendants of these siblings

- All the red color numbers need to be changed, very expensive

- OrdPath [O'Neil et al SIGMOD04]
- At the beginning, use odd numbers only

1

3

5

7

3.1

3.3

5.1

- OrdPath [O'Neil et al SIGMOD04]
- In insertion, use even number together with odd numbers

Label of node a “-1”

Label of node b “6.1”

Label of node c “6.3”

Label of node d “6.2.1”

a

1

3

5

7

c

b

d

3.1

3.1

5.1

3.3

- All are at the same level, bad

- Nodes a, b, and c are at the same level, but their labels “-1”, “6.1”, and “6.3” do not look like this; need more time to determine this; will decrease the query performance
- Waste half numbers (even numbers); will make label size increase
- Need to calculate the even number between two odd numbers; update cost not cheap
- Use a fixed length size to indicate the size of a label, the fixed length size field will eventually encounter the overflow problem when a lot of nodes are inserted, so OrdPath can not completely avoid re-labeling

- Based on a top-down approach, each node is given a unique prime number (self_label) and the label of each node is the product of its parent node’s label (parent_label) and its own self_label.
- Query
- Use the modular and division operations to determine the ancestor-descendant and ordering relationships, which are very expensive

- Update
- When nodes are inserted into the XML tree, needs to re-calculate the SC values, which is much more expensive than re-labeling

- Details can be found in [Wu et al ICDE04]

- Dynamic Quaternary Encoding (QED)
- Four quaternary numbers “0”, “1”, “2” and “3” are used in the code and each number is stored with two bits, i.e. “00”, “01”, “10” and “11”.
- The quaternary number “0” is used as the separator, and only “1”, “2”, and “3” are used in the QED encoding.
- Compare QED codes based on the lexicographical order

1,16,1

2,3,2

4,9,2

10,13,2

14,15,2

5,6,3

7,8,3

11,12,3

- We show how to encode 16 numbers; we choose 16 because the total “start” and “end” values in the containment scheme is 16; this is only an example
- Any other number is ok to be encoded by our QED
- Every time encode the (1/3)th and (2/3)th numbers between two numbers
- “0” is the separator, and only “1”, “2”, and “3” appear in the QED codes, so (1/3)th and (2/3)th

0

17

0

17

0

17

- In the previous page, we can see that the FixedLenth codes are stored with length 5, i.e. the length of each code is 5 bits
- When a lot of codes are inserted, the length 5 is not large enough, all the FixedLength codes need to be changed.
- For the VarLength codes, we also need to store the length of each VarLength code, e.g., the length of “10000” is 5. We need to store this 5 using fixed length of bits (“101”; 3 bits). The sizes of other codes should also be stored using fixed length of bits (3 bits).
- When a lot of codes are inserted, this size of the size field 3 is not large enough, then all the codes must be changed
- This is called the overflow problem.

- For the QED codes “112”, “12”, and “122” etc. in the table, they are separated with “0”
- Stored as “11201201220”, based on the separator “0”, we can separate different codes
- “0” will never encounter the overflow problem
- Our QED encoding can help to completely avoid the re-labeling

- Our QED compares codes based on the lexicographical order
- The QED codes in the table are lexicographically ordered from top to bottom.
- E.g., “132” < “2” lexicographically because the comparison is from left to right, and the 1st symbol of “132” is “1”, while the 1st symbol of “2” is “2”.
- Another example, “23” < “232” lexicographically because “23” is a prefix of “232”.

112,332

12,122

13,23

232,32

322,33

132,2

212,22

3,312

- Replace the “start” and “end” values “1” to “16” with our QED codes
- A QED encoding based on containment scheme is formed
- Compare labels based on lexicographical order

- Note that we drop the level values from the right graph just for a clear presentation

- The root has 4 children. To encode 4 numbers based on our QED, the codes will be “12”, “2”, “3” and “32”.
- Similarly if there are 2 siblings, their self_labels (last component, e.g., “3” in “2.3” is the self_label) are “2” and “3”.
- If there is only 1 sibling, its self_label is “2”.

12

2

3

32

2.2

2.3

3.2

- For the prefix scheme, the delimiter “.” can not be stored together with the numbers in the implementation to separate different components.
- For our QED encoding, we use the following approach to process the delimiters.
- We use one “0” as the delimiterto separate different components of a prefix label
- e.g. separate “12” and “3” in “12.3”; the delimiter “0” is equivalent to the “.”; “12.3” is stored as “1203” in the implementation;

- use two consecutive separators “00” as the separatorto separate different labels
- e.g. “1202001203” represents 2 labels, i.e. “1202” and “1203”.

- We use one “0” as the delimiterto separate different components of a prefix label

Algorithm: GetInsertedCode

Input: Left_Code, Right_Code

Output: Inserted_Code, such that Left_Code < Inserted_Code < Right_Code lexicographically.

1: get the sizes of Left_Code and Right_Code

2: if size(Left_Code) < size(Right_Code) //Case (1)

3: then Inserted_Code = (the Right_Code with the last

4: symbol changed to “1”) concatenate “2”

5: else if size(Left_Code) > size(Right_Code)

6: if the last symbol of Left_Code is “2” //Case (2)

7: then Inserted_Code = the Left_Code with the

8: last symbol changed from “2” to “3”

9: else if the last symbol of Left_Code is “3” //Case (3)

10: then Inserted_Code = Left_Code concatenate “2”

11: else if size(Left_Code) = size(Right_Code) //Case (4)

12: then Inserted_Code = Left_Code concatenate “2”

112,332

12,122

13,23

232,32

322,33

132,2

212,22

3,312

- When we insert a node as shown in the below figure
- We should insert two QED codes between “23” and “232”
- First create the “start” value
- i.e. a code between “23” and “232”, the new code is “2312”;
- see Case (1) of the GetInsertedCode algorithm;

- Then create the “end” value
- i.e. a code between “2312” and “232”, the new code is “2313”;
- see Case (2) of the GetInsertedCode algorithm;

- First create the “start” value
- “23” < “2312” < “2313” < “232” lexicographically, we need not re-label any existing nodes.

2

22

3

12

32

202

203

302

- When we insert a node as shown in the below figure
- We should insert one QED code between “2” and “3”
- The new QED code between “2” and “3” is “22”;
- see Case (4)of the GetInsertedCode algorithm;

- “2” < “22” < “3” lexicographically, we need not re-label any existing nodes, but we can keep the order.

- We mainly report the results in updates
- We select the Hamlet file in Shakespeare’s play dataset
- Intermittent updates
- Hamlet file has 5 act elements, 6 insertion cases, i.e. before act[1], between act[1] and act[2], …, between act[4] and act[5], and after act[5].

- Uniformly frequent updates
- Insertions happens randomly at different places of the Hamlet file

- Skewed frequent updates
- Insertions always happen at a fixed place of the Hamlet file

- Prime needs to re-calculate less SC values, but its re-calculation time is very large
- Theorem. Our QED never needs to re-label any existing nodes
- The update time of our QED is much smaller
- The update performance differences among OrdPath, Float-point, and our QED can be seen in the next page
- Note that QED represents both the QED encoding and the QED-containment scheme, QED-PREFIX represents the scheme when we apply QED encoding to the prefix scheme.

(a) Number of nodes to re-label

(b) Time to re-label

- When uniformly frequent updates are performed,
- The update time of OrdPath and Float-Point is much larger (more than 386 times) than the time required by our QED approaches

- Our QED encoding only needs to modify the last 2 bits of the neighbor label, which is very cheap
- Both OrdPath and Float-point can not completely avoid re-labeling

(a) OrdPath1&2 vs QED-PREFIX

(b) Float-point vs QED

- When skewed frequent updates are performed,
- The update time of OrdPath and Float-Point is much larger (more than 8126 times) than the time required by our QED approaches

- The very large update time makes OrdPath and Float-pointunsuitable to answer queries in the frequent insertion environment.
- Our QED still works the best to answer queries in the environment that frequent insertions are executed

(a) OrdPath1&2 vs QED-PREFIX

(b) Float-point vs QED

- We propose the QED encoding
- QED can be applied broadly to different labeling schemes
- QED can completely avoid re-labeling in XML updates