slide1 n.
Download
Skip this Video
Download Presentation
Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries

Loading in 2 Seconds...

play fullscreen
1 / 36

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg, Canada. Motivation Path-Oriented Queries and Tree Inclusion Problem Evaluation of Path-Oriented Queries - Top-down Algorithm for Tree Inclusion

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries' - burton-mcguire


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Tree Inclusion, Signatures, and Evaluation of

Path-Oriented Queries

Dr. Yangjun ChenDept. Applied Computer Science, University of Winnipeg, Canada

  • Motivation
  • Path-Oriented Queries and Tree Inclusion Problem
  • Evaluation of Path-Oriented Queries
  • - Top-down Algorithm for Tree Inclusion
  • - Integration of Signatures into Top-down Tree Inclusion
  • Experiment Results
  • Summary and Future Work
slide2

Motivation

    • Local Information Resource Management – document databases
    • Internet – Distributed Document Databases
    • Document Databases
  • - Storage of documents in relational databases
  • non-structured data, semi-structured data
  • - Evaluation of path-oriented queries in document databases
  • path-oriented languages: XQL, XPath, and XML-QL
  • Query evaluation methods:
      • inverse-file based
      • signature based
      • string-matching based: suffix trees, Pat-trees
      • tree-inclusion based
    • Integrating signatures into top-down tree inclusion algorithm
slide3

Path-Oriented Queries and Tree Inclusion Problem

  • XML Documents and Path-Oriented Queries
slide4

Path-Oriented Queries and Tree Inclusion Problem

  • Tree Inclusion Problem
  • Definition (tree embedding) Let T and P be two labeled trees. A mapping M from the nodes of P to the nodes of Tis an embedding of Pinto Tif it preserves labels and ancestorship. That is, for all nodes u and v of P, we require that
  • a) M(u) = M(v) if and only if u = v,
  • b) label(u) = label(M(u)),
  • c) u is an ancestor of v in P if and only if M(u) is an ancestor of M(v) in T, and
  • d) v is to the left of u iff M(v) is to the left of M(u).
  • An embedding is root preserving if M(root(P)) = root(T). It can be shown that restricting to root-preserving embedding does not lose generality.
slide5

Path-Oriented Queries and Tree Inclusion Problem

  • Example:

T:

P:

Hotel-room-reservation

Hotel-room-reservation

name

location

reservation

name

location

type

address

?x

City-or-district

price

from

to

rooms

country

address

state

City-or-

district

Travel-lodge

Post-

code

number

number

street

street

one-bed-

room

April 20,

2005

April 28,

2005

Winnipeg

Winnipeg

$119.00

Manitoba

Canada

R3B 2E9

515

Portage Ave.

515

Portage Ave.

slide6

Path-Oriented Queries and Tree Inclusion Problem

  • - Algorithms for Tree Inclusion Problem
  • Bottom-up algorithm:
      • Kilpelainen-Mannila’s Algorithm (Pekka Kilpelainen and Heikki
  • Mannila, Ordered and unordered tree inclusion, SIAM Journal of
  • Computing, 24:340-356, 1995.)
      • O(|T| |P|) time
      • O(|T| |P|) space
      • Chen’s Algorithm (W. Chen, More efficient algorithm for ordered
      • tree inclusion, Journal of Algorithms, 26:370-385, 1998.)
      • O(|T||leaves(P)|) time
      • O(|leaves(P)|min{height(P), |leaves(T)|}) space
slide7

Path-Oriented Queries and Tree Inclusion Problem

  • - Algorithms for Tree Inclusion Problem
  • Top-down algorithms:
      • Y. Chen and Y.B. Chen, An Efficient Top-down Algorithm for Tree
      • Inclusion, in Proc. of 18th Intl. Conf. Symposium on High Performance
      • Computing System and Application, Winnipeg, Canada: IEEE,
      • May 2004, pp. 183-187.)
      • O(|T| |leaves(P)|) time, need no extra space
      • Y. Chen and Y.B. Chen, On the Top-down Tree Inclusion Algorithm,
      • submitted to Information Processing Letters.)
      • O(|T||height(P)|) time, need no extra space
      • Advantages of top-down over bottom-up:
      • - better computational complexities
      • - checking trees page-wise (suitable for the cases of large data volume)
      • - integrating signatures into tree inclusion to cut useless subtree checkings
      • as early as possible
slide8

Evaluation of Path-Oriented Queries

  • - Top-down Algorithm
  • Target tree: T = <t; T1, ..., Tk>, where t = root(T) and each Ti (i = 1, …, k)
  • is the subtrees of t;
  • Pattern forest: G = <P1, ..., Pq>, where each Pj(j = 1, …, q) is a subtree.
        • Main idea:
      • The algorithm attempts to find the number of subtrees j ( 0) within an
      • ordered forest G = <P1, ..., Pq> (q  1), which are embedded in a target
      • tree T. If j = q, we say that G is embedded in T. If j < q, then only the trees
      • P1, ..., and Pj are embedded in T. Let p1, ..., pq and t be the roots of P1, ..., Pq
      • and T, respectively. Since a forest does not have a root, we use a virtual
      • node pv to serve as a substitute for root(G). Thus, root(G) will return pv if
      • G = <P1, ..., Pq> with q > 1, and will return p1 if q = 1.
slide9

Evaluation of Path-Oriented Queries

- Top-down Algorithm

Case 1: root(G) pv (i.e., G = <P> is a tree and root(G) = p), and

label(p) label(t). If G is embedded in T, then there must exist a subtree Ti of

t such that it contains the whole G. The algorithm should return 1 if an

embedding can be found and 0 if it cannot.

label(root(T))  label(root(G))

G:

T:

Ti

Tree G is included in Ti.

slide10

Evaluation of Path-Oriented Queries

- Top-down Algorithm

Case 2: root(G) pv (i.e., G = <P> and root(G) = p), and label(p) = label(t).

Let <P1, ..., Pl> (l  0) be the forest of subtrees of pand <T1, ..., Tk> the forest

of subtrees of t. If G is embedded in T, there must exist two sequences of

integers: k1, ..., kg and l1, ..., lg (g l) such that includes < , ..., >

(i = 1, ..., g, l0 = 0, lg = l), where < , ..., > represents a forest containing

subtrees , ..., and . Thus, if lg = l, the algorithm should return 1 since we

have a root preserving inclusion of G in T. Otherwise, it should return 0.

label(root(T)) = label(root(G))

G:

T:

p

t

= Pl

Tk

P1

T1

include

include

slide11

Evaluation of Path-Oriented Queries

- Top-down Algorithm

Case 2: root(G) = pv and there exists an integer j (0 j q) such that

<P1, ..., Pj> is included in T. If j = q, then the whole G is embedded in T.

There are two possibilities to be considered when looking for j. The first

possibility is similar to Case 2, where there are two sequences of integers:

k1, ..., kg and l1, ..., lg (g q) that represent the order, in which the subtrees

of root(G) are embedded in the subtrees of root(T). In thiscase, j = lg.

If j = 0, we will check the second possibility to see whether there exists a

root preserving inclusion of P1 in T, i.e., label(p1) = label(t) and the subtrees

of p1 are included in the subtrees of t. In this case, j = 1.

slide12

possibility 1:

qv(virtual node)

T:

G:

t

= Pl

Tk

P1

T1

include

include

Evaluation of Path-Oriented Queries

- Top-down Algorithm

possibility 2:

label(root(T)) = label(root(P1))

qv(virtual node)

G:

T:

t

= Pl

Tk

P1

T1

include

slide13

Evaluation of Path-Oriented Queries

- Top-down Algorithm

  • j := bottom-up-process(T, G);
  • 13. if (j = l) then return 1 else 0;}
  • else {ift is a leaf then return 0;
  • 14. (*handling Case 1*)
  • 15. i := 1;
  • 16. while (i k) do
  • 17.{iftop-down-process(Ti, G) > 0 then return 1;
  • 18. i := i + 1;}
  • 19. return 0;} }
  • end

functiontop-down-process(T, G)

input: T = <t; T1, ..., Tk>, G = <p; P1, ..., Pq>

(*p may or may not be a virtual node.*)

output: if root(G) is virtual, returns j  0;

else returns 1 if T includes G; otherwise returns 0.

begin

1. ifroot(G) is virtual then

2. if (|T| < |P1| + |P2| or p has only one child)

3. thenG := P1;

4. else {j := bottom-up-process(T, G);

5. if (j = 0 and label(t) = label(P1’s root))

(*second possibility in Case 3*)

6. then {change P1’s root to a virtual node;

x := bottom-up-process(T, P1);

7. if (x = the number of the children of P1’s root)

thenj := 1 else j := 0;}

8. return j;}}

9. if |T| < |G| return 0;

10. else {if (label(t) = label(p)) (*handling Case 2*)

11. then {p := virtual node;

functionbottom-up-process(T, G)

input: T = <t; T1, ..., Tk>, G = <p; P1, ..., Pq>

output: j - an integer

begin

1. j := 0; i := 1;

2. while (j < q and i k) do

3. { x := top-down-process(Ti, G);

4. j := j + x; G := <p; Pj+1, ..., Pq>; i := i + 1; }

end

slide14

Integration of Signatures into Top-down Inclusion

  • Definition A signature for a key word or an attribute value is
  • hash-coded bit string.
  • - Example: (constructing a signature for a word with m = 4 and F = 12)
  • “database”
  • letter triplets: dat, ata, tab, aba, bas, ase
  • H(dat) = 5, H(ata) = 1, H(tab) = 8, H(aba) = 1, H(bas) = 10,
  • H(ase) = 8.
  • 100 010 010 100
      • D. Dervos, Y. Manolopulos and P. Linardis, “Comparison of signature
      • File models with superimposed coding,” J. of Information Processing
      • Letters 65 (1998) 101 - 106.
slide15

Integration of Signatures into Top-down Inclusion

Definition A signature for a key word or an attribute value is

hash-coded bit string.

- Important parameters:

m: number of 1s in bit string

F: length of bit string

D: size of a block (or average number of the key words of an element)

optimal choice of the parameters:

Fln2 =mD(1)

S. Christodoulakis and C. Faloutsos, “Design consideration for a message

file server,” IEEE Trans. Software Engineering, 10(2) (1984) 201-210.

slide16

0101

0011

0001

0010

1010

1100

a:

b:

c:

d:

e:

f:

0000

1000

0101

1000

1000

0000

T:

a

e

b

t0

t1

t2

t1

t2

t0

e

c

d

f

t22

t12

t22

t21

t11

t11

t12

t21

Integration of Signatures into Top-down Inclusion

- Assigning signatures to tree nodes

Let v be a node in a tree T. If v is a leaf node, its signature svis equal

to the signature assigned to its label. Otherwise, sv= s v1 ... vn, where

s represents the signature for the label associated with v, and s1, ... ,

and snare the signatures of v’s children: v1, ..., vn, respectively.

T:

a

1111 1101

e

b

1111 1101

1111 1000

f

e

c

d

1100 0000

0001 0101

0010 1000

1010 1000

slide17

t0

t1

p0

t2

t21

p1

t22

t11

p2

t12

e

1111 1101

c

d

0010 1000

0001 0101

Integration of Signatures into Top-down Inclusion

      • - Cutting off useless subtree checks by examining signatures
  • We assign each node v in T a bit string sv (called a signature), and each node
  • u in P a bit string su in such a way that if su matches sv then the subtree Tv
  • rooted at v may includes the subtree Pu rooted at u. Otherwise, Tv definitely
  • does not contain Pu. By “matching”, we mean that for each bit set to 1 in su,
  • the corresponding bit in sv is also set to 1 while for a bit set to 0 in su, the
  • corresponding bit in sv can be 0 or 1. In the following, we discuss this
  • technique in great detail.

virtual

node

T:

P:

This subtree will

not be explored.

a

1111 1101

a

b

1111 1000

0011 1101

e

c

d

f

0010 1000

1100 0000

1010 1000

0001 0101

slide18

Integration of Signatures into Top-down Inclusion

      • - Determine the length of signatures
  • Consider s = s1 / s2, where s1 and s2 are of length F and with m1
  • and m2 bits set to 1, respectively.
  • How to determine the length of S?
  • l - the number of 1s in s
  • d = l - m’, where m’ = max(m1, m2).
        • length(s) = F + cd, where c is a constant and should be tuned for different
        • applications.
        • The value of d can be estimated as follows.
        • l - random variable representing the number of positions, in which both
        • s1 and s2 have 1s.
slide19

Integration of Signatures into Top-down Inclusion

      • - Determine the length of signatures
  • El = 1 p(l = 1) + 2  p(l = 2) + ... + m’’  p(l = m’’) (2)
      • m’’ = min(m1, m2) and p(l = i) represents the probability that l is equal to i.
  • p(l = i) = (3)
  • d = l - m = m1 + m2 - l - max(m1, m2).
slide20

Evaluation of Path-Oriented Queries

  • - Procedure for calculating signature length
  • 1) Identify the key words in a document, which can be done by using
  • Connexor-analyzer (http://www.connexor.com/demos/index.html.)
  • 2) Determine the length of the signatures for the nodes of a document tree,
  • which can be done in two steps:
        • - First, use formula (1) to determine the initial length of the signatures
        • according to the number of the chosen key words and their distribution
        • - Secondly, use formula (2) and (3) to determine the length of the
        • signatures for each document according to the initial length set for
        • signatures.
slide21

Evaluation of Path-Oriented Queries

  • - Determine Procedure for calculating signature length

In the figure, F stands for the initial length of the signatures and m for

the initial number of bits set to 1.

slide22

Experiment Results

  • - Test Platform
  • Computer - DELL desktop PC equipped with Pentium III 864Ghz processor,
  • 512MB RAM and 20GB hard disk.
      • Database system - Oracle-9i Enterprise Edition, The default buffer cache of
      • Oracle-9i is of size 4MB.
      • Language - Oracle PL/SQL language.
  • Data - all the 37 Shakespeare’s plays in a database
slide23

Experiment Results

  • - Storage of XML documents in databases
  • All the documents are stored in three tables.
      • The relation Element has the following structure:
        • {DocID: <integer>, ID: <integer>, Ename: <string>,
        • firstChildID: <integer>, siblingID: <integer>, attributeID: <integer>}
slide24

Experiment Results

  • - Storage of XML documents in databases
  • The relation Text is of a simpler structure:
  • {DocID: <integer>, textID: <integer>, value: <string>},
      • where “textID” is for the identifiers of texts as the values of the corresponding elements
      • in the original document. One should notice that a text takes always an element as the
      • parent node. See the following table for illustration.
slide25

Experiment Results

  • - Storage of XML documents in databases
  • The relation Attribute has five data fields:
      • {DocID: <integer>, att-ID: <integer>, parentID: <integer>, att-name: <string>,
      • att-value: <string>}.
slide26

Experiment Results

- Tested queries

Group I - for testing path length impact

  • Group II - for testing node degree impact
slide27

Experiment Results

- Tested queries

Group III - for testing impact of matching at higher level

  • Group IV - for testing impact of matching at middle level
slide28

Experiment Results

- Tested queries

Group V- for testing impact of matching at lower level

slide29

Experiment Results

  • - Tested methods
        • Inversion on Elements and Words (IEW)
        • (C. Zhang, J. Naughton, D. DeWitt, Q. Luo and G. Lohman, “On Supporting
        • Containment Queries in Relational Database Management Systems, in Proc. of ACM
        • SIGMOD Intl. Conf. On Management of Data, California, USA, 2001.)
        • Inversion on Paths and Words (IPW)
        • (C. Seo, S. Lee, and H. Kim, An Efficient Index Technique for XML Documents
        • Using RDBMS, Information and Software Technology 45(2003) 11-22, Elsevier
        • Science B.V.)
        • Tree Inclusion Algorithm (TIA)
        • Tree Inclusion with Signatures (TIS)
slide30

E-index:

(1, <1, 45>, 0) ...

hotel-room-reservation

(1, <2, 4>, 1) ...

name

(1, <5, 28>, 2) ...

location

... ...

...

... ...

T-index:

(1, 3, 2) ...

Travel-lodge

(1, 7, 3) ...

Winnipeg

(1, 10, 3) ...

Manitoba

... ...

...

... ...

Experiment Results

  • - Tested methods
        • Inversion on Elements and Words (IEW)
        • - (Dno, Wposition, level) for a text word
        • - (Dno, Eposition, level) for an element

Example:

slide31

Experiment Results

  • - Tested methods
        • To evaluate the query: /hotel-room-reservation/location/address [street = Portage Ave.],
        • four joins are performed:
        • self-joins on E-index relation to connect ‘hotel-room-reservation’ and ‘location’,
        • ‘location’ and ‘address’,
        • ‘address’ and ‘street’,
        • the join between E-index and T-index relations to connect ‘street’ and ‘Portage Ave.’
slide32

Experiment Results

  • - Tested methods
        • Inversion on Paths and Words (IPW)
        • - Path(path, pathID),
        • - PathIndex(pathID, docno, begin, end)
        • - Word(word, wordID)
        • - WordIndex(wordID, docno, pathID, position)
slide33

Experiment Results

  • - Tested methods
      • In order to process the same query:
      • /hotel-room-reservation/location/address [street = Portage Ave.],
      • two joins are needed.
        • First join between Path and WordIndex relations with the following join condition:
      • Path.path = ‘hotel-room-reservation/location/address/street’ and
      • Path.pathID = WordIndex.pathID.
        • The second join between the result R of the first join and the Word relation with the
        • join condition:
        • R.wordID = Word.wordID and Word.word = ‘Portage Ave.’.
slide34

IPW

*

+

+

*

*

*

+

+

TIS

TIA

IEW

12

IPW

IPW

+

+

+

TIS

TIS

TIA

TIA

Execution time (sec.)

6

*

*

Q1

Q2

Q3

Q4

Q5

Results of Group III

Experiment Results

- Tested results

2

1000

IPW

TIS

TIA

Execution time (sec.)

Execution time (sec.)

1

100

+

*

+

*

*

*

+

+

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10

Results of Group I

Results of Group II

12

+

+

+

Execution time (sec.)

6

*

*

Q1

Q2

Q3

Q4

Q5

Results of Group IV

slide35

*

+

IPW

TIS

TIA

Experiment Results

- Tested results

12

+

+

+

Execution time (sec.)

6

*

*

Q1

Q2

Q3

Q4

Q5

Results of Group V

slide36

Summary and Future Work

        • Path-oriented queries in document databases
        • Evaluation of path-oriented queries
        • - top-down algorithm for tree inclusion problem signatures- Integration of signatures into top-down tree inclusion
        • Future work:
      • document recognition using
  • tree inclusion
  • probabilistic analysis
  • Benford low
  • Zipf low