Tree Inclusion, Signatures, and Evaluation of
This presentation is the property of its rightful owner.
Sponsored Links
1 / 36

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on
  • Presentation posted in: General

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg, Canada. Motivation Path-Oriented Queries and Tree Inclusion Problem Evaluation of Path-Oriented Queries - Top-down Algorithm for Tree Inclusion

Download Presentation

Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Tree inclusion signatures and evaluation of path oriented queries

Tree Inclusion, Signatures, and Evaluation of

Path-Oriented Queries

Dr. Yangjun ChenDept. Applied Computer Science, University of Winnipeg, Canada

  • Motivation

  • Path-Oriented Queries and Tree Inclusion Problem

  • Evaluation of Path-Oriented Queries

  • - Top-down Algorithm for Tree Inclusion

  • - Integration of Signatures into Top-down Tree Inclusion

  • Experiment Results

  • Summary and Future Work


Tree inclusion signatures and evaluation of path oriented queries

Motivation

  • Local Information Resource Management – document databases

  • Internet – Distributed Document Databases

  • Document Databases

  • - Storage of documents in relational databases

  • non-structured data, semi-structured data

  • - Evaluation of path-oriented queries in document databases

  • path-oriented languages: XQL, XPath, and XML-QL

  • Query evaluation methods:

    • inverse-file based

    • signature based

    • string-matching based: suffix trees, Pat-trees

    • tree-inclusion based

  • Integrating signatures into top-down tree inclusion algorithm


  • Tree inclusion signatures and evaluation of path oriented queries

    Path-Oriented Queries and Tree Inclusion Problem

    • XML Documents and Path-Oriented Queries


    Tree inclusion signatures and evaluation of path oriented queries

    Path-Oriented Queries and Tree Inclusion Problem

    • Tree Inclusion Problem

    • Definition (tree embedding) Let T and P be two labeled trees. A mapping M from the nodes of P to the nodes of Tis an embedding of Pinto Tif it preserves labels and ancestorship. That is, for all nodes u and v of P, we require that

    • a)M(u) = M(v) if and only if u = v,

    • b)label(u) = label(M(u)),

    • c)u is an ancestor of v in P if and only if M(u) is an ancestor of M(v) in T, and

    • d)v is to the left of u iff M(v) is to the left of M(u).

    • An embedding is root preserving if M(root(P)) = root(T). It can be shown that restricting to root-preserving embedding does not lose generality.


    Tree inclusion signatures and evaluation of path oriented queries

    Path-Oriented Queries and Tree Inclusion Problem

    • Example:

    T:

    P:

    Hotel-room-reservation

    Hotel-room-reservation

    name

    location

    reservation

    name

    location

    type

    address

    ?x

    City-or-district

    price

    from

    to

    rooms

    country

    address

    state

    City-or-

    district

    Travel-lodge

    Post-

    code

    number

    number

    street

    street

    one-bed-

    room

    April 20,

    2005

    April 28,

    2005

    Winnipeg

    Winnipeg

    $119.00

    Manitoba

    Canada

    R3B 2E9

    515

    Portage Ave.

    515

    Portage Ave.


    Tree inclusion signatures and evaluation of path oriented queries

    Path-Oriented Queries and Tree Inclusion Problem

    • -Algorithms for Tree Inclusion Problem

    • Bottom-up algorithm:

      • Kilpelainen-Mannila’s Algorithm (Pekka Kilpelainen and Heikki

  • Mannila, Ordered and unordered tree inclusion, SIAM Journal of

  • Computing, 24:340-356, 1995.)

    • O(|T| |P|) time

    • O(|T| |P|) space

    • Chen’s Algorithm (W. Chen, More efficient algorithm for ordered

    • tree inclusion, Journal of Algorithms, 26:370-385, 1998.)

    • O(|T||leaves(P)|) time

    • O(|leaves(P)|min{height(P), |leaves(T)|}) space


  • Tree inclusion signatures and evaluation of path oriented queries

    Path-Oriented Queries and Tree Inclusion Problem

    • -Algorithms for Tree Inclusion Problem

    • Top-down algorithms:

      • Y. Chen and Y.B. Chen, An Efficient Top-down Algorithm for Tree

      • Inclusion, in Proc. of 18th Intl. Conf. Symposium on High Performance

      • Computing System and Application, Winnipeg, Canada: IEEE,

      • May 2004, pp. 183-187.)

      • O(|T| |leaves(P)|) time, need no extra space

      • Y. Chen and Y.B. Chen, On the Top-down Tree Inclusion Algorithm,

      • submitted to Information Processing Letters.)

      • O(|T||height(P)|) time, need no extra space

      • Advantages of top-down over bottom-up:

      • -better computational complexities

      • -checking trees page-wise (suitable for the cases of large data volume)

      • -integrating signatures into tree inclusion to cut useless subtree checkings

      • as early as possible


    Tree inclusion signatures and evaluation of path oriented queries

    Evaluation of Path-Oriented Queries

    • -Top-down Algorithm

    • Target tree: T = <t; T1, ..., Tk>, where t = root(T) and each Ti (i = 1, …, k)

    • is the subtrees of t;

    • Pattern forest: G = <P1, ..., Pq>, where each Pj(j = 1, …, q) is a subtree.

      • Main idea:

    • The algorithm attempts to find the number of subtrees j ( 0) within an

    • ordered forest G = <P1, ..., Pq> (q  1), which are embedded in a target

    • tree T. If j = q, we say that G is embedded in T. If j < q, then only the trees

    • P1, ..., and Pj are embedded in T. Let p1, ..., pq and t be the roots of P1, ..., Pq

    • and T, respectively. Since a forest does not have a root, we use a virtual

    • node pv to serve as a substitute for root(G). Thus, root(G) will return pv if

    • G = <P1, ..., Pq> with q > 1, and will return p1 if q = 1.


    Tree inclusion signatures and evaluation of path oriented queries

    Evaluation of Path-Oriented Queries

    -Top-down Algorithm

    Case 1: root(G) pv (i.e., G = <P> is a tree and root(G) = p), and

    label(p) label(t). If G is embedded in T, then there must exist a subtree Ti of

    t such that it contains the whole G. The algorithm should return 1 if an

    embedding can be found and 0 if it cannot.

    label(root(T))  label(root(G))

    G:

    T:

    Ti

    Tree G is included in Ti.


    Tree inclusion signatures and evaluation of path oriented queries

    Evaluation of Path-Oriented Queries

    -Top-down Algorithm

    Case 2: root(G) pv (i.e., G = <P> and root(G) = p), and label(p) = label(t).

    Let <P1, ..., Pl> (l  0) be the forest of subtrees of pand <T1, ..., Tk> the forest

    of subtrees of t. If G is embedded in T, there must exist two sequences of

    integers: k1, ..., kg and l1, ..., lg (g l) such that includes < , ..., >

    (i = 1, ..., g, l0 = 0, lg = l), where < , ..., > represents a forest containing

    subtrees , ..., and . Thus, if lg = l, the algorithm should return 1 since we

    have a root preserving inclusion of G in T. Otherwise, it should return 0.

    label(root(T)) = label(root(G))

    G:

    T:

    p

    t

    = Pl

    Tk

    P1

    T1

    include

    include


    Tree inclusion signatures and evaluation of path oriented queries

    Evaluation of Path-Oriented Queries

    -Top-down Algorithm

    Case 2: root(G) = pv and there exists an integer j (0 j q) such that

    <P1, ..., Pj> is included in T. If j = q, then the whole G is embedded in T.

    There are two possibilities to be considered when looking for j. The first

    possibility is similar to Case 2, where there are two sequences of integers:

    k1, ..., kg and l1, ..., lg (g q) that represent the order, in which the subtrees

    of root(G) are embedded in the subtrees of root(T). In thiscase, j = lg.

    If j = 0, we will check the second possibility to see whether there exists a

    root preserving inclusion of P1 in T, i.e., label(p1) = label(t) and the subtrees

    of p1 are included in the subtrees of t. In this case, j = 1.


    Tree inclusion signatures and evaluation of path oriented queries

    possibility 1:

    qv(virtual node)

    T:

    G:

    t

    = Pl

    Tk

    P1

    T1

    include

    include

    Evaluation of Path-Oriented Queries

    -Top-down Algorithm

    possibility 2:

    label(root(T)) = label(root(P1))

    qv(virtual node)

    G:

    T:

    t

    = Pl

    Tk

    P1

    T1

    include


    Tree inclusion signatures and evaluation of path oriented queries

    Evaluation of Path-Oriented Queries

    -Top-down Algorithm

    • j := bottom-up-process(T, G);

    • 13.if (j = l) then return 1 else 0;}

    • else {ift is a leaf then return 0;

    • 14.(*handling Case 1*)

    • 15.i := 1;

    • 16.while (i k) do

    • 17.{iftop-down-process(Ti, G) > 0 then return 1;

    • 18.i := i + 1;}

    • 19.return 0;} }

    • end

    functiontop-down-process(T, G)

    input: T = <t; T1, ..., Tk>, G = <p; P1, ..., Pq>

    (*p may or may not be a virtual node.*)

    output: if root(G) is virtual, returns j  0;

    else returns 1 if T includes G; otherwise returns 0.

    begin

    1.ifroot(G) is virtual then

    2.if (|T| < |P1| + |P2| or p has only one child)

    3.thenG := P1;

    4.else {j := bottom-up-process(T, G);

    5.if (j = 0 and label(t) = label(P1’s root))

    (*second possibility in Case 3*)

    6.then {change P1’s root to a virtual node;

    x := bottom-up-process(T, P1);

    7.if (x = the number of the children of P1’s root)

    thenj := 1 else j := 0;}

    8.return j;}}

    9.if |T| < |G| return 0;

    10.else {if (label(t) = label(p)) (*handling Case 2*)

    11.then {p := virtual node;

    functionbottom-up-process(T, G)

    input: T = <t; T1, ..., Tk>, G = <p; P1, ..., Pq>

    output: j - an integer

    begin

    1.j := 0; i := 1;

    2.while (j < q and i k) do

    3.{x := top-down-process(Ti, G);

    4.j := j + x; G := <p; Pj+1, ..., Pq>; i := i + 1; }

    end


    Tree inclusion signatures and evaluation of path oriented queries

    Integration of Signatures into Top-down Inclusion

    • Definition A signature for a key word or an attribute value is

    • hash-coded bit string.

    • - Example: (constructing a signature for a word with m = 4 and F = 12)

    • “database”

    • letter triplets: dat, ata, tab, aba, bas, ase

    • H(dat) = 5, H(ata) = 1, H(tab) = 8, H(aba) = 1, H(bas) = 10,

    • H(ase) = 8.

    • 100 010 010 100

      • D. Dervos, Y. Manolopulos and P. Linardis, “Comparison of signature

      • File models with superimposed coding,” J. of Information Processing

      • Letters 65 (1998) 101 - 106.


    Tree inclusion signatures and evaluation of path oriented queries

    Integration of Signatures into Top-down Inclusion

    Definition A signature for a key word or an attribute value is

    hash-coded bit string.

    - Important parameters:

    m: number of 1s in bit string

    F: length of bit string

    D: size of a block (or average number of the key words of an element)

    optimal choice of the parameters:

    Fln2 =mD(1)

    S. Christodoulakis and C. Faloutsos, “Design consideration for a message

    file server,” IEEE Trans. Software Engineering, 10(2) (1984) 201-210.


    Tree inclusion signatures and evaluation of path oriented queries

    0101

    0011

    0001

    0010

    1010

    1100

    a:

    b:

    c:

    d:

    e:

    f:

    0000

    1000

    0101

    1000

    1000

    0000

    T:

    a

    e

    b

    t0

    t1

    t2

    t1

    t2

    t0

    e

    c

    d

    f

    t22

    t12

    t22

    t21

    t11

    t11

    t12

    t21

    Integration of Signatures into Top-down Inclusion

    - Assigning signatures to tree nodes

    Let v be a node in a tree T. If v is a leaf node, its signature svis equal

    to the signature assigned to its label. Otherwise, sv= s v1 ... vn, where

    s represents the signature for the label associated with v, and s1, ... ,

    and snare the signatures of v’s children: v1, ..., vn, respectively.

    T:

    a

    1111 1101

    e

    b

    1111 1101

    1111 1000

    f

    e

    c

    d

    1100 0000

    0001 0101

    0010 1000

    1010 1000


    Tree inclusion signatures and evaluation of path oriented queries

    t0

    t1

    p0

    t2

    t21

    p1

    t22

    t11

    p2

    t12

    e

    1111 1101

    c

    d

    0010 1000

    0001 0101

    Integration of Signatures into Top-down Inclusion

    • - Cutting off useless subtree checks by examining signatures

  • We assign each node v in T a bit string sv (called a signature), and each node

  • u in P a bit string su in such a way that if su matches sv then the subtree Tv

  • rooted at v may includes the subtree Pu rooted at u. Otherwise, Tv definitely

  • does not contain Pu. By “matching”, we mean that for each bit set to 1 in su,

  • the corresponding bit in sv is also set to 1 while for a bit set to 0 in su, the

  • corresponding bit in sv can be 0 or 1. In the following, we discuss this

  • technique in great detail.

  • virtual

    node

    T:

    P:

    This subtree will

    not be explored.

    a

    1111 1101

    a

    b

    1111 1000

    0011 1101

    e

    c

    d

    f

    0010 1000

    1100 0000

    1010 1000

    0001 0101


    Tree inclusion signatures and evaluation of path oriented queries

    Integration of Signatures into Top-down Inclusion

    • - Determine the length of signatures

  • Consider s = s1 / s2, where s1 and s2 are of length F and with m1

  • and m2 bits set to 1, respectively.

  • How to determine the length of S?

  • l - the number of 1s in s

  • d = l - m’, where m’ = max(m1, m2).

    • length(s) = F + cd, where c is a constant and should be tuned for different

    • applications.

    • The value of d can be estimated as follows.

    • l - random variable representing the number of positions, in which both

    • s1 and s2 have 1s.


  • Tree inclusion signatures and evaluation of path oriented queries

    Integration of Signatures into Top-down Inclusion

    • - Determine the length of signatures

  • El = 1 p(l = 1) + 2  p(l = 2) + ... + m’’  p(l = m’’)(2)

    • m’’ = min(m1, m2) and p(l = i) represents the probability that l is equal to i.

  • p(l = i) = (3)

  • d = l - m = m1 + m2 - l - max(m1, m2).


  • Tree inclusion signatures and evaluation of path oriented queries

    Evaluation of Path-Oriented Queries

    • - Procedure for calculating signature length

    • 1)Identify the key words in a document, which can be done by using

    • Connexor-analyzer (http://www.connexor.com/demos/index.html.)

    • 2)Determine the length of the signatures for the nodes of a document tree,

    • which can be done in two steps:

      • -First, use formula (1) to determine the initial length of the signatures

      • according to the number of the chosen key words and their distribution

      • -Secondly, use formula (2) and (3) to determine the length of the

      • signatures for each document according to the initial length set for

      • signatures.


    Tree inclusion signatures and evaluation of path oriented queries

    Evaluation of Path-Oriented Queries

    • - Determine Procedure for calculating signature length

    In the figure, F stands for the initial length of the signatures and m for

    the initial number of bits set to 1.


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    • - Test Platform

    • Computer - DELL desktop PC equipped with Pentium III 864Ghz processor,

    • 512MB RAM and 20GB hard disk.

      • Database system - Oracle-9i Enterprise Edition, The default buffer cache of

      • Oracle-9i is of size 4MB.

      • Language - Oracle PL/SQL language.

  • Data - all the 37 Shakespeare’s plays in a database


  • Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    • - Storage of XML documents in databases

    • All the documents are stored in three tables.

      • The relation Element has the following structure:

        • {DocID: <integer>, ID: <integer>, Ename: <string>,

        • firstChildID: <integer>, siblingID: <integer>, attributeID: <integer>}


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    • - Storage of XML documents in databases

    • The relation Text is of a simpler structure:

    • {DocID: <integer>, textID: <integer>, value: <string>},

      • where “textID” is for the identifiers of texts as the values of the corresponding elements

      • in the original document. One should notice that a text takes always an element as the

      • parent node. See the following table for illustration.


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    • - Storage of XML documents in databases

    • The relation Attribute has five data fields:

      • {DocID: <integer>, att-ID: <integer>, parentID: <integer>, att-name: <string>,

      • att-value: <string>}.


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    - Tested queries

    Group I - for testing path length impact

    • Group II - for testing node degree impact


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    - Tested queries

    Group III - for testing impact of matching at higher level

    • Group IV - for testing impact of matching at middle level


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    - Tested queries

    Group V- for testing impact of matching at lower level


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    • - Tested methods

      • Inversion on Elements and Words (IEW)

      • (C. Zhang, J. Naughton, D. DeWitt, Q. Luo and G. Lohman, “On Supporting

      • Containment Queries in Relational Database Management Systems, in Proc. of ACM

      • SIGMOD Intl. Conf. On Management of Data, California, USA, 2001.)

      • Inversion on Paths and Words (IPW)

      • (C. Seo, S. Lee, and H. Kim, An Efficient Index Technique for XML Documents

      • Using RDBMS, Information and Software Technology 45(2003) 11-22, Elsevier

      • Science B.V.)

      • Tree Inclusion Algorithm (TIA)

      • Tree Inclusion with Signatures (TIS)


    Tree inclusion signatures and evaluation of path oriented queries

    E-index:

    (1, <1, 45>, 0) ...

    hotel-room-reservation

    (1, <2, 4>, 1) ...

    name

    (1, <5, 28>, 2) ...

    location

    ... ...

    ...

    ... ...

    T-index:

    (1, 3, 2) ...

    Travel-lodge

    (1, 7, 3) ...

    Winnipeg

    (1, 10, 3) ...

    Manitoba

    ... ...

    ...

    ... ...

    Experiment Results

    • - Tested methods

      • Inversion on Elements and Words (IEW)

      • - (Dno, Wposition, level) for a text word

      • - (Dno, Eposition, level) for an element

    Example:


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    • - Tested methods

      • To evaluate the query: /hotel-room-reservation/location/address [street = Portage Ave.],

      • four joins are performed:

      • self-joins on E-index relation to connect ‘hotel-room-reservation’ and ‘location’,

      • ‘location’ and ‘address’,

      • ‘address’ and ‘street’,

      • the join between E-index and T-index relations to connect ‘street’ and ‘Portage Ave.’


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    • - Tested methods

      • Inversion on Paths and Words (IPW)

      • - Path(path, pathID),

      • - PathIndex(pathID, docno, begin, end)

      • - Word(word, wordID)

      • - WordIndex(wordID, docno, pathID, position)


    Tree inclusion signatures and evaluation of path oriented queries

    Experiment Results

    • - Tested methods

      • In order to process the same query:

      • /hotel-room-reservation/location/address [street = Portage Ave.],

      • two joins are needed.

        • First join between Path and WordIndex relations with the following join condition:

      • Path.path = ‘hotel-room-reservation/location/address/street’ and

      • Path.pathID = WordIndex.pathID.

        • The second join between the result R of the first join and the Word relation with the

        • join condition:

        • R.wordID = Word.wordID and Word.word = ‘Portage Ave.’.


    Tree inclusion signatures and evaluation of path oriented queries

    IPW

    *

    +

    +

    *

    *

    *

    +

    +

    TIS

    TIA

    IEW

    12

    IPW

    IPW

    +

    +

    +

    TIS

    TIS

    TIA

    TIA

    Execution time (sec.)

    6

    *

    *

    Q1

    Q2

    Q3

    Q4

    Q5

    Results of Group III

    Experiment Results

    - Tested results

    2

    1000

    IPW

    TIS

    TIA

    Execution time (sec.)

    Execution time (sec.)

    1

    100

    +

    *

    +

    *

    *

    *

    +

    +

    Q1

    Q2

    Q3

    Q4

    Q5

    Q6

    Q7

    Q8

    Q9

    Q10

    Results of Group I

    Results of Group II

    12

    +

    +

    +

    Execution time (sec.)

    6

    *

    *

    Q1

    Q2

    Q3

    Q4

    Q5

    Results of Group IV


    Tree inclusion signatures and evaluation of path oriented queries

    *

    +

    IPW

    TIS

    TIA

    Experiment Results

    - Tested results

    12

    +

    +

    +

    Execution time (sec.)

    6

    *

    *

    Q1

    Q2

    Q3

    Q4

    Q5

    Results of Group V


    Tree inclusion signatures and evaluation of path oriented queries

    Summary and Future Work

    • Path-oriented queries in document databases

    • Evaluation of path-oriented queries

    • -top-down algorithm for tree inclusion problemsignatures-Integration of signatures into top-down tree inclusion

    • Future work:

  • document recognition using

  • tree inclusion

  • probabilistic analysis

  • Benford low

  • Zipf low


  • Login