1 / 30

Querying Tree-Structured Data Using Dimension Graphs

Querying Tree-Structured Data Using Dimension Graphs. Dimitri Theodoratos ( New Jersey Institute of Technology , USA) Theodore Dalamagas (N ational Techn. University of Athens , Greece). Tree-structured Data Management. Tree structures : a means to organize the information on the Web.

ldoi
Download Presentation

Querying Tree-Structured Data Using Dimension Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Querying Tree-Structured Data Using Dimension Graphs Dimitri Theodoratos (New Jersey Institute of Technology, USA) Theodore Dalamagas (National Techn. University of Athens, Greece)

  2. Tree-structured Data Management • Tree structures: ameans to organize the information on the Web. • Examples:taxonomies, thematic categories, concept hierarchies, product catalogs, etc. • Organizing data in tree structures (tree-structured data) has been vastly established due to the popularity of the XML language. • XML language (W3C): the standard data exchange format on the Web • Data is stored natively in tree structures, or • Data is publicly available in tree structures to enable its automatic processing by programs, scripts, and agents

  3. Tree-structured Data Management • Querying tree-structured data is based on path expression queries. • Popular query languages for tree-structured data: XPath and XQuery (W3C), e.g: FOR $i IN /brand/type[price<900] RETURN {$i/id, $i/condition, $i/price} (find products cheaper than 900, and display their id, condition, and price) • Querying tree-structured data hits to two major obstacles: • the semistructured nature of data, • lack of semantics. • This is actually the penalty one has to pay for the flexibility offered by XML technologies. ... <brand> Sony <type> laptop <id> 1 </id> <condition> used </condition> <price> 800 </price> </type> </brand> ...

  4. Semistructured Nature of Tree-structured Data • Due to the first obstacle (i.e. semistructured nature): • Querying tree-structured data requires to resolve structural differences and inconsistencies. • The reason? different possible ways of organizing the same information in tree-structures. • Examples: • Structural differences: certain ‘nodes’ (i.e. categories, elements, etc...) exist in a tree-structured data source but not in another. • Structural inconsistencies: variations in ‘node’ sequences (even within a single tree-structured data source).

  5. Product r Catalog A PDAs Notebooks Desktops • Structural difference • Product catalog A has a finer categorization on notebooks, e.g.: Custom/Ultralight and 10’’/8’’ (for the ultralight) compared to Catalog B. Custom Ultralight Servers Multimedia HP IBM 8'' 10'' Mac Sony HP Sony IBM New New Used Product r Catalog B Notebooks Desktops PDAs Servers Used Mac Sony Dell Sony HP HP IBM Multimedia New Used Used HP IBM

  6. Product r Catalog A • Structural inconsistency • Product catalog A classifies notebooks by brand and next by condition, while catalog B the other way around (Sony/Used vs Used/Sony). PDAs Notebooks Desktops Custom Ultralight Servers Multimedia HP IBM 8'' 10'' Mac Sony HP Sony IBM New New Used Product r Catalog B New Used Used Notebooks Desktops PDAs New Used Servers Used Dell Sony Mac Sony HP IBM HP Mac Sony Multimedia New Used Used HP IBM

  7. Semistructured Nature of Tree-structured Data • Structural inconsistency (...cont.) • An XML doc includes the element sequence brand, type, condition, while another one (for same data) includes type, condition, brand. • Such inconsistencies are observed even within tree-structured data of a single data source. ... <brand> Sony <type> laptop <condition> used </condition> <price> 800 </price> </type> </brand> ... ... <type> laptop <condition> used <brand> Sony </brand> <price> 800 </price> </condition> </type> ... brand type type condition condition brand

  8. Semistructured Nature of Tree-structured Data • How structural differences and inconsistencies affects querying of tree-structured data? • The user shouldexplicitly specify them as part of the query. • Extremely cumbersome. • E.g.: explicitly specify disjunctions of possible alternative node sequences: /brand/type[price<900] OR /type/condition[price<900] OR /condition/type[price<900] .... <brand> Sony <type> laptop <condition> used </condition> <price> 800 </price> ...... <type> laptop <condition> used <brand> Sony </brand> <price> 800 </price> ...... ...... ...... ......

  9. Semistructured Nature of Tree-structured Data • However, sometimes specifying alternate node sequences is not due to the need to resolve structural differences and inconsistencies. • Users should be able to pose queries even if they do not know (or do not care about) the exact structure of tree-structured data sources. • e.g. find products cheaper than 900, and display their id, condition, and price • ...but I do not know (or I do not care!) whether condition is before brand and type! • Currently, query formulation on tree-structured data is strictly dependent on the structure of data. • Only ancestor/descendant relationship may produce relaxed path expressions (brand//type).

  10. Lack of Semantics in Tree-structured Data • Reminder: Querying tree-structured data hits to two major obstacles: • the semistructured nature of data (just explained) + lack of semantics. • Tree-structured data provides mainly syntactic and not semantic information. • However, there are inherent semantics in tree-structured data. • Sets of nodes in a catalog are usually related under a semantic interpretation, e.g. Mac, HP, Sony refer to a brand name. • Such information can be exploited to become part of query formulation and support query optimization. • Currently, query formulation on tree-structured data ignores this issue.

  11. Our Approach • We introduce the notion of dimension graphs to capture semantic information in tree-structured data. • We design a query language for tree-structured data. • Queries are not cast on the structure of tree-structured data. • Queries can handle structural differences and inconsistencies effectively. • We discuss query evaluation issues. • We show how dimension graphs can be used to query multiple tree-structured data sources.

  12. Data Model • We use value trees to represent tree-structured data. • Values (i.e. nodes) in value trees are grouped to form dimensions. • A dimension... • ...is a set of semantically related nodes (i.e. values) in the value tree. • The semantic interpretation is given by the user. • Two nodes in the same path cannot belong to the same dimension.

  13. pc_category Data Model R r • E.g. dimensions pc_type = {Notebooks, Desktops, PDAs}, pc_category = {Servers, Multimedia}, brand = {Mac, Sony, HP, IBM, Dell}, etc. pc_type pc_type Notebooks Desktops PDAs condition pc_category condition brand New Used Servers Used Dell Sony brand condition Mac Sony HP Mac Sony HP IBM Multimedia Used New Used brand HP IBM

  14. Data Model • We use dimension graphs to capture relationships between dimensions. • The nodes of a dimension graph represent dimensions. • There is an edge from dimension D1 to D2 if a value of D1 is the parent of some value in D2.

  15. pc_category Data Model R Value Tree T r pc_type pc_type Notebooks Desktops PDAs condition pc_category condition brand New Used Servers Used Dell Sony brand condition Mac Sony HP Mac Sony HP IBM Multimedia Used New Used brand HP IBM R Dimension Graph of T pc_type pc_category condition brand

  16. pc_category Data Model R Value Tree T r pc_type pc_type Notebooks Desktops PDAs condition pc_category condition brand New Used Servers Used Dell Sony brand condition Mac Sony HP Mac Sony HP IBM Multimedia Used New Used brand HP IBM R Dimension Graph of T pc_type pc_category condition brand

  17. Data Model • A dimension graph... • can be automatically extracted from a value tree, given the dimensions, • provides an abstraction of the structural information of value trees, • provides semantic query guidance to pose queries on tree-structured data, in the presence of structural differences and inconsistencies, • supports query evaluation and optimization. • ...will be explained soon.

  18. Querying Tree-structured Data • Queries are defined on dimension graphs and not directly on value trees. • The user annotates some dimensions. • Also, she has the choice of not specifying or partially specifyingparent-child and ancestor-descendantrelationships between the annotated dimensions in a query. • Our system identifies possible ‘valid’ orderings of dimensions exploiting the dimension graph. • These orderings are used as patterns for constructing a set of path expressionsto be sent directly to the value trees.

  19. pc_category Querying R Value Tree T r pc_type pc_type Notebooks Desktops PDAs condition pc_category condition brand New Used Servers Used Dell Sony brand condition Mac Sony HP Mac Sony HP IBM Multimedia Used New Used brand HP IBM Query on Dimension Graph of T R pc_type = ? annotated dimension the dimension can have any value = ? condition = pc_category the dimension should have {used} = { ... } specific values brand = {Sony, IBM}

  20. pc_category Querying R Value Tree T r pc_type pc_type Notebooks Desktops PDAs condition pc_category condition brand New Used Servers Used Dell Sony brand condition Mac Sony HP Mac Sony HP IBM Multimedia Used New Used brand HP IBM Query on Dimension Graph of T R ‘Find all Sony, IBM used products’, i.e. find paths in T from r to a leaf node that contain -any of the values of dimension pc_type, -the value ‘used’ of dimension condition, -either value ‘Sony’ or ‘IBM’ of dimension brand. pc_type = ? condition = pc_category {used} brand = {Sony, IBM}

  21. pc_category Querying R Value Tree T r pc_type pc_type Notebooks Desktops PDAs condition pc_category condition brand New Used Servers Used Dell Sony brand condition Mac Sony HP Mac Sony HP IBM Multimedia Used New Used brand HP IBM Query on Dimension Graph of T R ‘Find all Sony, IBM used products’, i.e. find paths in T from r to a leaf node that contain -any of the values of dimension pc_type, -the value ‘used’ of dimension condition, -either value ‘Sony’ or ‘IBM’ of dimension brand. pc_type = ? condition = pc_category {used} brand = {Sony, IBM}

  22. pc_category Querying R Value Tree T r pc_type pc_type Notebooks Desktops PDAs condition pc_category condition brand New Used Servers Used Dell Sony brand condition Mac Sony HP Mac Sony HP IBM Multimedia Used New Used brand HP IBM Query on Dimension Graph of T R pc_type = ? Notice how query handles the structural inconsistencies! condition = pc_category {used} brand = {Sony, IBM}

  23. Querying R Value Tree T r pc_type pc_type Notebooks Desktops PDAs condition pc_category .................... New Used Servers .................... brand Mac Sony HP Mac Sony HP IBM ‘Find all Sony, IBM used products. However, the nodes referring to brand name should be after the node ‘used’.’, i.e. Find paths in T from r to a leaf node that contain -any of the values of dimension pc_type, -the value ‘used’ of dimension condition, -either value ‘Sony’ or ‘IBM’ of dimension brand, However: values of condition should be parents of values of brand. Query on Dimension Graph of T R pc_type = ? condition = pc_category {used} brand = {Sony, IBM}

  24. pc_category Querying R Value Tree T r pc_type pc_type Notebooks Desktops PDAs condition pc_category condition brand New Used Servers Used Dell Sony brand condition Mac Sony HP Mac Sony HP IBM Multimedia Used New Used brand HP IBM Query on Dimension Graph of T R Find paths in T from r to a leaf node that contain -any of the values of dimension pc_type, -the value ‘used’ of dimension condition, -either value ‘Sony’ or ‘IBM’ of dimension brand, However: values of condition should be parents of values of brand. pc_type = ? condition = pc_category {used} brand = {Sony, IBM}

  25. Query Evaluation • Query evaluation exploits dimension graphs to detect answer paths. • An answer path is a path in a dimension graph that starts from R, includes all annotated dimensions, and ends on an annotated dimension. R Query on Dimension Graph of T pc_type = ? mobile_type condition = pc_category {used} brand = • Examples of answer paths: • /R/pc_type/condition/brand,/R/pc_type/pc_category/brand/condition, .... {Sony, IBM}

  26. pc_category Query Evaluation R Value Tree T r pc_type pc_type Notebooks Desktops PDAs condition pc_category condition brand New Used Servers Used Dell Sony brand condition Mac Sony HP Mac Sony HP IBM Multimedia Used New Used brand Query on Dimension Graph of T HP IBM R Answer paths are used to generate path expressions to be exploited by e.g. an XQuery engine to retrieve the answers from a value tree. E.g. /R/pc_type/condition/brand gives /r/(Notebooks|Desktops)/Used/(Sony|IBM) pc_type = ? condition = pc_category {used} brand = {Sony, IBM}

  27. Query Evaluation • The answer paths help to detect ordering of values that can possibly exist in a value tree. • Only these value orderings will be used to compute the answer of a query on the value tree. • This is performed before query evaluation reaches the value tree. • Detecting answers paths in a dimension graph is not a costly task since dimension graphs are much smaller than value trees.

  28. Query Evaluation • Query evaluation exploits dimension graphs to detect unsatisfiable queries (i.e. queries with empty answers in the value tree). • Examples of unsatisfiable queries: R R R pc_type pc_type = ? pc_type = ? brand pc_category brand condition pc_category = ? pc_category = ? mobile_type condition condition = ? =? Brand = ? mobile_type mobile_type = ? Two children have the same parent! No answer paths! No path from condition to mobile_type!

  29. Query Evaluation • Dimension graphs can be used to query multiple value trees. • Consider value trees T1, T2, ..., Tn over a dimension set D. • Let G1, G2, ..., Gn be their dimension graphs. • Construct a global dimension graph G by merging G1, G2, ..., Gn. • Queries are formed on G. • The annotations are transferred to G1, G2, ..., Gn. • Query evaluation is performed as described before.

  30. Conclusions • Querying tree-structured data using dimension graphs: • Dimension graphs: capture semantic information in tree-structured data. • Used for query formulation and evaluation. • Queries are not cast on the structure of tree-structured data but on dimension graphs. • Queries can handle structural differences and inconsistencies in value trees. • Query evaluation exploits dimension graphs to generate appropriate path expressions to be be evaluated on the value trees. • Dimension graphs can be also used to query multiple value trees.

More Related