1 / 68

TOSS: An Extension of TAX with Ontologies and Similarity Queries

TOSS: An Extension of TAX with Ontologies and Similarity Queries. Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland, College Park. Outline. Introduction Ontologies and Integration Similarity Enhanced Ontology (SEO) TOSS Algebra

overton
Download Presentation

TOSS: An Extension of TAX with Ontologies and Similarity Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland, College Park

  2. Outline • Introduction • Ontologies and Integration • Similarity Enhanced Ontology (SEO) • TOSS Algebra • Implementation and Experiments • Related Work

  3. XML: EXtensible Markup Language • markup language much like HTML, derived from SGML • designed to describe data for easy data transmission and data manipulation over the web • XML tags: not predefined  flexible to define your own tags

  4. XML Example in DBLP tag name <?xml version="1.0"?> <inproceedings> <author>Paolo Ciancarini</author> <author>Andrea Giovannini</author> <author>Davide Rossi</author> <title>Mobility and Coordination for distributed Java Applications</title> <pages>402—426</pages> <year>1999</year> <booktitle>Advances in Distributed Systems</booktitle> </inproceedings> value

  5. DBLP

  6. SIGMOD

  7. Motivating Examples and TAX • DBLP and SIGMOD bibliographies in XML • [Jagadish et al., TAX: A Tree Algebra in XML, in DBPL, 2001] • one of the best algebra developed for XML databases • selection, projection, product, etc • use pattern tree to find embeddings (matchings)

  8. TAX: Pattern tree • A pattern tree is a pair P=(T, F), where T = (V, E) is an object-labeled and edge-labeled tree such that: • Each object in V has a distinct interger as its label • Each edge is either labeled pc (for parent-child) or ad (for ancestor-descendant) • F is a selection condition applicable to objects tag name pc: parent-child ad: ascendant-descendant value

  9. TAX: Embedding • Suppose SDB is a semistructured database and P = (T, F) a pattern tree. An embedding of a pattern tree P into SDB is a total mapping h: P  U(v, E)SDBV from the nodes of T to those in SDB s.t.: • H preserves the structure of T, i.e., whenever (u, v)is a pc (resp. ad) edge in T, h(v) is a child (resp., descendent) of h(u) in SDB • The image under the mapping h satisfies the selection condition F

  10. TAX: Witness tree • Each embedding h of a pattern tree P into SDB induces a witness tree to the embedding denoted hSDB(P), defined as: • A node n of SDB is in the witness tree if n = h(u) for some node u in the pattern tree P • For any pair of nodes n, m in the witness tree, whenever m is the closest ancestor (of the nodes in the witness tree) of n in SDB, the witness tree contains the edge (m, n) • The witness tree preserve order between nodes in SDB, i.e., for any two nodes in hSDB(P), whenever m precedes n in the preorder enumeration of SDB, it does so in that of hSDB(P) as well

  11. tag name pc: parent-child ad: ascendant-descendant value • Pattern tree • Embedding witness trees: Data Instance

  12. TAX: Selection • Suppose SDB is a semistructured database, P = (T, F) a pattern tree, and SL is any set of nodes. A selection query σP,SL(SDB) returns all witness trees w.r.t. pattern tree P and SDB. In addition, if a node n in SL appears in a witness tree above, then all descendants of n will also be added to the witness tree.

  13. tag name pc: parent-child ad: ascendant-descendant value • Pattern tree • Selection witness trees: Data Instance

  14. Pattern tree • Selection witness trees: Data Instance

  15. Pattern tree • Selection witness trees  selection result Data Instance

  16. TAX: Projection • Suppose SDB is a semistructured database, P = (T, F) a pattern tree, and PL is a projection list (a list of ode labels appearing in P). A projection query πP,PL(SDB) returns tree(s) consisting of all nodes n selected from SDB s.t. for every node n in the result, there exists some witness tree hSDB(P) and n’ PL where hSDB(n’) = n.

  17. Pattern tree • Projection Data Instance

  18. Product • The product of two instances (two sets of trees) contains, for each pair of trees (from the two instances), a tree whose root is a new node (called tax_prod_root) with children as the roots of the two instances. X tax_prod_root

  19. SIGMOD Problems! DBLP

  20. Problems • Lack of lexical semantics in answering queries • Find papers written by “J. Ullman”: • J.D. Ullman? Jeffrey Ullman? • similar values/tags • Find papers whose at least one author is from “U.S. government”: • U.S. Census Bureau? U.S. Army? • values/tags with relationships described by ontologies

  21. Problems • TAX returns correct results High precision • but often misses some correct results Poor Recall • Quality = (recall  precision)1/2  Low Quality • Goal of our TOSS system: • extend and enhance the semantics of TAX to return high quality answers using ontology and similarity measures

  22. Our approach • capture inter-term lexical relationships by ontology and integrate ontologies of different DBs • use existing similarity measures to enhance the integrated ontology • TOSS: extend TAX algebra to query with ontology and similarity

  23. Architecture STORY, PARQ

  24. Architecture STORY, PARQ

  25. Ontology • a set S • S = {article, author, title} • a partially ordered set (S, ≤S) • part_of relation ≤S = {(author, article), (title, article), (title, title), (author, author), (article, article)} • a hierarchy (H, ≤H) is Hasse diagram for (S, ≤S) • a DAG with a minimal set of edges s.t. there’s a path from u to v iff u ≤Sv • H = {article, author, title} • ≤H = {(author, article), (title, article)}

  26. Ontology • Suppose Σ is some finite set of strings and S is some set. An ontology w.r.t. Σ is a partial mapping Θ from Σ to hierarchies for S • Σ = {part_of} • Θ(part_of) = (H, ≤H) author part_of article title part_of

  27. Ontology Integration SIGMOD DBLP

  28. Ontology Integration SIGMOD DBLP IC (interoperation constraints)

  29. Ontology Integration

  30. Ontology Integration

  31. Ontology Integration

  32. Ontology Integration Hierarchy graph associated with SIGMOD and DBLP

  33. Ontology Integration Fusion of ontologies of SIGMOD and DBLP

  34. Architecture STORY, PARQ

  35. Similarity Enhanced Ontology • A string similarity measure dS is any function which takes two strings X,Y and returns a non-negative real number such that • X, dS(X,X) = 0 • X,Y, dS(X,Y) = dS(Y,X)

  36. Similarity Enhanced Ontology • Any string similarity measure can be used such as Levenstein distance, Monge-Elkan distance, Jaro metric, Jaccard Similarity, etc(Cohen et al. "A comparison of string metrics for matching names and records", 1st Workshop on Data Clearning, Record Linkage and Object Consolidation, 2003) • For example: Levenstein distance assigns a unit cost to every edit operation. • dS(“relation”, “relational”)=2

  37. Similarity Enhanced Ontology • A similarity measure is any function which takes nodes A, B as input and returns a non-negative real numbers such that • d(A,B) = minXS,YT dS(X,Y), where dS is a string similarity measure, S,T are sets of strings contained in nodes A,B. • In an integrated ontology, nodes may contain one or more strings and we want to consider whether two nodes are sufficiently similar. Since strings in one node (in the original ontology) is equivalent, so we take the minimum of the distances of all string pairs from two nodes.

  38. Similarity Enhanced Ontology • A string similarity measure dS is strong iff for all strings X, Y, Z, • dS(X, Y) + dS(Y, Z) ≥ dS(X, Z)

  39. Similarity Enhanced Ontology

  40. Similarity Enhanced Ontology • Suppose H is an integrated hierarchy, d is a similarity measure and   0. (H’,) is a similarity enhancement of H w.r.t. d, iff H’ is a hierarchy and  is a function from H to 2H’ such that: • the original partial orderings in H are preserved, and no unwarranted orderings are included • all nodesmapped into the same node are similar to each other (by the threshold ) • two strings are similar iff they are jointly present in some node in (H’,) • no redundantnode whose string set is a subset of some other node

  41. Similarity Enhanced Ontology An example ontology Its similarity enhancement

  42. Similarity Enhanced Ontology • (H, d, ) is similarity consistent iff there exists a similarity enhancement of H w.r.t. d, . • Theorem • If (H, d, ) is similarity consistent, then all similarity enhancements of H are equivalent.

  43. Similarity Enhanced Ontology

  44. Architecture STORY, PARQ

  45. SEO Semistructured Instance A semistructured instance is defined as I = (V, E, t) where t associates a type in T with each attribute (tag and/or content) of each object o in V.

  46. TOSS Algebra • A simple selection condition has the form X op Y • op  { =, , <, , >, , ~, instance_of, isa, part_of, subtype_of, above, below}, and X, Y are terms, i.e.,attributes (tag, content), types, or typed values v: with v  dom(). • A selection condition is a simple selection condition OR a conjunction/disjunction of two selection conditions OR a negation of a selection condition

  47. TOSS Algebra • The pattern tree to find the titles of all papers in DBLP related to Microsoft (independently of the field in which Microsoft appears): #1.tag = inproceedings & #2.tag = title & #3.tag part_of inproceedings & #3.content ~ “Microsoft”

More Related