Efficient Data Indexing for Semantic Search: Exploring XSet and Region Algebras in Data Management

Web Data Management Indexes

In this lecture • Indexes • XSet • Region algebras • Indexes for Arbitrary Semistructured Data • Dataguides • T-indexes • Index Fabric Resources • Index Structures for Path Expressions by Milo and Suciu, in ICDT'99 • XSet description: http://www.openhealth.org/XSet/ • Data on the WebAbiteboul, Buneman, Suciu : section 8.2

The problem • Input: large, irregular data graph • Output: index structure for evaluating regular path expressions

The Data Semistructured data instance = a large graph

The queries SELECT X fROM (Bib.*.author).(lastname|firstname).Abiteboul X Regular expressions (using Lorel-like syntax) Select x from part._*.supplier.name x Requires: to traverse data from root, return all nodes x reachable by a path matching the given path expression. Select X From part._*.supplier: {name: X, address: “Philadelphia”} Need index on values to narrow search to parts of the database that contain the string “Philadelphia”.

Analyzing the problem • what kind of data • tree data (XML): easier to index • graph data: used in more complex applications • what kind of queries • restricted regular expressions (e.g. XPath): may be more efficient • arbitrary regular expressions: rarely encountered in practice

XSet: a simple index for XML • Part of the Ninja project at Berkeley • Example XML data:

XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown)

XSet: Efficient query evaluation (R1) SELECT X FROM part.name X -yes (R2) SELECT X FROM part.supplier.name X -yes (R3) SELECT X FROM *.supplier.name X -maybe (R4) SELECT X FROM part.*.subpart.name X -maybe • To evaluate R1, look for part in the root hash table h1, follow the link to table h2, then look for name. • R4 – following part leads to h2; traverse all nodes in the index (corresponding to *), then continue with the path subpart.name. • Thus, explore the entire subtree dominated by h2. • Will be efficient if index is small and fits in memory • R3 – leading wild card forces to consider all nodes in the index tree, resulting in less efficient computation than for R4. • Can index the index itself. • Retrieve all hash tables that contain a supplier entry, continue a normal search from there.

Region Algebras • structured text = text with tags (like XML) • powerful indexing techniques [Baeza-Yates, Gonnet, Navarro, Salminen, Tompa, etc.] • New Oxford English Dictionary • critical limitation:ordered data only (like text) • Assume: data given as an XML text file, and implicit ordering in the file. • less critical limitation: restricted regular expressions

Region Algebras: Definitions • data = sequence of characters [c1c2c3 …] • region = segment of the text in a file • representation (x,y) = [cx,cx+1, … cy], x – start position, y – end position of the region • example: <section> … </section> • region set = a set of regions s.t. any two regions are either disjoint or one included in the other • example all <section> regions (may be nested) • Tree data – each node defines a region and each set of nodes define a region set. • example: region p2 consisting of text under p2, set {p2,s2,s1} is a region set with three regions

Representation of a region set • Example: the <subpart> region set: • region algebra = operators on region set, s1 op s2defines a new region set

Region algebra: some operators • s1intersect s2 = {r | r s1, r s2} • s1included s2 = {r | rs1, r´ s2, r  r´} • s1including s2 = {r | r s1, r´ s2, r  r´} • s1parent s2 = {r | r s1, r´ s2, r is a parent of r´} • s1child s2 = {r | r s1, r´ s2, r is child of r´} Examples: <subpart> included <part> = { s1, s2, s3, s5} <part>including<subpart> = {p2, p3} <name> child <part> = {n1, n3, n12}

Efficient computation of Region Algebra Operators Example: s1 included s2 s1 = {(x1,x1'), (x2,x2'), …} s2 = {(y1,y1'), (y2,y2'), …} (i.e. assume each consists of disjoint regions) Algorithm: if xi < yj then i := i + 1 if xi' > yj' then j := j + 1 otherwise: print (xi,xi'), do i := i + 1 Can do in sub-linear time when one region is very small

From path expressions to region expressions • Use region algebra operators to answer regular path expressions: • Only restricted forms of regular path expressions can be translated into region algebra operators • expressions of the form R1.R2…Rn, where each Ri is either a label constant or the Kleene closure *. part.name name child (part child root) part.supplier.name name child (supplier child (part child root)) *.supplier.name name child supplier part.*.subpart.name name child (subpart included (part child root)) Region expressions correspond to simple XPath expressions

From path expressions to region expressions • Answering more complex queries: • Translates into the following region algebra expression: • “Philadelphia” denotes a region set consisting of all regions corresponding to the word “Philadelphia” in the text. • Such a region can be computed dynamically using a full text index. • Region expressions correspond to simple XPath expressions Select X From *.subpart: {name: X, *.supplier.address: “Philadelphia”} Name child (subpart includes (supplier parent (address intersect “Philadelphia”)))

Indexes for Arbitrary Semistructured Data • A semistructured data instance that is a DAG

Indexes for Arbitrary Semistructured Data • The data represents employees and projects in a company. • Two kinds of employees – programmers and statisticians • Three kinds of links to projects – leads, workson, consultants • Index graph – reduced graph that summarizes all paths from root in the data graph • Example: node p1 – paths from root to p1 labeled with the following five sequences: Project Employee.leads Employee.workson Programmer.employee.leads Programmer.employee.workson • Node p2 – paths from root to p2 labeled by same five sequences • p1 and p2 are language-equivalent

Indexes for Arbitrary Semistructured Data • For each node x in the data graph, Lx = {w|  a path from the root to x labeled w} x,y x  y  Lx = Ly [x] = {y | x  y } Nodes(I) = {[x] | x  nodes(G) I = Edges(I) = {[x] [y] | x  [x], y  [y], x y }

Indexes for Arbitrary Semistructured Data • We have the following equivalences: e1  e2 e3  e4  e5 p1  p2 p3  p4 p5  p6  p7

Indexes for Arbitrary Semistructured Data • Computing path expression queries • Compute query on I and obtain set of index nodes • Compute union of all extents • Returns nodes h8, h9. • Their extents are [p5, p6, p7] and [p8], respectively; • result set = [p5, p6, p7, p8] • Always: size(I)  size(G) • Efficient when I can be stored in main memory • Checking x  y is expensive. Select X From statistician.employee.(leads|consults): X

Indexes for Arbitrary Semistructured Data Use bisimulation instead of  Fact: x, y xb y  x  y Use the same construction, but [u] now refers to b instead of . Bisimulation: Let DB be a data graph. A relation  is a bisimulation on the reversed graph (i.e. all edges have their direction reversed) if the following conditions hold: 1. If x  y and x is a root, then so is y. 2. Conversely, if x  y and y is a root, then so is x. 3. If x  y, then for any edge x x there exists an edge y y, s.t. x  y. 4. Conversely, if x  y, then for any edge y y, then there exists an edge x x s.t. x  y.

DataGuides • Goldman & Widom [VLDB 97] • graph data • arbitrary regular expressions

DataGuides Definition given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.: - every path in DB also occurs in G - every path in G occurs in DB - every path in G is unique

Dataguides Example:

DataGuides • Multiple DataGuides for the same data:

DataGuides Definition Let w, w’ be two words (i.e. word queries) and G a graph w G w’ if w(G) = w’(G) Definition G is a strong dataguide for a database DB if G is the same as DB

DataGuides Example: • G1 is a strong dataguide • G2 is not strong person.project !DB dept.project person.project !G2 dept.project

DataGuides • Constructing the strong DataGuide G: Nodes(G)={{root}} Edges(G)= while changes do choose s in Nodes(G), a in Labels add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G) add (x -a->y) to Edges(G) • Use hash table for Nodes(G) • This is precisely the powerset automaton construction.

DataGuides • How large are the dataguides ? • if DB is a tree, then size(G) <= size(DB) • why? answer: every node is in exactly one extent of G • here: dataguide = XSet • How many nodes does the strong dataguide have for this DB ? 20 nodes (least common multiple of 4 and 5) Dataguides usually fail on data with cyclic schemas, like:

Efficient Data Indexing for Semantic Search: Exploring XSet and Region Algebras in Data Management