Xml indexing structure
Download
1 / 41

xml indexing structure - PowerPoint PPT Presentation


  • 376 Views
  • Updated On :

XML Indexing Structure. by XSoumia Elghani & XHanaa Talei CSC5370. Table of Content. Introduction Motivation Full Text Indexing Graphs Natix Sphinx Lore System Index Fabric. Introduction. Motivation . Web . Billion of documents. Finding a document become impossible.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'xml indexing structure' - Antony


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Xml indexing structure l.jpg

XML Indexing Structure

by XSoumia Elghani

& XHanaa Talei

CSC5370


Table of content l.jpg
Table of Content

  • Introduction

  • Motivation

  • Full Text Indexing

  • Graphs

  • Natix

  • Sphinx

  • Lore System

  • Index Fabric



Motivation l.jpg
Motivation

Web

Billion of documents

Finding a document become impossible

Need of efficient indexing techniques


Full text indexing l.jpg
Full Text Indexing

A full text provides standard retrieval of all text objects.

  • B+ tree.

  • Inverted list.


B tree l.jpg
B+ Tree

  • It is the most widely used of several index structures that maintain their efficiency.

  • B+ Tree is a dynamic structure

    • Insertions and deletions leave tree height balanced

    • Almost always better than maintaining a sorted file

    • B+ tree is also based on rotation

    • Most widely used index in Data base mangement systems


Slide7 l.jpg
HIW?

Original:

Insert 28:


Slide8 l.jpg
Cont…

Insert 70:

Insert 95:


Inverted list l.jpg
Inverted List

They store data from the database as keys sodata content can be quickly searched on.


Graphs l.jpg
Graphs

  • As we can represent data as tree, we can represent it as a graph.


More details l.jpg
More details

Employees

Programmers

Statisticians



Leads

Workson

Consults



Projects


Problem solution l.jpg
Problem, solution

  • P: Many links need to be reduced

  • S: An index graph a reduced graph that will summarizes all the paths from the root.

! Important

Language Equivalent

Project

Employee.leads

Employee.workson

Programmer.employee.leads

Programmer.employee.workson

The same thing apply to p2


Implementing an index l.jpg
Implementing an index??

  • Each node is a hash table containing one entry for each label at that node. Each index node has an extent: a list of pointers to all data nodes in the corresponding class.

    i.e: the extent of the node h4 is the list [e1, e2]

    We compute the query on the index and obtain a set of index nodes; and then we compute the union of all extents.



Example l.jpg
Example

  • Select x from statistician.employee.(leads|consults):x

  • This query will returns the nodes h8,h9; their extents are [p5,p6,p7] and [p8] then the result of our query is the union

Results:

Simplified form of DAG

Efficient way when it can be stored in main memory


Natix l.jpg
Natix

  • An efficient, native repository for storing, retrieving and managing tree structured large objects, preferably XML documents

  • It is based on split algorithm

  • Dynamically maintains physical records of size smaller than a page which contain sets of connected tree nodes.

  • It is similar to the hybrid system , but with some extensions


Natix architecture l.jpg
Natix Architecture

  • Record Manager: provided memory spaces divided into segments (collection of equal size pages) and each page holds one or more records.

  • Tree storage manager: operate on top of RM; it maps the tree used to model the document(topic)


Slide18 l.jpg
Cont..

  • Index management

  • Query engine

  • Schema manager, take care of the DTD

  • Document manager (validate the schema), make the necessary index update..

But they are not implemented yet


Physical model l.jpg
Physical Model

In order to store our logical tree, there are two important ways to classify the physical node:

object content

  • Large tree


1 object content l.jpg
1. Object content

  • The classification is based on the content of the node:

  • Aggregate: inner nodes of the tree; they contain their respective child nodes.

  • Literal:leaf nodes containing stream of bytes

  • Proxy: nodes which point to different records (thery are used in the representation of large trees.)


Slide21 l.jpg

Large Trees

Large trees are split into subtrees, and then store each subree in a single record

Scaffolding

Objects


Slide22 l.jpg

Second Step

Soumia


Sphinx l.jpg
Sphinx

  • Schema-conscious Path-Hierarchy Indexing of Xml.

  • Uses DTD to speed up the search process.

  • XML document Document Graph.

  • DTD  Schema Graph.




Lore system l.jpg
Lore System

  • DBMS designed for semistructured data

  • Uses OEM graph, a label directed graph.

  • Vertices are objects

  • Each object has a unique object identifier (e.g. &19)



Indexes in lore l.jpg
Indexes in Lore

  • To indentify objects with specific values:

    • Value Index

    • Text Index

  • To traverse DB graph:

    • Link Index

    • Path Index


Value index vindex l.jpg
Value Index (Vindex)

  • Implemented as B+trees

  • Takes a label ‘l’, a comparator ‘c’, and a value ‘v’

  • Returns all atomic objects having:

    • an incoming edge with the given label

    • a value satisfying the given operator and value

  • e.g. l=Price c=‘>’ v= 15.00

    result= {&11, &15}.


Text index tindex l.jpg
Text Index (Tindex)

  • Implemented using inverted lists.

  • Maps a given word ‘w’ and label ‘l’ to a list of atomic values with incoming edge ‘l’ that contain word ‘w’.

  • Label can be omitted for a full search.

  • Returns a list of postings (o,n) indicating that ‘w’ appears in object ‘o’ as the nth word in the value.

  • e.g. w=“Ford” l= Name

    result = {(&17,2),(&21,2)}


Link index lindex l.jpg
Link Index (Lindex)

  • Implemented using linear hashing

  • Used to retrieve the parents of an object

  • Takes a child object ‘c’ and a label ‘l’

  • Returns all parents ‘p’ such that there is an l-labeled edge from p to c.

  • If the label is omitted, lindex returns all parents and their labels

  • Useful because there are no inverse pointers in OEM graphs.


Path index pindex l.jpg
Path Index (Pindex)

  • Takes a given object ‘o’ (e.g. root) and a path ‘p’

  • Returns the set of objects reachable from ‘o’ following path ‘p’.

  • e.g. “select DB.Movie.Title”

    result = {&5,&9,&14}


Index fabric l.jpg
Index Fabric

  • Optimizes searches over semi-structured databases

  • Based on Patricia tries

  • Assigns a designator to each tag in the XML document.

  • To interpret the designators a designator dictionary is used


Patricia tries l.jpg

Practical Algorithm to Retrieve Information Coded in Alphanumeric

Nodes are labelled with their depth

Patricia Tries





Conclusion l.jpg
Conclusion Alphanumeric

  • A number of indexing techniques

  • Different approaches

  • Under construction (e.g. Natix)

  • Still developing and improving


References l.jpg
References Alphanumeric

  • Graphs: S. Abiteboul, P. Buneman, D. Suciu, “Data on the Web: from relations to semistructured data and XML”, Morgan Kuafman, 2000.

  • Natix:C.C Kanne, Guido Moerkotte. “Efficient storage of xml data“. Proc. of ICDE, California, USA, page 198, 2000.http://citeseer.nj.nec.com/kanne99efficient.html  .

  • Sphinx: L. K. Poola and J. R. Haritsa. "SphinX: Schema-conscious XML Indexing", Indian Institute of Science, 2001. http://citeseer.nj.nec.com/poola01sphinx.html


References40 l.jpg
References Alphanumeric

  • Lore: J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajamaran. “Indexing semistructured data “. Technical report, Stanford University, Computer Science Department, 1998.http://citeseer.nj.nec.com/mchugh98indexing.html.

  • Index Fabric: B. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. “A fast index for semistructured data”. In Proceedings of VLDB, 2001. http://citeseer.nj.nec.com/cooper01fast.html.



ad