xyleme n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Xyleme PowerPoint Presentation
Download Presentation
Xyleme

Loading in 2 Seconds...

play fullscreen
1 / 20

Xyleme - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

Xyleme. A Dynamic Warehouse for XML Data of the Web. Motivation. Efficient storage for huge quantities of XML data. Query processing. Data acquisition strategies to build the repository. Change control with services such as query subscription. Semantic data integration. Architecture.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Xyleme' - naiya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
xyleme
Xyleme
  • A Dynamic Warehouse for XML Data of the Web
motivation
Motivation
  • Efficient storage for huge quantities of XML data.
  • Query processing.
  • Data acquisition strategies to build the repository.
  • Change control with services such as query subscription.
  • Semantic data integration.
architecture
Architecture
  • Xyleme is functionally organized in four levels:
    • Physical level (the Natix repository).
    • Logical level (data acquisition and query processing).
    • Application level (change management and semantic data integration).
    • Interface level (interface with the web and interface with the Xyleme clients).
the natix repository
The Natix Repository
  • Xyleme requires the use of an efficient, update-able storage of XML data.
  • The existing approaches can be divided into two categories:
    • Flat streams
    • Metamodeling
  • Natix uses a hybrid approach.
natix repository

f1

Logical Tree

f6

f7

f2

f3

f4

f5

Natix Repository
  • Instead of storing each tree node in a separate record, we store whole documents( or subtrees of documents) together in one record.
  • Typical data trees may not fit on a single page. So the data trees are distributed data over several pages.

f1

Physical Tree

r1

p2

p1

Proxy object

h2

r3

r2

h2

Helper aggregate object

f6

f7

f2

f3

f4

f5

natix repository1
Natix Repository
  • A certain amount of insertions, removals and updates of objects stored in this way would lead to an unfavorable distribution of the data.
    • To avoid this, semantically splitting of the large objects based on the underlying tree structure is done.
  • Data tree is partitioned into subtrees, and store each subtree in a single record less than a page in size.
  • Connected subtrees residing in other records are represented by Proxy objects.
    • Proxy objects consist of the RID of the record which contains the subtree they represent.
  • Substituting all proxies by their respective subtrees reconstruct the original data tree.
natix repository2
Natix Repository
  • Inserting nodes
    • To insert a node into the logical data tree as a child node of f1, it must be decided where in the physical tree the insert should take place.
    • In Natix this choice may be determined by a configuration parameter.
    • After an insertion location has been decided, it is possible that the designated record’s disk page is full.
    • So the record has to be split.
natix repository3
Natix Repository
  • Splitting a record

A record’s subtree before a split

natix repository4
Natix Repository

Record assembly for the subtree

natix repository5
Natix Repository
  • Split Matrix
    • The elements express the desired clustering behavior of a node x with label j as children of a node y with label i.
query processing
Query Processing
  • Query processing in Xyleme is similar to OQL except:
    • In Xyleme we operate on XML documents that can be viewed as trees, where as OQL is defined on graphs of objects.
    • Pattern matching of trees is used to extract information in Xyleme, where as OQL does not provide this facility. This is done with a complex algebraic operator, named Pattern scan.
query processing1
Query Processing
  • The pattern scan operator is implemented using an index mechanism, named XyIndex, this is an extension of the full text index(F T I) technology.
  • Standard FTI returns the documents in which a word occurs.
  • XyIndex adds annotations to position each occurrence of a word within a document relatively to the other words.
data acquisition
Data Acquisition
  • Crawl the web in search of XML data.
  • Refresh pages to keep the repository up to date.
  • Several crawlers can be used simultaneously and only XML pages are stored. HTML pages are used to discover new links.
  • Critical issue is deciding which document to read/refresh next.
  • The decision to read/refresh each page is based on the minimization of a global cost function under some constraint.
    • The constraint is the average number of pages that Xyleme is willing to read per time period.
    • The cost function is the dissatisfaction of users being presented with stale data.
data acquisition1
Data Acquisition
  • More precisely it is based on the criteria like:
    • Subscription and publication
    • Temporal information such as last-time-read or change rate
    • Page importance
change control
Change Control
  • Change control is useful because the users may not only be interested in the current values but also in their evolution.
  • BULD diff algorithm is used for change control.
  • The algorithm is illustrated with the following example.
    • D1 and D2 be two XML documents, D2 being the recent one.
    • The starting point in the algorithm is to match the largest identical parts of both the documents.
    • This is done by registering in a map a unique signature for each subtree of D1.
    • Then every subtree of D2 starting from the largest is considered to find a identical registered subtree of D1.
    • Then the parents are matched, if they have the same label.
    • The fact that parents are matched help detect matching between descendants.
semantic data integration
Semantic Data Integration
  • Queries in Xyleme are formulated using the structure of the documents. In some areas, people are defining standard DTDs, but most companies publishing in XML have their own.
  • Users cannot be expected to know all of the hundreds of DTDs.
  • Xyleme provides a view mechanism, that enables users to query a single structure.
  • Defining views manually is a tedious process, however RDF can be used by the designer of the DTD to provide some extra knowledge, but this field is too young.
  • Thus natural language and machine learning techniques have been used in Xyleme.
semantic data integration1
Semantic Data Integration
  • First task is to classify DTDs into domains based on statistical analysis of the similarities between words found in the different DTDs. Similarity is based on ontologies.
  • Once an abstract DTD has been defined to structure a particular domain, the next task is to generate the semantic connections between elements in the abstract DTD to the concrete ones.
  • The problem now is to map paths to paths.
    • All tags along the path may not be words.
conclusions
Conclusions
  • The main distinguishing feature of Xyleme from other systems is that Xyleme is based on warehousing.
    • Feasible for queries requiring joins over pages distributed over the web.
    • Precise alerts of changes in pages of interests can be done by warehousing.
  • Problems with data integration.