Xyleme: Dynamic XML Data Warehouse for Efficient Data Management and Semantic Integration

Xyleme • A Dynamic Warehouse for XML Data of the Web

Motivation • Efficient storage for huge quantities of XML data. • Query processing. • Data acquisition strategies to build the repository. • Change control with services such as query subscription. • Semantic data integration.

Architecture • Xyleme is functionally organized in four levels: • Physical level (the Natix repository). • Logical level (data acquisition and query processing). • Application level (change management and semantic data integration). • Interface level (interface with the web and interface with the Xyleme clients).

Architecture

The Natix Repository • Xyleme requires the use of an efficient, update-able storage of XML data. • The existing approaches can be divided into two categories: • Flat streams • Metamodeling • Natix uses a hybrid approach.

f1 Logical Tree f6 f7 f2 f3 f4 f5 Natix Repository • Instead of storing each tree node in a separate record, we store whole documents( or subtrees of documents) together in one record. • Typical data trees may not fit on a single page. So the data trees are distributed data over several pages. f1 Physical Tree r1 p2 p1 Proxy object h2 r3 r2 h2 Helper aggregate object f6 f7 f2 f3 f4 f5

Natix Repository • A certain amount of insertions, removals and updates of objects stored in this way would lead to an unfavorable distribution of the data. • To avoid this, semantically splitting of the large objects based on the underlying tree structure is done. • Data tree is partitioned into subtrees, and store each subtree in a single record less than a page in size. • Connected subtrees residing in other records are represented by Proxy objects. • Proxy objects consist of the RID of the record which contains the subtree they represent. • Substituting all proxies by their respective subtrees reconstruct the original data tree.

Natix Repository • Inserting nodes • To insert a node into the logical data tree as a child node of f1, it must be decided where in the physical tree the insert should take place. • In Natix this choice may be determined by a configuration parameter. • After an insertion location has been decided, it is possible that the designated record’s disk page is full. • So the record has to be split.

Natix Repository • Splitting a record A record’s subtree before a split

Natix Repository Record assembly for the subtree

Natix Repository • Split Matrix • The elements express the desired clustering behavior of a node x with label j as children of a node y with label i.

Query Processing • Query processing in Xyleme is similar to OQL except: • In Xyleme we operate on XML documents that can be viewed as trees, where as OQL is defined on graphs of objects. • Pattern matching of trees is used to extract information in Xyleme, where as OQL does not provide this facility. This is done with a complex algebraic operator, named Pattern scan.

Query Processing • The pattern scan operator is implemented using an index mechanism, named XyIndex, this is an extension of the full text index(F T I) technology. • Standard FTI returns the documents in which a word occurs. • XyIndex adds annotations to position each occurrence of a word within a document relatively to the other words.

Data Acquisition • Crawl the web in search of XML data. • Refresh pages to keep the repository up to date. • Several crawlers can be used simultaneously and only XML pages are stored. HTML pages are used to discover new links. • Critical issue is deciding which document to read/refresh next. • The decision to read/refresh each page is based on the minimization of a global cost function under some constraint. • The constraint is the average number of pages that Xyleme is willing to read per time period. • The cost function is the dissatisfaction of users being presented with stale data.

Data Acquisition • More precisely it is based on the criteria like: • Subscription and publication • Temporal information such as last-time-read or change rate • Page importance

Change Control • Change control is useful because the users may not only be interested in the current values but also in their evolution. • BULD diff algorithm is used for change control. • The algorithm is illustrated with the following example. • D1 and D2 be two XML documents, D2 being the recent one. • The starting point in the algorithm is to match the largest identical parts of both the documents. • This is done by registering in a map a unique signature for each subtree of D1. • Then every subtree of D2 starting from the largest is considered to find a identical registered subtree of D1. • Then the parents are matched, if they have the same label. • The fact that parents are matched help detect matching between descendants.

Change Control

Semantic Data Integration • Queries in Xyleme are formulated using the structure of the documents. In some areas, people are defining standard DTDs, but most companies publishing in XML have their own. • Users cannot be expected to know all of the hundreds of DTDs. • Xyleme provides a view mechanism, that enables users to query a single structure. • Defining views manually is a tedious process, however RDF can be used by the designer of the DTD to provide some extra knowledge, but this field is too young. • Thus natural language and machine learning techniques have been used in Xyleme.

Semantic Data Integration • First task is to classify DTDs into domains based on statistical analysis of the similarities between words found in the different DTDs. Similarity is based on ontologies. • Once an abstract DTD has been defined to structure a particular domain, the next task is to generate the semantic connections between elements in the abstract DTD to the concrete ones. • The problem now is to map paths to paths. • All tags along the path may not be words.

Conclusions • The main distinguishing feature of Xyleme from other systems is that Xyleme is based on warehousing. • Feasible for queries requiring joins over pages distributed over the web. • Precise alerts of changes in pages of interests can be done by warehousing. • Problems with data integration.

Xyleme: Dynamic XML Data Warehouse for Efficient Data Management and Semantic Integration

Xyleme: Dynamic XML Data Warehouse for Efficient Data Management and Semantic Integration

Presentation Transcript

XML Warehousing and Xyleme

Learning Content Development: The Agile way with Xyleme