Efficient Incremental Validation of XML Documents

Efficient Incremental Validation of XML Documents

Denilson Barbosa

Alberto O.Mendelson

Leonid Libkin

Laurent Mignet

Marcelo Arenas

Presented by Daria Barger

- Introduction
- Types of constraints
- Update operations
- Incremental validation
- Experiments
- Conclusions
- Future work

- The problems of storing and querying XML documents have attracted a great deal of interest.
- Other aspects of XML data management, however, have not yet been satisfactorily explored.
- Among them is the problem of checking that documents are valid with respect to their specifications, and that they remain valid after updates.

- One popular form of XML document specification is the Document Type Definition (DTD).
- A DTD D is a grammar that defines a set of documents L(D).
- Each document in L(D) is said to be valid with respect to D .

The validationproblem is:

Given a DTD D and an XML document X, is it the case that X L(D) ?

The incrementalvalidationproblem is:

Let U be some update operation.

Given X L(D), is it the case that

U(X) L(D)?

Content Model:

Element- valid iff the string formed by concatenating its children elements belongs to L(E), the language denoted by E.

Elements are declared in DTD by rules of the form:

<!ELEMENT e c>

<?xml version="1.0"?>

<!ELEMENT db (person*)>

<!ELEMENT person(name, dep, email, tel*)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT dep(#PCDATA)>

<!ELEMENT email(#PCDATA)>

<!ELEMENT tel(#PCDATA)>

Content Model:

#PCDATA – validation can be done trivially

Attributes validation is trivial, except for

ID and IDREF attribute types.

Valid XML document should hold:

- Values of all ID attributes are unique
- Value of each IDREF attribute must be equal to the value of some ID attribute

Marking:

The specification of XML DTDs restricts the regular expression used for defining element content to be 1- unambiguous (deterministic).

Position – subscripted symbol in E`.

For given position x, Χ (x) denotes a corresponding (unmarked) symbol in Σ.

For example: pos(E’) = {a,b1,b2,c}

Χ (b1) =b

A regular expression E is 1- unambiguous if and only if for all words u,v,w over the subscripted alphabet pos(E) and all x,y in pos(E), the conditions

uxv, uyw L(E`) and x≠y

imply Χ(x) ≠ Χ(y)

Which regular expression is deterministic?

- (ab)|(ac)
- a(b|c)
- a(a+b)*ac

set of positions that appear as the first symbol of some word in L(E’)

set of positions that appear immediately after position x in some word in L(E’)

set of positions that appear as the last symbol of some word in L(E’)

- Append(p,y) - insert element y as the last child of element p.

Append

The incremental validation concerns only the content of the element where the update takes place. For example, after an Append(p,y) operation only the content of p needs to be revalidated.

- Together with the i-th child of p we store the value of for the automaton that validates the content model of p.
- This requires auxiliary storage of size
O(n log d), where n is a size of XML document, d is size of DTD

Append(p,y) operation

Problem: Complexity

Possible solution:

Let’s consider E=a(b1*|cb2*)

W=acb…b. All b’s match state b2

Delete c from w, receive w’=ab…b

Now all b’s match state b1

We should re - validate the entire string

This condition does not hold always, e.g.

Daria Barger – DB Seminar

Let E be regular expression over alphabet Σ

Follow(E,x) – set of position in E that can follow x in some path through E.

Define

such that

E is 1,2 conflict - free regular expression if:

- 1,2 Conflict Free DTD
- There is no “flipping” between automata states after the update.
- The per update complexity for 1,2 Conflict Free DTD is O(log n + log d) time and O(n log d) auxiliary space.

- Conflict-free DTD:
- No repeated symbols.
- The per update complexity: O(log n + log d) and constant auxiliary space.

Append(p,y) and InsertBefore(x,y) operations require checking that no two ID attributes are the same and every IDREF attribute in y refers to some existing document values.

The complexity:

O(|y|log n) time and linear auxiliary space.

|y| = size of added subtree.

After Delete(x) operation we have to check that there is no subtree rooted at x that contains a node that has an ID attribute referenced by some other node that is not a descendant of x.

c

b

a

Checking reference counter in delete requires O(log n) time.

Updating reference counter in insert/removing IDREF attribute: O(h log n) time.

1e+08

Incr CF –

Incr 1.2 CF –

Incr Arb –

Full Arb –

Full CF -

1e+06

Time [micro sec]

10000

100

64K

512K

4M

32M

256M

2G

Document size

1e+08

Incr CF –

Incr 1.2 CF –

Incr Arb –

Full Arb –

Full CF -

1e+06

Time [micro sec]

10000

100

64K

512K

4M

32M

256M

2G

Document size

Incr CF –

Incr 1.2 CF –

Incr Arb –

Full Arb –

Full CF -

1000

Time [micro sec]

100

10

64K

512K

4M

32M

256M

2G

Document size

- Handled insertion and deletion of subtrees (not leaf nodes only).
- Validated ID and IDREF attributes.
- Characterize a class of DTDs appearing to capture most real life DTDs that admits a log time and constant space incremental validation algorithm.
- Conducted experiments showing that the method is practical for large data documents and behaves much better than full revalidation.

Handling complex updates, involving several insertions and deletions as a single transactions.

