- 89 Views
- Uploaded on
- Presentation posted in: General

Efficient Incremental Validation of XML Documents

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Efficient Incremental Validation of XML Documents

Denilson Barbosa

Alberto O.Mendelson

Leonid Libkin

Laurent Mignet

Marcelo Arenas

Presented by Daria Barger

Daria Barger – DB Seminar

- Introduction
- Types of constraints
- Update operations
- Incremental validation
- Experiments
- Conclusions
- Future work

Daria Barger – DB Seminar

- The problems of storing and querying XML documents have attracted a great deal of interest.
- Other aspects of XML data management, however, have not yet been satisfactorily explored.
- Among them is the problem of checking that documents are valid with respect to their specifications, and that they remain valid after updates.

Daria Barger – DB Seminar

- One popular form of XML document specification is the Document Type Definition (DTD).
- A DTD D is a grammar that defines a set of documents L(D).
- Each document in L(D) is said to be valid with respect to D .

Daria Barger – DB Seminar

The validationproblem is:

Given a DTD D and an XML document X, is it the case that X L(D) ?

The incrementalvalidationproblem is:

Let U be some update operation.

Given X L(D), is it the case that

U(X) L(D)?

Daria Barger – DB Seminar

Content Model:

Element- valid iff the string formed by concatenating its children elements belongs to L(E), the language denoted by E.

Elements are declared in DTD by rules of the form:

<!ELEMENT e c>

<?xml version="1.0"?>

<!ELEMENT db (person*)>

<!ELEMENT person(name, dep, email, tel*)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT dep(#PCDATA)>

<!ELEMENT email(#PCDATA)>

<!ELEMENT tel(#PCDATA)>

Content Model:

#PCDATA – validation can be done trivially

Daria Barger – DB Seminar

Attributes validation is trivial, except for

ID and IDREF attribute types.

Valid XML document should hold:

- Values of all ID attributes are unique
- Value of each IDREF attribute must be equal to the value of some ID attribute

Daria Barger – DB Seminar

Marking:

The specification of XML DTDs restricts the regular expression used for defining element content to be 1- unambiguous (deterministic).

Position – subscripted symbol in E`.

For given position x, Χ (x) denotes a corresponding (unmarked) symbol in Σ.

For example: pos(E’) = {a,b1,b2,c}

Χ (b1) =b

Daria Barger – DB Seminar

A regular expression E is 1- unambiguous if and only if for all words u,v,w over the subscripted alphabet pos(E) and all x,y in pos(E), the conditions

uxv, uyw L(E`) and x≠y

imply Χ(x) ≠ Χ(y)

Which regular expression is deterministic?

- (ab)|(ac)
- a(b|c)
- a(a+b)*ac

Daria Barger – DB Seminar

set of positions that appear as the first symbol of some word in L(E’)

set of positions that appear immediately after position x in some word in L(E’)

set of positions that appear as the last symbol of some word in L(E’)

Daria Barger – DB Seminar

A

p

A

A

y

A

A

A

A

A

A

A

A

A

A

A

A

- Append(p,y) - insert element y as the last child of element p.

Append

Daria Barger – DB Seminar

A

A

A

A

A

- InsertBefore(x,y) – insert element y as immediate left sibling of element x.(This operation is not defined if x is the root of the document).

A

A

A

x

A

A

A

y

A

A

Insert Before

A

A

Daria Barger – DB Seminar

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

x

- Delete(x) – delete element x from the document. Note that if x is the root of the document the operation is trivially valid.

Delete(x)

Daria Barger – DB Seminar

The incremental validation concerns only the content of the element where the update takes place. For example, after an Append(p,y) operation only the content of p needs to be revalidated.

Daria Barger – DB Seminar

wk

w2

w1

p

w3

…

- Together with the i-th child of p we store the value of for the automaton that validates the content model of p.
- This requires auxiliary storage of size
O(n log d), where n is a size of XML document, d is size of DTD

Daria Barger – DB Seminar

wk

w2

w1

p

y

w3

…

Append(p,y) operation

Daria Barger – DB Seminar

wk

w2

w1

wi

Delete(x) operation

p

…

…

Problem: Complexity

Daria Barger – DB Seminar

Possible solution:

Let’s consider E=a(b1*|cb2*)

W=acb…b. All b’s match state b2

Delete c from w, receive w’=ab…b

Now all b’s match state b1

We should re - validate the entire string

This condition does not hold always, e.g.

Daria Barger – DB Seminar

Let E be regular expression over alphabet Σ

Follow(E,x) – set of position in E that can follow x in some path through E.

Define

such that

E is 1,2 conflict - free regular expression if:

Daria Barger – DB Seminar

- 1,2 Conflict Free DTD
- There is no “flipping” between automata states after the update.
- The per update complexity for 1,2 Conflict Free DTD is O(log n + log d) time and O(n log d) auxiliary space.

- Conflict-free DTD:
- No repeated symbols.
- The per update complexity: O(log n + log d) and constant auxiliary space.

Daria Barger – DB Seminar

Append(p,y) and InsertBefore(x,y) operations require checking that no two ID attributes are the same and every IDREF attribute in y refers to some existing document values.

The complexity:

O(|y|log n) time and linear auxiliary space.

|y| = size of added subtree.

Daria Barger – DB Seminar

After Delete(x) operation we have to check that there is no subtree rooted at x that contains a node that has an ID attribute referenced by some other node that is not a descendant of x.

c

b

a

Checking reference counter in delete requires O(log n) time.

Updating reference counter in insert/removing IDREF attribute: O(h log n) time.

Daria Barger – DB Seminar

1e+08

Incr CF –

Incr 1.2 CF –

Incr Arb –

Full Arb –

Full CF -

1e+06

Time [micro sec]

10000

100

64K

512K

4M

32M

256M

2G

Document size

Daria Barger – DB Seminar

1e+08

Incr CF –

Incr 1.2 CF –

Incr Arb –

Full Arb –

Full CF -

1e+06

Time [micro sec]

10000

100

64K

512K

4M

32M

256M

2G

Document size

Daria Barger – DB Seminar

Incr CF –

Incr 1.2 CF –

Incr Arb –

Full Arb –

Full CF -

1000

Time [micro sec]

100

10

64K

512K

4M

32M

256M

2G

Document size

Daria Barger – DB Seminar

- Handled insertion and deletion of subtrees (not leaf nodes only).
- Validated ID and IDREF attributes.
- Characterize a class of DTDs appearing to capture most real life DTDs that admits a log time and constant space incremental validation algorithm.
- Conducted experiments showing that the method is practical for large data documents and behaves much better than full revalidation.

Daria Barger – DB Seminar

Handling complex updates, involving several insertions and deletions as a single transactions.

Daria Barger – DB Seminar