Cis 550 fall 2001
This presentation is the property of its rightful owner.
Sponsored Links
1 / 108

CIS 550 Fall 2001 PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

CIS 550 Fall 2001. Handout 6 XML. Outline (ambitious). Background: documents (SGML/HTML) and databases (structured and semistructured data) XML Basics and Document Type Descriptors XML query languages: XML-QL and XSL. XML additions: Xlink, Xpointer, RDF, SOX, XML-Data

Download Presentation

CIS 550 Fall 2001

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cis 550 fall 2001

CIS 550Fall 2001

Handout 6

XML

Fall 2001


Outline ambitious

Outline (ambitious)

  • Background: documents (SGML/HTML) and databases (structured and semistructured data)

  • XML Basics and Document Type Descriptors

  • XML query languages: XML-QL and XSL.

  • XML additions: Xlink, Xpointer, RDF, SOX, XML-Data

  • Document Object Model (XML API's)

Fall 2001


Some useful articles

Some Useful Articles

XML, Java, and the future of the web

http://webreview.com/wr/pub/97/12/19/xml/index.html

XML and the Second-Generation Web

http://www.sciam.com/1999/0599issue/0599bosak.html

Articles/standards for XML, XSL, XML-QL http://www.w3c.org/

http://www.w3.org/TR/REC-xml

Fall 2001


Part i background

Part I: Background

What’s the difference between the world of documents and information retrieval and databases and query interfaces?

Fall 2001


Documents vs databases

Document world

> plenty of small documents

> usually static

> implicit structure

section, paragraph, toc,

> tagging

> human friendly

> content

form/layout, annotation

> Paradigms

“Save as”, wysiwyg

> meta-data

author name, date, subject

Database world

> a few large databases

> usually dynamic

> explicit structure (schema)

> records

> machine friendly

> content

schema, data, methods

> Paradigms

Atomicity, Concurrency, Isolation, Durability

> meta-data

schema description

Documents vs Databases

Fall 2001


What to do with them

Documents

editing

printing

spell-checking

counting words

retrieving (IR)

searching

Database

updating

cleaning

querying

composing/transforming

What to do with them

Fall 2001


Cis 550 fall 2001

HTML

  • Lingua franca for publishing hypertext on the World Wide Web

  • Designed to describe how a Web browser should arrange text, images and push-buttons on a page.

  • Easy to learn, but does not convey structure.

  • Fixed tag set.

Text (PCDATA)

Opening tag

  • <HTML>

  • <HEAD><TITLE>Welcome to the XML course</TITLE></HEAD>

  • <BODY>

  • <H1>Introduction</H1>

  • <IMGSRC=”dragon.jpeg"WIDTH="200"HEIGHT="150” >

  • </BODY>

  • </HTML>

Closing tag

“Bachelor” tag

Attribute name

Attribute value

Fall 2001


Thin red line

Thin red line

  • The line between the document world and the database world is not clear.

  • In some cases, both approaches are legitimate.

  • An interesting middle ground is data formats -- of which XML is an example

  • Examples

    • Personal address book

    • Swissprot

    • ASN.1

Fall 2001


Personal address book over 20 years

Personal address book over 20 years

1977

NAchison, Malcolm

F Dr. M.P. Achison

A Dept. of Computer Science

A University of Edinburgh

A Kings Buildings

A Edinburgh E12 8QQ

A Scotland

T 031-123-8855 ext. 4359 (work)

T 031-345-7570 (home)

N Albani, Paolo

F Prof. Paolo Albani

A Dip. Informatica e Sistemistica

A Universita di Roma La Sapienza

...

1990

N Achison, Malcolm

F Prof. M.P. Achison

A Dept. of Computing Science

A University of Glasgow

A Lilybank Gardens

A Glasgow G12 8QQ

A Scotland

T 041-339-8855 ext. 4359

T 041-357-3787 (private)

T 031-667-7570 (home)

X 041-339-0090

C [email protected]

N Achison, Malcolm

F Prof. M.P. Achison

A 34 Inverness Place

A Edinburgh, EH3 8UV

1980

N Achison, Malcolm

F Dr. M.P. Achison

A Dept. of Computer Science

....

T 031-667-7570 (home)

C [email protected]

1997

N Achison, Malcolm

F Prof. M.P. Achison

A Department of Computing Science

...

T 031-667-7570 (home)

X 041-339-0090

C [email protected]

W http://www.dcs.gla.ac.uk/mpa

2000

?

Fall 2001


Swissprot

Swissprot

ID 11SB_CUCMA STANDARD; PRT; 480 AA.

AC P13744;

DT 01-JAN-1990 (REL. 13, CREATED)

DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)

DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)

DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.

OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).

OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;

OC VIOLALES; CUCURBITACEAE.

RN [1]

RP SEQUENCE FROM N.A.

RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;

RX MEDLINE; 88166744.

RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARANISHIMURA I.;

RL EUR. J. BIOCHEM. 172:627-632(1988).

RN [2]

RP SEQUENCE OF 22-30 AND 297-302.

RA OHMIYA M., HARA I., MASTUBARA H.;

RL PLANT CELL PHYSIOL. 21:157-167(1980).

Fall 2001


Swissprot cont d

Swissprot (cont’d)

CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.

CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A

CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A

CC DISULFIDE BOND.

CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).

DR EMBL; M36407; G167492; -.

DR PIR; S00366; FWPU1B.

DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.

KW SEED STORAGE PROTEIN; SIGNAL.

FT SIGNAL 1 21

FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.

FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).

FT CHAIN 297 480 DELTA CHAIN (BASIC).

FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.

FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).

FT CONFLICT 27 27 S -> E (IN REF. 2).

FT CONFLICT 30 30 E -> S (IN REF. 2).

SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32;

MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR

RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA

IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV

...

Fall 2001


The structure of xml

The Structure of XML

  • XML consists of tags and text

  • Tags come in pairs<date> ...</date>

  • They must be properly nested

    <date> <day> ... </day> ... </date> --- good

    <date><day> ... </date>... </day> --- bad

    (You can’t do <i> ... <b> ... </i> ...</b> in HTML)

Fall 2001


Xml text

XML text

XML has only one “basic” type -- text.

It is bounded by tags e.g.

<title> The Big Sleep </title>

<year> 1935 </ year> --- 1935 is still text

XML text is called PCDATA (for parsed

character data). It uses a 16-bit encoding,

e.g. \&\#x0152 for the Hebrew letter Mem

Later we shall see how new types are specified by XML-data

Fall 2001


Xml structure

XML structure

Nesting tags can be used to express various structures. E.g. A tuple (record) :

<person>

<name>Malcolm Atchison</name>

<tel>(215) 898 4321</tel>

<email>[email protected]</email>

</person>

Fall 2001


Xml structure cont

XML structure (cont.)

  • We can represent a list by using the same

    tag repeatedly:

<addresses>

<person> ... </person>

<person> ... </person>

<person> ... </person>

...

</addresses>

Fall 2001


Terminology

Terminology

The segment of an XML document between an opening and a corresponding closing tag is called an element.

<person>

<name> Malcolm Atchison </name>

<tel> (215) 898 4321 </tel>

<tel> (215) 898 4321 </tel>

<email> [email protected] </email>

</person>

element

element, a sub-element

of

not an element

Fall 2001


Xml is tree like

person

name

tel

tel

email

XML is tree-like

Malcolm Atchison

(215) 898 4321

(215) 898 4321

[email protected]

Semistructured data models typically put the labels on the edges

Fall 2001


Mixed content

Mixed Content

An element may contain a mixture of sub-elements and PCDATA

<airline>

<name> British Airways </name>

<motto>

World’s <dubious> favorite</dubious> airline

</motto>

</airline>

Data of this form is not typically generated from databases. It is needed for consistency with HTML

Fall 2001


A complete xml document

A Complete XML Document

<?xmlversion="1.0"?>

<person>

<name> Malcolm Atchison </name>

<tel> (215) 898 4321 </tel>

<email> [email protected] </email>

</person>

Fall 2001


Two ways of representing a db

Two ways of representing a DB

projects:

title budget managedBy

employees:

name ssn age

Fall 2001


Project and employee relations in xml

Project and Employee relations in XML

<db>

<project>

<title> Pattern recognition </title>

<budget> 10000 </budget>

<managedBy> Joe </managedBy>

</project>

<employee>

<name> Joe </name>

<ssn> 344556 </ssn>

<age> 34 < /age>

</employee>

Projects and employees are intermixed

<employee>

<name> Sandra </name>

<ssn> 2234 </ssn>

<age> 35 </age>

</employee>

<project>

<title> Auto guided vehicle </title>

<budget> 70000 </budget>

<managedBy> Sandra </managedBy>

</project>

:

</db>

Fall 2001


Cis 550 fall 2001

Project and Employee relations in XML (cont’d)

Employees follow projects

<employees>

<employee>

<name> Joe </name>

<ssn> 344556 </ssn>

<age> 34 </age>

</employee>

<employee>

<name> Sandra </name>

<ssn> 2234 </ssn>

<age>35 </age>

</employee>

:

<employees>

</db>

<db>

<projects>

<project>

<title> Pattern recognition </title>

<budget> 10000 </budget>

<managedBy> Joe </managedBy>

</project>

<project>

<title> Auto guided vehicles </title>

<budget> 70000 </budget>

<managedBy> Sandra </managedBy>

</project>

:

</projects>

Fall 2001


Cis 550 fall 2001

Project and Employee relations in XML (cont’d)

Or without “separator” tags …

<db>

<projects>

<title> Pattern recognition </title>

<budget> 10000 </budget>

<managedBy> Joe </managedBy>

<title> Auto guided vehicles </title>

<budget> 70000 </budget>

<managedBy> Sandra </managedBy>

:

</projects>

<employees>

<name> Joe </name>

<ssn> 344556 </ssn>

<age> 34 </age>

<name> Sandra </name>

<ssn> 2234 </ssn>

<age> 35 </age>

:

</employees>

</db>

Fall 2001


Attributes

Attributes

An (opening) tag may contain attributes. These are typically used to describe the content of an element

<entry>

<wordlanguage= “en”> cheese </word>

<wordlanguage= “fr”> fromage </word>

<wordlanguage= “ro”> branza </word>

<meaning> A food made … </meaning>

</entry>

Fall 2001


Attributes cont d

Attributes (cont’d)

Another common use for attributes is to express dimension or type

<picture>

<height dim= “cm”> 2400 </height>

<width dim= “in”> 96 </width>

<data encoding = “gif”compression = “zip”>

[email protected]!G96YE<FEC ...

</data>

</picture>

A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed .

Fall 2001


When to use attributes

When to use attributes

It’s not always clear when to use attributes

<person ssno= “123 45 6789”>

<name> F. MacNiel </name>

<email>

[email protected]

</email>

...

</person>

<person>

<ssno>123 45 6789</ssno>

<name> F. MacNiel </name>

<email>

[email protected]

</email>

...

</person>

Fall 2001


Using ids

Using IDs

<family>

<person id="jane" mother="mary" father="john">

<name> Jane Doe </name>

</person>

<person id="john" children="jane jack">

<name> John Doe </name> <mother/>

</person>

<person id="mary" children="jane jack">

<name> Mary Doe </name>

</person>

<person id="jack" mother=”mary" father="john">

<name> Jack Doe </name>

</person>

</family>

Fall 2001


Odl schema

ODL schema

classMovie

( extentMovies, key title )

{

attribute string title;

attribute string director;

relationshipset<Actor> casts

inverse Actor::acted_In;

attribute int budget;

} ;

class Actor

( extent Actors, key name )

{

attribute string name;

relationship set<Movie> acted_In

inverse Movie::casts;

attribute int age;

attribute set<string> directed;

} ;

Fall 2001


An example

An example

<db>

<movie id=“m1”>

<title>Waking Ned Divine</title>

<director>Kirk Jones III</director>

<cast idrefs=“a1 a3”></cast>

<budget>100,000</budget>

</movie>

<movie id=“m2”>

<title>Dragonheart</title>

<director>Rob Cohen</director>

<cast idrefs=“a2 a9 a21”></cast>

<budget>110,000</budget>

</movie>

<movie id=“m3”>

<title>Moondance</title>

<director>Dagmar Hirtz</director>

<cast idrefs=“a1 a8”></cast>

<budget>90,000</budget>

</movie>

:

<actor id=“a1”>

<name>David Kelly</name>

<acted_In idrefs=“m1 m3 m78” >

</acted_In>

</actor>

<actor id=“a2”>

<name>Sean Connery</name>

<acted_In idrefs=“m2 m9 m11”>

</acted_In>

<age>68</age>

</actor>

<actor id=“a3”>

<name>Ian Bannen</name>

<acted_In idrefs=“m1 m35”>

</acted_In>

</actor>

:

</db>

Fall 2001


Part ii the document object model dom

Part II The Document Object Model (DOM)

Programming with XML

Fall 2001


Xml parsers

XML Parsers

  • traditional: build data structure (DOM)

  • event based: SAX (Simple API for XML)

    • http://www.megginson.com/SAX

    • write handler for start tag and for end tag

Fall 2001


Programming in xml why we need database technology

Programming in XML -- why we need database technology

  • Let’s examine some elements of the Document Object Model. It provides an interface to parsed data.

Fall 2001


Dom overview

DOM -- overview

  • Interface to parsed XML

  • “… language-neutral...” interface (IDL)

  • Level 1. Functionality for XML document navigation and manipulation.

    • Level 2. Stylesheets and namespaces Level 3. (future) Document loading and saving DTDs and schemas

http://www.w3.org/DOM/

Fall 2001


Dom representation a tree of nodes

Document

getTagName()

= “part”

Element

Element

Element

Attr

Attr

getTagName()

= “name”

getTagName()

= “weight”

CharacterData

CharacterData

getName()= “part”

getValue=“at23”

getName()= “units”

getValue=“mks”

getData()= “widget”

getData()= “0.454”

DOM representation -- a tree of nodes

<partdb>

<part id = “a123” units= “mks”>

<name> widget </name>

<weight> 0.454 </weight>

</part>

</partdb>

Fall 2001


Node interface

Node interface

public interface Node {

...

public String getNodeName();

...

public NodeList getChildNodes();

...

public NamedNodeMap getAttributes();

...

}

Fall 2001


Cis 550 fall 2001

Sub-elements

--an “array”

public interface NodeList {

public Node item(int index);

public int getLength();

}

public interface NamedNodeMap {

public Node getNamedItem(String name);

...

}

Attributes

-- a “dictionary”

Fall 2001


A common form of data extraction

<doc1>

<employee>

<name> John Doe </name>

<contact-info>

<address> … </address>

<tel> 123 7456 </tel>

<email> [email protected]</email>

</contact-info>

<dept> Math </dept>

</employee>

<employee>

</employee>

...

</doc1>

A common form of data extraction

John Doe 123 7456

Jane Dee 234 5678

...

Find the names and telephones of all employees in Math

Fall 2001


Top level traversal in dom

Top-level traversal in DOM

public class Test

{

public static void main(String args[]) throws Exception

{

Parser parser = new Parser( args[0] );

Document doc = parser.readStream( new FileInputStream( args[0] ));

NodeList nodes = doc.getDocumentElement.getChildNodes();

for (int i=0; i<nodes.getLength(); i++)

{

Node n = (Element) nodes.item(i); //coercion

// select Math depts}

}

}

Not specified

by DOM

Fall 2001


Selecting the math departments

Selecting the math departments

NodeList ndl = n.getChildNodes();

for(int idl=0; idl<ndl.getLength(); idl++)

{

if ((n.tagName = "dept") &&

(((CharacterData) (n.getFirstChild)).getData="math")) // coercion

{

//inner code

return;

}

}

Fall 2001


The inner code

Preorder search of subtree.

Convenient, but is it what we want?

The inner code

NodeList nnl= n.getChildNodes();

for(int ii=0; ii < nnl.getLength(); ii++)

{

Node nn = (Element) nodes.item(i); //coercion

if (nn.tagName = "name" )

Nodelist ncl = n.getElementsByTagName("tel");

for(int iii = 0; iii < cl.getLength(); iii++)

{

nc = ncl.item(iii);

System.out.print((CharacterData) nn.firstChild).data //coercion

System.out.println((CharacterData) nc.firstChild).data //coercion

}

}

Fall 2001


Comments on our code

Comments on our code

  • Already quite cumbersome. Compare with an “equivalent” semistructured database query:

  • Code may fail wherever there is a coercion, or give “empty” results.

  • Code is already inefficient (double iterations over the same set of nodes)

  • Need for types !!!

select {name: $N, tel: $T}

where {name: $N, dept: “Math”, contact-info.tel: $T} <- DB

Fall 2001


Constructing data

<doc1>

<employee>

<name> John Doe </name>

<contact-info>

<address> … </address>

<tel> 123 7456 </tel>

<email> [email protected]</email>

</contact-info>

<dept> Math </dept>

</employee>

<employee>

</employee>

...

</doc1>

Constructing data

<doc2>

<employee>

<name> John Doe </name>

<tel> 123 7456 </tel>

</employee>

<employee>

<name> Jane Dee </name>

<tel> 234 5678 </tel>

</employee>

...

</doc2>

Fall 2001


Constructing data using the dom

Constructing Data using the DOM

Document d = new DocumentImplementation …

Element root = d.createElement(“doc2”)

// set root of document

//top level loop

{

Element emp = d.createElement(“employee”)

root.appendChild(emp)

//innermost loop

{...

Element name = d.createElement(“name”)

// set s to appropriate character string

name.appendChild(createCDATASection(s))

emp.appendChild(name)

...

}

}

All node constructors are methods of the document implementation

Could also use an existing node or a “cloned” node

Fall 2001


A join

<doc1>

<employee>

<name> John Doe </name>

<contact-info>...</contact-info>

<dept> Math </dept>

</employee>

...

</doc1>

A join

<doc3>

<row>

<name> John Doe </name>

<building> A123 </building> </row>

<row>

<name> Jane Dee </name>

<building> B456 </building> </row>

...

</doc3>

<doc2>

<department>

<dname> Math </dname>

<building> A123 </building>

</department>

...

</doc2>

The names of employees and their department buildings

Fall 2001


Implementing a join in the dom

Implementing a Join in the DOM??

nl1 = r1.getChildNodes() //r1 is root of doc1

for (int i1 = 1; i1 < n1.getLength; i1++)

{

nl2 = r2.getChildNodes() //r1 is root of doc2

for (int i2 = 1; i2 < n2.getLength; i2++)

}

  • Even if we can get both documents in core, this is not the most efficient method

  • If not?

  • This is a typical database query!

Fall 2001


Conclusions so far

Conclusions so far

  • We need types for our XML documents, and we’d like to be able to use them as a static check of our program correctness.

  • We need database technology for large “documents” and -- perhaps -- for simplicity of code.

Fall 2001


Before we leave apis sax a low level alternative to dom

Before we leave APIs… SAX, a low-level alternative to DOM

  • SAX = Simple API for XML

  • supported by most XML parsers

  • event-driven

  • Instead of reading the entire file in memory and building a tree, SAX reads a stream of tokens and triggers events, e.g.,

    • startDocument

    • startElement

    • endElement

    • endDocument

  • The programmer has to write a document handler that captures these events and do something with the tokens

  • Simpler than DOM: programmer must do more storage management

  • Less likely than DOM to fail on large documents!!

Fall 2001


Why databases can help

Why Databases Can Help

  • Efficiency. Querying and Storage

  • Simplicity (a consequence of efficiency?) Query languages have a simple “algebraic” structure. Query transformation = optimization.

  • (Maybe) suggestions for type systems/constraints.

select name, bldg

from emps, depts

where emps.dname = depts.dname

Efficiently executed in SQL through use of indexing and clever join algorithms)

Fall 2001


Other things that databases might do for xml

Other things that databases might do for XML

  • Concurrency (allow a granularity finer than that of a file)

    • Proposals for updates?

  • Integrity. Keys and referential integrity (XML proposals still weak)

  • Data mining

  • Compression

Fall 2001


The down side

The Down-Side

  • Query languages are typically rather weak. You can’t always express a computation in a QL.

  • We may have to tamper with the semantics of XML (sets vs. lists)

  • We may have to simplify XML or its attendant “type systems” (DTDs, XML-schema) -- No bad thing :)

Fall 2001


Part iii document type descriptors

Part III: Document Type Descriptors

Imposing structure on XML documents

Fall 2001


Document type descriptors

Document Type Descriptors

  • Document Type Descriptors (DTDs) impose structure on an XML document.

  • There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems.

  • The DTD is a syntactic specification.

Fall 2001


Example the address book

Example: The Address Book

<person>

<name> MacNiel, John </name>

<greet> Dr. John MacNiel </greet>

<addr>1234 Huron Street </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 </fax>

<tel> (321) 786 2543 </tel>

<email> [email protected] </email>

</person>

Exactly one name

At most one greeting

As many address lines as needed (in order)

Mixed telephones and faxes

As many

as needed

Fall 2001


Specifying the structure

Specifying the structure

  • name to specify a nameelement

  • greet? to specify an optional (0 or 1) greet elements

  • name,greet?to specify a name followed by an optional greet

Fall 2001


Specifying the structure cont

Specifying the structure (cont)

  • addr*to specify 0 or more address lines

  • tel | faxa telor a fax element

  • (tel | fax)* 0 or more repeats of tel or fax

  • email*0 or more email elements

Fall 2001


Specifying the structure cont1

Specifying the structure (cont)

So the whole structure of a person entry is specified by

name, greet?, addr*, (tel | fax)*, email*

This is known as a regular expression. Why is it important?

Fall 2001


Regular expressions

Regular Expressions

Each regular expression determines a corresponding finite state automaton. Let’s start with a simpler example:

name, addr*, email

This suggests a simple parsing program

addr

name

email

Fall 2001


Another example

Another example

name,address*,(tel | fax)*,email*

address

email

tel

tel

name

email

fax

fax

email

Adding in the optional greet further

complicates things

Fall 2001


A dtd for the address book

A DTD for the address book

<!DOCTYPE addressbook [

<!ELEMENT addressbook (project*)>

<!ELEMENT project

(name, greet?, address*, (fax | tel)*, email*)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT greet (#PCDATA)>

<!ELEMENT address(#PCDATA)>

<!ELEMENT tel (#PCDATA)>

<!ELEMENT fax (#PCDATA)>

<!ELEMENT email (#PCDATA)>

]>

Fall 2001


Our relational db revisited

Our relational DB revisited

projects:

title budget managedBy

employees:

name ssn age

Fall 2001


Two dtds for the relational db

Two DTDs for the relational DB

<!DOCTYPE db [

<!ELEMENT db (projects,employees)>

<!ELEMENT projects (project*)>

<!ELEMENT employees (employee*)>

<!ELEMENT project (title, budget, managedBy)>

<!ELEMENT employee (name, ssn, age)>

...

]>

<!DOCTYPE db [

<!ELEMENT db (project | employee)*>

<!ELEMENT project (title, budget, managedBy)>

<!ELEMENT employee (name, ssn, age)>

...

]>

Fall 2001


Recursive dtds

Recursive DTDs

<DOCTYPE genealogy [

<!ELEMENT genealogy (person*)>

<!ELEMENT person (

name,

dateOfBirth,

person, -- mother

person )> -- father

...

]>

What is the problem with this?

Fall 2001


Recursive dtds cont d

Recursive DTDs cont’d.

<DOCTYPE genealogy [

<!ELEMENT genealogy (person*)>

<!ELEMENT person (

name,

dateOfBirth,

person?, -- mother

person? )> -- father

...

]>

What is now the problem with this?

Fall 2001


Some things are hard to specify

Some things are hard to specify

Each employee element is to contain name, age and ssn elements in some order.

<!ELEMENT employee

( (name, age, ssn) | (age, ssn, name) |

(ssn, name, age) | ...

)>

Suppose there were many more fields !

Fall 2001


Summary of xml regular expressions

Summary of XML regular expressions

  • AThe tag A occurs

  • e1,e2The expression e1 followed by e2

  • e*0 or more occurrences of e

  • e?Optional -- 0 or 1 occurrences

  • e+1 or more occurrences

  • e1 | e2either e1 or e2

  • (e)grouping

Fall 2001


Take home message 1

Take Home Message 1

Do not write something to confuse yourself

<!ELEMENT PARTNER (NAME?, ONETIME?, PARTNRID?, PARTNRTYPE?, SYNCIND?, ACTIVE?, CURRENCY?, DESCRIPTN?, DUNSNUMBER?, GLENTITYS?, NAME*, PARENTID?, PARTNRIDX?, PARTNRRATG?, PARTNRROLE?, PAYMETHOD?, TAXEXEMPT?, TAXID?, TERMID?, USERAREA?, ADDRESS*, CONTACT*)>

Cited from oagis_segments.dtd (one of the files in Novell Developer Kit http://developer.novell.com/ndk/indexexe.htm)

<PARTNER> <NAME> Ben Franklin </NAME> </PARTNER>

Q. Which NAME is it?

Fall 2001


Specifying attributes in the dtd

Specifying attributes in the DTD

<!ELEMENT height (#PCDATA)>

<!ATTLIST height

dimension CDATA #REQUIRED

accuracy CDATA #IMPLIED >

The dimension attribute is required; the accuracy attribute is optional.

CDATA is the “type” of the attribute -- it means string.

Fall 2001


Specifying id and idref attributes

Specifying ID and IDREF attributes

<!DOCTYPE family [

<!ELEMENT family (person)*>

<!ELEMENT person (name)>

<!ELEMENT name (#PCDATA)>

<!ATTLIST person

id ID #REQUIRED

mother IDREF #IMPLIED

father IDREF #IMPLIED

children IDREFS #IMPLIED>

]>

Fall 2001


Some conforming data

Some conforming data

<family>

<person id="jane" mother="mary" father="john">

<name> Jane Doe </name>

</person>

<person id="john" children="jane jack">

<name> John Doe </name>

</person>

<person id="mary" children="jane jack">

<name> Mary Doe </name>

</person>

<person id="jack" mother=”mary" father="john">

<name> Jack Doe </name>

</person>

</family>

Fall 2001


Consistency of id and idref attribute values

Consistency of ID and IDREF attribute values

  • If an attribute is declared as ID

    • the associated values must all be distinct (no confusion)

  • If an attribute is declared as IDREF

    • the associated value must exist as the value of some ID attribute (no dangling “pointers”)

  • Similarly for all the values of an IDREFS attribute

  • ID and IDREF attributes are not typed

Fall 2001


An alternative specification

An alternative specification

<!DOCTYPE family [

<!ELEMENT family (person)*>

<!ELEMENT person (mother?, father?, children, name)>

<!ATTLIST person id ID #REQUIRED>

<!ELEMENT name (#PCDATA)>

<!ELEMENT mother EMPTY>

<!ATTLIST mother idref IDREF #REQUIRED>

<!ELEMENT father EMPTY>

<!ATTLIST father idref IDREF #REQUIRED>

<!ELEMENT children EMPTY>

<!ATTLIST children idrefs IDREFS #REQUIRED>

]>

Fall 2001


The revised data

The revised data

<family>

<person id = "jane”>

<name> Jane Doe </name>

<mother idref = "mary”></mother>

<father idref = "john"></father>

</person>

<person id = "john”>

<name> John Doe </name>

<children idrefs = "jane jack"></children>

</person>

...

</family>

Fall 2001


A useful abbreviation

A useful abbreviation

When an element has empty content we can use

<tag blahblahbla/> for <tag blahblahbla></tag>

For example:

<family>

<person id = "jane”>

<name> Jane Doe </name>

<mother idref = "mary”/>

<father idref = "john”/>

</person>

...

</family>

Fall 2001


Odl schema1

ODL schema

classMovie

( extentMovies, keytitle )

{

attributestring title;

attributestring director;

relationship set<Actor> cast

inverseActor::acted_In;

attributeint budget;

} ;

classActor

( extentActors, keyname )

{

attributestring name;

relationshipset<Movie> acted_In

inverseMovie::cast;

attributeint age;

attributeset<string> directed;

} ;

Fall 2001


Schema dtd

Schema.dtd

<!DOCTYPE db [

<!ELEMENT db (movie+, actor+)>

<!ELEMENT movie (title,director,cast,budget)>

<!ATTLIST movie id ID #REQUIRED>

<!ELEMENT title (#PCDATA)>

<!ELEMENT director (#PCDATA)>

<!ELEMENT casts EMPTY>

<!ATTLIST casts idrefs IDREFS #REQUIRED>

<!ELEMENT budget (#PCDATA)>

Fall 2001


Schema dtd cont d

Schema.dtd (cont’d)

<!ELEMENT actor (name, acted_In,age?, directed*)>

<!ATTLIST actor id ID #REQUIRED>

<!ELEMENT name (#PCDATA)>

<!ELEMENT acted_In EMPTY>

<!ATTLIST acted_In idrefs IDREFS #REQUIRED>

<!ELEMENT age (#PCDATA)>

<!ELEMENT directed (#PCDATA)>

]>

Fall 2001


More on odl and dtd

More on ODL and DTD

  • The Object Data Management Group (ODMG) suggested OIFML, a XML document type of Object Interchange Format.

  • http://www.odmg.org/library/readingroom/oifml.pdf

Fall 2001


Constraints on id s and idref s

Constraints on IDs and IDREFs

  • ID stands for identifier. No two ID attributes with the same name may have the same value (of type CDATA)

  • IDREF stands for identifier reference. Every value associated with an IDREF attribute must exist as an ID attribute value

  • IDREFS specifies several (0 or more) identifiers

Fall 2001


Connecting the document with its dtd

Connecting the document with its DTD

In line:

<?xmlversion="1.0"?>

<!DOCTYPE db [<!ELEMENT ...> … ]>

<db> ... </db>

Another file:

<!DOCTYPE db SYSTEM "schema.dtd">

A URL:

<!DOCTYPE db SYSTEM

"http://www.schemaauthority.com/schema.dtd">

Fall 2001


Well formed and valid documents

Well-formed and Valid Documents

  • Well-formed applies to any document (with or without a DTD): proper nesting of tags and unique attributes

  • Valid specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on references satisfied

Fall 2001


Dtds v s schemas or types

DTDs v.s Schemas (or Types)

  • By database (or programming language) standards DTDs are rather weak specifications.

    • Only one base type -- PCDATA

    • No useful “abstractions” e.g., sets

    • IDREFs are untyped. You point to something, but you don’t know what!

    • No constraints e.g., child is inverse of parent

    • No methods

    • Tag definitions are global

  • Some of the XML extensions impose something like a schema or type on an XML document. We’ll see these later

Fall 2001


Lots of possibilities for schemas

Lots of possibilities for schemas

  • XML Schema (under W3C’s spotlight)

  • XDR (Microsoft’s BizTalk)

  • SOX (Schema for Object-Oriented XML)

  • Schematron

  • DSD (AT&T Labs and BRICS)

  • and more.

Fall 2001


Some tools

Some tools

  • XML Authority http://www.extensibility.com/tibco/solutions/xml_authority/index.htm

  • XML Spy http://www.xmlspy.com/download.html

Fall 2001


Summary

Summary

  • XML is a new data format. Its main virtues are widespread acceptance and the (important) ability to handle semistructured data (data without schema)

  • DTDs provide some useful syntactic constraints on documents. As schemas they are weak

  • How to store large XML documents?

  • How to query them?

  • How to map between XML and other representations?

Fall 2001


Part iv query languages

Part IV: Query Languages

Why a query language? Extracting, Restructuring, Integration, Browsing…

XML-QL

http://www.w3.org/TR/NOTE-xml-ql

http://db.cis.upenn.edu/XML-QL/

XPATH (part of a query language)

http:www.w3.org/TR/xpath

XSLT

http://www.w3.org/TR/xslt

http://www.mulberrytech.com/quickref/XSLTquickref.pdf

Fall 2001


Xquery likely to gain acceptance

XQuery -- likely to gain acceptance...

  • … and, like other things in W3C, not necessarily the best. XQuery is a group of projects. General blather and informal specifications:

  • http://www.w3.org/XML/Query

  • Ingredients:

  • A concrete syntax: http://www.w3.org/TR/xquery

  • Based on XPath: http://www.w3.org/TR/xpath.html

  • A formal semantics and “algebra”: http://www.w3.org/TR/query-semantics/

  • Some test cases: http://www.w3.org/TR/xmlquery-use-cases

Fall 2001


Query languages and dtds

Query Languages and DTDs

  • The DOM does not interact with DTDs (later “levels” may do this)

  • For query languages there is almost no interaction (only to find out the names of IDs and IDREFs)

  • XDUCE (developed at Penn) is the only well-developed language that uses DTDs as a type system. www.cis.upenn.edu/~hahosoya/xduce/

Fall 2001


Xml ql xml query language

XML-QL (XML Query Language)

  • W3C proposal, August 1998

  • authors:

    • Mary FernandezAT&T

    • Dana FlorescuINRIA

    • Alon LevyUniv. of Washington

    • Dan SuciuAT&T

    • Alin DeutschUniv. of Pennsylvania

Fall 2001


Address book revisited

Address Book Revisited

<addrBook>

<person SSN=“111-22-3333”>

<name> Caesar </name>

<greet> Caesar Imperator</greet>

<addr> The Capitol </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 <fax>

<tel> (321) 786 2543 </tel>

<email> [email protected] </email>

</person>

</addrBook>

Fall 2001


Xml ql pattern matching

XML-QL: Pattern Matching

Find Caesar’s e-mail address:

where<addrBook>

<person>

<name>Caesar</name>

<email>$e</email>

</person>

</addrBook> in “http://db.cis.upenn.edu/~peter/address.xml”

construct $e

<XML>[email protected]</XML>

Data Extraction

Fall 2001


Xml ql constructing new xml data

XML-QL: Constructing New XML Data

Whom can we contact electronically?

where<addrBook>

<person>

<greet>$g</greet>

<email>$e</email>

</person>

</addrBook> in “http://...”

construct <e-contact>

<who>$g</who>

<where>$e</where>

</e-contact>

<XML>

<e-contact>

<who>Caesar Imperator</who>

<where>[email protected]

</where>

</e-contact>

<e-contact>

<who>Brutus</who>

<where>[email protected]

</where>

</e-contact>

...

</XML>

Data Restructuring

Fall 2001


Xml ql joins

XML-QL: Joins

Who of our contacts was involved in a movie?

where<addrBook>

<person>

<greet>$g</greet>

<email>$e</email>

</person>

</addrBook> in “http://…address.xml”

<movie><title>$t</>

<character>$g</>

</movie> in “http://www.imdb.com”

construct <cine-contact>

<who>$g</who>

<movie>$t</movie>

<where>$e</where>

</cine-contact>

Fall 2001


Xml ql joins cont d

XML-QL: Joins (cont’d)

<XML>

<cine-contact>

<who>Caesar Imperator</who>

<where>[email protected]</where>

<movie>Asterix and Cleopatra</movie>

</cine-contact>

<cine-contact>

<who>Dr. Strangelove</who>

<where>[email protected]</where>

<movie>Dr. Strangelove or How I Stopped ...</movie>

</cine-contact>

...

</XML>

Data Integration

Fall 2001


Xml ql data model

XML-QL Data Model

  • Directed, labeled graph

  • Tags represented as edge labels

  • Sets of attribute name-value pairs as node labels

  • Two models: ordered and unordered

Fall 2001


Xml ql data model cont d

addrBook

person

SSN=“111-…”

name

greet

tel

fax

tel

email

Caesar

Caesar Imperator

addr

addr

(321) 786 2543

The Capitol

Rome, OH

XML-QL Data Model (cont’d)

<person SSN=“111-22-3333”>

<name> Caesar </name>

<greet> Caesar Imperator </greet>

<addr> TheCapitol </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 <fax>

<tel> (321) 786 2543 </tel>

<email> [email protected] </email>

</person>

Fall 2001


Xml ql semantics variable bindings

addrBook

person

person

SSN=“111-…”

SSN=“111-…”

name

greet

tel

fax

tel

email

name

greet

tel

fax

tel

email

Stragelove

strangelov@

[email protected]

Caesar

Dr. Strangelove

Caesar Imperator

addr

addr

addr

addr

(321) 786 2543

The Capitol

Washington, DC

The Capitol

Rome, OH

where <addrBook>

<person>

<name>$n</>

<email>$e</>

</>

</>

$n $e

Caesar [email protected]

Strangelove [email protected]

XML-QL Semantics: Variable Bindings

Fall 2001


Xml ql semantics xml output

$n $e

Caesar [email protected]

Strangelove [email protected]

XML-QL Semantics: XML Output

construct <e-contact>

<who>$n</who>

<where>$e</where>

</e-contact>

XML

e-contact

e-contact

who

where

who

where

Caesar

[email protected]

Strangelove

[email protected]

Fall 2001


Advanced xml ql

Advanced XML-QL

Find tags of person subelements:

where<addrBook.person.$tag></>

in “http://db.cis.upenn.edu/~peter/address.xml”

construct <childOfPerson>$tag</>

Find all email addresses and fax numbers :

where<addrBook._*. (email | fax)>$eORf</>

in “http://db.cis.upenn.edu/~peter/address.xml”

construct <emailOrFax>$eORf</>

Schema browsing

Fall 2001


More advanced xml ql

More Advanced XML-QL

Find attributes of person elements:

where<_*.person $attrName=$attrVal></>

in “http://db.cis.upenn.edu/~peter/address.xml”

construct <personAttribute>

<name>$attrName</>

<value>$attrVal</>

</>

Schema browsing

Fall 2001


The xml tree model

person

tel

fax

tel

email

name

greet

@SSN

addr

addr

The XML Tree Model

<person SSN=“111-22-3333”>

<name> Caesar </name>

<greet> Caesar Imperator</greet>

<addr>1234 Huron Street </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 <fax>

<tel> (321) 786 2543 </tel>

<email> [email protected] </email>

</person>

Fall 2001


Xml ql semantics xml output1

who

where

Strangelove

[email protected]

XML-QL Semantics: XML Output

construct <e-contact>

<who>$n</who>

<where>$e</where>

</e-contact>

$n $e

Caesar [email protected]

Strangelove [email protected]

Strangelove [email protected]

e-contact

e(Strangelove,@hate)

XML

who

where

e-contact

e-contact

Strangelove

[email protected]

e(Caesar,[email protected])

e(Strangelove,@love)

who

where

w(Caesar,[email protected])

Caesar

[email protected]

Fall 2001


Xml ql skolem functions

XML-QL Skolem Functions

construct <e-contact ID=e($n) >

<who ID=w($n) >$n</who>

<where>$e</where>

</e-contact>

XML

e-contact

e-contact

where

e(Caesar)

e(Strangelove)

who

where

who

where

[email protected]

w(Caesar)

w(Strangelove)

Caesar

[email protected]

Strangelove

[email protected]

Fall 2001


Xsl extensible stylesheet language

XSL (Extensible Stylesheet Language)

  • W3C working draft by Adobe Systems

  • Original purpose was to specify rendering of XML documents (mainly by Web browsers: HTML)

  • Consists of two parts:

    • an XML transformation language

    • a formatting vocabulary denoting typographic abstractions (paragraph, page, rule, footer, etc.)

Fall 2001


The xsl processor

XML

XML

XML

HTML

The XSL Processor

original document

stylesheet

transformer

restructured document,

elements are formatting objects

plug in your favorite

formatter

presentation (other formats possible)

Fall 2001


Formatting example

Formatting Example

original document:

<doc>

<title>An example</title>

<p>This is a test.</p>

<p>This is <emph>another</emph>test.</p>

</doc>

transformed document:

<fo:queue>

<fo:block font-weight=“bold”>An example</fo:block>

<fo:block>This is a test.</fo:block>

<fo:block>This is <fo:inline-sequence font-style=“italic”>another

</fo:inline-sequence>

test.

</fo:block>

</fo:queue>

Fall 2001


Template rules

Template Rules

<xsl:template match=“title”>

<fo:block font-weight=“bold”>

<xsl:apply-templates/>

</fo:block>

</xsl:template>

<xsl:template match=“p”>

<fo:block><xsl:apply-templates/>

</fo:block>

</xsl:template>

<xsl:template match=“emph”>

<fo:inline-sequence font-style=“italic”><xsl:apply-templates/>

</fo:inline-sequence>

</xsl:template>

Fall 2001


Template rules cont d

Template Rules (cont’d)

<xsl:template match=“addrBook”>

<xsl:apply-templates/>

</xsl:template>

<xsl:template match=“person”>

<contact>

<name><xsl:value-of select=“name”/></name>

<e-mail><xsl:value-of select=“email”></e-mail>

</contact>

</xsl:template>

where <addrBook.person><name>$n</>

<email>$e</></>

construct <contact><name>$n</>

<e-mail>$e</></>

Fall 2001


Xsl vs xml ql

XSL vs. XML-QL

XML-QL XSL

XML output

no schema required

data extraction

data restructuring

data integration

schema browsing

relational complete

Fall 2001


  • Login