1 / 39

A syntax for Data

A syntax for Data. by Jose Carlos Cabrera Zuniga. Preface.

mrinal
Download Presentation

A syntax for Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A syntax for Data by Jose Carlos Cabrera Zuniga

  2. Preface In this presentation, it is going to be introduced the relation between semistructured data and XML. To accomplish with this objective, first it is showed the semistructured data concept. Then, it is showed the use of XML to represent this kind of data.

  3. Semistructured Data Semistructureddata is often explained as schemaless or self describing, terms that indicate that there is no separate description of the type or structure of data.

  4. data {name: “Alan”, tel: 2157786, email: “agg@abc.com” } labels

  5. { name: {first: “Alan”, last: “Black”}, tel: 2157786, email: “agg@abc.com” }

  6. { name: {first: “Alan”, last: “Black”}, tel: 2157786, email: “agg@abc.com” } email name tel “agg@abc.com” 2157786 last first “Black” “Alan”

  7. { person: {name: “Alan”, tel: 2157786, email: “agg@abc.com” } person: {name: “Sara”, tel: 2136877, email: “sara@math.edu” } person: {name: “Fred”, tel: 2157786, email: “fred@abc.com” } }

  8. One of the main strengths of semistructured data is its ability to accommodate variations in structures… { person: {name: “Alan”, tel: 2157786, email: “agg@abc.com” } person: { name: {first: “Sara”, last: “Green”} tel: 2136877, email: “sara@math.edu” } person: {name: “Fred”, tel: 2157786, Height: 183 } }

  9. In semistructured data, we make the conscious choice of forgetting any type the data might have had, and we serialize it by annotating each data item explicitly with its description (such a name, tel, etc.). Such data is called selfdescribing.

  10. Base Types: • Numbers start with a digit. • Strings start with a quotation mark “ • There are many other types, with defined textual encodings, such as date, time, wav, that we would like to include. For each one it would be necessary to develop a notation (in many cases it is not necessary to re-invent a notation).

  11. REPRESENTING RELATIONAL DATABASES A relational database is normally described by a schema such as r1(a,b,c) r2(c,d) where r1 an r2 are the names of the relations, and a, b, c and c, d are the column names of the two relations.

  12. { r1: { row: { a: a1, b:b1, c: c1}, row: { a: a2, b:b2, c: c2} }, r2: { row: { c:c2, d:d2}, row: { c:c3, d:d3}, row: { c:c4, d:d4}, } } a b c a1 b1 c1 a2 b2 c2 c d c2 d2 c3 d3 c4 d4 r2(c,d) r1(a,b,c)

  13. { r1: { row: { a: a1, b:b1, c: c1}, row: { a: a2, b:b2, c: c2} }, r2: { row: { c:c2, d:d2}, row: { c:c3, d:d3}, row: { c:c4, d:d4}, } } One representation of a relational database r2 r1 row row row row row a c a c c d b b c d c4 d4 a1 c1 b1 a2 c2 b2 c2 d2 c d c3 d3

  14. { r1: { row: { a: a1, b:b1, c: c1}, row: { a: a2, b:b2, c: c2} }, r2: { row: { c:c2, d:d2}, row: { c:c3, d:d3}, row: { c:c4, d:d4}, } } Other representation of a relational database row row row row row r1 r1 r2 r2 r2 c a b c a b c d c d c d a1 c1 a2 b2 c2 c2 c4 b1 d2 c3 d4 d3

  15. Representing Object Databases Modern database applications handle objects, either through an object-relational or an object database. Such data can be represented as semistructured data, too.

  16. Example. Tree persons: Mary, who has two children, John and Jane. { person: &o1 { name: “Mary”, age: 45, child: &o2, child: &o3, }, person: &o2 { name: “John”, age: 17, relatives: { mother: &o1, sister: &o3} }, person: &o3 { name: “Jane”, country: “Canada”, mother: &o1 } }

  17. person person person child mother &o3 &o1 &o2 child country name name name age age relatives 17 45 “John” “Jane” “Canada” “Mary” mother sister

  18. The presence of a node label such as &o1 before a structure binds &o1 to the identity of that structure. The names &o1, &o2, &o3 are called object identities, or oids. At this point, the data is no longer a tree but a graph, in which each node has a unique oid. An oid can be used to access logically and physically a collection of data.

  19. Oid Oid

  20. In our simple syntax for semistructured data, we allow both nodes with explicit oids and nodes without oids: the system will explicitly assign a unique oid automatically, when the data is parsed. Thus {a:&o1{b:&o2 5}} and {a:{b:5}} denote isomorphic graphs, as does {a:&o1 {b:5}}. What could happen with: {a: {b:3}, a: {b:3} } ?

  21. SPECIFICATION OF SYNTAX Let’s call ssd-expression to any semistructured data expression. <ssd-expr> ::= <value> | oid <value> |oid <value> ::= atomicvalue | <complexvalue> <complexvalue> ::= {label: <ssd-expr>, … , label:<ssd-expr>} Atomicvalue: any number or string of characters Oid : like &123

  22. Definition. We say that an object identifier o is defined in an ssd-expression s if either s is of the form o v for some value v or s is of the form {l1:e1, … , ln:en} and o is defined in one of the e1, … , en. If it occurs in any other way in s, we say it is used in s. • Definition. (Consistency)For an ssd-expression s to be consistent it must satisfy the following properties: • Any object identifier is defined at most once in s. • If an object identifier o is used in s, it must be defined in s. Note. This definition must be extended if it is necessary to consider external resources and external oids.

  23. THE OBJECT EXCHANGE MODEL (OEM) An oem object is a quadruple (label, oid, type, value) Where label is a character string, oid is the object’s identifier, and type is either complex or some identifier denoting an atomic type (like integer, string, gif-image, etc.). When type is complex, then the object is called a complex object, and value is a set (or list) of oids. Otherwise the object is an atomic object, and value is an atomic value of that type.

  24. Thus OEM data is essentially a graph, like the semistructured data described in this section, but in which labels are attached to nodes rather than edges.

  25. Definition. A graph ( N, E ) consist of a set N of nodes and a set E of edges. Associated with eachedge e in E there is an (ordered) pair of nodes, the source node s(e) and the target node t(e). t(e) e s(e)

  26. Definition. A path is a sequence e1, … , ek of edges such that t(ei) = s(ei+1), 1<= i<=k – 1. Such a path is called a path from the source s(e1) of e1 to the target t(ek) of ek. The number of edges in this path, k, is its length. t(e1) t(e2) t(ek) s(e1) s(ek)

  27. Definition. A node r is a root for a graph (N, E)if there is a path from r to n for every n in N, n <> r. Definition. A cycle in a graph is a path between a node and itself. A graph with no cycles is called acyclic. Definition. A rooted graph is a tree if there is a unique path from r to n for every n in N, n <> r. Definition. A node is terminal node or a leaf if it is not the source of any edge in E.

  28. The followed model of semistructured data is that of an edge-labeled graph.

  29. XML and Semistructured Data { person : { name: {first: “Alan”, last: “Black”}, tel: 2157786, email: “agg@abc.com” } } <person> <name> Alan </name> <tel> 2157786 </tel> <email> agg@abc.com </email> </person>

  30. For trees, let’s call T a translation function such that T(AtomicValue ) = AtomicValue T({ l1 : v1 , … , ln : vn}) = < l1 > T[ v1 ] </l1 > … < ln> T[ vn ] </l1 > person person name email age email name tel Alan 2157786 agg@abc.com 2157786 agg@abc.com Alan

  31. For graphs: < state id = “s2” > <scode> NE </scode> <sname> Nevada </sname> </state> <state id=“c2”> <ccode> CCN </ccode> <cname> Carson City </cname> <state-of idref = “s2” /> </city> Observe that <state-of> is an empty element; its only purpose is for reference.

  32. The ssd-expressions for the next graph are: a a a: { b: some string } c b a: { c: some string } some string

  33. If the attribute c is a reference attribute… a a <a> <b id=“&o123” > some string </b> </a> <a c=“&o123”/> c b some string Assuming that b is now a reference attribute. <a b = “&o123”/> <a> <c id=“&o123”> some string </c> </a> This a is an empty element

  34. ORDER The semistructured data model described is based on unordered collections, while XML is ordered. For example the following two pieces of semistructured data are equivalent: person:{firstname: “John”, lastname: “Smith”} Person:{lastname: “Smith”, firstname: “John”}

  35. While the following two XML doc. are not equivalent: <person> <firstname> John </firstname> <lastname> Smith </lastname> </person> <person> <lastname> Smith </lastname> <firstname> John </firstname> </person>

  36. To make things worse, attributes are NOT ORDERED in XML. For example, are equivalent: <person firstname=“john” lastname=“Smith”/> <person lastname=“Smith” firstname=“john”/> Applications that uses XML for data exchange are likely to ignore order…

  37. MIXING ELEMENTS AND TEXT XML allow us to mix PCDATA and subelements within an element: <Person> This is my best friend <Name> Alessandreia </Name> <Age> 25 </Age> I am not too sure of the following email <Email> alessia@cos.ufrj.br </Email> </Person> In order to translate XML back into the syntax of ssd-expressions it is necessary to add some surrounding standard tag for the PCDATA

  38. END L M X

More Related