680 likes | 991 Views
SGML, HTML, XML: Do We Really Need All That?. ISMT Multimedia Fall 2002 Dr Vojislav B Mišić. Lecture Overview. What is a markup language? HTML markup: what’s good, what’s wrong Extensions to HTML (dHTML and style sheets, XML and XSL, …) XML Basic elements Well-formed vs. valid XML
 
                
                E N D
SGML, HTML, XML:Do We Really Need All That? ISMT Multimedia Fall 2002 Dr Vojislav B Mišić
Lecture Overview • What is a markup language? • HTML markup: what’s good, what’s wrong • Extensions to HTML (dHTML and style sheets, XML and XSL, …) • XML • Basic elements • Well-formed vs. valid XML • Writing a DTD • Examples of XML
Markup languages • What is markup? • Text (actual contents of the document) • is interspersed with markings • Markup is related to the text • notes on the content • notes on text presentation • but virtually anything can be marked (remember Fermat’s last theorem?) • Markup language allows separation of concerns: content vs. presentation
Standards for markup • SGML (IBM) – a standardized way to write other markup languages (actually, a meta-language) • SGML-based language is specified using a DTD (Document Type Definition) • SGML is not really a user-friendly language, hence its use was rather limited, even though software support for it does exist
Other markup languages • TeX (Knuth) is another widely used markup language • Performs extremely well for complex texts with • mathematical formulas and symbols • cross-references • different typefaces • foreign language
A TeX example \begin{equation}\label{coh1} \Psi (S) = \displaystyle \frac{\displaystyle \sum_{x \in R (S)} \left( \# S_w (x) - 1 \right)} {\displaystyle \sum_{x \in R (S)} \left( \# S - 1 \right)} \end{equation}
HTML • HTML (HyperText Markup Language) is the language of the Internet • Allows platform-independent browsing • Text-only at first, media later • Hyperlinks, limited visual formatting • However, it is far from perfect, and is gradually being replaced (current version: 4.01)
HTML markup • First you write the text, then add appropriate markup tags • Tags can describe logical entities • Headings of different levels: H1, H2, … • Lists and list elements (UL, OL, LI) • But tags can describe visual effects (display rendering) • Bold and italic text (B, IT) • Font and typeface changes
If you make an error… • Anything not recognized as correct HTML is essentially ignored • HTML browser just treats it as plain text and displays it directly • In this manner, users are still able to see most of the source, albeit without proper formatting • Your opinion: is this good or bad?
HTML editing • HTML source is ASCII and essentially layout independent • Plain text editors can be used • You can put extra white space to your heart’s content, with no effect on what is displayed by the browser • Most browsers allow you to view and save the HTML source of the document displayed – the quickest way to learn HTML • HTML is interpreted – editing changes are displayed (almost) instantly
HTML on the Internet • HTML browsers can display graphics and other media objects • Although HTML by itself provides only the most primitive support for multimedia • Tags can specify target URLs (hyperlinks) • Error tolerance ensures that anyone with a browser (any browser) can access HTML documents • … all of which made HTML the language of choice for hypertext on the Internet
More HTML features • Visual formatting is allowed but not forced • you can specify a typeface, but the browser will substitute another one of its own choice if the one specified is not available • User can easily change the presentation • just resize window and select different fonts/sizes • Browser differences (IE vs. Navigator) – actually, not very important any more
HTML Interactivity • Interactivity at first limited to hyperlinks • Forms introduced later (Navigator 3) • Form support still limited, most often a client- or server-side scripting is required • Proliferation of scripting languages • CGI scripts • JavaScript and Jscript (more details later) • Vbscript, ASP • perl
Is HTML a Good Markup Language? • Logical and visual formatting capabilities together • Some people argue for cleaner separation of logical from visual formatting • Others want more author control • Many extensions (some proprietary) • Changes generally lean towards greater author control over document rendering – more direct formatting instructions included
Dynamic HTML • Commercial term – there is no such thing as a dHTML standard • Combination of HTML with new technologies • Stylesheets add greater author control • Scripting allows improved interactivity, including user input • Even simple animations are possible • As always, not quite compatible extensions by Microsoft and Netscape
HTML styles • In standard HTML, logical markup tags (such as <H1>) have predefined properties for • Typeface • Font size • Mode • Line spacing • Properties cannot be changed, and we cannot define our own tags • The only way is to use a (possibly way too long) sequence of appropriate primitive tags every time – not a very convenient solution
Stylesheets to the rescue • Cascaded stylesheets (CSS): cleaner separation of markup from actual content • Style: a named set of properties that define presentation of a chunk of text (character, paragraph, …) • Styles are present in text processing software (WinWord) but in some markup languages as well (TeX) • CSS is used with HTML, but it’s not HTML – although browsers know how to handle them together
CSS Syntax • A CSS-compatible stylesheet contains a set of rules, each with a selector (name), a number of properties and their values • Rules can be • Inline (within a HTML tag, in document body) • Embedded (in the head of a HTML document) • External, in a separate file which is then linked or imported into a HTML document • Position of the rule defines the scope of its effect on the document
CSS Selectors • HTML selectors – text portions of HTML tags • Class selectors – can be applied to any HTML tag • ID selectors – usually applied only once per page to a particular HTML tag • Type of HTML tag defines the scope of CSS properties • Block level (DIV, LI, H1) • Inline (B, FONT, TT) • Replaced tags (IMG)
CSS Properties • Always of the form property:value; • Categories of properties control • Typefaces (fonts, size, mode) • Text (kerning, leading, alignment) • Lists (bullets, indentation) • Colors (borders, text, rules, background) • Margins • Positioning of individual elements
CSS Rule with a HTML selector • Effective redefinition of HTML tags, e.g.:B { fonts: bold 18pt times,serif; text-decoration: underline;} • Redefines the <B> (boldface) tag throughout the rest of the document • Don’t forget to close the brace!
CSS Rule with a class selector • Independent style, applicable to any HTML tag:.extra { font-size: 28pt; }.huge { font-size: 48pt; } • Class selector must be referred to within the HTML tag:<B class="extra">Extra</B><B class="huge">HUGE</B>
CSS Rule with a class selector • May be linked to a specific HTML tag:p.mini { font-size: 8pt; }p.big { font-size: 14pt; } • Class selector may be applied to this HTML tag only:<P class=“mini">mini</P><P class=“big">BIG</P>
CSS Rule with an ID selector • Another independent style, applicable to any HTML tag:#area1 { position: relative; margin-left: 9em; color: red; } • ID is specified within the HTML tag:<SPAN ID="area1"> ... </SPAN>
More on CSS selectors • Several CSS selectors may share the same definition, and individual selectors may get additional properties separately • CSS rules can refer to tags nested within other tags, e.g.,P B { background: pink; } • redefines the <B> tag only when encountered within the <P> tag
Adding CSS to your document • Within a style container in the document head:<HEAD><STYLE TYPE="text/css"><!-- CSS rules go here--></STYLE></HEAD> • HTML comment tags hide the CSS rules form non-CSS browsers
Importing CSS into your document • Create a separate file, stylefile.css, then write<HEAD><LINK REL=stylesheets TYPE="text/css“ HREF="stylefile.css“></HEAD> • Several files may be added in this manner
More on CSS • Single line comments start with // • Multiline comments between matched pairs of /* and */ • A stylesheet file may import another stylesheet file (hence the name CSS) with the statement@import url(stylefile) • But: the last rule listed wins! • Also: beware of browser differences!
More CSS capabilities • Font selection • Text control • List properties • Background properties • Absolute and relative positioning (but this is very dangerous!) • Visibility (which probably has little use by itself – but it can be quite useful when changed though appropriate scripts) • Stacking (vertical) order
Document Object Model • DOM describes the structure of HTML HTML document as a hierarchy • Thus allowing a script written in a suitable language to access and manipulate only selected element (or elements) within that document • document.images.b1.src="button_on.gif" describes a path from root or top (which is the document itself) to a particular element – an image file • Then, a script can manipulate this element (e.g., hide, show, replace, move, …) in response to certain events
XML • eXtended Markup Language: a simplified (easier, more consistent) version of SGML • XML-compliant languages defined with appropriate DTDs • XML parsers signal syntax errors (unlike HTML) – use of authoring tools implied • current uses (with more to follow) • SMIL for synchronized multimedia • RDF for resource definition exchange
What is XML? • A method for putting structured data in a text file • Data stored on disk can be in binary or text format • Binary formats are often more concise • Text format allows human inspection • XML is a set of rules/guidelines/conventions for designing text formats for such data, to produce files that are • Easy to generate and read (by a computer) • Unambiguous and platform-independent • Extensible, easy to localize/internationalize
XML looks like HTML but isn't HTML • XML makes use of • tags (words bracketed by '<' and '>') and • attributes (of the form name="value") • HTML specifies what each tag & attribute means (and often how the text between them will look in a browser) • XML uses the tags only to delimit pieces of data – and leaves the interpretation to the application
XML is text, but isn't meant to be read • XML files are text files, but they are not made for human readers • Text format allows experts (such as programmers) to more easily debug applications • Text format allows the use of a simple text editor to fix a broken XML file • Rules for XML files much stricter than for HTML • Applications are not allowed to try to second-guess the creator of a broken XML file – if the file is broken, just stop and issue an error message
XML is verbose, but that is not a problem • XML is a text format and uses tags to delimit the data • Therefore, XML files are nearly always larger than comparable binary formats • But disk space isn't as expensive anymore as it used to be, and compression/decompression can be fast and reliable • Communication protocols can compress data on the fly, thus saving bandwidth as effectively as a binary format
XML is … good • XML is license-free • XML is platform-independent • XML is well-supported • Choosing XML is a lot like choosing SQL • you still have to build your own database and your own programs/procedures that manipulate it • but there are many tools available and many people that can help you • XML isn't always the best solution, but it is always worth considering …
XML is a family of technologies • XML: the specification that defines what "tags" and "attributes" are • Xlink describes a standard way to add hyperlinks to an XML file • CSS is applicable to XML as it is to HTML • XSL: an advanced language for style sheets (presentation and manipulation) • XSLT: a transformation language • SMIL: Synchronized Multimedia Modeling • … and others
Well-formed vs. valid XML • Well-formed vs. valid XML • Well-formed documents comply with XML well-formedness constraints, which require that • Elements properly nest within each other • Elements use other markup syntax correctly • XML allows you to use elements of your own naming: ESSAY, SECTION, PARAGRAPH, NOTE, IMPORTANT • … unlike HTML, which forces all documents into a fixed document type
Writing XML One, Two • XML Declaration: declares the nature of XML documents to document readers • <?xml version="1.0" standalone="yes"?> • <?xml version="1.0" standalone="no"?> • <?xml version="1.0“ standalone="no“ encoding="UTF-8"?> • Root element: contains all other elements (i.e., the rest of the document) • Root element is synonymous with your document type • Root element cannot be repeated
An XML example <?xml version="1.0" standalone="yes"?> <TRIVIA><MATH><QUESTION>What is the square root of 25</QUESTION><ANSWER>5</ANSWER></MATH> <GENERAL><QUESTION>What is the season after Summer</QUESTION><ANSWER>Fall</ANSWER><ANSWER>Autumn </ANSWER></GENERAL></TRIVIA>
Rules for XML elements • All elements must have opening and closing (start and end) tags <MATH> ... </MATH> • There are exceptions – tags like <QUESTION ... /> • Case matters – CML is case-sensitive • Proper tag nesting must be observed • You can add whitespace to your heart’s content – it is ignored in processing
XML Writing • Describe content with elements of your own naming • Invent a new element each time you introduce content that significantly differs from any previous • More elements = greater control you will have later, when you use it • Add attributes to elements • Attributes describe the content or behavior of elements
Another Example • <?xml version="1.0" standalone="yes"?><HELP><TITLE>XML Help</TITLE><QUERY area="XML"><QUESTION>Where do I start?</QUESTION><ANSWER>Start with your root element. Break your document down into parts, fill them in, repeat.</ANSWER></QUERY><QUERY area="XML"><QUESTION>Are my element names are well chosen?</QUESTION></HELP>
XML Writing 4 • Parsing: checking well-formedness <PRICE>$57.80</PRICE><PET><CAT type="Cornish Rex">Cat nests properly within PET.</CAT></PET><WEATHER>Foggy no closing tag<LEVEL>Intermediate<LEVEL> improper tag<PASSWORD>planetB612</PASSWD> wrong spelling<DISTANCE TYPE=KM 120</DISTANCE> missing closing bracket<CAR><engine>engine does not nest properly within CAR</CAR></engine> improper nesting
Valid XML • Valid XML—unlike well-formed one—requires a Document Type Definition • DTD: a set of rules that a particular document type must follow • The rules state the name and contents of each element, and the contexts in which a particular element can and must exist • DTD enables communication with databases • Valid XML documents may be accompanied by style sheets for proper presentation
What’s in a DTD • Two essential structures: the element and the attribute • Root element: contains all other elements • Contents of other elements defined recursively starting from the root, until you reach text-level elements, e.g., <!ELEMENT NAME CONTENT> • Elements may have attributes, which are defined within the element definition, or separately, e.g., <!ATTLIST ELEMENT-NAME NAME CDATA #IMPLIED>
Writing a DTD <!ELEMENT novel (preface,chapter+,biography?,criticalessay*)> <!ELEMENT preface (paragraph+)> <!ELEMENT chapter (title,paragraph+,section+)> <!ELEMENT section (title,paragraph+)> <!ELEMENT biography (title,paragraph+)> <!ELEMENT criticalessay (title,section+)> <!ELEMENT paragraph (#PCDATA|keyword)*> <!ELEMENT title (#PCDATA|keyword)*> <!ELEMENT keyword (#PCDATA)>
DTD Declarations (1):Element type declaration • Each element type includes a name, content, and possibly a set of attributes • A document can contain many conforming elements of that type • Sequence: ordered list of components (,) • Choice: alternative components (|) • Components may be optional (?) • Components may be required and repeatable (+) • Components may be optional and repeated (*) • Mixed-content declarations must include #PCDATA , parsed character data (i.e., text) as their first member
DTD Declarations (2):Attribute List Declarations • Much more variation here  • String type attributes (CDATA): virtually unconstrained text strings • Enumeration attributes: require a list of options to pick from • Attribute defaults: • #REQUIRED, required; • #IMPLIED, optional; • #FIXED "value", a fixed value, • "value", a default but overridable value • Usage: <ELEMENT-NAME NAME="value">
An Attribute List Example <!ELEMENT MEMO (TO,FROM,SUBJECT,BODY,SIGN)><!ATTLIST MEMO importance (HIGH|MEDIUM|LOW) "LOW"><!ELEMENT TO (#PCDATA)><!ELEMENT FROM (#PCDATA)><!ELEMENT SUBJECT (#PCDATA)><!ELEMENT BODY (P+)><!ELEMENT P (#PCDATA)><!ELEMENT SIGN (#PCDATA)><!ATTLIST SIGN signatureFile CDATA #IMPLIED email CDATA #REQUIRED>