superset me not why the jpts i s sufficient if you use appropriate layer validation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Superset Me—Not: Why the JPTS I s Sufficient if You Use Appropriate Layer Validation PowerPoint Presentation
Download Presentation
Superset Me—Not: Why the JPTS I s Sufficient if You Use Appropriate Layer Validation

Loading in 2 Seconds...

play fullscreen
1 / 32

Superset Me—Not: Why the JPTS I s Sufficient if You Use Appropriate Layer Validation - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

Superset Me—Not: Why the JPTS I s Sufficient if You Use Appropriate Layer Validation. Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010. Summary.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Superset Me—Not: Why the JPTS I s Sufficient if You Use Appropriate Layer Validation' - izzy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
superset me not why the jpts i s sufficient if you use appropriate layer validation

Superset Me—Not:Why the JPTS Is Sufficient if You Use Appropriate Layer Validation

Alexander (“Sasha”) Schwarzman

American Geophysical Union (AGU)

JATS-Con

November 2, 2010

summary
Summary

We have built a superset of the NLM Journal Publishing Tag Set in order to enforce business rules, data types, and house style and, having done that, realized that a JPTS subset could have been sufficient to meet AGU's needs if it were used in conjunction with the appropriate layer validation technology, such as Schematron

Superset Me—Not JATS-Con Nov 2, 2010

contents
Contents
  • Why we built a JPTS superset
  • DTD vs. Schematron
    • Attribute values
    • Number of element occurrences
    • Element position & sequence
    • References
  • Lessons learned

Superset Me—Not JATS-Con Nov 2, 2010

why we built a jpts superset
Why we built a JPTS superset
  • No generic book model
  • Lack of familiarity with Schematron
  • Lack of mature tool support (running SVRL not a viable option in Production environment)
  • Lack of expertise on integrating Schematron with validation against relational DB
  • JATS v2.3: no Compound Keywords, not all content models parameterized

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron attribute values
DTD vs. Schematron:Attribute values

Requirement: Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt)

Strict DTD

<!ATTLIST article article-type (rga | cor | edt) #REQUIRED >

JPTS

<!ATTLIST article article-type CDATA #IMPLIED >

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron attribute values cont d
DTD vs. Schematron:Attribute values (cont’d)

XML instance (contains non-allowed article type)

<article article-type='xxx'/>

Schematron

<rule context="article">

<assert test="@article-type=('rga','cor','edt')">

@article-type '<value-of select='@article-type'/>' not allowed, must be 'rga', 'cor', or edt'</assert></rule>

Schematron message

@article-type 'xxx' not allowed, must be 'rga', 'cor', or

'edt'

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron number of element occurrences
DTD vs. Schematron:Number of element occurrences

Requirement:Acknowledgments, if present, must contain exactly one paragraph, except for two journals (journal code ‘ja’ and ‘rg’) where Acknowledgments must contain two paragraphs

Strict DTD

<!ELEMENT ack (p, p?) >

JPTS

<!ELEMENT ack (p*) >

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron number of occurrences cont d
DTD vs. Schematron:Number of occurrences (cont’d)

XML instance (wrong number of paragraphs)

<article>

...

<journal-id>jb</journal-id>

...

<ack>

<p>Blah</p>

<p>Blah-blah</p>

</ack>

</article>

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron number of occurrences cont d1
DTD vs. Schematron:Number of occurrences (cont’d)

Schematron

<rule context="ack[ancestor::*/journal-id=('ja','rg')]">

<assert test="count(p) eq 2">

'<name/>' in '<value-of select="ancestor::*/journal-id"/>'

must contain exactly two paragraphs</assert></rule>

<rule context="ack">

<assert test="count(p) eq 1">

'<name/>' in '<value-of select="ancestor::*/journal-id"/>'

must contain only one paragraph</assert></rule>

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron number of occurrences cont d2
DTD vs. Schematron:Number of occurrences (cont’d)

Schematron message

'ack' in 'jb' must contain only one paragraph

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron element position sequence
DTD vs. Schematron:Element position & sequence

Requirement:If a journal has subj. grouping (ToC category, subset) & article belongs to sp. collection (sp. section, theme), then subj. grouping info must precede special collection info

Strict DTD

<!ELEMENT article-categories

(subject-group*,

special-collection?) >

JPTS

<!ELEMENT article-categories

(subj-group*) >

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron element position sequence cont d
DTD vs. Schematron:Element position & sequence (cont’d)

XML instance (wrong sequence of subject groups)

<article-categories>

<subj-group subj-group-type="special-section">

<subject content-type="EARLYWARN1">New Methods and

Applications of Earthquake Early Warning</subject>

</subj-group>

<subj-group subj-group-type="toc-category">

<subject content-type="SDE">Solid Earth</subject>

</subj-group>

</article-categories>

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron element position sequence cont d1
DTD vs. Schematron:Element position & sequence (cont’d)

Schematron

<rule context="article-categories/

subj-group[@subj-group-type=('special-section','theme')]">

<assert test="not(following-sibling::

subj-group[@subj-group-type=('toc-category','subset')])">

<name/>/@subj-group-type='<value-of select='@subj-group-

type'/>' must appear after a ToC Category or a Subset

when either is present</assert></rule> 

Schematron message

subj-group/@subj-group-type='special-section' must appear

after a ToC Category or a Subset when either is present

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references
DTD vs. Schematron:References

Validating references is a challenge:

  • Variety vs. the need to enforce editorial style

Strict DTD:

  • Fixed element order, no mixed content
  • Punctuation, spacing, face markup – on output

JPTS:

  • Lots of elements, any order, mixed content
  • Punctuation, spacing, face markup included

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d
DTD vs. Schematron:References (cont’d)

Strict DTD

<!ELEMENT book-standalone-citation

((person-group | string-name),

year,

source,

edition?,

(person-group | string-name)?,

size?,

elocation-id?,

publisher-name,

publisher-loc) >

<!ATTLIST book-standalone-citation

id ID #REQUIRED >

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d1
DTD vs. Schematron:References (cont’d)

JPTS

<!ELEMENT mixed-citation

(#PCDATA | person-group | string-name |

year | source | edition | size |

elocation-id | publisher-name |

publisher-loc | ... | ...)* >

<!ATTLIST mixed-citation

id ID #IMPLIED

publication-type CDATA #IMPLIED >

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d2
DTD vs. Schematron:References (cont’d)

Example:

Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York.

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d3
DTD vs. Schematron:References (cont’d)

XML instance (strict DTD)

<book-standalone-citation id="mood63">

<person-group person-group-type="author">

<name><surname>Mood</surname>

<given-names>A. M.</given-names></name>

<name><surname>Graybill</surname>

<given-names>F. A.</given-names></name>

</person-group>

<year>1963</year>

<source>Introduction to the Theory Statistics</source>

<edition>2nd</edition>

<size units="page">295 pp<size/>

<publisher-name>McGraw-Hill</publisher-name>

<publisher-loc>New York</publisher-loc>

</book-standalone-citation>

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d4
DTD vs. Schematron:References (cont’d)

XML instance (JPTS)

<mixed-citation publication-type="book-standalone">

<string-name>

<surname>Mood</surname>, <given-names>A. M.</given-names>

</string-name>, and <string-name>

<given-names>F. A.</given-names> <surname>Graybill</surname>

</string-name>

(<year>1963</year>),

<source><italic>Introduction to the

Theory Statistics</italic></source>,

<edition>2</edition>nd ed.,

<size units="page">295</size> pp.,

<publisher-name>McGraw-Hill</publisher-name>,

<publisher-loc>New York</publisher-loc>.

</mixed-citation>

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d5
DTD vs. Schematron:References (cont’d)

Schematron can check that all required elements are present and are in the correct sequence (note the required elements and that edition,if present, follows source):

<!ELEMENT book-standalone-citation

((person-group | string-name),

year,

source,

edition?,

(person-group | string-name)?,

size?,

elocation-id?,

publisher-name,

publisher-loc) >

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d6
DTD vs. Schematron:References (cont’d)
  • Schematron can check that all required elements are present:

<rule context="mixed-citation[@publication-type='book-standalone']">

<assert test="(person-group | string-name) and year

and source and publisher-name

and publisher-loc">

required element missing</assert></rule>

  • & that the elements are in the correct sequence:

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d7
DTD vs. Schematron:References (cont’d)

XML instance (JPTS) (edition is in the wrong place)

<mixed-citation publication-type="book-standalone">

<string-name>

<surname>Mood</surname>, <given-names>A. M.</given-names>

</string-name>, and <string-name>

<given-names>F. A.</given-names><surname>Graybill</surname>

</string-name>

(<year>1963</year>),

<edition>2</edition>nd ed.,

<source><italic>Introduction to the Theory …</italic></source>,

<size units="page">295</size> pp.,

<publisher-name>McGraw-Hill</publisher-name>,

<publisher-loc>New York</publisher-loc>.

</mixed-citation>

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d8
DTD vs. Schematron:References (cont’d)

This Schematron uses positional predicate [1] to check that year is immediately followed by source:

<rule context="mixed-citation[@publication-type=

'book-standalone']/year">

<assert test="following-sibling::*[1]/self::source">

'<name/>' must be followed by 'source', not by '<value-of select='name(following-sibling::*[1])'/>'

</assert></rule>

Schematron message

'year' must be immediately followed by 'source', not by 'edition'

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d9
DTD vs. Schematron:References (cont’d)

But how to check the sequence of required elements when there might be optional elements interspersed between them?

This Schematron checks that required publisher-name is preceded by required source, regardless of any optional elements that may occur in-between:

<rule context="mixed-citation[@publication-type=

'book-standalone']/publisher-name">

<assert test="preceding-sibling::source">

'<name/>' must be preceded by 'source'</assert></rule>

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d10
DTD vs. Schematron:References (cont’d)
  • Rick Jelliffe’s approach combines flexibility of JPTS with benefits of a DTD-like fixed element order:
    • Each element rewritten as a string of its element names
    • Content model represented as a regular expression
    • Schematron checks the string of names against regex
    • Schematron generates an error message if content does not match the model

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d11
DTD vs. Schematron:References (cont’d)

An XML file, e.g., citation-models.xml, specifies structured citation

models:

...

<model publication-type="book-standalone">

((string-name | person-group),

year,

source,

edition,

(string-name | person-group)?,

size?,

elocation-id?,

publisher-name,

publisher-loc)

</model>

...

Superset Me—Not JATS-Con Nov 2, 2010

dtd vs schematron references cont d12
DTD vs. Schematron:References (cont’d)
  • Advantages:
    • DTD is still DTD-valid
    • Mixed content is permitted
    • Type-sensitive handling of references is possible
  • Caveat: XSLT 2.0!

Superset Me—Not JATS-Con Nov 2, 2010

lessons learned
Lessons learned
  • AGU Tag Set + Schematron (200+ checks)
    • Ensures data quality
    • Ensures markup integrity
    • Provides control over production processes
  • AGU Tag Set is a superset of JPTS
    • Based on JPTS
    • Uses the same modularization principles
    • Can be easily mapped to JPTS
  • Were we to do this again we would have developed JPTS subset and a Schematron

Superset Me—Not JATS-Con Nov 2, 2010

lessons learned cont d
Lessons learned (cont’d)
  • Appropriate layer validation
    • Even the most “Prussian” DTD can’t enforce all business rules, data types, and house style
    • Rules-based checking needed anyway
    • May as well use “Californian” JPTS (de facto industry standard) adopted by publishers, conversion & composition vendors, archives, etc.
  • Paradigm shift: the crux of validation shifts from XML parser to Schematron engine

Superset Me—Not JATS-Con Nov 2, 2010

lessons learned cont d1
Lessons learned (cont’d)
  • This shift is not without costs:
    • Content may be valid to JPTS but make no sense
    • Dependency on Schematron for semantic integrity
    • Constraints on business partners: must be Schematron-capable and have tools
    • Schematron does not “fix” problems—people do. Processes and procedures must be well-defined

Superset Me—Not JATS-Con Nov 2, 2010

lessons learned cont d2
Lessons learned (cont’d)
  • Writing a simple Schematron is easy;

building a complex and efficient one is not:

    • Elicit, document, convey, and clarify the Requirements
    • Ensure Schematron fits into your workflow
    • Modularize Schematron
    • Ensure that individual Schematron rules aren’t in conflict
    • Optimize Schematron performance
    • Employ XSLT 2.0
    • Test, test, test
    • Cultivate Schematron & XSLT 2.0 expertise in-house

Superset Me—Not JATS-Con Nov 2, 2010

conclusion
Conclusion
  • What about content that is not like a journal article, e.g., generic (non-NCBI) books and their parts/chapters?
  • When this deficiency is addressed, the NLM Archiving and Interchange Tag Suite could truly say:

“Superset Me—Not!”

Superset Me—Not JATS-Con Nov 2, 2010