1 / 37

What We Do

Automating Content Analysis with Trang and Simple XSLT Scripts Bob DuCharme XML 2008 December 9, 2008. What We Do. We help companies lower the cost of creating and managing information. About me. Solutions Architect, Innodata Isogen weblog: http://www.snee.com/bobdc.blog other writing:

hanley
Download Presentation

What We Do

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automating Content Analysis with Trang and Simple XSLT ScriptsBob DuCharmeXML 2008December 9, 2008

  2. What We Do We help companies lower the cost of creating and managing information.

  3. About me • Solutions Architect, Innodata Isogen • weblog: http://www.snee.com/bobdc.blog • other writing: See http://www.snee.com/bob • URLs referenced today: http://www.snee.com/xml/xml2008

  4. Single source publishing and “editorial” XML Input 1 Output 1 Process D Input 2 Input 3 Process A Process B Process C Editorial Master (XML) Process E Output 3 Input 4 Process F Output 2 Input 5

  5. Content analysis: why? • You’ve “inherited” some content • Convert to your current editorial format • Convert it to new output formats • Efficient development of efficient conversion routines

  6. colors.txt: red green blue green blue blue red $ sort colors.txt blue blue blue green green red red Handy tool 1 before we get to the XML parts: sort

  7. sort colors.txt | uniq -c 3 blue 2 green 2 red Handy tool 2 before we get to the XML parts: uniq

  8. Sample data

  9. trang From http://www.thaiopensource.com/relaxng/trang.html: Trang converts between different schema languages for XML. It supports the following languages: • RELAX NG (XML syntax) • RELAX NG compact syntax • XML 1.0 DTDs • W3C XML Schema A schema written in any of the supported schema languages can be converted into any of the other supported schema languages, except that W3C XML Schema is supported for output only, not for input. Trang can also infer a schema from one or more example XML documents.

  10. trang Trang can also infer a schema from one or more example XML documents!!!!!

  11. Analyzing content with trang <whatever> <?xml version="1.0" encoding=“UTF-8" ?> <somedoc>Here is one document</somedoc> <somedoc>Here is another</somedoc> <somedoc>Here is another</somedoc> <somedoc>Here is another</somedoc> </whatever>

  12. Create RELAX NG versions of … • Elsevier article DTD: trang art510.dtd art510.rng • Combined sample content: trang issueContents.xml issueContents.rng • Compare results: saxon art510.rng compareElsRNG.xsl | sort > compareElsRNG.out

  13. compareElsRNG.xsl (1 of 2) <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0"> <xsl:strip-space elements="*"/> <xsl:output method="text"/> <xsl:variable name="schema“ select="document('issueContents.rng')"/> <xsl:template match="text()"/>

  14. compareElsRNG.xsl (2 of 2) <xsl:template match="r:element"> <xsl:variable name="name" select="@name"/> <xsl:choose> <xsl:when test="$schema/r:grammar//r:element/@name[. = $name]"> Yes: <xsl:value-of select="$name"/> </xsl:when> <xsl:otherwise> No: <xsl:value-of select="$name"/> </xsl:otherwise> </xsl:choose> <xsl:apply-templates/> </xsl:template> </xsl:stylesheet>

  15. compareElsRNG.xsl: some sample output No: tb:colspec No: tb:left-border No: tb:right-border No: tb:top-border Yes: aid Yes: article Yes: body Yes: ce:abstract Yes: ce:abstract-sec Yes: ce:acknowledgment Yes: ce:affiliation

  16. Analyzing the XML itself • Or SGML, after using James Clark’s sx: sx -f err.out -x lower myfile.sgm > myfile.xml

  17. Counting elements: countElements.xsl <xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:strip-space elements="*"/> <xsl:output method="text"/> <xsl:template match="text()"/> <xsl:template match="*"> <xsl:value-of select="name()"/> <xsl:text> </xsl:text> <xsl:apply-templates/> </xsl:template> </xsl:stylesheet>

  18. Using countElements.xsl to count elements saxon issueContents.xml countElements.xsl | sort | uniq -c | sort

  19. Start of list: 1 ce:chem 1 ce:displayed-quote 1 ce:inline-figure 1 ce:nomenclature 1 ce:textbox 1 ce:textbox-body 1 ce:underline 1 ce:vsp 1 doc 1 sb:e-host 2 small-caps 3 display 3 formula End of list: 5726 ce:cross-ref 6916 entry 7225 mml:mo 7760 sb:maintitle 7760 sb:title 7929 ce:label 8458 ce:hsp 9326 mml:mi 10331 mml:mrow 12438 ce:italic 16453 sb:author 17082 ce:given-name 17095 ce:surname Result of counting elements

  20. Count element/parent combinations <xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:strip-space elements="*"/> <xsl:output method="text"/> <xsl:template match="text()"/> <xsl:template match="*"> <xsl:value-of select="name(..)"/>/<xsl:value-of select="name()"/> <xsl:text> </xsl:text> <xsl:apply-templates/> </xsl:template> </xsl:stylesheet>

  21. Some parent/child counts 1 ce:displayed-quote/ce:simple-para 59 ce:biography/ce:simple-para 107 ce:legend/ce:simple-para 115 ce:abstract-sec/ce:simple-para 859 ce:caption/ce:simple-para

  22. countAttributes.xsl <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:strip-space elements="*"/> <xsl:output method="text"/> <xsl:template match="text()"/> <xsl:template match="@*"> <xsl:value-of select="name(..)"/> <xsl:text>/@</xsl:text> <xsl:value-of select="name()"/> <xsl:text> </xsl:text> </xsl:template> <xsl:template match="*"> <xsl:apply-templates select="*|@*"/> </xsl:template> </xsl:stylesheet>

  23. Counting the attributes: an excerpt 1 ce:textbox/@id 28 ce:enunciation/@id 44 ce:table-footnote/@id 50 ce:biography/@id 79 ce:footnote/@id 104 ce:correspondence/@id 142 ce:table/@id 175 ce:affiliation/@id 180 ce:formula/@id 182 ce:section/@id 713 ce:figure/@id 4224 ce:bib-reference/@id

  24. Count formula elements with/without ID values <xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:template match="/"> Yes: <!-- finds 180 --> <xsl:value-of select="count(//ce:formula[@id])"/> No: <!-- finds 208 --> <xsl:value-of select="count(//ce:formula[not(@id)])"/> </xsl:template> </xsl:stylesheet>

  25. Find all values of a particular attribute <xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:template match="*"> <xsl:apply-templates select="*|@*"/> </xsl:template> <xsl:template match="text()|@*"/> <xsl:template match="ce:link/@locator"> <xsl:value-of select="."/><xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>

  26. Running OneAttValue.xsl xsltproc OneAttvalue.xsl issueContents.xml | sort | uniq -c | sort • Output ending like this: 10 gr12 11 gr11 14 gr10 17 fx1 17 fx2 18 gr9 24 gr8 37 gr7 55 gr6 67 gr5 91 gr4 99 gr3 103 gr1 103 gr2

  27. Output just the comments in a document <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="text()"/> <xsl:template match="comment()"> <xsl:copy/> </xsl:template> </xsl:stylesheet>

  28. Output just the processing instructions in a document <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml"/> <xsl:template match="processing-instruction()"> <xsl:copy/> </xsl:template> </xsl:stylesheet>

  29. elAttList.xsl goal • Go through rng schema • For each element, output dtdname.dtd\telementName • For each attribute, output dtdname.dtd\telementName\tattributeName

  30. elAttList.xsl part 1 of 2 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0" version="1.0"> <xsl:param name="dtdname" >no dtdname parameter supplied</xsl:param> <xsl:strip-space elements="*"/> <xsl:output method="text"/> <xsl:template match="r:files|r:attribute| r:value "/>

  31. elAttList.xsl part 1 of 2 <xsl:template match="r:element"> <xsl:variable name="elName" select="@name"/> <xsl:value-of select="$dtdname"/> <xsl:text>&#9;</xsl:text> <xsl:value-of select="@name"/> <xsl:text>&#10;</xsl:text> <xsl:for-each select="r:attribute | r:optional/r:attribute"> <xsl:value-of select="$dtdname"/> <xsl:text>&#9;</xsl:text> <xsl:value-of select="$elName"/> <xsl:text>&#9;</xsl:text> <xsl:value-of select="@name"/> <xsl:text>&#10;</xsl:text> </xsl:for-each> <xsl:apply-templates/> </xsl:template> </xsl:stylesheet>

  32. normalizeRNG.xsl <xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0" > <xsl:output indent="yes"/> <xsl:template match="r:element/r:ref | r:optional/r:ref"> <xsl:variable name="referent" select="@name"/> <xsl:apply-templates select="//r:define[@name = $referent]“ mode="copying"/> </xsl:template> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> <xsl:template match="r:define" mode="copying"> <xsl:apply-templates select="node()"/> </xsl:template> </xsl:stylesheet>

  33. Analyzing an SGML DTD • Why? When migrating away from it • RNG or W3C XSD both XML, but not SGML • Using Earl Hood’s perlSGML DTD analysis tools

  34. XML-based analysis of SGML DTD • Run Earl Hood’s dtd2html utility • Run tagsoup or HTML Tidy on output files • Now you’ve got XML where you can pull out element information with XSLT

  35. XML-based analysis of SGML DTD (revised) • Tweak dtd2html to add <div class=“whatever”></div> elements • Run Earl Hood’s dtd2html utility • Run tagsoup or HTML Tidy on output files • Now you’ve got XML where you can pull out element information with XSLT

  36. Summary • This is not an integrated report generator. It’s Legos. • Pipelining data between existing tools, re-usable scripts, and quick hacks. • Document your command lines, e.g. saxon temp1.xml temp3.xsl > temp1a.xml • Clients like reports, especially in spreadsheets.

  37. Thank you! • Referenced resources: http://www.snee.com/xml/xml2008

More Related