1 / 12

XSL, Swish-e and DjVu

XSL, Swish-e and DjVu. Kevin Reiss Rutgers-Newark School of Law Library March 10 th , 2004 TAG Meeting. Project Description: New Jersey Digital Legal Library . url: http://njlegallib.rutgers.edu Create a searchable & browsable repository of previously unavailable NJ Legal Information

stian
Download Presentation

XSL, Swish-e and DjVu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XSL, Swish-e and DjVu Kevin Reiss Rutgers-Newark School of Law Library March 10th, 2004 TAG Meeting

  2. Project Description: New Jersey Digital Legal Library • url: http://njlegallib.rutgers.edu • Create a searchable & browsable repository of previously unavailable NJ Legal Information • 3 Collections: • New Jersey Administrative Reports, 1979-1991 • New Jersey Executive Orders, 1941 – January 1990 • New Jersey Attorney General Opinions • Collection 1 scanned professionally by Princeton Imaging, 2 & 3 done in house on a flatbed Minolta PS7000 • OCR quality is good in Collection, poor in 2 and 3 • Available in PDF and DjVu [with embedded OCR text] • DjVu created with LizardTech Document Express 3.1 • PDF created with c42pdf [http://c42pdf.ffii.org/]

  3. Project Requirements • Use only open-source tools [other than for document creation] • Need to provide full-text searching and searching within specific metadata fields • Documents need to be indexed and retrieved as atomic units, rather than at the page-level • Solution: • Store the metadata and full-text of each document in the same unit and find an indexing program that can index them both. • Ultimate solution: • Extract OCR text from DjVu files using djvutoxml • Use XSL to combine djvutoxml output and metadata in xml in a single XHTML file • Use swish-e to index and search the XHTML file

  4. Swish-e Basics • url: http://www.swish-e.org/ • Simple Web Indexing for Humans – Enhanced • Full-Text indexing program written in C, available freely • Special indexing modes for XML, HTML documents, can index any plain-text format • Uses standard open-source filtering tools to index ps, pdf, word, and ps.gz documents • Can index both file-systems and over HTTP • Supports several stemming algorithms • Supports Boolean searching • Supports wildcard and phrase searching • Indexing controlled by standard configuration file format • Uses libxml to parse XML|HTML documents

  5. Why Choose Swish-e • It can index and search HTML metatags • It is fast, index several thousand files in a few seconds • Decent compression in the index app 700 pages with metadata results in a 13.5 mb index • Swish:API, a perl module for embedding swish-e in applications available • This module forms the basis of a fairly functional demo web-based search app that can be used to build your own search interface • Easy to select the meta or xml tags you wish to index and return with search results using the “metatag” and “property” declaration in the swish-e config file • Excellent documentation [http://www.swish-e.org/current/docs/] • Under active development, version 2.4.2 just released yesterday

  6. XSL Basics • Extensible Stylesheet Language [http://www.w3.org/Style/XSL/] • Really two W3C XML standards • XSLT: a transformation language for XML documents • XSL-FO: a powerful language for specifying formatting semantics, much more powerful than CSS, generally used for print publications • Written as well-formed XML • Some predict it will take on SQL-like functionality for XML Documents • Based on the paradigm of functional programming • XSLT transformations are executed using an XSLT processor • Many Java-based XSLT Processors • I use libxml [http://xmlsoft.org/], a very based C-based library that includes an XML parser and XSLT processor • Takes an XML document as input and transforms this into XML, HTML, or plain-text output • The instructions for this transformation are located in XSLT stylesheets • Transforms one tree to another

  7. XSL Syntax Basics • Stylesheets are constructed of a series of “templates” that match nodes or groups of nodes in an XML document • Example: main XSL stylesheet for djvu2xhtml conversion • Groups of nodes are selected by written XPATH expressions • XPATH is another W3C standard [http://www.w3.org/TR/xpath] • Purpose “a language for addressing parts of an XML document” • Has a number of familiar procedural constructs: looping, branching, named variables • Example of variables: parameter stylesheet for djvu2xhtml • Some problems: • Can be slow for large documents [whole document is loaded into memory • Multiple input and output documents are clunky • String processing is problematic, no regexes, typically need to use recursively structures for complicated tasks

  8. DJVUXML tools • Part of DjVulibre 3.5.12 or higher • URL: http://djvu.sourceforge.net/doc/man/djvuxml.html • Does djvused-like (annotations, highlighting) functions using XML syntax • Djvutoxml outputs an XML serialization of a DjVu Document • Example Output – results in very large files • This reflects line, page and column information, can vary quite a bit from document type to document type • Unrecognized OCR often results in Unicode errors, so use the provided xml2utf8 or xml2utf16 filters • Provides you with a set coordinates for regions in a DjVu document contrary to what the plug-in understands

  9. Workflow • Prepare metadata in XML • Available in a format based on partly Dublin core, part in-house tags • This was extracted from static HTML pages • Prepare customized metadata and display information for the documents to be transformed: example • I use emacs nxml-mode for editing XML documents • Invoke DjVuXML commands • Transform documents to XHTML: example • Prepare Swish-e index • Put in meta and properties information in config file • Prepare Search Interface • Put in meta and property information in cgi interface config • Put in display related meta and property information in search template file

  10. Problems • Use of space could develop into an issue • XSLT transformations using the djvuxml format are too slow to be used in any real-time processing, must be done in batch • Updating or adding metadata must be done by hand or by program, no data entry interface • Swish-e has limited support for indexing XML attributes • Swish-e can only index specific fields in XML documents that are defined as properties • To enable highlighting in DjVu Documents will need to solve the coordinate problem • Complicated modifications to the search interface are time consuming and require you to learn on of the perl HTML template mechanisms, like Template::Toolkit or HTML::Template

  11. Future Directions • Explore fully Aware XML indexing engines • Amberfish • eXist – example Apps, based on XQuery • Xindice • Search Interface Improvements • Take the user directly to their keyword in the document • Dynamically generate the browsing pages for the collection based on information in the metadata files [currently static HTML] • DjVuXSL Stylesheet Improvement • Work on string processing capabilities to recognize paragraphs, lists • Rework the use of the document() to improve processing speed • Try XSLT 2.0, to see if the new string processing capabilites can help • Learn more about the structure of DjVu documents to make the stylesheets more reliable

  12. Useful Links • DjVuXSL • DjVuXSL Stylesheets Homepage • Guide to Dublin Core in HTML • Swish-e • Current Swish-e Documentation • XSL • XSL-List • Jenni Tennison's XSLT Pages • Book: XSLT Programmer's Reference • XSLT 1.0 Tutorial • XSLT 2.0 Introduction • XSLT 2.0 Implementation

More Related