1 / 73

XTF in Depth

XTF in Depth. Powerful Search and Display for Electronic Text. Martin Haye California Digital Library. January 2009 presentation at University of Sydney. XTF in Depth. Part 1: What is XTF and how does it compare? Who is using it? What needs does it address? New features in 2.1

merrow
Download Presentation

XTF in Depth

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XTF in Depth Powerful Search and Display for Electronic Text Martin HayeCalifornia Digital Library January 2009 presentation at University of Sydney

  2. XTF in Depth • Part 1: • What is XTF and how does it compare? • Who is using it? • What needs does it address? • New features in 2.1 • Design and data flow • Adapting Lucene and Saxon • Planned improvements • Part 2: • Interactive demos

  3. XTF in 5 minutes • eXtensible Text Framework • Search and display technology from CDL • Open-source Java framework • Powerful and highly configurable • All about rapid prototyping, fast deployment, and incremental improvement • XML + Full text search • Also indexes PDF, HTML, Word • Excel and Powerpoint coming soon

  4. XTF in 5 minutes • Search: Query power/speed of Lucene, plus: • search results shown in context • keyword search, facets, spelling, lots more • View: Processing power of Saxon, plus: • large file optimizations, hit markup • Configure and customize exclusively in XSLT • Flexible, overlapping collections • Mature, tightly integrated, well documented • In use at CDL and many other places

  5. What XTF is not • It is not a content management system • Creation (conversion, scanning, manual)‏ • Ingest / administration • Editing • Preservation • Not built for remote administration • Not a true XML database • but close • Not Google • Google: one interface to vast grab-bag of data • XTF: crafted interfaces to high-quality data sets

  6. How does XTF compare? Green- stone * * Solr Turn-key / easy---------------> XTF 2.1 XTF 2.0 Customizable / Powerful ----------------------------------------> * caveat: based on my limited experience with Greenstone and Solr

  7. Online Archive of California

  8. eScholarship Editions

  9. calisphere

  10. Mark Twain Project Online

  11. UC Berkeley

  12. University of Sydney

  13. Encyclopedia of Chicago

  14. Indiana University: Newton

  15. Indiana University: Swinburne

  16. Sweden

  17. Brazil

  18. Italy

  19. Needs • Let’s look at four needs that XTF was created to address: • Diverse data • Open software • Rapid deployment • Community involvement

  20. Needs: 1. Diverse data • Our collections: many and diverse • eScholarship (TEI, PDF) • UC Press monographs (a text may be > 10 megs) • 25,000 scholarly articles in PDF • Mark Twain • Hand-crafted critical edition (TEI + MODS)‏ • OAC: finding aids, images, books, manuscripts • Japanese American Relocation Digital Archives • TEI, EAD, MODS • Book scanning projects (Google, Internet Archive) • Thousands of scanned books (PDF + DC)‏ • Millions of Melvyl catalog records (MARC)

  21. Needs: 2. Open software • Digital Publishing Products • “Black box” (no control over fixes & features)‏ • Often not standards-based • Tech companies have short lifespans • Support often spotty • Data can be held hostage, or even lost • $$$$$

  22. Needs:3. Rapid deployment • New collections arriving • Users don't want to wait a year for access • Many “what if” and “wouldn't it be cool” requests from our staff • Java programmers are expensive • Look & feel goes stale quickly • Barrage of feature requests

  23. Needs:4. Community involvement • We want to share the load • For XTF 2.1, we asked the XTF community to vote for features they wanted • At CDL we try to align our development to needs of the community • Result: Everybody benefits

  24. New and improved in 2.1 • Faceted browse • Search flexibility • Bookbag • Spelling correction • Similar items • OAI-PMH

  25. Faceted browse • Previously implementing faceted browse required lots of XSLT programming. • Hierarchical facets: even harder • Required us to deeply refactor the stylesheets, but now it’s simple to add new facets.

  26. Faceted browse

  27. Faceted browse

  28. Hierarchical facets

  29. Hierarchical facets

  30. Search flexibility • Keyword search: single box (now default). Internally, searches multiple fields. • Advanced search: explicitly fill in constraints for various fields • Freeform search (new): text-based field specifiers, AND, OR, parentheses, etc.

  31. Keyword search

  32. Advanced search

  33. Freeform search

  34. This fit nicely into XTF’s architecture Simple but conforming implementation OAI-PMH

  35. Bookbag • Refactored the AJAX to use YUI (Yahoo User Interface widgets) • Still session based • Now supports emailing the bookbag

  36. Bookbag

  37. Bookbag

  38. Bookbag

  39. Spelling correction • Unicode bug fixes • On by default and fully integrated

  40. Spelling correction

  41. Spelling correction

  42. Similar items • Allows user to see “more like this” • Improved AJAX integration • On by default - no configuration needed

  43. Similar items

  44. Similar items

  45. Other changes in XTF 2.1 • Built-in NLM “Blue”, TEI P5, MS Word support (still support TEI P4, EAD, PDF, HTML, text) • Valid XHTML output • RawQuery servlet to provide a query back-end to a (e.g. Ruby) front-end or mash-up. • Bug fixes and minor changes (many reported/requested by users)

  46. Wiki documentation

  47. Wiki documentation

  48. Design philosophy • Adaptation through programming • XTF is still about building what you want using a set of powerful tools But now: • Stylesheets are more modular • Build interfaces faster using honed widgets • Prettier UI to start with

  49. XTF is open, standards based • Based on free, open-source tools: • Java SDK 1.5+ • Lucene 2.1 full-text search toolkit • Saxon 8.9 XSLT processor • UNICODE support throughout • XTF itself is open-source (BSD license)‏ • No native code – pure Java and XSLT 2.0 • Runs on Windows, Solaris, Linux, MacOS • Drops right in to Tomcat or Resin • Lots of user-fixable documentation

  50. Modular • Use crossQueryservlet to search, dynaXML to display and navigate. Deploy one or both. • Stylesheets govern flow of data – no Java programming required • Easy to add features incrementally • 100% configurable “look and feel” • Skin & slice: one system can have several interfaces and multiple “brands” • Collection subsetting driven by meta-data

More Related