XTF in Depth

XTF in Depth Powerful Search and Display for Electronic Text Martin HayeCalifornia Digital Library January 2009 presentation at University of Sydney

XTF in Depth • Part 1: • What is XTF and how does it compare? • Who is using it? • What needs does it address? • New features in 2.1 • Design and data flow • Adapting Lucene and Saxon • Planned improvements • Part 2: • Interactive demos

XTF in 5 minutes • eXtensible Text Framework • Search and display technology from CDL • Open-source Java framework • Powerful and highly configurable • All about rapid prototyping, fast deployment, and incremental improvement • XML + Full text search • Also indexes PDF, HTML, Word • Excel and Powerpoint coming soon

XTF in 5 minutes • Search: Query power/speed of Lucene, plus: • search results shown in context • keyword search, facets, spelling, lots more • View: Processing power of Saxon, plus: • large file optimizations, hit markup • Configure and customize exclusively in XSLT • Flexible, overlapping collections • Mature, tightly integrated, well documented • In use at CDL and many other places

What XTF is not • It is not a content management system • Creation (conversion, scanning, manual)‏ • Ingest / administration • Editing • Preservation • Not built for remote administration • Not a true XML database • but close • Not Google • Google: one interface to vast grab-bag of data • XTF: crafted interfaces to high-quality data sets

How does XTF compare? Green- stone * * Solr Turn-key / easy---------------> XTF 2.1 XTF 2.0 Customizable / Powerful ----------------------------------------> * caveat: based on my limited experience with Greenstone and Solr

Online Archive of California

eScholarship Editions

calisphere

Mark Twain Project Online

UC Berkeley

University of Sydney

Encyclopedia of Chicago

Indiana University: Newton

Indiana University: Swinburne

Sweden

Brazil

Italy

Needs • Let’s look at four needs that XTF was created to address: • Diverse data • Open software • Rapid deployment • Community involvement

Needs: 1. Diverse data • Our collections: many and diverse • eScholarship (TEI, PDF) • UC Press monographs (a text may be > 10 megs) • 25,000 scholarly articles in PDF • Mark Twain • Hand-crafted critical edition (TEI + MODS)‏ • OAC: finding aids, images, books, manuscripts • Japanese American Relocation Digital Archives • TEI, EAD, MODS • Book scanning projects (Google, Internet Archive) • Thousands of scanned books (PDF + DC)‏ • Millions of Melvyl catalog records (MARC)

Needs: 2. Open software • Digital Publishing Products • “Black box” (no control over fixes & features)‏ • Often not standards-based • Tech companies have short lifespans • Support often spotty • Data can be held hostage, or even lost • $$$$$

Needs:3. Rapid deployment • New collections arriving • Users don't want to wait a year for access • Many “what if” and “wouldn't it be cool” requests from our staff • Java programmers are expensive • Look & feel goes stale quickly • Barrage of feature requests

Needs:4. Community involvement • We want to share the load • For XTF 2.1, we asked the XTF community to vote for features they wanted • At CDL we try to align our development to needs of the community • Result: Everybody benefits

New and improved in 2.1 • Faceted browse • Search flexibility • Bookbag • Spelling correction • Similar items • OAI-PMH

Faceted browse • Previously implementing faceted browse required lots of XSLT programming. • Hierarchical facets: even harder • Required us to deeply refactor the stylesheets, but now it’s simple to add new facets.

Faceted browse

Hierarchical facets

Search flexibility • Keyword search: single box (now default). Internally, searches multiple fields. • Advanced search: explicitly fill in constraints for various fields • Freeform search (new): text-based field specifiers, AND, OR, parentheses, etc.

Keyword search

Advanced search

Freeform search

This fit nicely into XTF’s architecture Simple but conforming implementation OAI-PMH

Bookbag • Refactored the AJAX to use YUI (Yahoo User Interface widgets) • Still session based • Now supports emailing the bookbag

Bookbag

Spelling correction • Unicode bug fixes • On by default and fully integrated

Spelling correction

Similar items • Allows user to see “more like this” • Improved AJAX integration • On by default - no configuration needed

Similar items

Other changes in XTF 2.1 • Built-in NLM “Blue”, TEI P5, MS Word support (still support TEI P4, EAD, PDF, HTML, text) • Valid XHTML output • RawQuery servlet to provide a query back-end to a (e.g. Ruby) front-end or mash-up. • Bug fixes and minor changes (many reported/requested by users)

Wiki documentation

Design philosophy • Adaptation through programming • XTF is still about building what you want using a set of powerful tools But now: • Stylesheets are more modular • Build interfaces faster using honed widgets • Prettier UI to start with

XTF is open, standards based • Based on free, open-source tools: • Java SDK 1.5+ • Lucene 2.1 full-text search toolkit • Saxon 8.9 XSLT processor • UNICODE support throughout • XTF itself is open-source (BSD license)‏ • No native code – pure Java and XSLT 2.0 • Runs on Windows, Solaris, Linux, MacOS • Drops right in to Tomcat or Resin • Lots of user-fixable documentation

Modular • Use crossQueryservlet to search, dynaXML to display and navigate. Deploy one or both. • Stylesheets govern flow of data – no Java programming required • Easy to add features incrementally • 100% configurable “look and feel” • Skin & slice: one system can have several interfaces and multiple “brands” • Collection subsetting driven by meta-data

XTF in Depth

XTF in Depth

Presentation Transcript

In More Depth…

In-Depth Interview

IPv6 In Depth

London in Depth

Defense-in-Depth

In Depth Interviews -

JavaScript in Depth

In-depth Interviews

In Depth Study

Defense-In-Depth

XTF in Depth

Defense in Depth

XTF 2.1

Reading in Depth

FastMM in Depth

Statics in Depth

BIRT In Depth

Enzymes In Depth

GWT In-depth

Defense-in-Depth

Slammer in Depth