xtf in depth
Download
Skip this Video
Download Presentation
XTF in Depth

Loading in 2 Seconds...

play fullscreen
1 / 73

XTF in Depth - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

XTF in Depth. Powerful Search and Display for Electronic Text. Martin Haye California Digital Library. January 2009 presentation at University of Sydney. XTF in Depth. Part 1: What is XTF and how does it compare? Who is using it? What needs does it address? New features in 2.1

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' XTF in Depth' - duante


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
xtf in depth

XTF in Depth

Powerful Search and Display for Electronic Text

Martin HayeCalifornia Digital Library

January 2009 presentation

at University of Sydney

xtf in depth1
XTF in Depth
  • Part 1:
    • What is XTF and how does it compare?
    • Who is using it?
    • What needs does it address?
    • New features in 2.1
    • Design and data flow
    • Adapting Lucene and Saxon
    • Planned improvements
  • Part 2:
    • Interactive demos
xtf in 5 minutes
XTF in 5 minutes
  • eXtensible Text Framework
  • Search and display technology from CDL
  • Open-source Java framework
  • Powerful and highly configurable
  • All about rapid prototyping, fast deployment, and incremental improvement
  • XML + Full text search
  • Also indexes PDF, HTML, Word
    • Excel and Powerpoint coming soon
xtf in 5 minutes1
XTF in 5 minutes
  • Search: Query power/speed of Lucene, plus:
    • search results shown in context
    • keyword search, facets, spelling, lots more
  • View: Processing power of Saxon, plus:
    • large file optimizations, hit markup
  • Configure and customize exclusively in XSLT
  • Flexible, overlapping collections
  • Mature, tightly integrated, well documented
  • In use at CDL and many other places
what xtf is not
What XTF is not
  • It is not a content management system
    • Creation (conversion, scanning, manual)‏
    • Ingest / administration
    • Editing
    • Preservation
  • Not built for remote administration
  • Not a true XML database
    • but close
  • Not Google
    • Google: one interface to vast grab-bag of data
    • XTF: crafted interfaces to high-quality data sets
how does xtf compare
How does XTF compare?

Green-

stone

*

*

Solr

Turn-key / easy--------------->

XTF 2.1

XTF 2.0

Customizable / Powerful ---------------------------------------->

* caveat: based on my limited experience with Greenstone and Solr

needs
Needs
  • Let’s look at four needs that XTF was created to address:
    • Diverse data
    • Open software
    • Rapid deployment
    • Community involvement
needs 1 diverse data
Needs: 1. Diverse data
  • Our collections: many and diverse
    • eScholarship (TEI, PDF)
      • UC Press monographs (a text may be > 10 megs)
      • 25,000 scholarly articles in PDF
    • Mark Twain
      • Hand-crafted critical edition (TEI + MODS)‏
    • OAC: finding aids, images, books, manuscripts
      • Japanese American Relocation Digital Archives
      • TEI, EAD, MODS
    • Book scanning projects (Google, Internet Archive)
      • Thousands of scanned books (PDF + DC)‏
      • Millions of Melvyl catalog records (MARC)
needs 2 open software
Needs: 2. Open software
  • Digital Publishing Products
    • “Black box” (no control over fixes & features)‏
    • Often not standards-based
    • Tech companies have short lifespans
    • Support often spotty
    • Data can be held hostage, or even lost
    • $$$$$
needs 3 rapid deployment
Needs:3. Rapid deployment
  • New collections arriving
  • Users don\'t want to wait a year for access
  • Many “what if” and “wouldn\'t it be cool” requests from our staff
  • Java programmers are expensive
  • Look & feel goes stale quickly
  • Barrage of feature requests
needs 4 community involvement
Needs:4. Community involvement
  • We want to share the load
  • For XTF 2.1, we asked the XTF community to vote for features they wanted
  • At CDL we try to align our development to needs of the community
  • Result: Everybody benefits
new and improved in 2 1
New and improved in 2.1
  • Faceted browse
  • Search flexibility
  • Bookbag
  • Spelling correction
  • Similar items
  • OAI-PMH
faceted browse
Faceted browse
  • Previously implementing faceted browse required lots of XSLT programming.
  • Hierarchical facets: even harder
  • Required us to deeply refactor the stylesheets, but now it’s simple to add new facets.
search flexibility
Search flexibility
  • Keyword search: single box (now default). Internally, searches multiple fields.
  • Advanced search: explicitly fill in constraints for various fields
  • Freeform search (new): text-based field specifiers, AND, OR, parentheses, etc.
oai pmh
This fit nicely into XTF’s architecture

Simple but conforming implementation

OAI-PMH
bookbag
Bookbag
  • Refactored the AJAX to use YUI (Yahoo User Interface widgets)
  • Still session based
  • Now supports emailing the bookbag
spelling correction
Spelling correction
  • Unicode bug fixes
  • On by default and fully integrated
similar items
Similar items
  • Allows user to see “more like this”
  • Improved AJAX integration
  • On by default - no configuration needed
other changes in xtf 2 1
Other changes in XTF 2.1
  • Built-in NLM “Blue”, TEI P5, MS Word support (still support TEI P4, EAD, PDF, HTML, text)
  • Valid XHTML output
  • RawQuery servlet to provide a query back-end to a (e.g. Ruby) front-end or mash-up.
  • Bug fixes and minor changes (many reported/requested by users)
design philosophy
Design philosophy
  • Adaptation through programming
  • XTF is still about building what you want using a set of powerful tools

But now:

  • Stylesheets are more modular
  • Build interfaces faster using honed widgets
  • Prettier UI to start with
xtf is open standards based
XTF is open, standards based
  • Based on free, open-source tools:
    • Java SDK 1.5+
    • Lucene 2.1 full-text search toolkit
    • Saxon 8.9 XSLT processor
  • UNICODE support throughout
  • XTF itself is open-source (BSD license)‏
  • No native code – pure Java and XSLT 2.0
  • Runs on Windows, Solaris, Linux, MacOS
  • Drops right in to Tomcat or Resin
  • Lots of user-fixable documentation
modular
Modular
  • Use crossQueryservlet to search, dynaXML to display and navigate. Deploy one or both.
  • Stylesheets govern flow of data – no Java programming required
  • Easy to add features incrementally
  • 100% configurable “look and feel”
  • Skin & slice: one system can have several interfaces and multiple “brands”
    • Collection subsetting driven by meta-data
why xslt
Why XSLT?
  • XSLT is a natural fit for XML
    • Powerful, dynamic language
    • Incredibly high-quality, free processor (Saxon)‏
  • Why not Java/Struts?
    • Poor for rapid prototyping, steep learning curve
  • Why not Ruby?
    • Not necessarily a good match for XML data
    • Can be too clever by half
    • But a smart mash-up might be cool...
indexing
Indexing
  • Input filters adapt to many doc types
    • Any XML doc type
    • PDF, MS Word, plain text, untidy HTML
  • XTF is agnostic regarding:
    • Document identifiers
    • Filesystem organization
      • Uses document selector stylesheet to identify and classify documents in filesystem
    • Meta-data storage
  • Incremental indexing
    • Simply update filesystem then run indexer.
flexible search display
Flexible Search/Display
  • One query, many collections
    • XTF enables “Virtual collections”
  • Output filters for various result views
    • e.g. simple vs. advanced search form, results in brief vs. long format, etc.
  • Query parsers for different search interfaces
    • Interface to other query protocols
    • SRU and OAI-PMH already implemented
    • Should be easy to adapt other queries:
      • Very extensive set of query operators
      • Flexible query composition
  • Faceted browse
query power
Query Power
  • Many operators
    • AND, OR, NEAR, NOT, phrase, range, wildcard
    • Or-Near, multi-field AND, “more like this”
  • Arbitrarily complex queries
    • Combine full-text search with meta-data
    • Unusual queries like:"dynamic duo" near "red phone"
  • Structure-aware searching
    • e.g. search only headings, or only bibliographies
    • But must pre-define which structures to search
more power
More Power
  • Fixed-length snippets
    • Highlight the hit and just the hit
  • Sort by relevance, or any meta-data fields
  • Spelling correction
  • No penalty for huge documents
    • XTF “lazily” pulls in only those parts used by a particular request (e.g. show just Chapter 1)‏
  • Scalable
    • Proven with 10 million records / 14 gigs data
    • but beyond that, Solr looks better
  • Authentication: IP lists, LDAP, or external
adapting lucene and saxon
Adapting Lucene and Saxon
  • Adapting Lucene
    • Chunking, flattening, hit marking, stop-words, setting limits, insensitivity, special queries, faceted browsing, spelling correction
  • Adapting Saxon
    • Lazy trees, misc. extensions
adapting lucene chunking
Adapting Lucene:Chunking
  • Why
    • Lucene\'s proximity searches perform best on small documents
    • Small chunks enable efficient generation of 80-character “snippet” surrounding each hit
  • How
    • XTF breaks text blocks into 200-word chunks
    • Chunks overlap to detect a hit starting in one and ending in the next.
    • Each chunk carries structural info, plus pointer to location in XML doc.
    • Only first chunk carries meta-data for doc
adapting lucene flattening xml
Adapting Lucene:Flattening XML
  • XSLT prefilter flattens XML structure
    • Series of text blocks
    • Block tagged with structural info for search
    • Prefilter can boost or suppress sections
    • Fine control over proximity matching
  • Prefilter gathers/marks meta-data
    • Can come from within the document, from an XML doc in filesystem, or fetched from a URL.
    • Synthesize meta-data (e.g. sort fields, facets)‏
adapting lucene hit marking
Adapting Lucene:Hit Marking
  • Marking search hits in context
    • Lucene doesn\'t pinpoint location of hits, only gives a score per-document
    • Custom enhancements to Lucene\'s “span” logic score and locate each hit.
    • dynaXML dynamically adds ranked hits to original XML doc, then sends to XSLT formatter.
    • crossQuery forms a snippet around and highlights each hit.
adapting lucene stop words
Adapting Lucene:Stop-words
  • Robust, efficient stop-word handling
    • “the, a, an, it, on...”
    • People do use them, and expect corresponding results.
    • Lucene normally ignores stop-words, for speed.
    • XTF quietly joins stop-words to adjacent words, forming “n-grams”
    • Example: “man on the moon” -> man-on on-the the-moon
    • Queries are internally rewritten to search for n-grams automatically.
adapting lucene setting limits
Adapting Lucene:Setting Limits
  • Limits on aberrant queries
    • Adjustable limits on number of terms matched by range or wildcard queries
    • N-grams naturally make most queries efficient
    • Configurable limits on amount of “work” performed by a single query.
  • Numeric range query
    • Avoids term expansion
    • Efficiently filters very granular data, e.g. timestamps: 2006-11-14:12:46:03.77
adapting lucene insensitivity
Adapting Lucene:Insensitivity
  • Accent/diacritic marks
    • Many users can\'t or don\'t know how to type them
    • XTF indexer uses configurable map to remove accents
    • crossQuery maps query terms
  • Plural
    • Convenient for “cat” to match “cats” also
    • Configurable map of plural to singular used at index and query time
adapting lucene special queries
Adapting Lucene:Special Queries
  • OR-NEAR
    • Standard OR query doesn\'t use proximity
    • OR-NEAR: if words nearby, score is boosted
  • Multi-field AND
    • All terms must be present, in any field.
    • Essential for certain keyword searches: against all enemies clarke(matches against title and author)‏
  • More like this
    • Auto-calculates “interesting” terms in meta-data
    • Creates OR-NEAR query to find similar docs
adapting lucene faceted browsing
Adapting Lucene:Faceted Browsing
  • Draws facet term list from Lucene index
  • Each facet cached in-memory
  • Counts per group created dynamically
  • Special mini-language to sort/select (esp. useful for hierarchical facets)‏
adapting lucene spelling correction
Adapting Lucene:Spelling Correction
  • Any standard dictionary won\'t match place and proper names
  • Idea: use the index as source of suggestions
  • XTF searches words within edit distance 2
  • Candidates ranked by weighted score:
    • Edit distance (transpositions discounted)‏
    • Frequency of use in the index
    • Double-metaphone match
  • Multi-word correction uses pair frequencies
  • On test data, 80% right suggestion
adapting saxon lazy trees
Adapting Saxon:Lazy Trees
  • The need: display small parts of large (> 10MB) XML documents
  • Solution: create a binary, random-access version of each document
  • XSL keys calc\'d once and stored
  • Only elements accessed by a given request are loaded from disk
  • Care must be taken in stylesheets
  • Profile mode is useful for optimization
adapting saxon extensions
Adapting Saxon:Extensions
  • More complete SQL database connection
  • Ability to call external tools
    • Automatic XML conversion in/out
    • Timeout enforcement
  • File utilities
    • Check file existence
    • Get file length and timestamp
  • Session data
    • Key/value pairs
    • Value can be XML or plain string
the future
The future
  • XTF 2.2:
    • Better out-of-box for large EADs
    • Fixes for incremental indexing; other bug fixes
    • Specify any number of sub-dirs to index
    • Possible TEI P5 refactoring
    • Background auto-warming of new index
    • Support for indexing Powerpoint and Excel files
  • Further out:
    • A page-turner for scanned texts and converted PDFs
    • Pop-up image/PDF page snippets
    • And of course, features suggested by users
demos
Demos
  • I’ll demonstrate the features we talked about on several different XTF sites “out in the wild.”
slide73
Fin
  • Project: xtf.sourceforge.net
  • Docs: xtf.wiki.sourceforge.net
  • Discuss: groups.google.com/group/xtf-user
  • This talk: xtf.sourceforge.net/talks/2009-01-23.ppt
  • Me: [email protected]
ad