1 / 40

Accommodating Diverse Search Requirements over a Fedora Repository

Accommodating Diverse Search Requirements over a Fedora Repository. Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008. Background. Indiana University Digital Library Program Started in 1997 Diversity of formats and collections

duke
Download Presentation

Accommodating Diverse Search Requirements over a Fedora Repository

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008

  2. Fedora Users Group - Open Repositories 2008 Background • Indiana University Digital Library Program • Started in 1997 • Diversity of formats and collections • Text, image, musical scores, audio, video, … • Diversity of search systems • DLXS, XTF, Lucene, DB2 NSE, Oracle Text • Current project to unify architecture for storage, discovery, and delivery around Fedora

  3. Fedora Users Group - Open Repositories 2008 Search System Development • Phase one: create a search architecture and template for an image based search and discovery application • Phase two: extend the template and architecture to support more advanced search and discovery applications over different object types

  4. Fedora Users Group - Open Repositories 2008 PHASE I: CREATING A BASIC IMAGE SEARCH

  5. Fedora Users Group - Open Repositories 2008 Phase One: Simple Image Search • Slocum puzzle collection: ideal test case • Small number of objects • Simple content model • Each object represents a single physical puzzle • Basic metadata: METS, MODS, DC • RELS-EXT isMemberOf relationship with a collection object • Pre-scaled derivative images

  6. Fedora Users Group - Open Repositories 2008

  7. Fedora Users Group - Open Repositories 2008 Requirements: Identifier Resolution • External Identifiers rather than Fedora PIDs • Seamless migration to Fedora • No commitment to any underlying repository architecture • Requirement: Quickly resolve our identifier (PURL) to the Fedora PID

  8. Fedora Users Group - Open Repositories 2008 Requirements: PURL Identifier Resolution http://purl.dlib.indiana.edu/iudl/lilly/slocum/thumbnail/LL-SLO-004696 OCLC PURL Resolver Hypothetical ID Resolution Service http://fedora.dlib.indiana.edu:8080/fedora/get/iudl:19794/THUMBNAIL

  9. Fedora Users Group - Open Repositories 2008 Requirements: Keyword and Fielded Search • Very basic search requirements for any discovery and delivery web application • Keyword search should maximize discovery • MODS fields should be searchable to maximize accuracy of matches • Search results paging • Support for simple Boolean operators • Wildcard searches are a requirement • Full metadata record (MODS) returned

  10. Fedora Users Group - Open Repositories 2008 Remaining Requirements • User interface • Extensible, Reusable, Customizable • Service oriented approach • Centralize core search system • Standards-based access for integration with other services and end-user tools

  11. Fedora Users Group - Open Repositories 2008 Requirements: Search System UI Layer Search Layer Slocum Webapp PURL Resolution Fielded Search Fedora Integration Generic Search Webapp

  12. Fedora Users Group - Open Repositories 2008 Solutions: Search Protocol • Search and Retrieve via URL (SRU) • One of very few standard search protocols • Extremely powerful and flexible query language (CQL) • Can return records of any type • Most commonly used with DC, MODS, MARCXML • Has mechanisms for extension in case special needs arise

  13. Fedora Users Group - Open Repositories 2008 Search System Solutions: SRU UI Layer Search Layer Slocum Webapp PURL Resolution Fielded Search SRU Fedora Integration Generic Search Webapp SRU

  14. Fedora Users Group - Open Repositories 2008 Solutions: Existing Products • Fedora Search • Good for finding items based on basic Fedora metadata, but not for more sophisticated searching • Fedora Resource Index Search • Also limited to searching basic metadata, not the content of datastreams

  15. Fedora Users Group - Open Repositories 2008 Solutions: Existing Products • Fedora Generic Search Service (GSearch) • Hooks into Fedora • Works with Lucene • Easy to customize search fields though XSLT transformation of existing metadata • OCLC SRU/W Implementation • Relatively complete implementation in Java, with ongoing development • Others have had success using with Lucene

  16. Fedora Users Group - Open Repositories 2008 Search System OCLC SRU Implementation Fedora Generic Search Service SRU Lucene Database extension Updates Reads index

  17. Fedora Users Group - Open Repositories 2008 Phase 1 Solution: General Applicability • Pieces of this solution have been used for other image collections • SRU is used to expose these collections to OneSearch@IU, our federated search service • The XSLT that assigned metadata to Lucene index fields was a solid base for the indexing needs of other collections.

  18. Fedora Users Group - Open Repositories 2008 Phase 1 Solution: Lingering Problems • Our XSLT for the Generic Search Service wasn’t perfect • Some complications prevented full automation • We punted on getting the perfect Lucene analyzer configuration

  19. Fedora Users Group - Open Repositories 2008 PHASE II: EXTENDING FOR DIFFERENT COLLECTIONS

  20. Fedora Users Group - Open Repositories 2008 EVIA Digital Archive

  21. Fedora Users Group - Open Repositories 2008 Requirement: EVIADA Video Annotation Collection Field Collection Video Object Custom Annotation Software Video Object Video Object Field Collection Object

  22. Fedora Users Group - Open Repositories 2008 Requirement: EVIADA Video Annotation Collection • Complex Data model • One Fedora object which is addressable and discoverable in parts • New features • Faceted Search and Browse • Extensive custom fields

  23. Fedora Users Group - Open Repositories 2008 Requirements: IN Harmony Sheet Music Collection

  24. Fedora Users Group - Open Repositories 2008 Requirements: IN Harmony Sheet Music Collection • Complex Content model • Three types of objects below the collection • Sheet music • Individual Score • Page Image Chariot Race March

  25. Fedora Users Group - Open Repositories 2008 Requirements: IN Harmony Sheet Music Collection • New Features • Faceted Search and Browse • Exact match searches • Date range searches • Dozens of very specific fields • Sorting by date or title

  26. Fedora Users Group - Open Repositories 2008 Options: • Extend our existing implementation • All too appealing because of familiarity and “sunk costs” • Major conflicts between existing model and desired model could result in unmaintainable “hackish” implementations • Switch to a new infrastructure • Would be great, if something existed that met our needs without having to rework everything • Some combination • Best of both worlds?

  27. Fedora Users Group - Open Repositories 2008 Options: Faceted Search and Browse • Use Solr • Built-in support for facets • Is a service layer with an XML response • But do we really want to abandon SRU, or maintain two search service protocols?

  28. Fedora Users Group - Open Repositories 2008 Options: Faceted Search and Browse • Extend SRU Implementation • Prevents the need for yet another service layer • Has wide reuse potential • Could be backed by Solr without substantially more effort.

  29. Fedora Users Group - Open Repositories 2008 Solution: Faceted Search over SRU SRU Service (now with facet support)

  30. Fedora Users Group - Open Repositories 2008 Solution: Other SRU Improvements • More complete CQL support • Easy Improvements • Operators (and, or, not, any, all) • Application-specific fields

  31. Fedora Users Group - Open Repositories 2008 Solutions: Other SRU Improvements • More complete CQL support • Difficult Improvements • “cql.exact” relation • facet implementation • sort support dc.subject dc.subject exact “United Kingdom” dc.subject.exact dc.subject dc.subject.sort index

  32. Fedora Users Group - Open Repositories 2008 Options: Index Generation Fedora Generic Search Service Homegrown Solution

  33. Fedora Users Group - Open Repositories 2008 Reconsideration: GSearch • Limited by the one to one relationship between Lucene documents and fedora objects • Storing valid XML in CDATA to be stored in Lucene is messy and is prone to error as the metadata becomes more diverse • We really only use it to generate a Lucene index

  34. Fedora Users Group - Open Repositories 2008 Consideration: Solr • Robust wrapper for Lucene • Exposes service to update index • Exposes search features as a service • Abstracts away much of the of complexities of Lucene • Migrating existing search indexes would be prohibitively time consuming, but it might be the best tool to bring up new collections

  35. Fedora Users Group - Open Repositories 2008 Solution: Custom index service • A service whose initial functionality is simply to create and maintain Lucene Index directories that are served by SRU. • Can easily be extended/configured to use different search engines or to delegate the process entirely (perhaps to Solr) • Support for existing GSearch style XSLT • Simple Java interface to allow for easy index implementations.

  36. Fedora Users Group - Open Repositories 2008 Search Service OCLC SRU Implementation Custom Index Service Basic Index Writer Lucene Database – configured for quick id resolution GSearch Style XSLT Index Writer Lucene Database – configured for basic search New Style XSLT Index Writer Lucene Database – configured for advanced search Compound Model Java Index Writer Lucene Database – configured for compound model searches index index index index

  37. Fedora Users Group - Open Repositories 2008 Search Service OCLC SRU Implementation Custom Index Service Basic Index Writer Lucene Database – configured for quick id resolution G Search Style XSTL Index Writer Lucene Database – configured for basic search New Style XSTL Index Writer Lucene Database – configured for advanced search Compound Model Java Index Writer Solr Wrapping Index Lucene Database – configured for compound model searches Solr Database – configured to interface with solr. Solr index index index index

  38. Fedora Users Group - Open Repositories 2008 Future Plans • Full Text searching • Search text of entire books or journals • Determine where in the hierarchy the match occurred • Provide snippets with highlighted matches in context for the search results listing • Solutions • XTF, Solr through our custom index service

  39. Fedora Users Group - Open Repositories 2008 Conclusion • Most of the work is configuring the index which is a requirement that cannot be avoided. • Migration doesn’t have to be difficult or disruptive • Always be willing and able to consider new products and technologies

  40. Fedora Users Group - Open Repositories 2008 Thanks! Any Questions? • www.dlib.indiana.edu • wiki.dlib.indiana.edu/confluence/x/AQI • midurbin@indiana.edu • jwd@indiana.edu

More Related