380 likes | 387 Views
OAIster: A “No Dead Ends” Digital Object Service. Kat Hagedorn OAIster Librarian University of Michigan Libraries October 3, 2003. background. One-year Mellon grant project to test the feasibility of making OAI-enabled metadata for digital objects accessible to the public
E N D
OAIster: A “No Dead Ends” Digital Object Service Kat Hagedorn OAIster Librarian University of Michigan Libraries October 3, 2003
background • One-year Mellon grant project to test the feasibility of making OAI-enabled metadata for digital objects accessible to the public • Digital Library Production Service at University of Michigan Libraries began work in December 2001 • Publicized as OAIster in February 2002 • Launched in June 2002
highlights • Any audience • Any subject matter • Any format • Freely accessible • No dead ends • One-stop shopping …retrieving the “hidden web”
the protocol • OAI = Open Archives Initiative • OAI-PMH = Open Archives Initiative Protocol for Metadata Harvesting • Designed to make it easy to exchange metadata among interested parties • Consists of 6 HTTP requests to identify repositories / metadata and perform “harvesting”
tool we borrowed • University of Illinois Urbana-Champaign open-source OAI protocol harvester • java edition for our unix environment • Worked collaboratively to iron out kinks • resumptionToken / retryAfter • inexplicable kill • bogus records in MySQL table
development environment • Digital Library Extension Service (DLXS) • Develop open-source middleware and license XPAT search engine for building and mounting digital libraries • Middleware consists of document classes, i.e., Text, Image, Bib, FindAid • Originally designed to make SGML encoded texts available online
tool we developed • Runs in DLXS environment using BibClass • Current BibClass web templates modified • Additional java-based transformation tool to: • DC metadata records concatenated • No-digital-object records filtered out • Records counted • Conversion from UTF-8 to ISO-8859-1 • XSLT used to transform DC records into BibClass records
system design XSL stylesheets (per source type) UIUC harvester XSLT transformation tool OAI-enabled DC records Record storage Non-OAI-enabled DC records Search interface (XPAT) BibClass indexes
result • One place to look for digital objects • Big • 1,484,767 metadata records • 195 institutions (as of August 03) • Popular • Averages 3300 search sessions / month • Picked up in March 03: average 3700 now • 43,894 searches total (through July 03)
repositories: e.g., • Online Archive of California: manuscripts, photographs, and works of art held in institutions across California • arXiv Eprint Archive: math and physics pre- and post-prints • Sammelpunkt, Elektronisch Archivierte Theorie: archive of philosophical publications • British Women Romantic Poets Project: collection of poems written by British women between 1789 and 1832
repositories: stats • As of July 03, out of 191 repositories… • U.S. and foreign • U.S.: 49% (94) • Foreign: 51% (97) • By subject • Humanities: 26% (50) • Science: 30% (58) • Mixed: 43% (83) • E-prints and pre-prints • Using eprints.org software: 41% (78) • Not using eprints.org software: 58% (110)
major issues encountered • Metadata variation • Records not leading to digital objects • Access restrictions on digital objects described in records • Duplicate records for a single digital object
issue: metadata variation • With more records, users need more restrictions • Consistent metadata needed to facilitate these restrictions • One option: normalization of data
issue: metadata variation • Type: the obvious quick win • 240 metadata values mapped to four generic values (text, image, audio, video) • e.g., audio, sound = audio motion, animation, newsreels, etc. = video watercolour, watercolor, slides, etc. = image article, articles, booklet, diss, story, etc. = text
issue: metadata variation • Date: where to begin? • Most records with at least one date • Some records include up to seven dates • No consistent style of date • Subject: out of context, what meaning? • Many records with at least one subject element • But over 100 records with more than 50 subjects • And one record with 1000!
issue: metadata variation • Sample date values <date>2-12-01</date> <date>2002-01-01</date> <date>0000-00-00</date> <date>1822</date> <date>between 1827 and 1833</date> <date>18--?</date> <date>November 13, 1947</date> <date>SEP 1958</date> <date>235 bce</date> <date>Summer, 1948</date>
issue: metadata variation • Sample subject values <subject>30,51,52</subject> <subject>1852, Apr. 22. E[veritt] Judson, letter to Philuta [Judson].</subject> <subject>Slavery--United States--Controversial literature</subject> <subject>view of interior with John Henry sculpture</subject> <subject>Particles (Nuclear physics) -- Research.</subject>
issue: no digital objects • Some records contain links to further description of digital object • But not the digital object itself • Culling difficult • One option: add explanatory text to site
issue: access restrictions • No records where metadata itself is restricted in use (as far as we know!) • Definitely some records where objects are restricted to licensed users • One option: add explanatory text to site
issue: access restrictions • DC Rights element: often not enough info about viewing restrictions • Currently no protocol method for indicating restricted digital objects (i.e., “yes/no” toggle element) • Need to assess whether users feel informed or frustrated when encountering restricted objects
issue: duplicate records • Two records harvested, different identifiers, same object described and pointed to • Acquired in two ways: • Harvesting of original repository and aggregator • Receiving “static” DC records provided by content creator and harvesting aggregator
issue: duplicate records • Aggregators can contain records not currently available through OAI channels • Aggregators do not always contain all the records of a particular original repository • So, need to harvest both aggregator and original repositories
issue: duplicate records • Harvest records from aggregator • Also receive from original content creator, but as snapshot • e.g., MEO and cogprints • Snapshot before aggregator • Creator unsure all records would be aggregated
issue: duplicate records • Were duplicates to be identified, how to deal with the issue? • Suppress? • Group? • Flag? • So far, not addressed in OAIster
assessment • Large survey (over 400 respondents) • 2 rounds of face-to-face and remote user testing • Conducted before design and after phase one rollout
assessment: survey • Online journals and reference materials wanted over other digital objects • Difficult to search for information; every service different; where to start • Number of respondents (5%) indicated they were generally successful in finding resources online
assessment: user testing • No short and long record formats: one size fits all • Want clearly defined and labeled AND/OR searching options • Results clear and easy to understand • Want to sort by title, date, institution, resource format…you name it! • Use OAIster for academic, trustworthy, authentic materials
Focus on high usability Focus on all content available Some service providers have increased functionality (e.g., de-duplication, integration of thesauri) service providers: comparison high UIUC, Emory, etc. OAIster Usability Ad hoc DP-9 low some all Content
future of OAIster • Make it faster • Advanced searching • Grouping to aid browsing • Saving/emailing/downloading records • Further normalization of data • Handling duplicate records • Collaboration with other services: search, instructional…
current state of protocol • Popular • As Peter Suber says: • “…no other single idea or technology in the [open-source movement has enjoyed this density of endorsement and adoption in a six month period.” • Data providers over one year: • June 02: 56 repositories / 274,062 records • June 03: 187 repositories / 1,246,953 records • Over three-fold increase for repositories • Over four-fold increase for records
future of protocol • Branching out • HTTP vs. SOAP • DC required vs. highly recommended • Use of OAI in closed environments • Static repository protocol • Need for add-on applications • OAI evangelism
what can you do? • OAI-enable your data • DLXS customer: easiest • Make sure data is UTF-8 / Unicode compliant • Provide as much metadata as you can • Use standard element tags • Develop “sets” for service providers • Let us know you’re ready to be harvested • Keep us informed about changes to the harvesting URL, new data and deleted data, change in contact info
contact info • Kat Hagedorn • University of Michigan Libraries, Digital Library Production Service • khage@umich.edu • http://www.oaister.org/