bitter harvest metadata harvesting issues problems and possible solutions l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions PowerPoint Presentation
Download Presentation
Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions

Loading in 2 Seconds...

  share
play fullscreen
1 / 39
noma

Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions - PowerPoint PPT Presentation

131 Views
Download Presentation
Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Bitter HarvestMetadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library

  2. Outline • Brief Harvesting Overview • Harvesting Problems • Steps to a Fruitful Harvest • A Harvesting Service Model • Indexing and Interfaces • What’s Next?

  3. Open Archives Initiative • Open Archives Initiative: “develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content” • Huh? Let’s just say it’s an effort to help people find stuff • Protocol for Metadata Harvesting (OAI-PMH) specifies how repositories can expose their metadata for others to harvest • Well over 500 repositories world-wide support the protocol • OAIster.org has indexed 3.5 million items from those repositories

  4. OAI-PMH • Data providers (DP) — those with the stuff • Service providers (SP) — those who harvest metadata and provide aggregation and search services • OAI-PMH verbs: • Identify • ListIdentifiers • ListMetadataFormats • ListSets • ListRecords • GetRecord • Software for both DPs and SPs readily available

  5. www.oaforum.org/tutorial/

  6. OAI Architecture Source: Open Archives Forum Tutorial

  7. gita.grainger.uiuc.edu/registry/

  8. errol.oclc.org

  9. Harvesting Problems • Sets • Metadata Formats • Metadata Artifacts • Granularity • Metadata Variances

  10. Sets • Records are harvested in clumps, called “sets” created by DPs • No guidelines exist for defining sets • Examples: • Collection • Organizational structure • Format (but is a page image an image? See example)

  11. Metadata Formats • Only required format is simple Dublin Core, although any format can be made available in addition • Few DPs surface richer metadata • Simple DC is simply too simple! • Example (artifact vs. surrogate dates)

  12. Metadata Artifacts • “unintended, unwanted aberrations” • Sample causes: • Idiosyncratic local practices • Anachronisms • HTML code • Examples: • Circa = string of dates for searching purposes • [electronic resource]

  13. Granularity • Record Granularity: what is an “object”? • A book, or each individual page? • Examples: CDL, Univ. of Michigan • Metadata Granularity: • Multiple values in one field • Example: Univ. of Washington

  14. Metadata Variances • Subject terminology differences • Disparities in recording the same metadata • Example: date variances • Mapping oddities or mistakes • Examples: 1) format into description, 2) description into subject

  15. Steps to a Fruitful Harvest • Needs Assessment (it’s the user, stupid) • DP Identification and Communication • Metadata Capture • Metadata Analysis • Metadata Subsetting • Metadata Normalization • Metadata Enrichment • Indexing • Interface (it’s still the user, stupid)

  16. Needs Assessment • What are you trying to accomplish? • What will your users want to be able to do? • What metadata will you need, and what procedures will you need to set up to enable these activities? • Which repositories have what you want? • Is what they have (e.g., sets, metadata) usable as is, or ?

  17. DP Identification & Communication • Identification: • Use UIUC directory of DPs to identify potential sources • Communication: • Not required to tell them you are harvesting, but may help establish a good relationship • May want to request that they surface a richer metadata format and/or provide a different set

  18. Metadata Capture • Sample questions to answer: • Individual sets, or all? • Richer metadata formats available? • How frequently to reharvest? • Start from scratch each time or update? • Many software options

  19. Virginia Tech Perl Harvester +-----------------------------------------+ | Harvester Sample Configurator | +-----------------------------------------+ | Version 1.1 :: July 2002 | | Hussein Suleman <hussein@vt.edu> | | Digital Library Research Laboratory | | www.dlib.vt.edu :: Virginia Tech | ------------------------------------------+ Defaults/previous values are in brackets - press <enter> to accept those enter "&delete" to erase a default value enter "&continue" to skip further questions and use all defaults press <ctrl>-c to escape at any time (new values will be lost) Press <enter> to continue [ARCHIVES] Add all the archives that should be harvested Current list of archives: No archives currently defined ! Select from: [A]dd [D]one Enter your choice [D] : a{return} [ARCHIVE IDENTIFIER] You need a unique name by which to refer to the archive you will harvest metadata from Examples: nsdl-380602, VTETD Archive identifier [] : nsdl-380602{return}

  20. Metadata Analysis • Finding out what you have (and don’t have) • Encoding practices • Gap analysis (e.g., missing fields, etc.) • Mistakes (e.g., mapping errors) • Software can help • Commercial software like Spotfire • In-house or open source software tools

  21. Five elements are used 71% of the time Source: 2002 Master’s Thesis, Jewel Hope Ward, UNC Chapel Hill

  22. Metadata Analysis Model

  23. Metadata Subsetting • DP sets are unlikely to serve all SP uses well • SPs will need the ability to subset harvested metadata • Example: prototype subsetting tool

  24. A Subsetting Model

  25. Metadata Normalization • Normalizing: to reduce to a standard or normal state • Prototype date normalization service screen

  26. Metadata Enrichment • Adding fields or values may be useful or required, for example: • Metadata provider information • Geographic coverage • Subject terms mapped to a different thesaurus • Authority control record

  27. A Harvesting Service Model

  28. Indexing • Pick your favorite database/indexing software: • MySQL • SWISH-E • May need to specifically set up a method to search across the entire record • May need different fields for indexing than for display

  29. Interface • Software interface (API) for other applications: • SRU/SRW? • Arbitrary Web Services schema? • User interface

  30. What’s Next? • Further protocol development • Services layered on top of OAI-PMH • Shared software tools • Best practices for both DPs and SPs

  31. oai-best.comm.nsdl.org