bitter harvest metadata harvesting issues problems and possible solutions l.
Skip this Video
Download Presentation
Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions

Loading in 2 Seconds...

play fullscreen
1 / 39

Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions - PowerPoint PPT Presentation

  • Uploaded on

Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions. Roy Tennant California Digital Library. Outline. Brief Harvesting Overview Harvesting Problems Steps to a Fruitful Harvest A Harvesting Service Model Indexing and Interfaces What’s Next?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions' - noma

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bitter harvest metadata harvesting issues problems and possible solutions

Bitter HarvestMetadata Harvesting Issues, Problems, and Possible Solutions

Roy Tennant

California Digital Library

  • Brief Harvesting Overview
  • Harvesting Problems
  • Steps to a Fruitful Harvest
  • A Harvesting Service Model
  • Indexing and Interfaces
  • What’s Next?
open archives initiative
Open Archives Initiative
  • Open Archives Initiative: “develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content”
  • Huh? Let’s just say it’s an effort to help people find stuff
  • Protocol for Metadata Harvesting (OAI-PMH) specifies how repositories can expose their metadata for others to harvest
  • Well over 500 repositories world-wide support the protocol
  • has indexed 3.5 million items from those repositories
oai pmh
  • Data providers (DP) — those with the stuff
  • Service providers (SP) — those who harvest metadata and provide aggregation and search services
  • OAI-PMH verbs:
    • Identify
    • ListIdentifiers
    • ListMetadataFormats
    • ListSets
    • ListRecords
    • GetRecord
  • Software for both DPs and SPs readily available
oai architecture
OAI Architecture

Source: Open Archives Forum Tutorial

harvesting problems
Harvesting Problems
  • Sets
  • Metadata Formats
  • Metadata Artifacts
  • Granularity
  • Metadata Variances
  • Records are harvested in clumps, called “sets” created by DPs
  • No guidelines exist for defining sets
  • Examples:
    • Collection
    • Organizational structure
    • Format (but is a page image an image? See example)
metadata formats
Metadata Formats
  • Only required format is simple Dublin Core, although any format can be made available in addition
  • Few DPs surface richer metadata
  • Simple DC is simply too simple!
  • Example (artifact vs. surrogate dates)
metadata artifacts
Metadata Artifacts
  • “unintended, unwanted aberrations”
  • Sample causes:
    • Idiosyncratic local practices
    • Anachronisms
    • HTML code
  • Examples:
    • Circa = string of dates for searching purposes
    • [electronic resource]
  • Record Granularity: what is an “object”?
    • A book, or each individual page?
    • Examples: CDL, Univ. of Michigan
  • Metadata Granularity:
    • Multiple values in one field
    • Example: Univ. of Washington
metadata variances
Metadata Variances
  • Subject terminology differences
  • Disparities in recording the same metadata
    • Example: date variances
  • Mapping oddities or mistakes
    • Examples: 1) format into description, 2) description into subject
steps to a fruitful harvest
Steps to a Fruitful Harvest
  • Needs Assessment (it’s the user, stupid)
  • DP Identification and Communication
  • Metadata Capture
  • Metadata Analysis
  • Metadata Subsetting
  • Metadata Normalization
  • Metadata Enrichment
  • Indexing
  • Interface (it’s still the user, stupid)
needs assessment
Needs Assessment
  • What are you trying to accomplish?
  • What will your users want to be able to do?
  • What metadata will you need, and what procedures will you need to set up to enable these activities?
  • Which repositories have what you want?
  • Is what they have (e.g., sets, metadata) usable as is, or ?
dp identification communication
DP Identification & Communication
  • Identification:
    • Use UIUC directory of DPs to identify potential sources
  • Communication:
    • Not required to tell them you are harvesting, but may help establish a good relationship
    • May want to request that they surface a richer metadata format and/or provide a different set
metadata capture
Metadata Capture
  • Sample questions to answer:
    • Individual sets, or all?
    • Richer metadata formats available?
    • How frequently to reharvest?
    • Start from scratch each time or update?
  • Many software options
virginia tech perl harvester
Virginia Tech Perl Harvester


| Harvester Sample Configurator |


| Version 1.1 :: July 2002 |

| Hussein Suleman <> |

| Digital Library Research Laboratory |

| :: Virginia Tech |


Defaults/previous values are in brackets - press <enter> to accept those

enter "&delete" to erase a default value

enter "&continue" to skip further questions and use all defaults

press <ctrl>-c to escape at any time (new values will be lost)

Press <enter> to continue


Add all the archives that should be harvested

Current list of archives:

No archives currently defined !

Select from: [A]dd [D]one

Enter your choice [D] : a{return}


You need a unique name by which to refer to the archive you

will harvest metadata from

Examples: nsdl-380602, VTETD

Archive identifier [] : nsdl-380602{return}

metadata analysis
Metadata Analysis
  • Finding out what you have (and don’t have)
    • Encoding practices
    • Gap analysis (e.g., missing fields, etc.)
    • Mistakes (e.g., mapping errors)
  • Software can help
    • Commercial software like Spotfire
    • In-house or open source software tools

Five elements are used 71% of the time

Source: 2002 Master’s Thesis, Jewel Hope Ward, UNC Chapel Hill

metadata subsetting
Metadata Subsetting
  • DP sets are unlikely to serve all SP uses well
  • SPs will need the ability to subset harvested metadata
  • Example: prototype subsetting tool
metadata normalization
Metadata Normalization
  • Normalizing: to reduce to a standard or normal state
  • Prototype date normalization service screen
metadata enrichment
Metadata Enrichment
  • Adding fields or values may be useful or required, for example:
    • Metadata provider information
    • Geographic coverage
    • Subject terms mapped to a different thesaurus
    • Authority control record
  • Pick your favorite database/indexing software:
    • MySQL
    • SWISH-E
  • May need to specifically set up a method to search across the entire record
  • May need different fields for indexing than for display
  • Software interface (API) for other applications:
    • SRU/SRW?
    • Arbitrary Web Services schema?
  • User interface
what s next
What’s Next?
  • Further protocol development
  • Services layered on top of OAI-PMH
  • Shared software tools
  • Best practices for both DPs and SPs