metadata for digital repositories l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Metadata for Digital Repositories PowerPoint Presentation
Download Presentation
Metadata for Digital Repositories

Loading in 2 Seconds...

play fullscreen
1 / 152

Metadata for Digital Repositories - PowerPoint PPT Presentation


  • 470 Views
  • Uploaded on

Metadata for Digital Repositories Mark Jordan Repository Redux University of Prince Edward Island September 19, 2007 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Canada License Schedule 9:00 - 10:30

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Metadata for Digital Repositories' - Faraday


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
metadata for digital repositories

Metadata for Digital Repositories

Mark Jordan

Repository Redux

University of Prince Edward Island

September 19, 2007

This work is licensed under a Creative Commons

Attribution-NonCommercial-ShareAlike 2.5 Canada

License

schedule
Schedule
  • 9:00 - 10:30
    • Background; types of metadata; major standards; choosing metadata schemes
  • 10:45 - 12:00
    • Metadata life cycle; strategies for creation and management; automated creation; supplementation strategies
  • 1:00 - 2:30
    • SFU theses workflow case study; native vs. derived; crosswalks
  • 2:45 - 4:30
    • Application Profiles; OAI; CARLCore AP case study
what is metadata
What is Metadata?
  • Different meanings in different communities
  • Information about information
  • Can describe information at any level
    • Collection
    • Item
    • Item within item
  • Can be embedded within an object or separate from it
types of metadata
Types of Metadata
  • Descriptive
  • Terms and conditions
  • Administrative data
  • Content ratings
  • Provenance
  • Linking or relationship data
  • Structural data

Carl Lagoze, Clifford A. Lynch, and Ron Daniel, Jr. “The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata”. 1996. http://hdl.handle.net/1813/7248

metadata and cataloguing
Metadata and Cataloguing
  • Perception that cataloguing is old and metadata is new
  • Traditional cataloguing focuses on descriptions of analogue materials
  • Metadata focuses on management of networked resources
  • For locally created or managed networked resources (such as repositories), cataloguing is insufficient
metadata schemes
Metadata Schemes
  • Defines a collection of elements for supporting a specific function
  • Defines structures for element values
  • Defines formal aspects of the element set, such as name, definition, data type, etc.
  • Some schemes are expressed as XML schemas
containers vs rules of description
Containers vs. Rules of Description
  • Containers dictate structure
  • Rules of description dictate content
  • Common rules of description
    • AACR2
    • RDA
    • RAD
vs glue standards
Vs. Glue Standards
  • OpenURL
    • Syntax for encoding bib data in URLs
    • http://resolver.example.edu/cgi?genre=book&isbn=0836218310&title=The+Far+Side+Gallery+3
  • COinS
    • OpenURLS embedded in HTML <span> tags
  • unAPI
    • Identifiers embedded in HTML <abbr> tags for autodiscovery and “copy and paste”
  • Microformats
    • For example, <a href="http://creativecommons.org/licenses/by/2.0/" rel="license">cc by 2.0</a>
selected major standards
Selected Major Standards
  • Dublin Core
  • MODS
  • Collection Description
  • RDA
  • EAD
  • PREMIS
  • METS
dublin core
Dublin Core
  • Standard metadata set for describing resources
  • It is flexible
    • Qualified vs. unqualified
    • Can be expressed in HTML, XML ,or using RDF
  • Dummying down is a good thing
dublin core element set
Title

Creator

Subject

Description

Publisher

Contributor

Date

Type

Format

Identifier

Source

Language

Relation

Coverage

Rights

Dublin Core Element Set
dublin core qualifiers
Dublin Core Qualifiers
  • Types
    • Element refinements
    • Encoding schemes
  • Examples
    • Description
      • Table of contents, abstract
    • Date
      • Created, valid, available, issued, modified
    • Subject
      • LCSH, MESH, DDC, LCC, UDC
slide14
MODS
  • A “bibliographic element set that may be used for a variety of purposes, and particularly for library applications.”
  • Richer than DC, simpler than MARC
  • Does not assume the use of any specific cataloging code
  • Elements: titleInfo, title, name, namePart, originInfo, etc.
slide15

<?xml version="1.0" encoding="UTF-8"?>

<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">

<mods:titleInfo>

<mods:title>A Jewel of Honesty</mods:title>

</mods:titleInfo>

<mods:genre>Article</mods:genre>

<mods:abstract>clashing oppositions</mods:abstract>

<mods:subject>

<mods:geographic>N/A</mods:geographic>

</mods:subject>

<mods:subject authority="none">

<mods:topic>General interest</mods:topic>

</mods:subject>

<mods:relatedItem type="host">

<mods:titleInfo>

<mods:title>Carnegie Newsletter</mods:title>

<mods:title>Celebration a Spectacle of Hope</mods:title>

</mods:titleInfo>

<mods:name>

<mods:namePart>Pra'N'Ava</mods:namePart>

<mods:role>

<mods:roleTerm authority="marcrelator" type="text">author</mods:roleTerm>

</mods:role>

</mods:name>

<mods:name>

<mods:namePart>N/A</mods:namePart>

<mods:role>

<mods:roleTerm authority="chodarr" type="text">recipient</mods:roleTerm>

</mods:role>

</mods:name>

<mods:part>

<mods:extent unit="pages">

<mods:start>9</mods:start>

<mods:list>9,14</mods:list>

</mods:extent>

</mods:part>

<mods:originInfo>

<mods:dateIssued encoding="iso8601">19870101</mods:dateIssued>

</mods:originInfo>

</mods:relatedItem>

</mods:mods>

slide16

<?xml version="1.0" encoding="UTF-8"?>

<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">

<mods:titleInfo>

<mods:title>A Jewel of Honesty</mods:title>

</mods:titleInfo>

<mods:genre>Article</mods:genre>

<mods:abstract>clashing oppositions</mods:abstract>

<mods:subject>

<mods:geographic>N/A</mods:geographic>

</mods:subject>

<mods:subject authority="none">

<mods:topic>General interest</mods:topic>

</mods:subject>

<mods:relatedItem type="host">

<mods:titleInfo>

<mods:title>Carnegie Newsletter</mods:title>

<mods:title>Celebration a Spectacle of Hope</mods:title>

</mods:titleInfo>

<mods:name>

<mods:namePart>Pra'N'Ava</mods:namePart>

<mods:role>

<mods:roleTerm authority="marcrelator" type="text">author

</mods:roleTerm>

</mods:role>

</mods:name>

dcmi collection description
DCMI Collection Description
  • Formal description of aggregation or collection of items
  • Can apply to collections where item-level metadata is not available or appropriate, or to collections where it is
  • Sample elements:
    • accrualMethod, accrualPeriodicity
  • Developed as NISO Z39.91

Dublin Core Collection Description Application Profile,

http://www.ukoln.ac.uk/metadata/dcmi/collection-application-profile/2004-02-01/

slide18
RDA
  • Resource Description and Access, the successor to AACR2
  • Diane Hillmann’s critique
    • Reliance on transcription and specified sources of information
    • Reliance on unstructured notes
    • Multiple versions in one record
    • Full review at http://dublincore.org/usage/meetings/2006/04/seattle/rda-review/RDA_for_who.htm
slide19
EAD
  • XML schema for encoding archival finding aids
  • Contains elements for all aspects of archival description, from <repository> to <daoloc>
  • <archdesc> is the standard tag for describing fonds, series, subseries, etc. hierarchies
slide20

<did>

<head>Summary Description of the Tom Stoppard Papers</head>

<repository>

<corpname>The University of Texas at Austin

<subarea>Harry Ransom Humanities Research Center</subarea>

</corpname>

</repository>

<origination>

<persname source="lcnaf" encodinganalog="100">Stoppard,Tom</persname>

</origination>

<unittitle encodinganalog="245">Tom Stoppard Papers, </unittitle>

<unitdate type="inclusive">1944-1995</unitdate>

<physdesc encodinganalog="300">

<extent>68 boxes (28 linear feet)</extent>

</physdesc>

<unitid type="accession">R4635</unitid>

<physloc audience="internal">14E:SW:6-8</physloc>

<abstract>The papers of British playwright Tom Stoppard (b. 1937 encompass

his entire career and consist of multiple drafts of his plays, from the well-known

<title render="italic">Rosencrantz and Guildenstern Are Dead</title> to several

that were never produced, correspondence, photographs, and posters, as

well as materials from stage, screen, and radio productions from around the

world.</abstract>

</did>

premis
PREMIS
  • Data model
    • Digital objects
    • Intellectual entities
    • Agents
    • Events
    • Rights
    • Relationships
  • Data Dictionary contains examples and sections on compliance and implementation
  • Can be encoded in METS
metsrights
METSRights
  • Endorsed by METS Board but useful outside of METS documents
  • XML Elements
    • RightsDeclaration
      • RightsHolder
      • Context
        • Permissions
        • Constraint
slide24
METS
  • METS: Metadata Encoding & Transmission Standard
  • Encodes descriptive, administrative, and structural metadata in one XML file
  • Preferred data structure for digital library initiatives
  • Goals
    • Manage different types of metadata
    • Migrate resources between repositories
mets community
METS Community
  • Maintenance agency is Library of Congress
  • Website
    • http://www.loc.gov/standards/mets/
  • Implementation registry
    • Lists 33 projects at 24 institutions
mets components
METS Components
  • METS header
  • Descriptive metadata section
  • Administrative metadata section
  • File section
  • Structural map section
  • Structural link section
  • Behavior section
filesec
fileSec
  • Lists all files making up the resource
  • <fileLocat> points to files
  • IDs of <file> elements link to pertinent administrative metadata in <amdSec> using the ADMID attribute
slide28

<mets:fileSec>

<mets:fileGrp USE="archive image">

<mets:file ID="epi01m" MIMETYPE="image/tiff">

<mets:FLocat xlink:href="http://www.loc.gov/standards/mets/docgroup/

full/01.tif" LOCTYPE="URL"/>

</mets:file>

<mets:file> … </mets:file>

</mets:fileGrp>

<mets:fileGrp USE="reference image">

<mets:file ID="epi01r" MIMETYPE="image/jpeg">

<mets:FLocat

xlink:href="http://www.loc.gov/standards/mets/docgroup/jpg/01.jpg"

LOCTYPE="URL"/>

</mets:file>

</mets:fileGrp>

<mets:fileGrp USE="thumbnail image">

<mets:file ID="epi01t" MIMETYPE="image/gif">

<mets:FLocat

xlink:href="http://www.loc.gov/standards/mets/docgroup/gif/01.gif"

LOCTYPE="URL"/>

</mets:file>

</mets:fileGrp>

</mets:fileSec>

structmap
structMap
  • The only required section
  • Defines the hierarchical structure of the resource
  • Can be physical or logical
    • Physical structMaps simply list files in order
      • Pages that make up a book
    • Logical structMaps list files in order but in the context of the intellectural structure of the resource
      • Chapters that make up a book
slide30

<mets:structMap TYPE="physical">

<mets:div TYPE="book" LABEL="Martial Epigrams II">

<mets:div TYPE="page" LABEL="Blank page">

</mets:div>

<mets:div TYPE="page" LABEL="Page ii: Blank page">

</mets:div>

<mets:div TYPE="page" LABEL="Page iii: Title page">

</mets:div>

<mets:div TYPE="page" LABEL="Page iv: Publication info">

</mets:div>

<mets:div TYPE="page" LABEL="Page v: Table of contents">

</mets:div>

<mets:div TYPE="page" LABEL="Page vi: Blank page">

</mets:div>

<mets:div TYPE="page" LABEL="Page 1: Half title page">

</mets:div>

<mets:div TYPE="page" LABEL="Page 2 (Latin)">

</mets:div>

<mets:div TYPE="page" LABEL="Page 3 (English)">

</mets:div>

</mets:div>

</mets:div>

</mets:structMap>

dmdsec
dmdSec
  • Contains descriptive metadata
  • Descriptive metadatat can be included or linked externally
  • Descriptive metadata can be in any scheme
  • Can accommodate XML (ex., MODS) or binary (ex., MARC) representations of descriptive metadata
slide32

<mets:dmdSec ID="DMD1">

<mets:mdWrap MIMETYPE="text/xml" MDTYPE="MODS">

<mets:xmlData>

<mods:mods version="3.1">

<mods:titleInfo>

<mods:title>Epigrams</mods:title>

</mods:titleInfo>

<mods:name type="personal">

<mods:namePart>Martial</mods:namePart>

</mods:name>

<mods:name type="personal">

<mods:namePart>Ker, Walter C. A. (Walter Charles Alan),

1853-1929

</mods:namePart>

</mods:name>

<mods:typeOfResource>text</mods:typeOfResource>

</mods:mods>

</mets:xmlData>

</mets:mdWrap>

</mets:dmdSec>

amdsec
amdSec
  • Contains info on digital resource, files in the resource, or original analogue source
  • Type of info
    • Technical
    • Intellectual property
    • Provenance
slide34

<mets:techMD ID="AMD001">

<mets:mdWrap MIMETYPE="text/xml" MDTYPE="NISOIMG"

LABEL="NISO Img.Data">

<mets:xmlData>

<niso:MIMEtype>image/tiff</niso:MIMEtype>

<niso:Compression>LZW</niso:Compression>

<niso:PhotometricInterpretation>

8

</niso:PhotometricInterpretation>

<niso:Orientation>

1

</niso:Orientation>

<niso:ScanningAgency>

NYU Press

</niso:ScanningAgency>

</mets:xmlData>

</mets:mdWrap>

</mets:techMD>

mets header
mets Header
  • Contains info about the METS document
  • Sample

<metsHdr CREATEDATE="2006-05-09T15:00:00"

LASTMODDATE=”2006-05-09T21:00:00>

<mets:agent ROLE="CREATOR" TYPE="INDIVIDUAL">

<mets:name>Rick Beaubien</mets:name>

</mets:agent>

<mets:altRecordID TYPE=”LCCN”>20022838</mets:altRecordID>

</metsHdr>

structlink
structLink
  • Adds hyperlinks between elements in a Structural Map
  • Sample

<mets:structLink>

<mets:smLink xlink:from="LINK7" xlink:to="page1145"

xlink:title="projects">

</mets:smLink>

<mets:smLink xlink:from="LINK13" xlink:to="page1145”

xlink:title="projects">

</mets:smLink>

<mets:smLink xlink:from="LINK36" xlink:to="page113"

xlink:title="officers">

</mets:smLink>

<mets:smLink xlink:from="LINK37" xlink:to="page120"

xlink:title="calender">

</mets:smLink>

</mets:structLink>

behaviorsec
behaviorSec
  • Associates executable behaviors (i.e., computer code) with parts of a document/object
  • Sample

<mets:behaviorSec>

<mets:behavior ID="disp1" STRUCTID="top" BTYPE="display”

LABEL="Display Behavior">

<mets:interfaceDef LABEL="EAD Display Definition"

LOCTYPE="URL" xlink:href=

”http://texts.cdlib.org/dynaxml/profiles/display/oacDisplayDef.txt”/>

<mets:mechanism LABEL="EAD Display Mechanism"

LOCTYPE="URN" xlink:href=

“http://texts.cdlib.org/dynaxml/profiles/display/oacDisplayMech.xml

</mets:behavior>

</mets:behaviorSec>

linking between sections
Linking Between Sections
  • Can point to <dmdSec>
    • <file>, <stream>, <div>
  • Can point to <techMD>, <rightsMD>, <sourceID>, <digiprovMD>
    • <dmdSec>, <file>, <fileGrp>, <stream>
  • Can point to <file>
    • <fptr>, <area>
  • Can point to <div>
    • <behavior>
mets profiles
METS Profiles
  • METS is so flexible, it needs to be documented for each particular application or use
  • Components
    • URI
    • Date
    • Abstract
    • Extension schemas
    • Rules of description
    • Vocabularies
    • Structural rules for resources
    • Technical metadata
what is the point of all this
What is the point of all this?
  • Management of digital resources requires many types of metadata
  • Managing all this metadata can be difficult
  • METS can do it all, but is complex
functional requirements
Functional Requirements
  • What do you expect your metadata to do?
    • The nature of the resources you are putting in your digital collection
    • The nature of the intended audience(s) for your collection
    • The level of description
    • The size of your collection
    • Importance of interoperability
    • The resources your library has for creation and long-term maintenance of the metadata
nature of resources
Nature of Resources
  • Is there full text?
  • Are they “simple” or “complex”?
  • Do you supply multiple versions of the same resource?
  • Are all resources available to all users?
nature of users
Nature of Users
  • Is your audience general or specialized?
  • How information/network literate are they?
  • How much information will they need to choose appropriate resources?
  • What other assumptions can you safely make about your users, and how do those assumptions impact your metadata planning activities?
level of description
Level of Description
  • How much detail do you want to include in your metadata
  • Related to resources available for creation of metadata, and balance of quantity vs. quality
  • Expensive (e.g., subject) vs. cheap (e.g., file size) descriptive elements
size of collection
Size of Collection
  • Small collections rely less on metadata than large collections do
  • Browsing, faceting, and differentiating functions are more important in large collections
  • In general, the bigger the collection, the more granular the values in your metadata needs to be
    • E.g., subject vocabularies
importance of interoperability
Importance of Interoperability
  • Metadata in local schemes is more difficult to share than metadata in standard schemes
  • Always assume your metadata will be used in contexts different from the original
  • Plan metadata with crosswalks in mind
resources for managing metadata
Resources for Managing Metadata
  • How will metadata of various types be created and managed?
  • Does your institution have a DAM strategy?
  • Will preservation metadata (e.g., PREMIS) be managed?
frbr s user tasks
FRBR’s User Tasks
  • Functional requirements can be expressed in terms of the FRBR data model
    • Find entities which correspond to user’s search criteria
    • Identify an entity
    • Select an entity
    • Acquire or obtain access to the desired entity
analyzing domains
Analyzing Domains
  • Environmental
  • Object class
  • Object format

Jane Greenberg, “Understanding Metadata and Metadata Schemas.” In Metadata: A Cataloguer’s Primer. Ed. Richard P. Smiraglia. New York: Haworth. 2005.

metadata quality
Metadata Quality
  • Completeness
  • Accuracy
  • Provenance
  • Conformance to expectations
  • Logical consistency and coherence
  • Timelines
  • Accessibility

Thomas R. Bruce and Dianne I Hillmann, “The Continuum of Metadata

Quality: Defining, Expressing, Expoiting.” In Metadata in Practice. Ed. Diane I. Hillmann and Elaine L. Westbrooks. Chicago: American Library Association, 2004.

slide51

First

Intermission

before lunch
Before Lunch
  • Metadata life cycle
  • Strategies for creation and management
  • Automated metadata creation
  • Supplementation strategies
metadata management life cycle

Repurposing

Creation

Storage

Supplementation

Sharing

Metadata Management Life Cycle
strategies for creation and management
Strategies for Creation and Management
  • Depend on complexity and completeness of metadata
  • Common strategies
    • Create and manage simple (single type) metadata in one application
    • Create simple metadata in one app and manage in another
    • Create and manage different types of metadata in multiple apps, and combine for use
premis survey
PREMIS survey
  • Most common tool was relational databases
  • XML databases or XML files stored with digital objects
  • Flat files or object-relational databases
  • Most respondents were using two or more of these methods
creation
Creation
  • Avoid recreating metadata
  • Metadata can be created
    • At time of resource creation
      • Born digital
      • Digitized
    • After resource creation
  • Primarily a manual task
  • Variety of tools
raw xlm
Raw XLM
  • Advantages
    • Provides high level of control
    • Requires simple tools
  • Disadvantages
    • XML makes humans’ heads ache
    • Extremely unforgiving of errors
greenstone
Greenstone
  • Open source repository platform from University of Waikato
  • Provides support for several types of metadata and can export METS
  • Provides Java client (Greenstone Librarians’ Interface, a.k.a. GLI) for metadata production
  • Also provides “plugins” for extracting extracting metadata
contentdm
CONTENTdm
  • Commercial repository platform from OCLC
  • Provides support for several types of metadata and can export XML that can be converted into METS
  • Provides Windows client for production (Acquisition Station)
  • Also provides a web interface for creating metadta and ingesting content
slide63
CWIS
  • Open source “collection manager”
  • Product of the National Science Digital Library
  • Features rich metadata management tools
alouettecanada
AlouetteCanada
  • Metadata Toolkit will provide local management and access
  • Portal will provide centralized access
  • Best practices documents will support creation and management of metadata and content
alouettecanada metadata toolkit
AlouetteCanada Metadata Toolkit
  • A content management system for library, archives, and museum collections
  • Will allow staff to create metadata and manage content
  • Scheme support
    • MODS
    • EAD
    • METS
  • Will allow basic digital assets management
dam in the toolkit
DAM in the Toolkit
  • Tools for managing master and derivative versions of files
  • Tools for creating checksums and managing technical metadata
  • Tools for managing rights tracking
  • Tools for managing administrative metadata
alouettecanada portal
AlouetteCanada Portal
  • Aggregates metadata from participating institutions for centralized searching
  • Points back to Tooklit or whatever else is hosting items
  • Based on the OurOntario Portal
automated metadata creation
Automated Metadata Creation
  • Technical
    • JHOVE, DROID, digitization hardware
  • Descriptive
    • Born-digital document metadata
  • Subject
    • INFOMINE iVia tools
  • Structural
    • Sequential filename generation
chinese times processing workflow
Chinese Times Processing Workflow
  • Line up TIFF image in thumbnail view
  • Create directory with date as name
  • Copy that day’s files into directory
  • Run renamer/metadata creation script
      • Get all files in input dir, create full paths
      • Walk through inputfile list
      • Rename 1st file -01.tif, 2nd file -03.tif, 3rd file -02.tif, etc.
      • Output directory name and metadata file for CONTENTdm
  • Quality control
slide74

Import directory structure

Issue-level metadata file for import into CONTENTdm

Title Date Publisher Rights Description Type Format Language Filename

Chinese Times, April 1st, 1920 04/01/1920 The Chinese Freemasons Society of Canada Copyright the Chinese Freemasons Society of Canada

storage
Storage
  • Some file formats enable internal storage of metadata
  • For external storage, relational databases offer most flexibility
    • Complex metadata can be stored in simple structures
    • Can handle hierarchical data
  • Are agnostic to other phases in metadata life cycle
  • Not highly scalable for text retrieval
    • External indexers eliminate this problem
  • Can export and import XML, MARC, etc.
repurposing
Repurposing
  • Different use of metadata than originally intended
  • Often migrated to or imported into an external system
  • Examples
    • Dumping new items lists from ILS for use in external portal
    • Creating MARC records from vendor spreadsheet (demo)
sharing
Sharing
  • All metadata should be created to be shared
  • May require exporting, crosswalking, supplementation
  • Basic approaches to sharing: metasearching and harvesting
  • Syntaxes for sharing are easy, semantics for sharing more difficult
pkp metadata harvester
PKP Metadata Harvester
  • Open source
  • PHP/MySQL
  • Product of the Public Knowledge Project
  • Features
    • Can harvest any metadata format via OAI
    • Flexible plugin and customization features
    • Defines crosswalks between different schema
supplementation strategies
Supplementation Strategies
  • Manually add or update elements
  • Programmatically supplement
  • Add namespaces
  • Virtual supplementation
supplementation examples
Supplementation Examples
  • "on the horse" @ Harvard
  • adding namespaces into DC
  • PKP Metadata Harvester
  • CUFTS
    • cufts2marc
    • Subjects and other fields in MARC records in CUFTS
  • Georgia Tech’s Umlaut link resolver
example programmatic supplementation
Example:Programmatic Supplementation
  • get_subjects.pl titles.txt (demo)
  • Possible enhancements
    • Harvest complete record and pick out wanted fields
    • Write local MARC record
    • Add heuristics to dedupe and reduce false hits
example add namespaces
Example: Add Namespaces

Creator: Jane Doe

Title: Travels in Iceland

Date: 12/07/2003

Becomes in OAI-PMH

<oai_dc:dc>

<dc:creator>Jane Doe</dc:creator>

<dc:title>Travels in Iceland</dc:title>

<dc:date>12/07/2003</dc:date>

</oai_dc:dc>

example virtual supplementation
Example: Virtual Supplementation
  • Georgia Tech’s Umlaut link resolver
    • SFX ERM data
    • ILS Oracle database for holdings info
    • OCLC's xISBN service for related ISBNs
    • Google and Yahoo APIs for open access material
    • OCLC's Resolver Registry to determine additional link resolver for user’s IP address

Ross Singer, posting to NGC4LIB list thread “Link resolvers as loosely

coupled systems for holdings?” September 10, 2007

before the afternoon break
Before the Afternoon Break
  • SFU thesis workflow case study
  • Native vs. derived metadata
  • Crosswalks
workflow case study sfu electronic theses
Workflow case Study: SFU Electronic Theses
  • Prototyped several ETD services
  • Was developing an institutional repository program
  • Contacted vendors for retro conversion and discovered we could do it ourselves
  • Saw increasing need to process print theses more efficiently
goals
Goals
  • Digitize and provide access to over 4500 SFU theses described in our catalogue
  • Develop efficient current ETD service
  • Add content to SFU’s institutional repository
  • Provide access through both the catalogue and the IR
  • No intent to stop supporting print theses
specifications
Specifications
  • Digital versions would be for access only; no need seen to create high-quality masters
  • Theses would be available to all users
  • Metadata should be as rich as possible while remaining efficient to create
issues
Issues
  • Rights Management of retro theses
    • “Fair dealing”
    • Use of PDF’s security features
  • Developing efficient workflows for processing current theses
  • Standardization of descriptive metadata
  • Technical issues
    • Dirty OCR and specialized symbols
    • Challenging source documents
workflows
Workflows
  • Current (December 2004 - )
    • Digitization
    • Metadata
  • Retrospective (1967 – 1997)
    • Digitization
    • Metadata
workflow for current theses
Workflow for Current Theses
  • Thesis Assistant provides master list in MS Excel when previous semester’s submissions “closed”
  • Digitization staff scan unbound copies directly into Adobe Acrobat
    • Filenaming scheme: Unique ID assigned manually
  • Systems staff convert metadata
  • Systems staff import into DSpace
  • Systems staff create MARC in batch
  • Tech Services load into library catalogue
slide96

#!/usr/local/bin/perl

##################

### Main program ###

##################

&OpenInputFile;

&OpenOutputFiles;

<dspace_import>

<author>….</author>

<title>…</title>

<year>…</year>

<dept>…</dept>

</dspace_import>

Scanned

theses

PDFs

Thesis Assistant’s spreadsheet

with temporary thesis ID added

theses2dspace.pl

(Filenames correspond

to temp. theses IDs)

DSpace import metadata

and packages

LDR 00747nas 2200157za 4500

005 20040903164118.1

006 m d d |

007 cr u||||||||||

008 040903||||||||||||||||||||d|||||||||||||

100 00 _aSmith, Student P.

245 00 _aThe title: _bcontaining some catchy words

856 04 _uhttp://ir.lib.sfu.ca/handle/1892/99

#!/usr/local/bin/perl

##################

### Main program ###

##################

&OpenInputFile;

&OpenOutputFiles;

DSpace

Brief MARC records

dspace2marc.pl

III

Metadata Workflow for Current (Dec 2004 - ) Theses

DSpace import utility

thesisID1 1892/99

thesisID2 1892/100

thesisID3 1892/101

Dspace map file

MARC 856: http://ir.lib.sfu.ca/handle/1892/99

slide97

<dublin_core>

<dcvalue element="contributor" qualifier="author">

Henderson, Brian Charles</dcvalue>

<dcvalue element="title" qualifier="none">

Operational effectiveness in cellulose fibers business

of Weyerhaeuser Company: can the cost trends of 2005

be reversed?</dcvalue>

<dcvalue element="date" qualifier="issued">2006</dcvalue>

<dcvalue element="language" qualifier="iso">en</dcvalue>

<dcvalue element="rights" qualifier="none">Copyright remains

with the author</dcvalue>

<dcvalue element="type" qualifier="none">text</dcvalue>

<dcvalue element="type" qualifier="none">thesis</dcvalue>

<dcvalue element="description" qualifier="none">Research

Project (M.B.A.) - Faculty of Business Administration –

Simon Fraser University</dcvalue>

<dcvalue element="description" qualifier="abstract">

The Cellulose Fibers Business of Weyerhaeuser

Company [...] </dcvalue>

<dcvalue element="relation“

qualifier="isformatof">http://troy.lib.sfu.ca/search/t?

SEARCH=Operational+effectiveness+in+cellulose+fibers+

business+of+Weyerhaeuser+Company+can+the+cost+trends+

slide99

LDR 00000nam 2200000Ia 4500

006 m||||||||d||||||||

007 cr||n||||||d||

008 070823s2006||||bcc||||||m||||||||||eng||

035 _fgb

040 _aCaBVas

_beng

100 1 _aBuckham, Catherine Anne

245 10 _aPublic participation in land use planning:

_bWhat is the role of social capital? /

_cby Catherine Anne Buckham

300 _a leaves

260 _aBurnaby B.C. :

_bSimon Fraser University,

_c2006

500 _aTheses (Urban Studies Program) / Simon Fraser University

502 _aResearch Project (M.U.S.) - Simon Fraser University, 2006

520 3 _aThis study examines […]

810 2 _aSimon Fraser University.

_tTheses (Urban Studies Program)

856 41 _uhttp://ir.lib.sfu.ca/handle/1892/3730

966 _c2

_linprc

_s-

_i3

967 _c0

workflow for retro theses
Workflow for Retro Theses
  • Master production list derived from MARC records in catalogue
    • Filenaming scheme based on ILS bib record number
  • Digitization staff
    • Scan from microfiche and print copies
    • Remove signatures from approval pages manually
    • Create PDFs from page images
slide102

Pre-scanning

Preparation

Check hard drive space

Create working directory

Poor

quality

Test

scan

Scan printed theses

Good

quality

Perform batch scanning

Please refer to flatbed scanning instructions

Image processing

Poor quality

Quality

check

Good quality

PDF conversion

Retrospective

Digitization

Workflow

Courtesy of Ian Song,

Digital Initiatives

Coordinator,

SFU

slide103

Metadata Workflow Retrospective (1966 - 1997) Theses

LDR 00747nas 2200157za 4500

005 20040903164118.1094254879.1

006 m d d |

007 cr u||||||||||

008 040903||||||||||||||||||||d|||||||||||||

100 00 _aSmith, Student P.

245 00 _aThe title: _bcontaining some catchy words

#!/usr/local/bin/perl

##################

### Main program ###

##################

&OpenInputFile;

&OpenOutputFiles;

<dspace_import>

<author>….</author>

<title>…</title>

<year>…</year>

<dept>…</dept>

</dspace_import>

Scanned

theses

PDFs

(Filenames correspond

to III .bnumbers)

marc2dspace.pl

MARC records from III

DSpace import metadata

and packages

DSpace import utility

#!/usr/local/bin/perl

##################

### Main program ###

##################

&OpenInputFile;

&OpenOutputFiles;

b18721102 1892/204

b18762105 1892/205

b14731140 1892/1206

035 .b18721102

856 04 _uhttp://ir.lib.sfu.ca/handle/1892/99

DSpace

Dspace map file

Brief MARC records containing .bnumber

and 856 field for overlaying on existing

records

updatethesesmarc.pl

MARC 856: http://ir.lib.sfu.ca/handle/1892/99

III

interoperability
Interoperability
  • The ability of one system to communicate with another
  • Can exist on various levels
    • Low-level protocols like TCP/IP
    • High-level like metadata
  • Examples relevant to digital repositories
    • Dublin Core within METS document
    • OAI-PMH
  • Syntactic and semantic interoperability
how much interoperability
How Much Interoperability?
  • Will your collection be integrated into / linked to a larger one?
  • How important is internal consistency within your collections?
  • Best practices encourage interoperability
  • (Qualified) Dublin Core is safe choice
crosswalks
Crosswalks
  • Mappings for converting one schema to another
  • DC to MARC, DC to MODS, MARC to MODS, etc
  • Promote reuse, interoperatbility
  • Sample list
lossy and lossless crosswalks
Lossy and Lossless Crosswalks
  • Lossy: crosswalk removes granularity
  • Lossless: no loss of granularity
  • Dummy down vs. smarten up
  • Acid test: round trip a data set
native vs derived metadata
Native vs. Derived Metadata
  • Moving metadata from one container to another
  • Crosswalks document correspondences
  • Deriving metadata is part of sharing and reuse
example alouette metadata toolkit
Example: Alouette Metadata Toolkit
  • Metadata is stored internally in a relational database and as raw XML files
    • element ID, element ID eelation, info object ID, culture, element, schema, value
    • Attributes are also rows in same
  • It is exported as METS and EAD files
slide112

Second

Intermission

before end of day
Before End of Day
  • Application Profiles
  • OAI-PMH
application profiles
Application Profiles
  • A set of metadata elements, policies, and guidelines defined for a particular community or implementation
  • Obligation, legal qualifiers and values, best practice
  • CEN (European Committee For Standardization) CWA 14855
  • Examples
    • CanCore
    • DCMI Library Application Profile
    • DCMI Education Application Profile
    • OhioLINK Digital Media Center (DMC) Metadata Application Profile
why are profiles necessary
Why are Profiles Necessary?
  • Among 82 OAI data providers, 71% used only 5 elements (creator, identifier, title, date, and type)
  • 54% of providers used only creator and identifier for over half their records

Jewel Ward, “Unqualified Dublin Core Usage in OAI-PMH Data Providers” OCLC

Systems And Services 20.1 (2004), 40-47.

invent or borrow
Invent or Borrow?
  • Avoid inventing; borrow instead
  • Overhead of maintaining your own schema
  • Is your material so special?
  • Borrow properties (fields, elements), put effort into values
  • Document and give back your application profile
slide118
OAI
  • OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting
  • Harvesting, not resource discovery
  • Uses standard Web protocols
oai pmh model

Verbs

<OAI-PMH>…

OAI-PMH Model

Data providers

expose metadata

Service providers

harvest metadata

and do something

useful with it

examples of verbs
Examples of Verbs

http://oai.lib.sfu.ca/oai2.php?verb=Identify

  • verb=ListSets
  • verb=ListRecords&set=cartoons

&metadataPrefix=oai_dc

  • verb=ListRecords

&from=2002-06-01T02:00:00Z

&set=cartoons

&metadataPrefix=oai_dc

selective harvesting
Selective harvesting
  • Sets
    • Used for grouping items
    • May be flat or hierarchical
      • province:british+columbia
      • Type:Reports
  • Datestamps
    • Uses Coordinated Universal Time
    • “from” and “until” arguments
      • verb=…from=2003-01-15Z
harvest store repurpose

OIA Rep

OIA Rep

OIA Rep

OIA Rep

Some other

harvester

Search

New this week

Harvest, Store, Repurpose

Harvester /

Aggregator /

Data store

metadata sharing case study carl harvester and carlcore ap
Metadata Sharing Case Study: CARL Harvester and CARLCore AP
  • “Canadian Association of Research Libraries / Association des bibliothèques de recherche du Canada's Institutional Repository Metadata Harvester”
  • http://carl-abrc-oai.lib.sfu.ca/
  • Launched June 2004
  • Now contains 35,000+ records
  • Primarily a search engine for the harvested metadata
  • Uses the PKP Metadata Harvester software
repositories
Archimede Université Laval

Collection mémoires et thèses de l'Université Laval

DSpace@UCalgary.ca

eCommons::Research (University of Winnipeg)

Mspace (University of Manitoba)

Ozone (Ontario Scholars Portal)

Papyrus - Dépôt institutionnel numérique (Université de Montréal)

Simon Fraser University Institutional Repository

T-Space (University of Toronto)

University of Saskatchewan Electronic Theses & Dissertations

University of Waterloo Electronic Theses

UVicDSpace

Repositories
the problem
The Problem
  • Increased dissatisfaction with search capabilities

The Solutions

  • Improvements to the software
  • Development of an application profile
goals130
Goals
  • Develop a profile that
    • Improves quality of aggregated metadata
    • Is practical
    • Is voluntary
  • Benefits include
    • Better centralized services
    • Streamlined local practices
    • Guidance for new repositories
working group
Working Group
  • Mark Jordan (SFU), Chair
  • Sam Kalb (Queen’s)
  • Lynne McAvoy (CISTI)
  • Lisa O’Hara (Manitoba)
  • Sharon Rankin (McGill)
  • Kathleen Shearer (CARL)
  • Nancy Stuart (Victoria)
process
Process
  • Analyze the metadata (from June 2005)
  • Develop use cases and functional requirements
  • Survey other application profiles
    • ePrints UK “Using Simple Dublin Core to Describe Eprints”
    • “ARROW Discovery Service Harvesting Guide”
timeline past
Timeline (past)
  • October 2004: Proposal to develop AP
  • April 2005: Formation of mailing list
  • September 2005: Meeting in Ottawa
  • March 2006: Formation of AP working group
  • June 2006: Meeting in Québec
  • October 2006: CARLCore Level 1 available for comment
timeline future
Timeline (future)
  • November 10, 2006: Deadline for comments
  • January 31, 2007: Final release
    • IR platform-specific implementation guidelines
    • French translation
  • Ongoing: CARLCore Level 2
carlcore ap
CARLCore AP
  • Document is a standard application profile
  • Containing…
    • Rationale
    • General principles and recommendations
    • Entries for each uDC element
    • Appendices
      • Implementation guidelines
      • Sample records
      • CARLCore and the CARL Harvester
carlcore level 1
CARLCore Level 1
  • Uses only unqualified Dublin Core
  • Goal is to make use of the DC elements in OAI as consistent as possible
  • From the “Principles”:

CARLCore Level 1 parallels the Dublin Core Metadata Element Set in order to supply the richest and most consistent metadata possible within the minimum requirements of the Open Archives Initiative Protocol for Metadata Harvesting.

sample elements
Sample Elements
  • Identifier
  • Source
  • Type
handling local variations
Handling local variations
  • Top-down approach
    • Dictate shared vocabulary
  • Bottom up approach
    • Provide solution for accommodating both local and centralized needs
type map solution
“Type map” solution
  • Harvester uses a “map file” to convert local type values into shared vocabulary
  • Simple XML format
  • Each repository administrator maintains the map file
  • End result is that metadata is processed while being harvested
slide145

dissertation

picture

thesis

image

Harvester

Local repository

verb=ListRecords

slide146

<mappings>

<mapping from=" " to="Actes de conférence / Conference Proceedings" />

<mapping from=" " to="Article" />

<mapping from=" " to="Audio" />

<mapping from=" " to="Carte, plan / Map, plan" />

<mapping from=" " to="Chapitre de livre / Book chapter" />

<mapping from=" " to="Communication, présentation / Paper, Presentation" />

<mapping from=" " to="Ensemble de données / Dataset" />

<mapping from=" " to="Image" />

<mapping from=" " to="Livre / Book" />

<mapping from=" " to="Logiciel / Software" />

<mapping from=" " to="Mémoire de maîtrise / Master's thesis" />

<mapping from=" " to="Objet d'apprentissage / Learning Object" />

<mapping from=" " to="Partition musicale / Musical Score" />

<mapping from=" " to="Pré-publication / Preprint" />

<mapping from=" " to="Rapport / Report" />

<mapping from=" " to="Thèse de doctorat / Doctoral dissertation" />

<mapping from=" " to="Vidéo / Video" />

<mapping from=" " to="Autre / Other" />

</mappings>

carlcore level 2
CARLCore Level 2
  • Will add elements to CARLCore Level 1
  • One existing goal is to provide faceted discipline browsing
    • Using OAI sets?
    • Using one ore more non uDC elements?
  • May focus on disciplinary archives
  • Other features leading to “added value” for users
implementation issues
Implementation Issues
  • Legacy metadata
  • Conflicts with local IR metadata practice
  • Inflexible OAI gateways in IR platforms
  • Lack of tools to test compliance
  • Yes, using CARLCore is optional… but there is strength in numbers
carlcore to do list
CARLCore To Do List
  • Take advantage of PKP Harvester’s data normalization features
  • CARLCore Level 2
  • Stay current with (and collaborate with) IR platforms
summary
Summary
  • Metadata requirements for repositories drive decisions
  • Do not reinvent the wheel — instead, adopt or develop an application profile
  • Metadata must be managed
  • Tools should not define your ability to manage your metadata
  • Metadata can be shared
recommended online reading
Recommended Online Reading
  • METS Primer and Reference Manual. http://www.loc.gov/standards/mets/METS%20Documentation%20draft%20070310p.pdf
  • DCMI Proceedings. http://www.dcmipubs.org/ojs/index.php/pubs
  • Understanding Metadata. NISO, 2004. http://www.niso.org/standards/resources/UnderstandingMetadata.pdf
recommended print reading
Recommended Print Reading
  • Library Technology Reports: Metadata and Its Applications. Ed. Brad Eden. 41.6: November-December 2005.
  • Metadata: A Cataloguer’s Primer. Ed. Richard Pl Smiraglia. New York: Haworth. 2005.
  • Metadata in Practice. Ed. Diane I. Hillmann and Elaine L. Westbrooks. Chicago: American Library Association, 2004.