Metadata interoperability for everyone – XML tools for catalogers Terry Reese Digital Production Unit Head Oregon State University
Finding our way • Metadata Interoperability • Crosswalk systems • Common problems • Metadata tools • Scripting Solutions • MarcEdit • MarcEdit and MODS • Metadata transformations • MODS editing • Automatic MODS harvesting • Conclusion
Why metadata interoperability? • Today, we have literally hundreds of different metadata schemas. In the library, we have a wide variety as well. • MARC (and all its flavors) • FGDC • Dublin Core • EAD • METS • MODS • Onyx • OAI • TEI • FRBR • GILS • etc…..
If you describe it….. • Metadata schemas are created by communities to meet the special descriptive needs of those communities. • Of course, one of the dangers is competing standards within groups creating multiple incompatible schema or the creation of variations of a particular schema within a community.
If you describe it….. <controlaccess> <subject source="lcsh" encodinganalog="650">College students--Iowa--Mount Vernon.</subject> <subject encodinganalog="650" source="lcsh">Student activities--Iowa--Mount Vernon.</subject> </controlaccess> <controlaccess> <subject source="lcsh"> <controlaccess encodinganalog=“650a”>College students</controlaccess> <controlaccess encodinganalog=“650z”>Iowa</controlaccess> <controlaccess encodinganalog=“650z”>Mount Vernon.</controlaccess> </subject> </controlaccess>
If you describe it… Some specialized examples: • MARC (MAchine Readable Communication) • http://oregonstate.edu/~reeset/presentations/ala/summer2005/marc.txt • EAD (Encoded Archival Description) • http://oregonstate.edu/~reeset/presentations/ala/summer2005/ead.xml • (MARC representation: http://oasis.orst.edu/record=b2324248) • Dublin Core • http://oregonstate.edu/~reeset/presentations/ala/summer2005/dc.xml • FGDC • http://oregonstate.edu/~reeset/presentations/ala/summer2005/fgdc.xml
If you describe it… Why would communities develop shared metadata schemas? • Shared schemas provide a structured method for sharing data within a community. • Example: MARC…its development paved the way for the current cooperative cataloging model and tools like: • OCLC • RLIN • Z39.50 • But shared best practices?
Why use crosswalks? Crosswalks: • Are developed by examining the similarities and differences between schemas. • Are one of the primary mechanism that can be used to allow different systems to interoperate with each other. • Breaks down data transfer barriers, allowing different systems to share data.
Why use crosswalks? • To combine metadata catalogs e.g. Union catalogs • To provide cross searchability between unlike datasets e.g. Federated search tools • To perform data/metadata maintenance e.g. Updating metadata formats – moving away from obsolete standards. • Repurposing one schema to another.
Why use crosswalks? • Cost • Metadata creation costs can be prohibitive • University of Indiana reported in 2003 on their digitization costs that 1/3 total cost attributed to metadata create.4 This was just the initial metadata creation costs and didn’t include estimates for ongoing metadata maintenance. • However, this isn’t just a digitization issue – its also an issue for traditional catalog workflows (books, serials, etc): • Loose OSU cost approximates (including OCLC charges): • Books (copy cataloging): $3 /book • Books (original): $27 /book • Thesis (subject/classification): $20 /thesis
Crosswalking challenges • Schema granularity • One to many matches and many to one matches • Crosswalking from schemas with different granularity levels • Trying to map anything from unqualified Dublin Core. • Handling object relationships or hierarchies. • EAD=>MARC
Crosswalking challenges • Dealing with spare parts • Since data crosswalking is rarely a one to one mapping, the process nearly always results in unmappable data.
Common Crosswalking System Designs • Type-broker model (Ockerbloom) • Facilitates crosswalking – allows users to query known systems • Provides analysis and facilitates unknown crosswalking systems: • Determines crosswalk path • Negotiates system nodes • Does negotiations without the need for a control data layer – but allows clients to specify a control data layer that must be utilized in the conversion process.
Common Crosswalking System Designs • Dumb-down crosswalking model • Converting data to its lowest common denominator. • Example: OAI’s initial use of Dublin Core as a tranfer format.
Metadata Tools • PERL-based: • MARC::RECORD, MARC::CharSet, MARC::XML • http://marcpm.sourceforge.net/ • Non-PERL based: • MarcEdit – includes XML API and crosswalks for a number of common metadata schemas. • http://oregonstate.edu/~reeset/marcedit/html/ • LC’s MARC tools: http://www.loc.gov/marc/marctools.html
MarcEdit • MarcEdit 5.0 • System Requirements:Using .NET FrameworkWindows 98, ME, NT, 2000, XP, 2003 .NET 1.1 FrameworkMDAC 2.7 runtimesUsing MONO Framework (hopefully available after August 2005).Windows 2000+, Linux and MAC OS XMONO system requirements
MarcEdit: crosswalking design • Utilizes a modified version of Ockerbloom’s type-broker system. • Unlike Ockerbloom’s system, which broker’s transformations between known schemas, MarcEdit utilizes MARCXML as a control schema to facilitate translation.
MarcEdit: crosswalking design • Ockerbloom model:broker system would continue doing translations till the desired format was found. Example: MODS, Dublin Core, MARCXML, MARC
Broker System model crosswalks Type broker
MarcEdit: crosswalking design • MarcEdit model: • So long as a schema has been mapped to MARCXML, any metadata combination could be utilized. This means that no more than two tranformations will ever take place. Example: MODS MARCXML EAD
MarcEdit: crosswalking design • MarcEdit Crosswalk model • Pro • Crosswalks need not be directly related to each other • Requires crosswalker to know specific knowledge of only one schema • Con • each known crosswalk must be mapped to MARCXML.
MarcEdit: Crosswalks for everyone • Example Crosswalks: • MODS => MARC • MODS => FGDC • MODS => Dublin Core • EAD => MODS • EAD=>HTML
MarcEdit: Crosswalks for everyone • What’s MarcEdit doing? • Facilitates the crosswalk by: • Performing character translations (MARC8-UTF8) • Facilitates interaction between binary and XML formats.
MarcEdit: Simplify Editing MODS records • New to MarcEdit 5.0 is the ability to edit MODS records in the MarcEditor as if it were a regular MARC file. • Allows catalogers unfamiliar with MODS to work with MODS data in a familiar form. • Will automatically translate new fields into MODS equivalents. • Will only translate MODS equivalent field data.
MarcEdit: Simplify Editing MODS records • How it works: • MODS file is translated to MARCXML • MARCXML is translated to MarcEdit Mnemonic format. • Internally, the MarcEditor tracks format and changes. • On save, mnemonic file will be retranslated back into MODS with edited and added fields being translated to their appropriate MODS mappings.
MarcEdit: Making OAI Simple • New to MarcEdit 5.0 is a Metadata Harvester. • From within the MarcEditor, users can harvest DC, oai_marc or MODS records directly into MARC. • http://oregonstate.edu/~reeset/presentations/ala/summer2005/harvest.wmv
Bibliography • Ockerbloom, John. Mediating among diverse data formats. School of Computer Science, Carnegie Mellon University. CMU-CS-98-102. January 1998. http://tom.library.upenn.edu/pubs/thesis/ • Digitization Costs & Funding. Digital Library Workshop. Oct. 2003. http://www.dlib.indiana.edu/workshops/alioct03/costs.ppt