1 / 16

Mining for Digital Resources: Identifying and Characterizing Digital Materials in WorldCat

Mining for Digital Resources: Identifying and Characterizing Digital Materials in WorldCat. Brian Lavoie Lynn Silipigni Connaway Ed O’Neill ACRL 12 th National Conference Minneapolis, MN April 9, 2005.

dayo
Download Presentation

Mining for Digital Resources: Identifying and Characterizing Digital Materials in WorldCat

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining for Digital Resources:Identifying and Characterizing Digital Materials in WorldCat Brian Lavoie Lynn Silipigni Connaway Ed O’Neill ACRL 12th National Conference Minneapolis, MN April 9, 2005

  2. More information about the OCLC Research Data Mining activity is available online:http://www.oclc.org/research/projects/mining/

  3. Rising Digital Tide • Equivalent of 5 exabytes of new information created in 2002; 92 percent stored on magnetic or optical media Lyman and Varian • Rush to digitize: • Cultural artifacts (images, audio, video, text) • Published content (books, journals, databases) • Communication (listservs, blogs, chat rooms) • Government information (reports, data, forms, records) • Survey of Academic Libraries: • Average expenditure in 2003 on digital resources: $250,000 (8 percent increase) • 40 percent of respondents intend to reduce spending on print resources in order to increase spending on digital resources

  4. Purpose of Study Focused questions … • Identify digital resources in WorldCat • Bibliographic criteria for algorithmic identification • Characterize digital materials: • Cataloging activity; material types; holdings patterns … But also broader questions … • Explore ways to use information in bibliographic records to generate new views of the catalog • “Large scale experiments with existing catalog records to see what can be done with legacy data” Roy Tennant, Library Journal

  5. Data Sources • WorldCat: world’s largest and most comprehensive bibliographic database • > 50,000 libraries worldwide use and contribute to WorldCat • Copy of WorldCat from July 2004: • ~53 million records • Copy of WorldCat holdings file from July 2004: • ~950 million holdings • Caveats: • No presumption that all (or even most) digital materials are cataloged in WorldCat • Focus on cataloging practice and experimentation with bibliographic data

  6. Identifying Digital Materials • “Standard” MARC21 criteria: • Type of Record: computer file [LDR/6 = m] • Form of Item: electronic [008/23 or 29 = s] • General Materials Designation: electronic resource [245 $h] • Other criteria: • Physical Description: electronic resource [007/0 = c] • Electronic Location and Access [856 2nd ind. = 0, no $3] • Additional Materials/Form of Material: computer file/electronic resource [006/0 = m] • Reproduction Note: electronic reproduction [533 $a]

  7. Analysis of “Other Criteria” • Analyzed records that did NOT meet any of the standard criteria, but DID meet at least one of the other criteria: RecallPrecision 007/0 = c Very High Low 856 2nd ind. = 0 High Low 006/0 = m Medium Low 533 $a Low High • Cataloging issues: • Accompanying materials • Separate record vs. combined record • Mis-codings • Opted for conservative strategy of using only standard criteria • Wrote algorithm for automatic scanning of WorldCat

  8. The WorldCat Digital Bucket WorldCat Digital ~750,000 records (~1.5 percent) ~53 million records

  9. Dynamics • Earliest Digital Record (lowest OCLC #): • #1617882: entered on September 11, 1975 • American Antiquarian Society • Data file on tape reel • Latest Digital Record (highest OCLC #): • #55794312: entered on July 1, 2004 • Mississippi State University • Master’s thesis in PDF format • Rate of Growth: January 2004 – July 2004 • Net increase of 1.8 million WorldCat records • Net increase of 61,000 records describing digital materials: ~3 percent of total increase

  10. WorldCat Cataloging Activity for Digital Materials:Number of “Digital Records” Entered, by Year (’75-’04) Contributed: 98% (WorldCat: 88%)

  11. Distribution of Digital Material Types in WorldCat (July 2004)

  12. Digital Material Types in WorldCat: 1985 and 2004(Percent of Total) 19852004 Books: - 43 Computer Files: 98 26 Government Docs: 1 14 Serials: - 6 Theses: - 3 Pamphlets: 1 3 Other:-5 100 100

  13. Digital (e-)Books: Additional Characteristics Median Holdings: 1 (All books in WC: 3) Uniquely Held: 65 percent (All books in WC: 32 percent) Total Holdings: ~13 million (All books in WC: ~700 million) Percent of Total Holdings Set By: ARLs: 6 (All books in WC: 23) Non-ARL academics: 71 (All books in WC: 44) Publics: 13 (All books in WC: 24) Digital books with at least one print equivalent cataloged in WorldCat: ~88,000 Percent of digital books available online: 70 percent

  14. Looking Ahead … “Murky Buckets”? • Early view: format most important feature of digital materials • Implies one “digital bucket” • But as number and variety of digital materials expand … • Need for increasingly fine distinctions between buckets • “Online e-book” requires 3 filters to surface in search results Format (digital), Means of access (network), Material type (book) • “Murky Bucket Syndrome”:We cannot entirely, unambiguously slice and dice [large bibliographic databases] because of historic data entry and cataloging practices … that were not oriented toward our new needs Lorcan Dempsey, quoted by Roy Tennant in Library Journal • Particularly troublesome for digital materials: • Cataloging practices in flux; new types of digital resources

  15. Conclusions • Identification and categorization of digital materials: • For now … need more work to identify consistent cataloging patterns in existing bibliographic records • And for the future … need clear, stable practices for cataloging digital materials • Benefits: • End users (resource discovery based on new views of the catalog) • Librarians (digitization priorities, collection analysis …) • “Processable catalogs”: • Make bibliographic data work harder!

  16. More information … • Paper forthcoming • Contacts: • lavoie@oclc.org • connawal@oclc.org • oneill@oclc.org • Presentation to be posted on OCLC Research Web site: • http://www.oclc.org/research/

More Related