1 / 50

Beyond Data Mining: Delivering the Next Generation of Services from Library Data

Beyond Data Mining: Delivering the Next Generation of Services from Library Data. Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC. WorldCat as an “Aggregate Collection”. Data Mining and Analysis of WorldCat:

khanh
Download Presentation

Beyond Data Mining: Delivering the Next Generation of Services from Library Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond Data Mining:Delivering the Next Generation of Services from Library Data Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC

  2. WorldCat as an “Aggregate Collection” • Data Mining and Analysis of WorldCat: • “…affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making.” • Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape. Library Resources & Technical Services, 51, 106-115 at 107.

  3. WorldCat: July 2008 Manifestations (records): 108,828,533 Works: 84,096,107 Total holdings: 1,292,763,300 Digital Items: 3,182,550 Institutions: 69,000 Physical Items: ~1.2 billion

  4. Global Origins of WorldCat Materials Germany 10% Rest of World 27% Unknown 17% France 4% Canada 3% UK 8% US 28%

  5. Global Origins of WorldCat Materials Materials w/non-US origins: 57.9 million (55%) Top 5: Germany: 10.0 million UK: 8.8 million France: 4.2 million Netherlands: 2.9 million Canada: 2.9 million Content Languages: 478 49% of WC non-English Top 5 non-English: German: 12 million French: 6.1 million Spanish: 3.5 million Dutch: 2.6 million Japanese: 2.4 million Non-English Metadata Language: 28 million (66 languages) Top 5: German: 11 million French: 1.8 million Dutch: 5.0 million Finnish: 0.7 million Swedish: 1.9 million

  6. WorldCat as a Decision-Making Resource • Collection management • Cooperative collection development • Comparative collection analysis • Collection assessment • Mass digitization • Off-site storage • Preservation

  7. WorldCat as a Decision-Making Resource • Services • Virtual reference • Recommender services • Social networking • Systems • Precision

  8. WorldCat as a Decision-Making Resource • Three Areas of Data Mining Research: • OCLC WorldMap • Audience Level • Publisher Name Server

  9. OCLC WorldMap

  10. OCLC WorldMapTM: Objectives • Geographically represent WorldCat data • Titles published in each country • Holdings for titles published in each country • Languages represented for titles published in each country

  11. OCLC WorldMapTM: Objectives • Geographically represent data from UNESCO, ARL, and NCES for each country • Number of • Libraries • Library volumes • Certified/degreed librarians • Registered library users • Library expenditures • Cultural heritage institutions (museums and archives) • Publishers

  12. OCLC WorldMapTM: Objectives • Research prototype • Support OCLC data mining research • Visually display data for review and analysis • Internal use • Sales and marketing • External use • Library collection assessment and comparison • Data may be processed AT A GLANCE • Complement the AAU/ARL Global Resources Network project • Project of the Council on Library and Information Resources (CLIR)

  13. http://pubserv.oclc.org:12223/WorldMap/

  14. OCLC Audience Level

  15. Audience Level: Rationale and Objectives Holdings represent selection decisions by librarians … implies there are more than 1 billion individual selection decisions in the WorldCat holdings file • Selections serve the interests of a library’s target community … • Associate community (audience level) to library profiles - e.g., ARL, non-ARL academic, public, K-12 school … ? • Thus we can infer materials’ audience level from holdings patterns, which in turn can support: • Collection management • Readers’ advisory services • Reference services • Information retrieval

  16. Example Computation: Build Community

  17. “FRBRizing” Audience Level Results • Calculate Audience Level for each Manifestation • Aggregate weighted holdings for Work

  18. Evaluating the OCLC Audience Level • Random sample of 30 Zoology books, all audience levels • Human subjects • Ranked books “in increasing order of difficulty” • Strong statistical correlation between human subjects’ ranking and programmatic ranking

  19. Evaluating the OCLC Audience Level

  20. http://audiencelevel.oclc.org/

  21. OCLC Publisher Name Server

  22. Publisher Name Server: Research Objectives • Resolve for data mining and quality of WorldCat • ISBN prefixes to publisher name • Variant publisher names to a preferred form • Complement Collection Analysis Service • Librarians • Publishers • Capture and profile attributes of individual publishers • Location(s) • Language(s) of materials published • Genre(s)/format(s) • Dominant subject domain(s) • Parent company and subsidiaries

  23. Publisher Name Server: Methodology • Programmatically cluster publishers’ records using ISBN prefixes • Data clustering (The Free Dictionary) • "The science of extracting useful information from large data sets or databases" • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Data in each subset (ideally) share some common trait • Hand parse the entities and resolve ISBN prefixes

  24. Publisher Name Server: Database • 1750 publishing entities • Relational database, preserving hierarchical relationships • Begins with high-occurrence entities: • “Top 10” lists (USA, UK, Canada, Australia, Germany, France, Netherlands, Japan, Italy, China, Russia, Spain, Finland, Australia, Taiwan, New Zealand) • Top 10 university presses • Mergers and acquisitions, last 8 years

  25. Publisher Name Server: Data Captured Database Fields: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL ----- Languages Formats Conspectus Subjects Data Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers’ Weekly Online Hoover’s Handbook Online Standard and Poor’s Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING

  26. Publisher Name Server: Database • More than 56,000 separate strings mapped to 1750 entities • 8.5 million OCLC records • 22% of these are Library of Congress records • ~490 million holdings • Hierarchical relationships maintained

  27. Entity-Parsing in a World of Mergers and Acquisitions Pearson PLC Penguin Books Pearson Canada Pearson Technology Group Allen Lane Ladybird Books Riverhead Books Copp Clark Adobe Press Cisco Press Puffin Books Putnam Books Berkeley Publishing Group Pearson Education, Inc. Avery Addison-Wesley Publishing Company Allyn and Bacon Prentice-Hall, Inc. Dominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co.

  28. Publisher Profiles • Oxford University Press • 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat) • Pearson PLC • Includes 14 subsidiaries and acquisitions • Aggregate: 291,433 records (0.27% of WorldCat)

  29. Oxford Univ. Press: English 96.74% Latin 0.51% German 0.39% Chinese 0.39% French 0.37% Spanish 0.28% Afrikaans 0.14% Middle English 0.13% Malay 0.09% Swahili 0.09% Pearson PLC: English 95.27% Spanish 1.43% German 1.33% French 0.60% Dutch 0.55% Latin 0.26% Malay 0.06% Ancient Greek 0.05% Portuguese 0.05% Italian 0.04% Publisher Profiles – Top Languages

  30. Oxford Univ. Press: Language/ Literature 27.12% History 11.92% Music 9.78% Philosophy/ Religion 9.55% Business/ Economics 6.15% Medicine 4.36% Law 3.85% Sociology 3.75% Political Science 3.58% Biology 2.60% Pearson PLC: Language/ Literature 18.67% Business/ Economics 13.30% Computer Science 9.42% Engineering 8.04% History 7.59% Mathematics 6.04% Education 5.64% Sociology 4.18% Philosophy/ Religion 3.81% Physical Sciences 2.75% Publisher Profiles – Conspectus Divisions

  31. Oxford Univ. Press: English literature 10.66% English language 5.86% Instrumental music 3.48% Vocal music 3.09% Literature on music 2.26% History – Britain 1.82% Economic history 1.38% American lit. 1.35% History – S. Asia 1.30% General history 1.29% Pearson PLC: English language 7.74% Business admin. 4.62% English literature 3.63% Economics 2.94% Comp. programming 2.39% Electrical engineering 2.24% Early childhood ed. 2.05% Computer software 1.88% U.S. federal law 1.80% Computer Science 1.54% Publisher Profiles – Conspectus Categories

  32. Oxford Univ. Press: English – modern 5.57% English lit – prose 2.51% English lit – 19th c. 2.23% Juvenile lit. 1.06% English lit – poetry 1.03% English lit – collections 0.80% Biographies 0.76% English lit – 1900-1960 0.74% Shakespeare 0.68% Sacred choruses 0.66% Pearson PLC: English – modern 7.68% Management 2.53% Programming 1.74% Arithmetic 1.09% Economic theory 1.06% Marketing 1.06% General algebra 1.04% Accounting 0.97% Juvenile lit. 0.93% English lit – 19th c. 0.89% Publisher Profiles – Conspectus Subjects

  33. Projected MARC coding of Authorized Forms • 710 Added Entry – Corporate Name • Add $4 for publisher name • Add $2 NAF where preferred form matches existing authority record (44% of current PNAF) • 752 Added Entry – Hierarchical Place Name • Add $2 FAST where place of publication matches FAST geographical subject headings

  34. Future Research • Further data mining • Profile aspects of publication output • Deeper scaling into WorldCat (beyond ISBN) • Plan for long-term maintenance • ISBN-13 compliance • File expansion of ongoing mergers/ acquisition activities

  35. Thank You! • Questions and Discussion • Lynn Silipigni Connaway connawal@oclc.org • Timothy J. Dickey dickeyt@oclc.org

More Related