beyond data mining delivering the next generation of services from library data n.
Skip this Video
Loading SlideShow in 5 Seconds..
Beyond Data Mining: Delivering the Next Generation of Services from Library Data PowerPoint Presentation
Download Presentation
Beyond Data Mining: Delivering the Next Generation of Services from Library Data

Loading in 2 Seconds...

play fullscreen
1 / 50

Beyond Data Mining: Delivering the Next Generation of Services from Library Data - PowerPoint PPT Presentation

  • Uploaded on

Beyond Data Mining: Delivering the Next Generation of Services from Library Data. Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC. WorldCat as an “Aggregate Collection”. Data Mining and Analysis of WorldCat:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Beyond Data Mining: Delivering the Next Generation of Services from Library Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
beyond data mining delivering the next generation of services from library data

Beyond Data Mining:Delivering the Next Generation of Services from Library Data

Lynn Silipigni Connaway, Ph.D.

Senior Research Scientist


Timothy J. Dickey, Ph.D.

Post-Doctoral Researcher


worldcat as an aggregate collection
WorldCat as an “Aggregate Collection”
  • Data Mining and Analysis of WorldCat:
  • “…affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making.”
  • Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape. Library Resources & Technical Services, 51, 106-115 at 107.
worldcat july 2008
WorldCat: July 2008

Manifestations (records): 108,828,533

Works: 84,096,107

Total holdings: 1,292,763,300

Digital Items: 3,182,550

Institutions: 69,000

Physical Items: ~1.2 billion

global origins of worldcat materials
Global Origins of WorldCat Materials



Rest of World












global origins of worldcat materials1
Global Origins of WorldCat Materials

Materials w/non-US origins:

57.9 million (55%)

Top 5:

Germany: 10.0 million

UK: 8.8 million

France: 4.2 million

Netherlands: 2.9 million

Canada: 2.9 million

Content Languages: 478

49% of WC non-English

Top 5 non-English:

German: 12 million

French: 6.1 million

Spanish: 3.5 million

Dutch: 2.6 million

Japanese: 2.4 million

Non-English Metadata Language:

28 million (66 languages)

Top 5:

German: 11 million French: 1.8 million

Dutch: 5.0 million Finnish: 0.7 million

Swedish: 1.9 million

worldcat as a decision making resource
WorldCat as a Decision-Making Resource
  • Collection management
    • Cooperative collection development
    • Comparative collection analysis
    • Collection assessment
    • Mass digitization
    • Off-site storage
    • Preservation
worldcat as a decision making resource1
WorldCat as a Decision-Making Resource
  • Services
    • Virtual reference
    • Recommender services
    • Social networking
  • Systems
    • Precision
worldcat as a decision making resource2
WorldCat as a Decision-Making Resource
  • Three Areas of Data Mining Research:
  • OCLC WorldMap
  • Audience Level
  • Publisher Name Server
oclc worldmap tm objectives
OCLC WorldMapTM: Objectives
  • Geographically represent WorldCat data
    • Titles published in each country
    • Holdings for titles published in each country
    • Languages represented for titles published in each country
oclc worldmap tm objectives1
OCLC WorldMapTM: Objectives
  • Geographically represent data from UNESCO, ARL, and NCES for each country
    • Number of
      • Libraries
      • Library volumes
      • Certified/degreed librarians
      • Registered library users
      • Library expenditures
      • Cultural heritage institutions (museums and archives)
      • Publishers
oclc worldmap tm objectives2
OCLC WorldMapTM: Objectives
  • Research prototype
    • Support OCLC data mining research
      • Visually display data for review and analysis
      • Internal use
        • Sales and marketing
      • External use
        • Library collection assessment and comparison
        • Data may be processed AT A GLANCE
    • Complement the AAU/ARL Global Resources Network project
      • Project of the Council on Library and Information Resources (CLIR)
audience level rationale and objectives
Audience Level: Rationale and Objectives

Holdings represent selection decisions by librarians … implies there are more than 1 billion individual selection decisions in the WorldCat holdings file

  • Selections serve the interests of a library’s target community …
  • Associate community (audience level) to library profiles - e.g., ARL, non-ARL academic, public, K-12 school …


  • Thus we can infer materials’ audience level from holdings patterns, which in turn can support:
  • Collection management
  • Readers’ advisory services
  • Reference services
  • Information retrieval
frbrizing audience level results
“FRBRizing” Audience Level Results
  • Calculate Audience Level for each Manifestation
  • Aggregate weighted holdings for Work
evaluating the oclc audience level
Evaluating the OCLC Audience Level
  • Random sample of 30 Zoology books, all audience levels
  • Human subjects
    • Ranked books “in increasing order of difficulty”
  • Strong statistical correlation between human subjects’ ranking and programmatic ranking
publisher name server research objectives
Publisher Name Server: Research Objectives
  • Resolve for data mining and quality of WorldCat
    • ISBN prefixes to publisher name
    • Variant publisher names to a preferred form
  • Complement Collection Analysis Service
    • Librarians
    • Publishers
  • Capture and profile attributes of individual publishers
    • Location(s)
    • Language(s) of materials published
    • Genre(s)/format(s)
    • Dominant subject domain(s)
    • Parent company and subsidiaries
publisher name server methodology
Publisher Name Server: Methodology
  • Programmatically cluster publishers’ records using ISBN prefixes
    • Data clustering (The Free Dictionary)
      • "The science of extracting useful information from large data sets or databases"
      • Classification of similar objects into different groups
      • Partitioning of a data set into subsets (clusters)
        • Data in each subset (ideally) share some common trait
  • Hand parse the entities and resolve ISBN prefixes
publisher name server database
Publisher Name Server: Database
  • 1750 publishing entities
  • Relational database, preserving hierarchical relationships
  • Begins with high-occurrence entities:
    • “Top 10” lists (USA, UK, Canada, Australia, Germany, France, Netherlands, Japan, Italy, China, Russia, Spain, Finland, Australia, Taiwan, New Zealand)
    • Top 10 university presses
    • Mergers and acquisitions, last 8 years
publisher name server data captured
Publisher Name Server: Data Captured

Database Fields:

Publisher Name, Preferred Form

Source of Preferred Form

Former Names

Variant Forms

ISBN Prefixes

HQ City

HQ Country

Other Cities





Conspectus Subjects

Data Sources:

U.S. Library of Congress, National Authority File, 110 (Corporate Name) field

Books In Print Online (W.W. Bowker)

The International ISBN Registry (K.G. Saur)

Publishers’ Weekly Online

Hoover’s Handbook Online

Standard and Poor’s Corporate Descriptions

The Directory of Corporate Affiliations (DIALOG)

Company websites


publisher name server database1
Publisher Name Server: Database
  • More than 56,000 separate strings mapped to 1750 entities
  • 8.5 million OCLC records
    • 22% of these are Library of Congress records
  • ~490 million holdings
  • Hierarchical relationships maintained
entity parsing in a world of mergers and acquisitions
Entity-Parsing in a World of Mergers and Acquisitions

Pearson PLC

Penguin Books

Pearson Canada

Pearson Technology Group

Allen Lane

Ladybird Books

Riverhead Books

Copp Clark

Adobe Press

Cisco Press

Puffin Books

Putnam Books

Berkeley Publishing Group

Pearson Education, Inc.


Addison-Wesley Publishing Company

Allyn and Bacon

Prentice-Hall, Inc.

Dominie Press

Benjamin/Cummings Publishing Company

Scott, Foresman and Company

HarperCollins Educational Publishers

Longmans, Green, and Co.

publisher profiles
Publisher Profiles
  • Oxford University Press
    • 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat)
  • Pearson PLC
    • Includes 14 subsidiaries and acquisitions
    • Aggregate: 291,433 records (0.27% of WorldCat)
publisher profiles top languages
Oxford Univ. Press:

English 96.74%

Latin 0.51%

German 0.39%

Chinese 0.39%

French 0.37%

Spanish 0.28%

Afrikaans 0.14%

Middle English 0.13%

Malay 0.09%

Swahili 0.09%

Pearson PLC:

English 95.27%

Spanish 1.43%

German 1.33%

French 0.60%

Dutch 0.55%

Latin 0.26%

Malay 0.06%

Ancient Greek 0.05%

Portuguese 0.05%

Italian 0.04%

Publisher Profiles – Top Languages
publisher profiles conspectus divisions
Oxford Univ. Press:

Language/ Literature 27.12%

History 11.92%

Music 9.78%

Philosophy/ Religion 9.55%

Business/ Economics 6.15%

Medicine 4.36%

Law 3.85%

Sociology 3.75%

Political Science 3.58%

Biology 2.60%

Pearson PLC:

Language/ Literature 18.67%

Business/ Economics 13.30%

Computer Science 9.42%

Engineering 8.04%

History 7.59%

Mathematics 6.04%

Education 5.64%

Sociology 4.18%

Philosophy/ Religion 3.81%

Physical Sciences 2.75%

Publisher Profiles – Conspectus Divisions
publisher profiles conspectus categories
Oxford Univ. Press:

English literature 10.66%

English language 5.86%

Instrumental music 3.48%

Vocal music 3.09%

Literature on music 2.26%

History – Britain 1.82%

Economic history 1.38%

American lit. 1.35%

History – S. Asia 1.30%

General history 1.29%

Pearson PLC:

English language 7.74%

Business admin. 4.62%

English literature 3.63%

Economics 2.94%

Comp. programming 2.39%

Electrical engineering 2.24%

Early childhood ed. 2.05%

Computer software 1.88%

U.S. federal law 1.80%

Computer Science 1.54%

Publisher Profiles – Conspectus Categories
publisher profiles conspectus subjects
Oxford Univ. Press:

English – modern 5.57%

English lit – prose 2.51%

English lit – 19th c. 2.23%

Juvenile lit. 1.06%

English lit – poetry 1.03%

English lit – collections 0.80%

Biographies 0.76%

English lit – 1900-1960 0.74%

Shakespeare 0.68%

Sacred choruses 0.66%

Pearson PLC:

English – modern 7.68%

Management 2.53%

Programming 1.74%

Arithmetic 1.09%

Economic theory 1.06%

Marketing 1.06%

General algebra 1.04%

Accounting 0.97%

Juvenile lit. 0.93%

English lit – 19th c. 0.89%

Publisher Profiles – Conspectus Subjects
projected marc coding of authorized forms
Projected MARC coding of Authorized Forms
  • 710 Added Entry – Corporate Name
    • Add $4 for publisher name
    • Add $2 NAF where preferred form matches existing authority record (44% of current PNAF)
  • 752 Added Entry – Hierarchical Place Name
    • Add $2 FAST where place of publication matches FAST geographical subject headings
future research
Future Research
  • Further data mining
    • Profile aspects of publication output
    • Deeper scaling into WorldCat (beyond ISBN)
  • Plan for long-term maintenance
    • ISBN-13 compliance
    • File expansion of ongoing mergers/ acquisition activities
thank you
Thank You!
  • Questions and Discussion
  • Lynn Silipigni Connaway
  • Timothy J. Dickey