1 / 27

The Interplay of Big Data, WorldCat , and Dewey

Big Data, Linked Data:  Classification Research at the Junction 24 th ASIS&T SIG/CR Classification Research Workshop, 2 November 2013. Rebecca Green, OCLC greenre@oclc.org Michael Panzer, OCLC panzerm@oclc.org. The Interplay of Big Data, WorldCat , and Dewey. Roadmap.

Download Presentation

The Interplay of Big Data, WorldCat , and Dewey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Big Data, Linked Data:  Classification Research at the Junction 24th ASIS&T SIG/CR Classification Research Workshop, 2 November 2013 Rebecca Green, OCLC greenre@oclc.org Michael Panzer, OCLC panzerm@oclc.org The Interplay of Big Data, WorldCat, and Dewey

  2. Roadmap • Setting the stage • Big data • WorldCat as big data • Literary warrant and the DDC • “Classification analytics” • Classified works • Access points • Trending topics • Structure of discipline

  3. Setting the stage

  4. 3 V’s of big data • Volume • Terabytes (10004), petabytes (10005), exabytes (10006), . . . • Number of transactions vs. number of bytes • My big data is not your big data

  5. 3 V’s of big data – cont. • Variety • Sources, perspectives, standards • Structured vs. unstructured data • Semantically related datasets • Velocity • Data creation • Data analysis

  6. WorldCat as big data • Variety • Records in MARC Bibliographic Format • Records in MARC Holdings Format • Records in MARC Authority Format (e.g., LCSH, FAST, BISAC, MeSH, VIAF) • Vendor records • WorldCat knowledge base • Institutional registry data • Institution-specific acquisitions, circulation, ILL data

  7. WorldCat as big data • Volume • Bibliographic data: over 300 million records • Holdings data: over 2 billion records • Authority data • LCSH: 26.4 million headings • VIAF: 24.2 million clusters; 21 million links between records

  8. Literary warrant and the DDC • DDC editorial rules call for literary warrant to be taken into account for: • Expansions (i.e., development of new classes) • Reductions (i.e., discontinuing entire classes) • Form of name used in class descriptions • Order in which topics are listed in multitopic caption • Creation of and choice of examples in add instructions • Indexability of topics (print; WebDewey) • Form of name for index entries

  9. “Classification analytics”

  10. Classified works • Periodic profiles of distribution of classified works across the classification to identify: • Expansions: Disciplines/subjects with sufficient literary warrant • Reductions: Classes with insufficient literary warrant

  11. Classified works:Expansion warranted (1) 306.44 Language Including pragmatics Class here anthropological linguistics, ethnolinguistics, sociolinguistics 306.446 Bilingualism and multilingualism 306.449 Language planning and policy 306.449 4–.449 9 Specific continents, countries, localities in modern world Add to base number 306.449 notation 4–9 from Table 2, e.g., language policy of India 306.44954

  12. Classified works:Expansion warranted (2) • Records retrieved in WorldCat searches on dd:306.44* not dd:(306.440* or 306.446* or 306.449*)

  13. Classified works:Reduction warranted (1) 006.33 *Knowledge-based systems . . . 006.336 *Programming for knowledge-based systems 006.336 3 *Programming languages for knowledge- based systems 006.337 Programming for knowledge-based systems for specific types of computers, for specific operating systems, for specific user interfaces 006.338 *Programs for knowledge-based systems

  14. Classified works:Reduction warranted (2) • Records retrieved in WorldCat searches for disjunction of DDC class number and standard subdivisions of number • Duplicates not filtered out of search results for 006.33 • Duplicates filtered out of all other search results

  15. Access points • Analysis of subject heading data in DDC categorized content to identify: • Areas where expansions of new classes should be considered • Additional access points / mappings for DDC classes • Additional topics to be added to class description

  16. Access points: Standing room topics and literary warrant • DDC class 004.678 *Internet Including extranets, virtual private networks Class here World Wide Web • LCSH: 010 ##  $a sh 97006102 ​ 150 ##  $a Extranets (Computer networks) ​ 450 ##  $a Virtual private networks (Computer networks) • dd: 004.678* and (hl: extranets w computer w networks) retrieves 69 records

  17. Access points: Topics added to class description 004.6 *Interfacing and communications . . . Including sensor networks . . . 006.22 *Embedded computer systems [formerly 004.1] Class here microcontrollers For a specific aspect of embedded computer systems, see the aspect, e.g., systems analysis and design of embedded computer systems 004.21, wireless sensor networks 004.6, software for embedded systems 005.3

  18. Trending topics • My trending topics are not your trending topics • Twitter—sudden high-magnitude spike in activity • DDC—“quick” achievement of literary warrant threshold + plateaus at steady rate • Trending topic detection vs. new topic detection • Newly minted LCSHs • Chapter/paper titles • Conferences

  19. Trending topics:Newly minted LCSHs (1)

  20. Trending topics:Newly minted LCSHs (2)

  21. Trending topics :Conferences • Big data: 29th British National Conference on Databases • 1st Workshop on Architectures and Systems for Big Data • Workshop on big data • Big Data Analytics: First International Conference • The Semantic Web: Semantics and Big Data: 10th International Conference • 2012 workshop on Management of big data systems • 2nd Workshop on Research in the Large : Using App Stores, Wide Distribution Channels and Big Data in UbiComp Research • IEEE International Congress on Big Data • Big Data 2 Knowledge (Workshop)

  22. Trending topics :Chapter/paper titles • Welcome to the big data age • Big Brother and big data around the world • How to make sense of big data? • Business and social implications of big data • Big data and health care • How should big data abuses be addressed? • What is big data? • Does big-data equal big value? • Big-data technologies

  23. Trending topics :Newly minted LCSHs (3)

  24. (Non-)Trending topics :Newly minted LCSHs (4)

  25. Structure of discipline • Analysis of title data in DDC categorized content to identify facet structure of discipline • Retrieve bibliographic records from WorldCat for monographic literature • Isolate title data • Identify noun phrases in the titles • Use conceptual density measure of Agirre & Rigau • Disambiguate noun phrases • Identify appropriate generalizations

  26. The Interplay of Big Data, WorldCat, and Dewey That’s all, folks! -- Thank you = La fin -- Merci beaucoup

  27. (Non-)Trending topics :Newly minted LCSHs (5)

More Related