1 / 28

Cataloging the Internet for the Sake of the User

Cataloging the Internet for the Sake of the User. Nancy Florio Ils506 Summer 2009 Chang Suk Kim, Ph.D. Historical Context. Historically, libraries and librarians have always sought to collect, organize, preserve, and disseminate the collective knowledge of the world.

kyrene
Download Presentation

Cataloging the Internet for the Sake of the User

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cataloging the Internet for the Sake of the User Nancy Florio Ils506 Summer 2009 Chang Suk Kim, Ph.D

  2. Historical Context • Historically, libraries and librarians have always sought to collect, organize, preserve, and disseminate the collective knowledge of the world. • The Diamond Sutra holds the distinction as being the earliest dated book ever printed. • It wasn’t until 1436 when Johannes Gutenberg invented a printing press with wooden moveable type that printed material became accessible, although not affordable, by the masses. • Primarily due to economic factors, book ownership has been a fairly recent luxury, while people previously sought out books from their local academic or public libraries.

  3. Keeping Track of the Collection • Initially, libraries kept track of their collections by creating lists in books. This system allowed only one point of access when looking for information. • In 1901 the Library of Congress began to sell printed cards of bibliographic information to libraries. • This step of placing the bibliographic cards in a card catalog allowed users to find information through multiple access points.

  4. Developing a Sense of Order • The Dewey Decimal and Library of Congress Classification systems, the Anglo-American Cataloging Rules (AACR2), and MARC format enabled users to access information locally that adhered to consistent standards. • With the development of MARC records, card catalogs in individual libraries gave way to electronic cataloging systems, further streamlining the information seeking process. • In 1967, the Ohio College Library Center (OCLC), a consortium of 54 colleges in Ohio formed a network to share their collections catalogued on MARC records. • The OCLC has since changed its name to the Online Computer Library Center and membership is now open to all libraries • through WorldCat.

  5. Bibliographic Standards • Today, anyone using a computer with Internet access is able to locate items based on any one of the 138 million bibliographic records held by WorldCat. • “The ability for the OCLC to operate as a collective requires consistent standards for precise communication”. O’Daniel (1992) • The Library of Congress developed these consistent standards by creating Authority Headings for subject, name, title, and name/title combinations and permitting users to harvest this data at http://authorities.loc.gov/. • This controlled vocabulary ensures users consistency and accuracy in bibliographic records and increases the likelihood of accessing the information they seek.

  6. Resources on World Wide Web • Over the past quarter century there has been a staggering increase in the availability of information in digital format on the World Wide Web. • The faculty and students at the School of Information Management and Systems at the University of California at Berkeley researched how much new information is created each year. • They estimated that in the year 2000, there were 20 to 50 terabytes of information on the Surface Web. In three short years that number had more than tripled; from roughly 50 terabytes to 167 terabytes. (Lyman et al, 2003)

  7. The Deep Web • Bright Planet estimates that the Deep Web holds 400 to 450 times the information of the Surface Web. Their estimate places information on the Deep Web at between 66,800 and 91,850 terabytes. • As a point of reference, it would require 10 terabytes to contain all the information in the entire print collections of the U.S. Library of Congress. • In 2003, this equaled the equivalent of between 6,600 to 10,000 times the entire LOC collections on the Web, most of it on the Deep Web.

  8. Why Catalog When There is Google? • One may wonder why there is a need to catalog the information on the Web when people are able to access information through generalized search engines such as Google. • In fact, “Google's mission is to organize the world's information and make it universally accessible and useful”. • Looking at the statistics, it appears Google is doing just that. Currently, Google’s “millions of servers process about 1 petabyte of user-generated data every hour. It conducts hundreds of millions of searches every day”. Vogelstein (2009)

  9. Why Catalog When There is Google? • While Google is able to process the equivalent of 6,600,000 to 10,000,000 times the amount of information in the collections of the Library of Congress every hour, volume and speed cannot be construed as true indicators of accuracy of information or relevancy of items retrieved. • Traditional search engines like Google operate based on a system of creating indices by crawling Web pages. • Pages need to be static for this form of indexing to work. Content in the Deep Web cannot be indexed this way because the majority of it doesn’t exist in a static format. Bergman (2001)

  10. Did You Know? • The Deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web. • More than 200,000 deep Web sites presently exist. On average, Deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites. • The Deep Web site is not well known to the Internet-searching public. • The Deep Web is the largest growing category of new information on the Internet.

  11. More Deep Web Facts • Deep Web sites tend to be narrower, with deeper content, than conventional surface sites. • Total quality content of the Deep Web is 1,000 to 2,000 times greater than that of the surface Web. • More than half of the Deep Web content resides in topic-specific databases. • A full ninety-five per cent of the Deep Web is publicly accessible information -- not subject to fees or subscriptions.

  12. Electronic Resources • Today’s academic libraries recognize the importance of providing access to high quality, peer-reviewed journals found on the Deep Web for their students and faculty. • Experimental data collected by ARL libraries over the last decade indicate that the portion of the library materials budget that is spent on electronic resources is indeed growing rapidly, from an estimated 3.6% in 1992-93 to 10.56% in 1998-99. Kyrillidou (2000) • Colleges and universities expend over 10% of their budget to purchase subscription databases and journals. • This money is wasted because only 2% of college students began their information searches on their library websites and 90% are dissatisfied with the information they found when a general search engine directed them to electronic resources in their library.

  13. “Ten Things Google has Found to be True” ~ #3 • “Fast is better than slow. Google believes in instant gratification. You want answers and you want them right now. Who are we to argue?” - Google.com • Who indeed. In fact, the OCLC survey documented the perception of students is that “search engines deliver better quality and quantity of information than librarian-assisted searching-and at greater speed”

  14. Information Seeking Behaviors • Carol C. Kuhithau at Rutgers has conducted extensive research on information seeking behavior from the user’s perspective. • “The bibliographic paradigm is based on certainty and order, whereas users’ problems are characterized by uncertainty and confusion”. • This uncertainty frequently causes feelings of anxiety in the student and the search for broad, generalized information compounds this state.

  15. Information Seeking Behaviors (con’t) • Kuhithau’s research demonstrated that “a clear formulation reflecting a personal view of the information encountered is the turning point of the search… confusion decreases, and interest intensifies”. • It is exactly because of this uncertainty and confusion that the generalized search engines they prefer are detrimental to the information seeking process. Kuhithau (1990)

  16. Why General Searches Don’t’ Work • John Lubans’ research at Duke University on freshmen internet use found only 7% rank their ability to use the web as “best”, while 23% see their use as “better”, and 29% rank their abilities as “good”. • Clearly these students could benefit from the organization and cataloging of Internet resources. • The search engines they love, which enable users to keyword-pattern-match against billions of web pages, are very good at finding distinctive phrases. • Unfortunately problems arise when students are in the beginning stages of discovery and are unsure exactly what they are looking for.

  17. Reasons to Catalog the Net • Subject-organized URL lists on websites are cumbersome and labor-intensive to develop and update. Porter and Bayard (1999) • Questionable authority on many sites • Complaints from librarians about Web resources: • Invariably center on the difficulties in organizing and archiving them… • Inconsistent quality… • Disappearing URLs resulting in the dreaded “404” message”…

  18. More Reasons to Catalog the Net • While subject-organized lists are not the same as cataloged Internet resources, the Michigan Electronic Library (MEL), Internet Public Library (IPL) and INFOMINE are a few excellent examples that illustrate the organizational abilities of individual librarians to organize small portions of the Internet. Oder (1998) • The main problems these individual indexes or catalogs face, though, is their size and frequent redundancy of items. • A federated catalog like WorldCat would alleviate the redundancy problem.

  19. OCLC’s Internet Cataloging Project • In 1991, OCLC’s Internet Cataloging Project began to address the need to develop a consortia approach to the problem with 30 catalogers spearheading the movement to catalog Internet resources. • Findings at the end of the project demonstrate that overall, MARC/AACR2 cataloging supported cataloging Internet resources, a method to link the record to the resource was beneficial for the user and instructional materials should be developed. Juls (1992) • A manual was published by the OCLC in response to these findings, library system vendors embraced the 856 MARC field for electronic location and access, and the Web OPAC was introduced. • By 1998 over 18,000 Internet resources had been cataloged by over 5,000 OCLC libraries.

  20. Difficulties in Cataloging the Web • In spite of this initial success, there are many inherent difficulties when cataloging Internet resources. • “lack of universally accepted controlled vocabulary; the lack of stability due to frequency of change of data; and the lack of quality standards”. O’Daniel (1992) • Cataloging electronic serials; difficulties locating prior issues for descriptive information, publishers frequently update digital information including titles, HTML and ASCII versions may have subtle differences, variations in paper and digital versions, many lack a table of contents. Hawkins (1998) • Websites move or even disappear. To address this problem, the OCLC’s Office of Research developed persistent URLs or PURLs. These aliases are assigned so that if a URL is changed for any reason the PURL need only be changed one time through the PURL server.

  21. Dublin Core to the Rescue • The development of the Dublin Core Metadata Element Description standardized metadata found on websites and streamlined the complexity of the MARC format. • DC uses 15 predetermined but flexible elements. • Metatags are created and embedded with the documents. • MARCit software was developed to specifically pull the metadata from the title and URL fields and place that metadata in the 245 and 856 MARC fields. • Although cataloging is a time-consuming and often cost-prohibitive activity, it will only be through these efforts to mesh Internet resources with local systems that Internet cataloging efforts will be successful.

  22. Academic Library Projects • “Subject gateways” may appeal more to academic libraries whose mission is to support the academic curricula and research needs of their students and faculty. Oden (1998) • INFOMINE, developed in 1994 at the University of California, Riverside, has embraced a combination of 100,000 librarian created links with 75,000 web crawler links. • They use modified LC subject headings and “focused, automatic Internet crawling as well as automatic text extraction and metadata creation functions to assist our experts in content creation and users in searching”( http://infomine.ucr.edu/).

  23. Cross-searching vs Local Indexing • Searching across multiple databases at one time frequently causes slow search speeds. • Because databases have not been indexed locally, each search query is created on the fly. This is the critical flaw with general search engines inability to access information that resides on the Deep Web. • In this instance, library search engines have access to the subscription databases, it is just that the search is too cumbersome due to lack of indexing. • At the 1999 digital libraries conference in Santa Fe, several inherent problems with the Z39.50 cross-search were identified; namely the tools are too slow, results are limited, and they frequently time out. Rochkind (2007)

  24. Open Archives Initiative – Protocol for Metadata Harvesting • Used for harvesting metadata - commonly referred to as “local indexing”. • This local indexing is the type used by Google Scholar and is what makes partnering with them so appealing for academic libraries. • A major roadblock for many academic libraries to index locally is a lack of cooperation and permissions from their content providers.

  25. Open Archives Initiative – Protocol for Metadata Harvesting (con’t) • Content providers are beginning to provide Google and Google Scholar with access to their metadata hoping for placement recognition in search results. • EBSCOhost and Gale have followed suit and allowed Google and other web crawlers to index their metadata. • The problem with partnering with Google Scholar is that libraries still don’t know what Google has or has not indexed. “If libraries licensed full text or metadata by cooperating with the content provider, they could know exactly what they have in their index and be assured of its completeness”. Rochkind (2007)

  26. Conclusion • In the course of a few short decades most libraries have or will become digital libraries on one scale or another. • At this point Google and Google Search with their local indexing protocol, have set the stage for academic and public libraries alike to utilize the technology that allows their users to access information quickly, efficiently, and with verified authority. • Today’s student wants to access information in a seamless environment in a timely fashion. • It will be through the cooperative efforts of libraries, librarians, catalogers and content providers utilizing the transfer of licensed content from providers to indexers by the OAI-PMH harvester process, that today’s patrons will be able to access information in the time and format that they require.

  27. References • Baruth, B. E. Is your catalog big enough to handle the web? American Libraries, 31(7), 56-60. • Bergman, M. K. (2001). The deep web: Surfing hidden value. Retrieved July 12, 2009, from http://brightplanet.com/ • "College Students’ Perceptions of Libraries and Information Resources," OCLC, 2006. • http://www.oclc.org/reports/pdfs/studentperceptions_conclusion.pdf • Dowling, T. P. (1997). The world wide web meets the OPAC. OhioLINK central catalog web interface. ALCTS Newsletter, 8(2), A-D. • Glaser, R. Internet sites in the library catalog: Where are we now? Alabama Librarian, 56(2), 10-12. • Hawkins, L. (1997). Serials published on the world wide web: Cataloging problems and decisions. The Serials Librarian., 33(1-2), 123. • Jul, E. (1996). Why catalog internet resources? Computers in Libraries, 16(1), 8. • Kuhlthau, C. C. (1991). Inside the search process: Information seeking from the user's perspective. Journal of the American Society for Information Science (1986-1998), 42(5), 361. • Kyrillidou, M. (2000). Research Library Spending on Electronic Scholarly Information is on the Rise. The Association of Research Libraries. Retrieved July 11, 2009, from http://tinyurl.com/l5qm3c • Lubans, J. (1998, April). How first-year university students use and regard Internet • resources. How First-Year University Students Use and Regard Internet Resources. Retrieved July 10, 2009, from http://www.lubans.org/docs/1styear/firstyear.html

  28. References (con’t) • Nichols releases MARCit for cataloging internet resources.(1998). Information Today, 15(3), 51. • OCLC Internet Cataloging Colloquium, & OCLC. (1996). Proceedings of the OCLC internet cataloging colloquium. • OCLC., Weitz, J., Greene, R. O., & OCLC. (1998). Cataloging electronic resources OCLC-MARC coding guidelines. • O'Daniel, H. B. (1999). Cataloguing the internet. Retrieved July 12, 2009, from • http://associates.ucr.edu/heather399.htm. • Oder, N. (1998). Cataloging the net: Can we do it? Library Journal, 123(16), 47-51. • Porter, G. M., & Bayard, L. (1999). Including web sites in the online catalog: Implications for cataloging, collection development, and access. The Journal of Academic Librarianship, 25(5), 390-394. • Rochkind, J. (2007). (Meta)search like google. Library Journal, 132(3), 28-30. • Shafer, K. E. (1997). Scorpion helps catalog the web. research project at OCLC. Bulletin of the American Society for Information Science, 24(1), 28-29. • Taylor, A. & Clemson, P. (1998). Access to networked documents: Catalogs? Search engines? Both? Retrieved July 11, 2009, fromhttp://worldcat.org/arcviewer/1/OCC/2003/07/21/0000003889/viewer/file9.html • Vine, R. (2004). Going beyond google for faster and smarter web searching. Teacher Librarian, 32(1), 19. • Vogelstein, F. (2009, August 2009). Keyword:Monopoly. Wired. 58-65.

More Related