Data, Data Everywhere: Making Sense of the Sea of User Data

Data, Data Everywhere: Making Sense of the Sea of User Data

MaxData Project Carol Tenopir and Donald W. King Gayle Baker, UT Libraries Eleanor Read, UT Libraries Maribeth Manoff, UT Libraries David Nicholas, Ciber, University College London http://web.utk.edu/~tenopir/maxdata/index.htm

MaxData “Maximizing Library Investments in Digital Collections Through Better Data Gathering and Analysis” Funded by Institute of Museum and Library Services (IMLS) 2005-2007

Study Objectives • To compare different methods of data collection • To develop a model that compares costs and benefits to the library of collecting and analyzing data from various methods • To help libraries make the best use of data

Study Teams • Surveys (UT and Ohio Libraries) • Library Data Reports (Vendor-provided and library collected) (UT Libraries) • Deep Log Analysis of raw journal usage data (Ciber and OhioLINK)

A bit more about the surveys…

Surveys

Three Types of Questions • Demographic • Recollection • Critical (last) incident of reading

Critical Incident Added to General Survey Questions • Specific (last incident of reading) • Includes all reading--e & print, library & personal • Detailed questions about last article read, e.g., purpose, value, time spent, format, how located, source • Last reading=random sample of readings • Allows detailed analysis

What Surveys Answer that Logs Do Not • Non-library readings • Print as well as electronic readings • Purpose and value of readings • Outcomes of readings

Surveys provide much useful data, but… • Surveys rely on memory and truthfulness • Response rates are falling • Surveys cost your users’ time • Surveys can only be done occasionally • Log reports and raw logs show usage

Local Sources of Use Data • Local log data for databases • Vendor-supplied usage reports • Other sources of data

Local Log Data: Database Use • Environment: Mixture of web-based and locally-loaded resources • Problem: Use data from vendors not available or not uniform • Solution: Log requests for databases from library’s database menu (1999- )

Local Log Data: Process • MySQL and Perl CGI scripts • Log files compiled monthly • Process data with Excel and SAS • Extract, reformat, summarize, graph

Uses of Local Log Data • Subscription management • Number of simultaneous users • Pattern of use of a database over time • Continuation decisions • Cost per request • Services management • Use patterns by day, week or semester • Location of users (campus, off-campus, wireless)

Local Log Data: Issues • Logs requests for access, not sessions • No detail on activity once in database • Undercounts: • Aggregators and full-text collections • Bookmarked access • Metasearch • Other sources of usage data supplement log data

Vendor-Supplied Usage Reports • Little post-processing of vendor data until 2002 • Made available upon request • Special attention to “big ticket” items • Full-text • Integrate subscription info with vendor data

Vendor-Supplied Usage Reports: Additional Processing • ARL Supplemental Statistics • Use data for electronic resources requested: • Number of logins (sessions) • Number of queries (searches) • Number of items requested • Fiscal year: July ‘04 – June ‘05

Vendor Reports to Review • University of Tennessee • Reports from 28 of 45 vendors listed as compliant with Release 1 of the Counter Code of Practice • Reports from 26 other vendors

The Challenge of Vendor-Supplied Use Reports • Request mode • Delivery • Format • Time period • Subscribed / titles used / all titles

Other Sources –Link Resolvers (e.g. SFX) • Past the database level to access of individual journals • Use is measured the same way across packages • Where vendor reports are unavailable or incomplete (Open Access, backfiles) • The more places SFX links are used (catalog, e-j list), the more complete the data

Other Sources –MetaSearch Engines (e.g. MetaLib) • “Number of searches” data that may not be counted in vendor reports (Z39.50) • Most useful and interesting to see how patrons are using federated searching

Other Sources –Proxy Servers (e.g. EZProxy) • Standard web log format captures data for every request to the server – this generates large logs that have to be analyzed • Some libraries send all users (not only remote users) through the proxy server for more complete log data

OhioLINK deep log analysis (DLA) showcase • Choice of OhioLINK – oldest big deal, common publisher platform and source of interesting data • Two purposes: 1) to show what kinds of data that DLA could generate; 2) raise the questions that need to be asked • Raw server logs of off-campus use June to December ’04 (pick-up returnees) and on-campus use for October. Logs uniquely contained search and navigational behaviour, too

Metrics • Four ‘use’ metrics employed – number of items/pages viewed, number of sessions conducted, number of items viewed in a session (site penetration) and amount of time spent online. • An ‘item’ might be: a list of journals – (subject or alphabetic), a list of journal issues, a contents page, an abstract or full-text article. • Search or navigational approach used (search engine, subject list of journals etc) • Users: returnees; by subject of journal and sub-net; name and type of institution.

Is the resource being used? • Items viewed. 1,215,000 items viewed on-campus (1 month) and 1,894,000 items viewed off campus (7 months). • Titles used. • Journals available October 2004 = 5872 • 5,868 jnls used if content lists, abstracts & articles included; 5,193 if only articles included. • 5% of jnls accounted for 38% of usage; 10% for 53%, and 50% for 93%.

Is the resource being used? • Number of journals viewed in a session. • Very pertinent: OhioLINK all about massive choice • Third of sessions saw no views to any items associated with a particular journal • Of two-thirds of sessions recording a journal item view, half viewed item (s) from 1 journal, 30% from 2 to 3 journals, 14% from 4 to 9 journals and 7% from 10+ • 49% of sessions saw a full text article viewed and the average number of articles viewed in a session was just over 2.

Is the resource being used? • Site penetration • 23% viewed 1 item in a session, 40% viewed 2 to 4 items, 21% viewed 5 to 10 items, 9% viewed 11 to 20 and 7% viewed 21+. • Figures quite impressive when compared to other digital libraries. Thus, in the case of EmeraldInsight, 42% of users viewed just one item. Due to the greater level of download freedom offered by OhioLINK?

Is the resource being used? • Returnees (off-campus) • 73% accessed OhioLINK journals once during the seven months (might have also used OhioLINK on campus). 22% came back between 2 to 5 times, 3% between 6 to 15 times and 2% more than 15 times. • Data compromised by floating IP addresses and multi-user machines

What can we learn about the methods used to find articles? • Search engine popularity. • 41% of sessions saw search engine only being used and a further 23% of sessions saw engine used together with either the alphabetic or subject lists. • Users of engines more likely to look at wider range of: • Journals. 66% of those using search engine viewed 2 or more journals, compared to 43% using either alphabetic or subject lists. People using all three methods most likely to view 10 or more different journals; nearly 1 in 5 did so;

What can we learn about the methods used to find articles? • Users of engines more likely to look at wider range of: • Subjects. Those utilising the engine were more likely to have viewed two or more subjects - 54% had done so compared to 41% of those whose sessions saw use of an alpha or subject list. • Older material. Search engine users viewed older material, while those accessing the service via the alphabetical or subject lists were more likely to view very current material.

Issues • This is only pilot data • Caching means not all transactions recorded in logs • Studying usage patterns of a given IP address, not a given user and there are the consequent problems that arise from multi-user machines, proxy servers and floating IP addresses • There are problems with calculating session time • However: 1) use a number of metrics; 2) will be collaborated by survey techniques; 3) we have three years to perfect our techniques!

References • Nicholas D, Huntington P, Russell B, Watkinson A, Hamid R. Jamali, Tenopir, C. The big deal: ten years on. Learned Information 18(4) October, 2005, pp?? • Nicholas D, Huntington P, Hamid R. Jamali, Tenopir, C. Journal of Documentation. 62, (2), 2006, pp?? • Nicholas D, Huntington P, Hamid R. Jamali, Tenopir, CFinding information in (very large) digital libraries: a deep log approach to determining differences in use according to method of access. Journal of Academic Librarianship. March 2006, pp??

Data, Data Everywhere: Making Sense of the Sea of User Data

Data, Data Everywhere: Making Sense of the Sea of User Data

Presentation Transcript

Data Profiling The Path to More Accurate Data

K-12 Statewide Longitudinal Data System (SLDS)

Analyzing Data For Effective Decision Making

Overview of the VHA Corporate Data Warehouse (CDW), the VSSC Portal and Importance of Accurate Data

Data Mining

Programs and Data

Data Archiving @ SAP

Data Minin g and Knowledge Acquizition — Chapter 2 — — Data Preprocessing —

Data Mining Toon Calders

Spatial Data Mining: Accomplishments and Research Needs

Introduction to Spring Data

Geographic Data and Relationships

Data Mining: Introduction

Data Mining: Data

Chapter 15

Data Representation

FLASH CARDS

Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration