Agenda The AOL Search Data • Review the time line for the AOL Search data event • Look at the data • Can we use the data? • The future of data access to the research community Michael Cole firstname.lastname@example.org 4 December 2006
Introduce the AOL Search data. • Review the social controversy around the release of data intended for the research community. • Look at the data. • Discuss the legality and ethics of using the data in research.
The data • statistics • samples • ML sets
Data • Basic Collection Statistics Dates: 1 March, 2006 - 31 May, 2006 Normalized queries: 36,389,567 lines of data 21,011,340 instances of new queries (w/ or w/o click-through) 7,887,022 requests for "next page" of results 19,442,629 user click-through events 16,946,938 queries w/o user click-through 10,154,742 unique (normalized) queries 657,426 unique user ID's Source: aol.U500K_README.txt (distributed w/data set)
Data (cont) • During this period, AOL Search had a market share of about 6.5% according to com Media Metrics. The data covers about 1.5% of AOL Search activity, so this data set is (very roughly) about 0.001% of all search activity. • The raw data file in text format is about 2.2G
Data: tag cloud • Top queries created using http://tagcrowd.com/
Queries as questions? • How many queries are formulated as questions to the system? 20487936 what amphibian did norwegian composer edvard grieg keep in his pocket and stroke whenever he needed inspiration 2006-04-13 11:54:4 20487936.playingAlongAtHome
Queries involving urls • How many queries are requests to find a link to a url? How many such requests are duplicates? • Is this evidence people use the search engines as substitutes for bookmarks?
Queries involving urls Of which (in the full collection): google.com 146379 yahoo.com 176541 ask.com 19828 msnsearch.com 6 myspace.com 157599 Together, they are 0.086 of the total .com queries. It is surprising so many urls are entered as search terms. Is this just evidence for an interface error - confusing the search box with the browser address bar?
Privacy Issues • To date, only one user has been publicly identified. A New York Times reporter found user 4417749: Thelma Arnold, a 62-year-old woman living in Georgia. • How easy was it to identify her from the queries? New York Times 9 Aug 2006 thelma
Information Behavior • The data contains the AOL search activity of each individual over a three month period. This real world, unscripted information seeking provides a window into across many domains. • Example: Searching for medical information • Which sites are used? Are they popular/ authoritative? • Which terms are used? Is there an attempt to use medical terminology?
Information Behavior (cont.) • The comprehensive search log may be able to support reasonable guesses of broad demographic categories for individuals.
It is almost irresistible to connect the queries and build a description of a life. • At least one public web site uses the AOL data as the stuff of voyeurs: http://aolpsycho.com • While the activity at aolpsycho.com is not kind, the collective effort is labelling the AOL users http://www.aolpsycho.com/tag/list • Of course, associating a person with a query is not always justified:nyt09aol.htm
AOL Search: Prepared research files Ten fold, randomized files have been prepared. They are suitable as training and testing sets for statistical model selection and for machine learning. • randomized by ids • randomized by time • randomized with no time stratification • stratified by week (Sun - Sat) [the short weeks at beginning and end of data set are eliminated] • stratified by month • in addition, time-sorted data is available by week and month
Should the data be used? • Academic research public reputation • Can any valid statistics be compiled about real world queries without considering this data? • Data has already been used in research publications by AOL researchers. • So the fact of publishing results based on the data is not the issue • At a minimum, if the data is used to check assumptions, form hypotheses etc. that would seem to be OK. But doesn't this need to be disclosed? • Can one reference use of the data even if the research is not based on the AOL data?
Should the data be used? (cont.) • AOL withdrew access to the data on 6 August 2006. They apologized for releasing it, but have not rescinded authorization to use the data as originally stated. • A number of visible web sites using the data have been set up and are still running. No evidence AOL has delivered take down notices. • http://www.aolsearchdatabase.com/ • http://dontdelete.com • http://aolpsycho.com • http://www.seosleuth.com/site/ • raw data mirrors: http://www.gregsadetsky.com/aol-data/
Impact? • Chilling effect on cooperation between academic community and commercial operations? • Special relationships to access and use data may be more critical than ever. Researchers without those relationships will be frozen out. • Developing large data sets for research use would be one response, but where are the resources for such an effort?
Using Real World Data Sets • May use further anonymization techniques. Replacing potential identifiers such as string of numbers that may be social security numbers, bank accounts with random numbers or zeros. Place names can also be encoded. This could break the inference links that can lead to identification. • Does this information hiding compromise statistical work? • Probably not • ... work on information behavior? • maybe not.