800 likes | 1.05k Views
Observational Approaches to Information Retrieval SIGIR 2014 Tutorial: Choices and Constraints (Part II). Diane Kelly, Filip Radlinski, Jaime Teevan. Slides available at: http://aka.ms/sigirtutorial. Diane Kelly, University of North Carolina, USA. Filip Radlinski , Microsoft, UK.
E N D
Observational Approaches to Information RetrievalSIGIR 2014 Tutorial: Choices and Constraints (Part II) Diane Kelly, Filip Radlinski, Jaime Teevan Slides available at: http://aka.ms/sigirtutorial
Diane Kelly, University of North Carolina, USA FilipRadlinski, Microsoft, UK Jaime Teevan, Microsoft Research, USA
Tutorial Goals • To help participants develop a broader perspective of research goals and approaches in IR. • Descriptive, predictive and explanatory • To improve participants’ understandings of research choices and constraints. • Every research project requires the researcher to make a series of choices about a range of factors and usually there are constraints that influence these choices. • By using some of our own research papers, we aim to expose you to the experiential aspects of the research process by giving you a behind the scenes view of how we make/made choices in our own research.
Overview • Observational log analysis • What we can learn • Collecting log data • Cleaning log data (Filip) • Analyzing log data • Field observations (Diane) Dumais, Jeffries, Russell, Tang & Teevan. “Understanding User Behavior through Log Data and Analysis.”
What We Can Learn Observational Approaches to Information Retrieval
Students prefer used textbooks that are annotated. [Marshall 1998] Mark Twain Cowards die many times before their deaths. Annotated by Nelson Mandela David Foster Wallace I have discovered a truly marvelous proof ... which this margin is too narrow to contain. Pierre de Fermat (1637)
Digital Marginalia • Do we lose marginalia with digital documents? • Internet exposes information experiences • Meta-data, annotations, relationships • Large-scale information usage data • Change in focus • With marginalia, interest is in the individual • Now we can look at experiences in the aggregate
Practical Uses for Behavioral Data • Behavioral data to improve Web search • Offline log analysis • Example: Re-finding common, so add history support • Online log-based experiments • Example: Interleave different rankings to find best algorithm • Log-based functionality • Example: Boost clicked results in a search result list • Behavioral data on the desktop • Goal: Allocate editorial resources to create Help docs • How to do so without knowing what people search for?
Value of Observational Log Analysis • Focus of observational log analysis • Description: What do people currently do? • Prediction: What will people do in similar situations? • Study real behavior in natural settings • Understand how people search • Identify real problems to study • Improve ranking algorithms • Influence system design • Create realistic simulations and evaluations • Build a picture of human interest
Societal Uses of Behavioral Data • Understand people’s information needs • Understand what people talk about • Impact public policy? (E.g., DonorsChoose.org) Baeza-Yates, Dupret, Velasco. A study of mobile search queries in Japan. WWW 2007
Personal Use of Behavioral Data • Individuals now have a lot of behavioral data • Introspection of personal data popular • My Year in Status • Status Statistics • Expect to see more • As compared to others • For a purpose
Defining Behavioral Log Data • Behavioral log data are: • Traces of natural behavior, seen through a sensor • Examples: Links clicked, queries issued, tweets posted • Real-world, large-scale, real-time • Behavioral log data are not: • Non-behavioral sources of large-scale data • Collected data (e.g., poll data, surveys, census data) • Not recalled behavior or subjective impression
Real-World, Large-Scale, Real-Time • Private behavior is exposed • Example: Porn queries, medical queries • Rare behavior is common • Example: Observe 500 million queries a day • Interested in behavior that occurs 0.002% of the time • Still observe the behavior 10 thousand times a day! • New behavior appears immediately • Example: Google Flu Trends
Drawbacks • Not controlled • Can run controlled log studies • Discussed in morning tutorial (Filip) • Adversarial • Cleaning log data later today (Filip) • Lots of missing information • Not annotated, no demographics, we don’t know why • Observing richer information after break (Diane) • Privacy concerns • Collect and store data thoughtfully • Next section addresses privacy
Language 社会科学 System errors Spam Porn
Query typology Query behavior
Uses of Analysis • Ranking • E.g., precision • System design • E.g., caching • User interface • E.g., history • Test set development • Complementary research Query typology Query behavior Long term trends
Surprises About Query Log Data • From early log analysis • Examples: Jansen et al. 2000, Broder 1998 • Scale: Term common if it appeared 100 times! • Queries are not 7 or 8 words long • Advanced operators not used or “misused” • Nobody used relevance feedback • Lots of people search for sex • Navigation behavior common • Prior experience was with library search
Surprises About Microblog Search? Ordered by relevance Ordered by time 8 new tweets
Surprises About Microblog Search? • Time important • People important • Specialized syntax • Queries common • Repeated a lot • Change very little • Often navigational • Time and people less important • No syntax use • Queries longer • Queries develop Ordered by relevance Ordered by time 8 new tweets
Overview • Observational log analysis • What we can learn • Understand and predict user behavior • Collecting log data • Cleaning log data • Analyzing log data • Field observations
Collecting Log Data Observational Approaches to Information Retrieval
How to Get Logs for Analysis • Use existing logged data • Explore sources in your community (e.g., proxy logs) • Work with a company (e.g., FTE, intern, visiting researcher) • Generate your own logs • Focuses on questions of unique interest to you • Examples: UFindIt, Wikispeedia • Construct community resources • Shared software and tools • Client side logger (e.g., VIBE logger) • Shared data sets • Shared platform • Lemur Community Query Log Project
Web Service Logs • Example sources • Search engine • Commercial site • Types of information • Queries, clicks, edits • Results, ads, products • Example analysis • Click entropy • Teevan, Dumais & Liebling. To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent. SIGIR 2008 Recruiting Academic field Government contractor
Controlled Web Service Logs • Example sources • Mechanical Turk • Games with a purpose • Types of information • Logged behavior • Active feedback • Example analysis • Search success • Ageev, Guo, Lagun & Agichtein. Find It If You Can: A Game for Modeling … Web Search Success Using Interaction Data. SIGIR 2011
Public Web Service Content • Example sources • Social network sites • Wiki change logs • Types of information • Public content • Dependent on service • Example analysis • Twitter topic models • Ramage, Dumais & Liebling. Characterizing microblogging using latent topic models. ICWSM 2010 j http://twahpic.cloudapp.net
Web Browser Logs • Example sources • Proxy • Logging tool • Types of information • URL visits, paths followed • Content shown, settings • Example analysis • DiffIE • Teevan, Dumais and Liebling. A Longitudinal Study of How Highlighting Web Content Change Affects .. Interactions. CHI 2010
Web Browser Logs • Example sources • Proxy • Logging tool • Types of information • URL visits, paths followed • Content shown, settings • Example analysis • Revisitation • Adar, Teevan and Dumais. Large Scale Analysis of Web Revisitation Patterns. CHI 2008
Rich Client-Side Logs • Example sources • Client application • Operating system • Types of information • Web client interactions • Other interactions – rich! • Example analysis • Stuff I’ve Seen • Dumais et al. Stuff I've Seen: A system for personal information retrieval and re-use. SIGIR 2003
A Simple Example • Logging search Queries and Clicked Results Web Service Web Service Web Service “SERP” dumais beijing sigir2014 vancouver chi 2014
A Simple Example • Logging Queries • Basic data: <query, userID, time> • Which time? timeClient.send, timeServer.receive, timeServer.send, timeClient.receive • Additional contextual data: • Where did the query come from? • What results were returned? • What algorithm or presentation was used? • Other metadata about the state of the system
A Simple Example • Logging Clicked Results(on the SERP) • How can a Web service know which SERP links are clicked? • Proxy re-direct • Script (e.g., JavaScript) • Dom and cross-browser challenges, but can instrument more than link clicks • No download required; but adds complexity and latency, and may influence user interaction • What happened after the result was clicked? • What happens beyond the SERP is difficult to capture • Browser actions (back, open in new tab, etc.) are difficult to capture • To better interpret user behavior, need richer client instrumentation http://www.chi2014.org vs. http://redir.service.com/?q=chi2014&url=http://www.chi2014.org/&pos=3&log=DiFVYj1tRQZtv6e1FF7kltj02Z30eatB2jr8tJUFR <img border="0" id="imgC" src=“image.gif" width="198" height="202" onmouseover="changeImage()" onmouseout="backImage()"><script lang="text/javascript"> function changeImage(){ document.imgC.src="thank_you..gif “; } function backImage(){ document.imgC.src=“image.gif"; }</script>
A (Not-So-) Simple Example • Logging: Queries, Clicked Results, and Beyond
What to Log • Log as much as possible • Time keyed events, e.g.: <time, userID, action, value, context> • Ideal log allows user experience to be fully reconstructed • But … make reasonable choices • Richly instrumented client experiments can provide guidance • Consider the amount of data, storage required • Challenges with scale • Storage requirements • 1k bytes/record x 10 records/query x 100 mil queries/day = 1000 Gb/day • Network bandwidth • Client to server; Data center to data center
What to Do with the Data • Keep as much raw data as possible • And allowable • Must consider Terms of Service, IRB • Post-process data to put into a usable form • Integrate across servers to organize the data • By time • By userID • Normalize time, URLs, etc. • Rich data cleaning
Practical Issues: Time • Time • Client time is closer to the user, but can be wrong or reset • Server time includes network latencies, but controllable • In both cases, need to synchronize time across multiple machines • Data integration • Ensure that joins of data are all using the same basis (e.g., UTC vs. local time) • Accurate timing data is critical for understanding the sequence of user activities, daily temporal patterns, etc.
Practical Issues: Users • Http cookies, IP address, temporary ID • Provides broad coverage and easy to use, but … • Multiple people use same machine • Same person uses multiple machines (and browsers) • How many cookies did you use today? • Lots of churn in these IDs • Jupiter Res (39% delete cookies monthly); Comscore (2.5x inflation) • Login or download client code (e.g., browser plug-in) • Better correspondence to people, but … • Requires sign-in or download • Results in a smaller and biased sample of people or data (who remember to login, decided to download, etc.) • Either way, loss of data
Using the Data Responsibly • What data is collected and how it can be used? • User agreements (terms of service) • Emerging industry standards and best practices • Trade-offs • More data: • More intrusive and potential privacy concerns, but also more useful for understanding interaction and improving systems • Less data: • Less intrusive, but less useful • Risk, benefit, and trust
Example: AOL Search Dataset • August 4, 2006: Logs released to academic community • 3 months, 650 thousand users, 20 million queries • Logs contain anonymized User IDs • August 7, 2006: AOL pulled the files, but already mirrored • August 9, 2006: New York Times identified Thelma Arnold • “A Face Is Exposed for AOL Searcher No. 4417749” • Queries for businesses, services in Lilburn, GA (pop. 11k) • Queries for Jarrett Arnold (and others of the Arnold clan) • NYT contacted all 14 people in Lilburn with Arnold surname • When contacted, Thelma Arnold acknowledged her queries • August 21, 2006: 2 AOL employees fired, CTO resigned • September, 2006: Class action lawsuit filed against AOL • AnonID Query QueryTimeItemRankClickURL • ---------- --------- --------------- ------------- ------------ • 1234567 uwcse 2006-04-04 18:18:18 1 http://www.cs.washington.edu/ • 1234567uw admissions process2006-04-04 18:18:18 3 http://admit.washington.edu/admission • 1234567 computer science hci2006-04-24 09:19:32 • 1234567 computer science hci2006-04-24 09:20:04 2 http://www.hcii.cmu.edu • 1234567 seattle restaurants2006-04-24 09:25:50 2 http://seattletimes.nwsource.com/rests • 1234567 perlmanmontreal2006-04-24 10:15:14 4 http://oldwww.acm.org/perlman/guide.html • 1234567 uw admissions notification2006-05-20 13:13:13 • …
Example: AOL Search Dataset • Other well known AOL users • User 711391 i love alaska • http://www.minimovies.org/documentaires/view/ilovealaska • User 17556639 how to kill your wife • User 927 • Anonymous IDs do not make logs anonymous • Contain directly identifiable information • Names, phone numbers, credit cards, social security numbers • Contain indirectly identifiable information • Example: Thelma’s queries • Birthdate, gender, zip code identifies 87% of Americans
Example: Netflix Challenge • October 2, 2006: Netflix announces contest • Predict people’s ratings for a $1 million dollar prize • 100 million ratings, 480k users, 17k movies • Very careful with anonymity post-AOL • May 18, 2008: Data de-anonymized • Paper published by Narayanan & Shmatikov • Uses background knowledge from IMDB • Robust to perturbations in data • December 17, 2009: Doe v. Netflix • March 12, 2010: Netflix cancels second competition • All customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy. . . Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation. • Ratings • 1: [Movie 1 of 17770] • 12, 3, 2006-04-18 [CustomerID, Rating, Date] • 1234, 5 , 2003-07-08 [CustomerID, Rating, Date] • 2468, 1, 2005-11-12 [CustomerID, Rating, Date] • … • Movie Titles • … • 10120, 1982, “Bladerunner” • 17690, 2007, “The Queen” • …
Using the Data Responsibly • Control access to the data • Internally: Access control; data retention policy • Externally: Risky (e.g., AOL, Netflix, Enron, Facebook public) • Protect user privacy • Directly identifiable information • Social security, credit card, driver’s license numbers • Indirectly identifiable information • Names, locations, phone numbers … you’re so vain (e.g., AOL) • Putting together multiple sources indirectly (e.g., Netflix, hospital records) • Linking public and private data • k-anonymity; Differential privacy; etc. • Transparency and user control • Publicly available privacy policy • Give users control to delete, opt-out, etc.