Diane Kelly, Filip Radlinski, Jaime Teevan

Observational Approaches to Information RetrievalSIGIR 2014 Tutorial: Choices and Constraints (Part II) Diane Kelly, Filip Radlinski, Jaime Teevan Slides available at: http://aka.ms/sigirtutorial

Diane Kelly, University of North Carolina, USA FilipRadlinski, Microsoft, UK Jaime Teevan, Microsoft Research, USA

Tutorial Goals • To help participants develop a broader perspective of research goals and approaches in IR. • Descriptive, predictive and explanatory • To improve participants’ understandings of research choices and constraints. • Every research project requires the researcher to make a series of choices about a range of factors and usually there are constraints that influence these choices. • By using some of our own research papers, we aim to expose you to the experiential aspects of the research process by giving you a behind the scenes view of how we make/made choices in our own research.

Research Goals & Approaches

Example: Search Difficulties

Overview • Observational log analysis • What we can learn • Collecting log data • Cleaning log data (Filip) • Analyzing log data • Field observations (Diane) Dumais, Jeffries, Russell, Tang & Teevan. “Understanding User Behavior through Log Data and Analysis.”

What We Can Learn Observational Approaches to Information Retrieval

Students prefer used textbooks that are annotated. [Marshall 1998] Mark Twain Cowards die many times before their deaths. Annotated by Nelson Mandela David Foster Wallace I have discovered a truly marvelous proof ... which this margin is too narrow to contain. Pierre de Fermat (1637)

Digital Marginalia • Do we lose marginalia with digital documents? • Internet exposes information experiences • Meta-data, annotations, relationships • Large-scale information usage data • Change in focus • With marginalia, interest is in the individual • Now we can look at experiences in the aggregate

Practical Uses for Behavioral Data • Behavioral data to improve Web search • Offline log analysis • Example: Re-finding common, so add history support • Online log-based experiments • Example: Interleave different rankings to find best algorithm • Log-based functionality • Example: Boost clicked results in a search result list • Behavioral data on the desktop • Goal: Allocate editorial resources to create Help docs • How to do so without knowing what people search for?

Value of Observational Log Analysis • Focus of observational log analysis • Description: What do people currently do? • Prediction: What will people do in similar situations? • Study real behavior in natural settings • Understand how people search • Identify real problems to study • Improve ranking algorithms • Influence system design • Create realistic simulations and evaluations • Build a picture of human interest

Societal Uses of Behavioral Data • Understand people’s information needs • Understand what people talk about • Impact public policy? (E.g., DonorsChoose.org) Baeza-Yates, Dupret, Velasco. A study of mobile search queries in Japan. WWW 2007

Personal Use of Behavioral Data • Individuals now have a lot of behavioral data • Introspection of personal data popular • My Year in Status • Status Statistics • Expect to see more • As compared to others • For a purpose

Defining Behavioral Log Data • Behavioral log data are: • Traces of natural behavior, seen through a sensor • Examples: Links clicked, queries issued, tweets posted • Real-world, large-scale, real-time • Behavioral log data are not: • Non-behavioral sources of large-scale data • Collected data (e.g., poll data, surveys, census data) • Not recalled behavior or subjective impression

Real-World, Large-Scale, Real-Time • Private behavior is exposed • Example: Porn queries, medical queries • Rare behavior is common • Example: Observe 500 million queries a day • Interested in behavior that occurs 0.002% of the time • Still observe the behavior 10 thousand times a day! • New behavior appears immediately • Example: Google Flu Trends

Drawbacks • Not controlled • Can run controlled log studies • Discussed in morning tutorial (Filip) • Adversarial • Cleaning log data later today (Filip) • Lots of missing information • Not annotated, no demographics, we don’t know why • Observing richer information after break (Diane) • Privacy concerns • Collect and store data thoughtfully • Next section addresses privacy

Language 社会科学 System errors Spam Porn

Query typology

Query typology Query behavior

Uses of Analysis • Ranking • E.g., precision • System design • E.g., caching • User interface • E.g., history • Test set development • Complementary research Query typology Query behavior Long term trends

Surprises About Query Log Data • From early log analysis • Examples: Jansen et al. 2000, Broder 1998 • Scale: Term common if it appeared 100 times! • Queries are not 7 or 8 words long • Advanced operators not used or “misused” • Nobody used relevance feedback • Lots of people search for sex • Navigation behavior common • Prior experience was with library search

Surprises About Microblog Search?

Surprises About Microblog Search? Ordered by relevance Ordered by time 8 new tweets

Surprises About Microblog Search? • Time important • People important • Specialized syntax • Queries common • Repeated a lot • Change very little • Often navigational • Time and people less important • No syntax use • Queries longer • Queries develop Ordered by relevance Ordered by time 8 new tweets

Overview • Observational log analysis • What we can learn • Understand and predict user behavior • Collecting log data • Cleaning log data • Analyzing log data • Field observations

Collecting Log Data Observational Approaches to Information Retrieval

How to Get Logs for Analysis • Use existing logged data • Explore sources in your community (e.g., proxy logs) • Work with a company (e.g., FTE, intern, visiting researcher) • Generate your own logs • Focuses on questions of unique interest to you • Examples: UFindIt, Wikispeedia • Construct community resources • Shared software and tools • Client side logger (e.g., VIBE logger) • Shared data sets • Shared platform • Lemur Community Query Log Project

Web Service Logs • Example sources • Search engine • Commercial site • Types of information • Queries, clicks, edits • Results, ads, products • Example analysis • Click entropy • Teevan, Dumais & Liebling. To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent. SIGIR 2008 Recruiting Academic field Government contractor

Controlled Web Service Logs • Example sources • Mechanical Turk • Games with a purpose • Types of information • Logged behavior • Active feedback • Example analysis • Search success • Ageev, Guo, Lagun & Agichtein. Find It If You Can: A Game for Modeling … Web Search Success Using Interaction Data. SIGIR 2011

Public Web Service Content • Example sources • Social network sites • Wiki change logs • Types of information • Public content • Dependent on service • Example analysis • Twitter topic models • Ramage, Dumais & Liebling. Characterizing microblogging using latent topic models. ICWSM 2010 j http://twahpic.cloudapp.net

Web Browser Logs • Example sources • Proxy • Logging tool • Types of information • URL visits, paths followed • Content shown, settings • Example analysis • DiffIE • Teevan, Dumais and Liebling. A Longitudinal Study of How Highlighting Web Content Change Affects .. Interactions. CHI 2010

Web Browser Logs • Example sources • Proxy • Logging tool • Types of information • URL visits, paths followed • Content shown, settings • Example analysis • Revisitation • Adar, Teevan and Dumais. Large Scale Analysis of Web Revisitation Patterns. CHI 2008

Rich Client-Side Logs • Example sources • Client application • Operating system • Types of information • Web client interactions • Other interactions – rich! • Example analysis • Stuff I’ve Seen • Dumais et al. Stuff I've Seen: A system for personal information retrieval and re-use. SIGIR 2003

A Simple Example • Logging search Queries and Clicked Results Web Service Web Service Web Service “SERP” dumais beijing sigir2014 vancouver chi 2014

A Simple Example • Logging Queries • Basic data: <query, userID, time> • Which time? timeClient.send, timeServer.receive, timeServer.send, timeClient.receive • Additional contextual data: • Where did the query come from? • What results were returned? • What algorithm or presentation was used? • Other metadata about the state of the system

A Simple Example • Logging Clicked Results(on the SERP) • How can a Web service know which SERP links are clicked? • Proxy re-direct • Script (e.g., JavaScript) • Dom and cross-browser challenges, but can instrument more than link clicks • No download required; but adds complexity and latency, and may influence user interaction • What happened after the result was clicked? • What happens beyond the SERP is difficult to capture • Browser actions (back, open in new tab, etc.) are difficult to capture • To better interpret user behavior, need richer client instrumentation http://www.chi2014.org vs. http://redir.service.com/?q=chi2014&url=http://www.chi2014.org/&pos=3&log=DiFVYj1tRQZtv6e1FF7kltj02Z30eatB2jr8tJUFR <img border="0" id="imgC" src=“image.gif" width="198" height="202" onmouseover="changeImage()" onmouseout="backImage()"><script lang="text/javascript"> function changeImage(){ document.imgC.src="thank_you..gif “; } function backImage(){ document.imgC.src=“image.gif"; }</script>

A (Not-So-) Simple Example • Logging: Queries, Clicked Results, and Beyond

What to Log • Log as much as possible • Time keyed events, e.g.: <time, userID, action, value, context> • Ideal log allows user experience to be fully reconstructed • But … make reasonable choices • Richly instrumented client experiments can provide guidance • Consider the amount of data, storage required • Challenges with scale • Storage requirements • 1k bytes/record x 10 records/query x 100 mil queries/day = 1000 Gb/day • Network bandwidth • Client to server; Data center to data center

What to Do with the Data • Keep as much raw data as possible • And allowable • Must consider Terms of Service, IRB • Post-process data to put into a usable form • Integrate across servers to organize the data • By time • By userID • Normalize time, URLs, etc. • Rich data cleaning

Practical Issues: Time • Time • Client time is closer to the user, but can be wrong or reset • Server time includes network latencies, but controllable • In both cases, need to synchronize time across multiple machines • Data integration • Ensure that joins of data are all using the same basis (e.g., UTC vs. local time) • Accurate timing data is critical for understanding the sequence of user activities, daily temporal patterns, etc.

Practical Issues: Users • Http cookies, IP address, temporary ID • Provides broad coverage and easy to use, but … • Multiple people use same machine • Same person uses multiple machines (and browsers) • How many cookies did you use today? • Lots of churn in these IDs • Jupiter Res (39% delete cookies monthly); Comscore (2.5x inflation) • Login or download client code (e.g., browser plug-in) • Better correspondence to people, but … • Requires sign-in or download • Results in a smaller and biased sample of people or data (who remember to login, decided to download, etc.) • Either way, loss of data

Using the Data Responsibly • What data is collected and how it can be used? • User agreements (terms of service) • Emerging industry standards and best practices • Trade-offs • More data: • More intrusive and potential privacy concerns, but also more useful for understanding interaction and improving systems • Less data: • Less intrusive, but less useful • Risk, benefit, and trust

Example: AOL Search Dataset • August 4, 2006: Logs released to academic community • 3 months, 650 thousand users, 20 million queries • Logs contain anonymized User IDs • August 7, 2006: AOL pulled the files, but already mirrored • August 9, 2006: New York Times identified Thelma Arnold • “A Face Is Exposed for AOL Searcher No. 4417749” • Queries for businesses, services in Lilburn, GA (pop. 11k) • Queries for Jarrett Arnold (and others of the Arnold clan) • NYT contacted all 14 people in Lilburn with Arnold surname • When contacted, Thelma Arnold acknowledged her queries • August 21, 2006: 2 AOL employees fired, CTO resigned • September, 2006: Class action lawsuit filed against AOL • AnonID Query QueryTimeItemRankClickURL • ---------- --------- --------------- ------------- ------------ • 1234567 uwcse 2006-04-04 18:18:18 1 http://www.cs.washington.edu/ • 1234567uw admissions process2006-04-04 18:18:18 3 http://admit.washington.edu/admission • 1234567 computer science hci2006-04-24 09:19:32 • 1234567 computer science hci2006-04-24 09:20:04 2 http://www.hcii.cmu.edu • 1234567 seattle restaurants2006-04-24 09:25:50 2 http://seattletimes.nwsource.com/rests • 1234567 perlmanmontreal2006-04-24 10:15:14 4 http://oldwww.acm.org/perlman/guide.html • 1234567 uw admissions notification2006-05-20 13:13:13 • …

Example: AOL Search Dataset • Other well known AOL users • User 711391 i love alaska • http://www.minimovies.org/documentaires/view/ilovealaska • User 17556639 how to kill your wife • User 927 • Anonymous IDs do not make logs anonymous • Contain directly identifiable information • Names, phone numbers, credit cards, social security numbers • Contain indirectly identifiable information • Example: Thelma’s queries • Birthdate, gender, zip code identifies 87% of Americans

Example: Netflix Challenge • October 2, 2006: Netflix announces contest • Predict people’s ratings for a $1 million dollar prize • 100 million ratings, 480k users, 17k movies • Very careful with anonymity post-AOL • May 18, 2008: Data de-anonymized • Paper published by Narayanan & Shmatikov • Uses background knowledge from IMDB • Robust to perturbations in data • December 17, 2009: Doe v. Netflix • March 12, 2010: Netflix cancels second competition • All customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy. . . Even if, for example, you knew all your own ratings and their dates you probably couldn’t identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation. • Ratings • 1: [Movie 1 of 17770] • 12, 3, 2006-04-18 [CustomerID, Rating, Date] • 1234, 5 , 2003-07-08 [CustomerID, Rating, Date] • 2468, 1, 2005-11-12 [CustomerID, Rating, Date] • … • Movie Titles • … • 10120, 1982, “Bladerunner” • 17690, 2007, “The Queen” • …

Using the Data Responsibly • Control access to the data • Internally: Access control; data retention policy • Externally: Risky (e.g., AOL, Netflix, Enron, Facebook public) • Protect user privacy • Directly identifiable information • Social security, credit card, driver’s license numbers • Indirectly identifiable information • Names, locations, phone numbers … you’re so vain (e.g., AOL) • Putting together multiple sources indirectly (e.g., Netflix, hospital records) • Linking public and private data • k-anonymity; Differential privacy; etc. • Transparency and user control • Publicly available privacy policy • Give users control to delete, opt-out, etc.

Diane Kelly, Filip Radlinski, Jaime Teevan

Diane Kelly, Filip Radlinski, Jaime Teevan

Presentation Transcript

Pat Teevan RN, CRM

Jaime Teevan MIT, CSAIL

Jaime Villarreal

Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims

Filip Průša (Filip Prusa)

JAIME HAYON

Jaime Teevan Microsoft Research

Jaime Teevan - Available Microsoft Research

Jaime

Jaime Morgan

Jaime Teevan MIT, CSAIL

Jaime Teevan MIT, CSAIL