Project Presentations

Project Presentations • Thursday next week, each student will make a 4-minute presentation on their project in class (with 1 or 2 minutes for questions) • Email me your Powerpoint or PDF slides, with your name (e.g., joesmith.ppt), before 10am next Thursday • Suggested content: • Definition of the task/goal • Description of data sets • Description of algorithms • Experimental results and conclusions • Be visual where possible! (i.e., use figures, graphs, etc) • Final project report will be due by 12 noon Tuesday of finals week – more details to come later Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

ICS 278: Data MiningLecture 18: Analysis of Web User Data Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Outline • Basic concepts in Web mining • Analyzing user navigation or clickstream data • Predictive modeling of Web navigation behavior • Markov modeling methods • Analyzing search engine data • Ecommerce aspects of Web log mining • Automated recommender systems Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Further Reading • Modeling the Internet and the Web, P. Baldi, P. Frasconi, P. Smyth, Wiley, 2003. • ACM Transactions on Internet Technology (ACM TOIT) – can be accessed via ACM Digital Library (available from UCI IP addresses). • Annual WebKDD workshops at the ACM SIGKDD conferences. • Papers on Web page prediction • Selective Markov models for predicting Web page accesses, M. Deshpande, G. Karypis, ACM Transactions on Internet Technology, May 2004. • Model-based clustering and visualization of navigation patterns on a Web site, Cadez et al, Journal of Data Mining and Knowledge Discovery, 2003. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Introduction to Web Mining • Useful to study human digital behavior, e.g. search engine data can be used for • Exploration e.g. # of queries per session? • Modeling e.g. any time of day dependence? • Prediction e.g. which pages are relevant? • Applications • Understand social implications of Web usage • Design of better tools for information access • E-commerce applications Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Advertising Applications • Revenue of many internet companies is driven by advertising • Key problem: • Given user data: • Pages browsed • Keywords used in search • Demographics • Determine the most relevant ads (in real-time) • Currently about 50% of keyword searches can not be matched effectively to any ads • (other aspects include bidding/pricing of ads) • Another major problem: “click fraud” • Algorithms that can automatically detect when online advertisements are being manipulated (this is a major problem for Internet advertising) • Understanding the user is key to these types of applications Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Data Sources for Web Mining • Web content • Text and HTML content on Web pages, e.g., categorization of content • Web connectivity • Hyperlink/directed-graph structure of the Web • e.g., using PageRank to infer importance of Web pages • e.g., using links to improve accuracy in classification of Web pages • Web user data • Data on how users interact with the Web • Navigation data, aka “clickstream” data • Search query data (keywords for users) • Online transaction data (e.g., purchases at an ecommerce store) • Volume of data? • Large portals (e.g., Yahoo!, MSN) report 100’s of millions of users per month Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Flowchart of a typical Web Mining process (From Cooley, ACM TOIT, 2003) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

How our Web navigation is recorded… • Web logs • Record activity between client browser and a specific Web server • Easily available • Can be augmented with cookies (provide notion of “state”) • Search engine records • Text in queries, which pages were viewed, which snippets were clicked on, etc • Client-side browsing records • Automatically recorded by client-side software • Harder to obtain, but much more accurate than server-side logs • Other sources • Web site registration, purchases, email, etc • ISP recording of Web browsing Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Web Server Log Files • Server Transfer Log: • transactions between a browser and server are logged • IP address, the time of the request • Method of the request (GET, HEAD, POST…) • Status code, a response from the server • Size in byte of the transaction • Referrer Log: • where the request originated • Agent Log: • browser software making the request (spider) • Error Log: • request resulted in errors (404) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

s = server actions c = client actions cs = client-to-server actions sc = server-to-client actions W3C Extended Log File Format Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Example of Web Log entries Apache web log: 205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET /~sophal/whole5.gif HTTP/1.0" 200 9609 "http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0 (compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)" 216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET /~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; http://www.inktomi.com/slurp.html)“ 202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/indextop.html HTTP/1.1" 200 3510 "http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“ 202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/animate.js HTTP/1.1" 200 14261 "http://www.csua.berkeley.edu/~tahir/indextop.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“ Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Routine Server Log Analysis • Typical statistics/histograms that are computed • Most and least visited web pages • Entry and exit pages • Referrals from other sites or search engines • What are the searched keywords • How many clicks/page views a page received • Error reports, like broken links • Many software products that produce standard reports of this type of data • Very useful for Web site managers • But does not provide “deep” insights • e.g., are there clusters/groups of users that use the site in different ways? Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Visualization of Web Log Data over Time Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Descriptive Summary Statistics • Histograms, scatter plots, time-series plots • Very important! • Helps to understand the big picture • Provides “marginal” context for any model-building • models aggregate behavior, not individuals • Challenging for Web log data • Examples • Session lengths (e.g., power laws) • Click rates as a function of time, content Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

L = number of page requests in a single session from visitors to www.ics.uci.edu over 1 week in November 2002 (robots removed) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Best fit of simple power law model Log P(L) = -a Log L + b or P(L) = b L-a Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Web data measurement issues • Important to understand how data is collected • Web data is collected automatically via software logging tools • Advantage: • No manual supervision required • Disadvantage: • Data can be skewed (e.g. due to the presence of robot traffic) • Important to identify robots (also known as crawlers, spiders) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

A time-series plot of ICS Website data Number of page requests per hour as a function of time from page requests in the www.ics.uci.edu Web server logs during the first week of April 2002. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Example: Web Traffic from Commercial Site(slide from Ronny Kohavi, Amazon) Sept-11 Note significant drop in human traffic, not bot traffic Weekends Internal Perfor-mance bot Registration at Search Engine sites Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Robot / human identification • Removal of robot data is important preprocessing step before clickstream analysis • Robot page-requests often identified using a variety of heuristics • e.g. some robots self-identify themselves in the server logs • All robots in principle should visit robots.txt on the Web Server • Also, robots should identify themselves via the User Agent field in page requests • But other robots actively try to disguise that they are robots • Patterns of access • Robots explore the entire website in breadth first fashion • Humans access web-pages more typically in depth-first fashion • Timing between page-requests can be more regular for robots (e.g., every 5 seconds) • Duration of sessions, number of page-requests per day: often unusually large (e.g., 1000’s of page-requests per day) for robots. • Tan and Kumar (Journal of Data Mining and Knowledge Discovery, 2002) provide a detailed description of using classification techniques to learn how to detect robots Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Fractions of Robot Data(from Tan and Kumar, 2002) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

From Tan and Kumar, 2002Overallaccuraciesof around 90%were obtainedusing decisiontree classifiers, trained on sessionsof lengths 1, 2, 3, 4,.. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers • In theory, requester browser requests a page from a Web server and the request is processed • In practice, there are • Other users • Browser caching • Dynamic addressing in local network • Proxy Server caching Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers A graphical summary of how page requests from an individual user can be masked at various stages between the user’s local computer and the Web server. Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Page requests, caching, and proxy servers • Web server logs are therefore not so ideal in terms of a complete and faithful representation of individual page views • There are heuristics to try to infer the true actions of the user: - • Path completion (Cooley et al. 1999) • e.g. If known B -> F and not C -> F, then session ABCF can be interpreted as ABCBF • Anderson et al. 2001 for more heuristics • In general case, it is hard to know what exactly the user viewed Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Identifying individual users from Web server logs • Useful to associate specific page requests to specific individual users • IP address most frequently used • Disadvantages • One IP address can belong to several users • Dynamic allocation of IP address • Better to use cookies (or login ID if available) • Information in the cookie can be accessed by the Web server to identify an individual user over time • Actions by the same user during different sessions can be linked together Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Identifying individual users from Web server logs • Commercial websites use cookies extensively • 97 % of users have cookies enabled permanently on their browsers (source: Amazon.com, 2003) • However … • There are privacy issues – need implicit user cooperation • Cookies can be deleted / disabled • Another option is to enforce user registration • High reliability • But can discourage potential visitors • Large portals (such as Yahoo!) have high fraction of logged-in users Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Sessionizing • Time oriented (robust) • e.g., by gaps between requests • not more than 20 minutes between successive requests • this is a heuristic – but is a standard “rule” used in practice • Navigation oriented (good for short sessions and when timestamps unreliable) • Referrer is previous page in session, or • Referrer is undefined but request within 10 secs, or • Link from previous to current page in web site Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Client-side data • Advantages of collecting data at the client side: • Direct recording of page requests (eliminates ‘masking’ due to caching) • Recording of all browser-related actions by a user (including visits to multiple websites) • More-reliable identification of individual users (e.g. by login ID for multiple users on a single computer) • Preferred mode of data collection for studies of navigation behavior on the Web • Companies like ComScore and Nielsen use client-side software to track home computer users Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Client-side data • Statistics like ‘Time per session’ and ‘Page-view duration’ are more reliable in client-side data • Some limitations • Still some statistics like ‘Page-view duration’ cannot be totally reliable e.g. user might go to fetch coffee • Need explicit user cooperation • Typically recorded on home computers – may not reflect a complete picture of Web browsing behavior • Web surfing data can be collected at intermediate points like ISPs, proxy servers • Can be used to create user profile and target advertise Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Modeling Clickrate Data • Data • 200k Alexa users, client-side, over 24 hours • ignore URLs requested • goal is to build a time-series model that characterizes user click rates Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov-Poisson Model(Scott and Smyth, 2003) • Doubly stochastic process • Locally constant Poisson rate • indexed by M Markov states • Fit a model with M = 3 states • absence of a Web session • Web session with slow click rate: 1 minute rate • Web session with rapid click rate: 10 second rate • Used hierarchical Bayes on individuals Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Early studies from 1995 to 1997 • Earliest studies on client-side data are Catledge and Pitkow (1995) and Tauscher and Greenberg (1997) • In both studies, data was collected by logging Web browser commands • Population consisted of faculty, staff and students • Both studies found • clicking on the hypertext anchors as the most common action • using ‘back button’ was the second common action Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Early studies from 1995 to 1997 • high probability of page revisitation (~0.58-0.61) • Lower bound because the page requests prior to the start of the studies are not accounted for • Humans are creatures of habit? • Content of the pages changed over time? • strong recency (page that is revisited is usually the page that was visited in the recent past) effect • Correlates with the ‘back button’ usage • Similar repetitive actions are found in telephone number dialing etc Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

The Cockburn and McKenzie study from 2002 • Earlier studies were outdates • Web has changed dramatically in the past few years • Cockburn and McKenzie (2002) provides a more up-to-date analysis • Analyzed the daily history.dat files produced by the Netscape browser for 17 users for about 4 months • Population studied consisted of faculty, staff and graduate students • Study found revisitation rates higher than past 94 and 95 studies (~0.81) • Time-window is three times that of past studies Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

The Cockburn and McKenzie study from 2002 • Revisitation rate less biased than the previous studies? • Human behavior changed from an exploratory mode to a utilitarian mode? • The more pages user visits, the more are the requests for new pages • The most frequently requested page for each user can account for a relatively large fraction of his/her page requests • Useful to see the scatter plot of the distinct number of pages requested per user versus the total pages requested Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

The Cockburn and McKenzie study from 2002 The number of distinct pages visited versus page vocabulary size of each of the 17 users in the Cockburn and McKenzie (2002) study (log-log plot) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

The Cockburn and McKenzie study from 2002 Bar chart of the ratio of the number of page requests for the most frequent page divided by the total number of page requests, for 17 users in the Cockburn McKenzie (2002) study Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Outline • Basic concepts in Web log data analysis • Predictive modeling of Web navigation behavior • Markov modeling methods • Analyzing search engine data • Ecommerce aspects of Web log mining Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov models for page prediction • General approach is to use a finite-state Markov chain • Each state can be a specific Web page or a category of Web pages • If only interested in the order of visits (and not in time), each new request can be modeled as a transition of states • Issues • Self-transition • Time-independence Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov models for page prediction • For simplicity, consider order-dependent, time-independent finite-state Markov chain with M states • Let s be a sequence of observed states of length L. e.g. s = ABBCAABBCCBBAA with three states A, B and C. st is state at position t (1<=t<=L). In general, • first-order Markov assumption • This provides a simple generative model to produce sequential data Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov models for page prediction • If we denote Tij = P(st = j|st-1 = i), we can define a M x M transition matrix • Properties • Strong first-order assumption • Simple way to capture sequential dependence • If each page is a state and if W pages, O(W2), W can be of the order 105 to 106 for a CS dept. of a university • To alleviate, we can cluster W pages into M clusters, each assigned a state in the Markov model • Clustering can be done manually, based on directory structure on the Web server, or automatic clustering using clustering techniques Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov models for page prediction • Tij = P(st = j|st-1 = i) represents the probability that an individual user’s next request will be from category j, given they were in category i • We can add E, an end-state to the model • E.g. for three categories with end state: - • E denotes the end of a sequence, and start of a new sequence Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Markov models for page prediction • First-order Markov model assumes that the next state is based only on the current state • Limitations • Doesn’t consider ‘long-term memory’ • We can try to capture more memory with kth-order Markov chain • Limitations • Inordinate amount of training data O(Mk+1) Data Mining Lectures Analysis of Web User Data Padhraic Smyth, UC Irvine

Project Presentations