1 / 23

Characterizing Visitors to a Website Across Multiple Sessions

Characterizing Visitors to a Website Across Multiple Sessions. NGDM Workshop, Nov 2002. Arindam Banerjee Joydeep Ghosh. Motivation. Why Characterize or Predict web user behavior? Site-centric view: Personalization, sticky websites

olencia
Download Presentation

Characterizing Visitors to a Website Across Multiple Sessions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh Banerjee and Ghosh

  2. Motivation Why Characterize or Predict web user behavior? • Site-centric view: Personalization, sticky websites • User-centric view: personal agents for information acquisition • Universalist approaches: Pagerank, web metrics,… Banerjee and Ghosh

  3. Clustering Users from Web Logs • Wide variety of web behavior  segment users based on surfing behavior as a first step to further analysis. • User: set of sessions • Session: sequence of • (page I.d., time spent on that page) tuples • How to cluster sets of sequences? Banerjee and Ghosh

  4. The Approach • Cluster Sessions • Session Similarity Measure • Session Similarity Graph • Outlier Detection • Graph Partitioning • Create a Cluster Space • Cluster users in this Space Banerjee and Ghosh

  5. A Similarity Measure for Sessions • Overlap between two sessions represented by the longest common subsequence (LCS) • Obtain session similarity using LCS and time informationsession similarity = (time similarity in LCS) x (importance of LCS) • The similarity component : • Average min-max similarity for each page in the LCS • The importance component : • Average of the fraction of overall session time spent in the LCS Banerjee and Ghosh

  6. Session Clustering • Find the pairwise similarity values between all pair of sessions; record only similarities > q • Incrementally construct similarity graph Gq • the vertices are the sessions, the edge weights are the session similarity values • no isolated vertices (discard “outliers”) • Balanced Graph Partitioning • we used Metis [Karypis, Kumar] Banerjee and Ghosh

  7. The Cluster Space • Given: each session assigned to one of k clusters (sets) • Sessions of a user are distributed among the k sets • vector u = [u1u2 … uk ]T where ui = number of sessions of the user belonging to cluster I • Stage II : User Clustering • find pairwise similarity values using the extended Jaccard measure • partition similarity graph • Gives l user clusters and a set of outlier users Banerjee and Ghosh

  8. The Dataset : Sulekha.com Banerjee and Ghosh

  9. Dataset details • Logs over a one month period • Raw log size 184 Mb • 453,953 files accessed • 37,753 sessions in all • 23,310 sessions after some preprocessing/filtering • 2,493 users Banerjee and Ghosh

  10. Results : Session Clusters Banerjee and Ghosh

  11. Results : User Clusters • user : [(128.194.xxx.xxx)] • (/authors,3)(/articles,129) • (/authors,8)(/articles,8) • (/authors,80)(/articles,2141) • user : [(209.30.xxx.xxx)] • (/home,77)(/articles,111)(/authors,93)(/articles,629)(/misc,58) (/coffeehouse,75)(/wo-men,967) • (/articles,2627) • user : [(171.68.xxx.xxx)] • (/home,323)(/articles,24)(/authors,45)(/articles,1290) A user cluster : people who read the articles Banerjee and Ghosh

  12. Results : User Clusters • user : [(152.170.xxx.xxx)] • (/home,21)(/wo-men,1075)(/philosophy,52) • user : [(209.244.xxx.xxx)] • (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo-men,31) • (/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)(/coffeehouse,382)(/biztech,298)(/philosophy,290) • (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093) A user cluster : people interested in wo-men, philosophy, coffeehouse Banerjee and Ghosh

  13. Results : User Clusters • user : [(216.154.xxx.xxx)] • (/coffeehouse,12)(/biztech,25)(/books,48) • (/coffeehouse,13)(/biztech,26)(/books,19) • user : [(204.220.xxx.xxx)] • (/coffeehouse,162) • (/coffeehouse,40) • user : [(32.100.xxx.xxx)] • (/coffeehouse,12)(/contests 12) • (/coffeehouse,43)(/contests 44) A user cluster : people interested in coffeehouse – bookmarked it ! Banerjee and Ghosh

  14. Result Visualization using CLUSION [Strehl &Ghosh 01] Sessions Users Banerjee and Ghosh

  15. Conclusions • Segmentation: a basic pre-processing step for Web Mining • Similarity measure + Cluster Space Concept: applicable to clustering of sets of any data-structure • For certain websites, time spent on the pages matters • not handled by current commercial tools • Outlier detection before clustering is important • Results QA-ed by human subjects • Results for clusters & outliers at both levels were subjectively good • No good way to find cluster quality analytically • Formation of similarity graph is a slow process Banerjee and Ghosh

  16. Future Work • Improve the present method by: • using cluster seeds for cluster growing • using alternative clustering algorithms for each stage • studying the effect of thresholds, number of clusters on performance • studying the importance of order of page-visits • studying the importance of balanced clustering Banerjee and Ghosh

  17. Backup Banerjee and Ghosh

  18. Issues : Choice of Parameters • Number of session clusters, k, should be chosen appropriately • Thresholds for forming session & user similarity graphs : • threshold value should be chosen after looking at the distribution of edge weights Banerjee and Ghosh

  19. Related Work • Research in Web Mining : • Extraction of navigational patterns : Spiliopoulou, Faulstich • Ordering relationships : Mannila, Meek • Surfing prediction : Pitkow, Pirolli • Clustering web usage sessions : Fu, Sandhu, Shih Banerjee and Ghosh

  20. Example • Sessions : • Session1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)] • Session2 = [(b,5) (d,12) (f,1) (a,7) (c,5)] • LCS pages = [(b)(d)(c)] • Corresponding Index, Times Sequences : • Index1 = [(1)(2)(3)], Time1 = [(100) (8) (5)] • Index2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)] • Similarity over each LCS page : of the two times • Similarity on page b = 5/100 = 0.05 • Similarity on page d = 8/12 = 0.67 • Similarity on page c = 5/5 = 1.00 Banerjee and Ghosh

  21. Example (contd.) • The similarity component = (0.05 + 0.67 + 1.00)/3 = 0.57 • The importance component : • Fraction of time spent in the LCS by Session1 = 113/149 = 0.76 • Fraction of time spent in the LCS by Session2 = 22/30 = 0.73 • The mean = (0.76+0.73)/2 = 0.75 • The overall similarity = 0.57 x 0.75 = 0.43 Banerjee and Ghosh

  22. Issues : Session Resolution • Generate coarse resolution paths making use of the concept hierarchy of the website • Reduces computations; Increases interpretability of results Banerjee and Ghosh

  23. Comments • Results QA-ed by human subject • Results for clusters & outliers at both levels were subjectively good • No good way to find cluster quality analytically • Clustering algorithms for the two stages • Stage I : Graph partitioning works well for large sparse graphs, so it is desirable in this stage • Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate • Cluster space • Gives a general framework for mapping any non-vector clustering problem to an equivalent vector clustering problem Banerjee and Ghosh

More Related