characterizing visitors to a website across multiple sessions n.
Download
Skip this Video
Download Presentation
Characterizing Visitors to a Website Across Multiple Sessions

Loading in 2 Seconds...

play fullscreen
1 / 23

Characterizing Visitors to a Website Across Multiple Sessions - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Characterizing Visitors to a Website Across Multiple Sessions. NGDM Workshop, Nov 2002. Arindam Banerjee Joydeep Ghosh. Motivation. Why Characterize or Predict web user behavior? Site-centric view: Personalization, sticky websites

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Characterizing Visitors to a Website Across Multiple Sessions' - olencia


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
characterizing visitors to a website across multiple sessions

Characterizing Visitors to a Website Across Multiple Sessions

NGDM Workshop, Nov 2002

Arindam Banerjee

Joydeep Ghosh

Banerjee and Ghosh

motivation
Motivation

Why Characterize or Predict web user behavior?

  • Site-centric view: Personalization, sticky websites
  • User-centric view: personal agents for information acquisition
  • Universalist approaches: Pagerank, web metrics,…

Banerjee and Ghosh

clustering users from web logs
Clustering Users from Web Logs
  • Wide variety of web behavior  segment users based on surfing behavior as a first step to further analysis.
      • User: set of sessions
      • Session: sequence of
        • (page I.d., time spent on that page) tuples
    • How to cluster sets of sequences?

Banerjee and Ghosh

the approach
The Approach
  • Cluster Sessions
    • Session Similarity Measure
    • Session Similarity Graph
      • Outlier Detection
    • Graph Partitioning
  • Create a Cluster Space
  • Cluster users in this Space

Banerjee and Ghosh

a similarity measure for sessions
A Similarity Measure for Sessions
  • Overlap between two sessions represented by the longest common subsequence (LCS)
  • Obtain session similarity using LCS and time informationsession similarity = (time similarity in LCS) x (importance of LCS)
  • The similarity component :
    • Average min-max similarity for each page in the LCS
  • The importance component :
    • Average of the fraction of overall session time spent in the LCS

Banerjee and Ghosh

session clustering
Session Clustering
  • Find the pairwise similarity values between all pair of sessions; record only similarities > q
  • Incrementally construct similarity graph Gq
    • the vertices are the sessions, the edge weights are the session similarity values
    • no isolated vertices (discard “outliers”)
  • Balanced Graph Partitioning
    • we used Metis [Karypis, Kumar]

Banerjee and Ghosh

the cluster space
The Cluster Space
  • Given: each session assigned to one of k clusters (sets)
  • Sessions of a user are distributed among the k sets
    • vector u = [u1u2 … uk ]T where ui = number of sessions of the user belonging to cluster I
  • Stage II : User Clustering
    • find pairwise similarity values using the extended Jaccard measure
    • partition similarity graph
  • Gives l user clusters and a set of outlier users

Banerjee and Ghosh

the dataset sulekha com
The Dataset : Sulekha.com

Banerjee and Ghosh

dataset details
Dataset details
  • Logs over a one month period
  • Raw log size 184 Mb
  • 453,953 files accessed
  • 37,753 sessions in all
  • 23,310 sessions after some preprocessing/filtering
  • 2,493 users

Banerjee and Ghosh

results session clusters
Results : Session Clusters

Banerjee and Ghosh

results user clusters
Results : User Clusters
  • user : [(128.194.xxx.xxx)]
    • (/authors,3)(/articles,129)
    • (/authors,8)(/articles,8)
    • (/authors,80)(/articles,2141)
  • user : [(209.30.xxx.xxx)]
    • (/home,77)(/articles,111)(/authors,93)(/articles,629)(/misc,58) (/coffeehouse,75)(/wo-men,967)
    • (/articles,2627)
  • user : [(171.68.xxx.xxx)]
    • (/home,323)(/articles,24)(/authors,45)(/articles,1290)

A user cluster :

people who read the articles

Banerjee and Ghosh

results user clusters1
Results : User Clusters
  • user : [(152.170.xxx.xxx)]
    • (/home,21)(/wo-men,1075)(/philosophy,52)
  • user : [(209.244.xxx.xxx)]
    • (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo-men,31)
    • (/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)(/coffeehouse,382)(/biztech,298)(/philosophy,290)
    • (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093)

A user cluster :

people interested in wo-men, philosophy, coffeehouse

Banerjee and Ghosh

results user clusters2
Results : User Clusters
  • user : [(216.154.xxx.xxx)]
    • (/coffeehouse,12)(/biztech,25)(/books,48)
    • (/coffeehouse,13)(/biztech,26)(/books,19)
  • user : [(204.220.xxx.xxx)]
    • (/coffeehouse,162)
    • (/coffeehouse,40)
  • user : [(32.100.xxx.xxx)]
    • (/coffeehouse,12)(/contests 12)
    • (/coffeehouse,43)(/contests 44)

A user cluster :

people interested in coffeehouse – bookmarked it !

Banerjee and Ghosh

conclusions
Conclusions
  • Segmentation: a basic pre-processing step for Web Mining
  • Similarity measure + Cluster Space Concept: applicable to clustering of sets of any data-structure
  • For certain websites, time spent on the pages matters
    • not handled by current commercial tools
  • Outlier detection before clustering is important
  • Results QA-ed by human subjects
    • Results for clusters & outliers at both levels were subjectively good
  • No good way to find cluster quality analytically
  • Formation of similarity graph is a slow process

Banerjee and Ghosh

future work
Future Work
  • Improve the present method by:
    • using cluster seeds for cluster growing
    • using alternative clustering algorithms for each stage
    • studying the effect of thresholds, number of clusters on performance
    • studying the importance of order of page-visits
    • studying the importance of balanced clustering

Banerjee and Ghosh

backup

Backup

Banerjee and Ghosh

issues choice of parameters
Issues : Choice of Parameters
  • Number of session clusters, k, should be chosen appropriately
  • Thresholds for forming session & user similarity graphs :
    • threshold value should be chosen after looking at the distribution of edge weights

Banerjee and Ghosh

related work
Related Work
  • Research in Web Mining :
    • Extraction of navigational patterns : Spiliopoulou, Faulstich
    • Ordering relationships : Mannila, Meek
    • Surfing prediction : Pitkow, Pirolli
    • Clustering web usage sessions : Fu, Sandhu, Shih

Banerjee and Ghosh

example
Example
  • Sessions :
    • Session1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)]
    • Session2 = [(b,5) (d,12) (f,1) (a,7) (c,5)]
  • LCS pages = [(b)(d)(c)]
  • Corresponding Index, Times Sequences :
    • Index1 = [(1)(2)(3)], Time1 = [(100) (8) (5)]
    • Index2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)]
  • Similarity over each LCS page : of the two times
    • Similarity on page b = 5/100 = 0.05
    • Similarity on page d = 8/12 = 0.67
    • Similarity on page c = 5/5 = 1.00

Banerjee and Ghosh

example contd
Example (contd.)
  • The similarity component

= (0.05 + 0.67 + 1.00)/3

= 0.57

  • The importance component :
    • Fraction of time spent in the LCS by Session1 = 113/149 = 0.76
    • Fraction of time spent in the LCS by Session2 = 22/30 = 0.73
    • The mean = (0.76+0.73)/2 = 0.75
  • The overall similarity

= 0.57 x 0.75

= 0.43

Banerjee and Ghosh

issues session resolution
Issues : Session Resolution
  • Generate coarse resolution paths making use of the concept hierarchy of the website
  • Reduces computations; Increases interpretability of results

Banerjee and Ghosh

comments
Comments
  • Results QA-ed by human subject
    • Results for clusters & outliers at both levels were subjectively good
    • No good way to find cluster quality analytically
  • Clustering algorithms for the two stages
    • Stage I : Graph partitioning works well for large sparse graphs, so it is desirable in this stage
    • Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate
  • Cluster space
    • Gives a general framework for mapping any non-vector clustering problem to an equivalent vector clustering problem

Banerjee and Ghosh