1 / 20

Discovery of Significant Usage Patterns from Clusters of Clickstream Data

Discovery of Significant Usage Patterns from Clusters of Clickstream Data. Lin Lu, Margaret Dunham, and Yu Meng Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275-0122 llu(mhd,ymeng)@engr.smu.edu. WebKDD’05 1. Introduction.

Download Presentation

Discovery of Significant Usage Patterns from Clusters of Clickstream Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovery of Significant Usage Patterns from Clusters of Clickstream Data Lin Lu, Margaret Dunham, and Yu Meng Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275-0122 llu(mhd,ymeng)@engr.smu.edu WebKDD’051

  2. Introduction • Significant Usage Patterns (SUP) - SUP is extracted from clusters of abstracted user sessions - Use a unique two-phase abstraction technique - With desired beginning and/or ending Web pages - With normalized probability WebKDD’052

  3. Model Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions AbstractionHierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery SUPs per Cluster WebKDD’053

  4. JCPenney Homepage D1 D2 Dn Department level … C1 Cn Category level … … I1 In Item level … … Fig 1. Hierarchy of J.C. Penney Web site Alignment of Web sessions • Create sub-abstracted Web sessions URL -> {<Concept hierarchy keyword> <Unique ID> <|>} Example: D0|C875|I D0|C875|I P27593 P27592 P28 -507169015 WebKDD’054

  5. Alignment of Web sessions • Computing the similarity between any two Web pages • The higher the level in the hierarchy, the more importance in determining the similarity of two Web pages, should give more weight. • Scoring scheme - step 1: determine the longer page representation string in the two Web page representations. - step 2: weight is assigned to each level in the hierarchy: the lowest level in longer page representation string is given weight 2 to its abstract level, the second to the lowest level is given weight 4 to its abstract level, and so on. The corresponding ID is always given weight 1. WebKDD’055

  6. Alignment of Web sessions • Computing the similarity between any two Web pages - step 1: compare the two Web page representation strings from the left to the right and stopped at the first pair which they are different. - step 2: compute the ratio of the sum of the weights of those matching parts to the weight of longer page representation string. Example: Page 1: D0|C875|I Weight=6+1+4+1+2=14 Page 2: D0|C875 Weight=6+1+4+1=12 Similarity=12/14=0.857 WebKDD’056

  7. Apply Needleman-Wunsch global alignment algorithm Model Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions AbstractionHierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’057

  8. A(i-1, j-1) A(i-1, j) A(i, j-1) Alignment of Web sessions • Computing optimal alignment of two sequences • using Needleman-Wunsch algorithm A(i, j) A(i, j) = max[A(i-1, j-1)+s(Xi, Yj); A(i-1, j)-d; A(i, j-1)-d] where s(Xi, Yj) is the similarity between Xi and Yj, d is the score of aligning Xi (Yj) with a gap WebKDD’058

  9. Alignment of Web sessions • Apply Needleman-Wunsch global alignment algorithm • Scoring scheme [3] • if (matching) score = 20;//a pair of Web pages with similarity 1 • else if (mis-matching) score = –10;//a pair of Web pages with similarity 0 • else if (gap) score = –10; //a Web page aligns with a gap • elsescore = –10 ~ 20;//the pair of Web pages with similarity between 0 and 1 • Example: • P47104 D0|C0|I D469|C469 D2652|C2652 • D469|C16758|I D0|C0|I D469|C469 Thus, session similarity = 32.1/4 = 8.025 WebKDD’059

  10. Apply Nearest neighbor clustering algorithm Model Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Clusters of User Sessions AbstractionHierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’0510

  11. Model Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions AbstractionHierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’0511

  12. Create Concept-based Abstracted Sessions • Represent the abstracted page accesses in a session as a sequence like: P1 D1 C1 I1 P2 D2 C2 I2 … • In a session, the same Pi, Di, Ci, and Ii (i=1, 2…) represents the same page. However, in different sessions, the same page may be represented by different elements. Example: Original session: D7107|C7121 D7107|C7126|I076bdf3 D7107|C7131|I084fc96 D7107|C7131 P55730 P96 P27 P14 P27592 P28 P33711 -505884861 Abstracted session: C1 I1 I2 C2 P1 P2 P3 P4 P5 P6 P7 -505884861 WebKDD’0512

  13. 1 5 0.75 0.5 0.4 0.33 0.5 0.25 0.2 0.17 3 E S 0.5 0.33 0.33 0.17 0.2 0.33 0.17 0.17 2 0.5 4 0.2 Generating Significant Usage Patterns • Use Markov model to represent sessions in each cluster Example: (1) 1, 2, 3, 5, 4 (2) 2, 4, 3, 5 (3) 3, 2, 4, 5 (4) 1, 3, 4, 3 (5) 4, 2, 3, 4, 5 • The probability of a path normalized where Pti is the transition probability between two adjacent states WebKDD’0513

  14. Generating Significant Usage Patterns • Significant Usage Patterns Example: WebKDD’0514

  15. Experimental Result • On average purchase sessions are longer than those sessions without purchase - review the information, compare the price, the quality and etc. - fill out the billing and shipping information to commit the purchase WebKDD’0515

  16. Experimental Result SUPs in non-purchase cluster Interested in gathering information of products in different categories. S-C1-C1-C2-C3-C4-C5-C5-I1-E S-C1-C1-I1-C1-C2-C3-C4-C5-E S-I1-C1-C2-C3-C4-C5-C6-C7-E Interested in reviewing general pages (to gather general information). Not serious visitors (the average session length is 3) WebKDD’0516

  17. review the information, compare among products, and fill out the payment and shipping information • The average length of SUPs is • longer in the purchase cluster • than in non-purchase cluster • SUPs in the purchase cluster have • higher probability than those in • non-purchase cluster. have purchase in mind vs. random browsing behavior Experimental Result WebKDD’0517

  18. Conclusion and Future Work • Summary - By applying clustering to abstracted user sessions, it is more likely to find groups of users with similar motivations for visiting a specific website. - By giving the flexibility for user to specify the beginning and/or ending Web page(s), users can have more control in generating patterns of their interests. • Future - Scalability - Cluster to identify different user groups - Online identification of user to predefined cluster WebKDD’0518

  19. References [1]J. Borges and M. Levene, “Data Mining of User Navigation Patterns”, In Proc. the Workshop on Web Usage Analysis and User Profiling (WEBKDD'99), 31-36, San Diego, August 15, 1999. [2]J. Borges and M. Levene, “An average linear time algorithm for web data mining”, International Journal of Information Technology and Decision Making, 3, (2004), 307-320. [3] W. Wang and O. R. Zaïane, “Clustering Web Sessions by Sequence Alignment”, Third International Workshop on Management of Information on the Web in conjunction with 13th International Conference on Database and Expert Systems Applications DEXA'2002, pp 394-398, Aix en Provence, France, September 2-6, 2002.

  20. Thank you Questions? WebKDD’0520

More Related