1 / 38

Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

A New Reactive Method for Processing Web Usage Data. Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering Ankara, Turkey. OUTLINE. Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion.

neka
Download Presentation

Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Reactive Method for Processing Web Usage Data Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey

  2. OUTLINE • Web Mining • Previous Session Reconstruction Heuristics • Smart-SRA • Agent Simulator • Experimental Results • Conclusion

  3. Data & Web Mining • Data Mining: Discovery of useful and interesting patterns from a large dataset. • Web mining: the application of data mining techniques to discover and retrieve useful information and patterns from the World Wide Web documents and services. • Dimensions: • Web content mining • Web structure mining • Web usage mining

  4. Web Mining Web Usage Mining (WUM) Application of data mining techniques to web log data in order to discover user access patterns. Example User Web Access Log It is possible to capture necessary information for WUM.

  5. Web Mining Phases of Web Usage Mining • Data Processing • Includes reconstruction of user sessions by using heuristics techniques. (Most important phase) since it directly affects quality of extracted frequent patterns at final step significantly. • Pattern Discovery • Includes Discovering useful patterns from reconstructed sessions obtained in the first phase.We have related work about Pattern Discovery phase [Bayir 06-1].

  6. OUTLINE • Web Mining • Previous Session Reconstruction Heuristics • Smart-SRA • Agent Simulator • Experimental Results • Conclusion

  7. Session Reconstruction Previous Session Reconstruction Heuristics • Includes selecting and grouping requests belonging to the same user by using heuristics techniques. • Types: • Reactive strategies process requests after they are handled by the web server, they process web server logs to obtain session. The proposed approach is this thesis is reactive. • Proactive strategies process requests during the interactive browsing of the web site by the user. Session data is gathered during interaction of web user. applied on dynamic server pages.

  8. Previous Reactive Heuristics Session Reconstruction Proactive Strategies need to change internal structure of web site. To illustrate, change in source code of each dynamic web pages. Reactive strategies need no change, used for web analytics purposes, customers give web logs of their web site and analyzed them by using this methods. Reactive methods are applicable for all web sites satisfying same log format.

  9. Previous Reactive Heuristics Two types of reactive heuristics defined before • Time-oriented heuristics[Spiliopoulou 98, Cooley 99-1] • Navigation-oriented heuristic [Cooley 99-1, Cooley 99-2] Smart-SRA [Bayir 06-2]is new approach proposed in this thesis. It combines these heuristics with web topology information in order to increase the accuracy of the reconstructed sessions.

  10. Example Web Topology Graph used for Applying heuristics Previous Reactive Heuristics The topology of web site can be represented by directed web graph. The topology information can be extracted by using crawling module of Search engine APIs. Example Web Page Request Sequence

  11. Time-oriented heuristics -1 Previous Session Reconstruction Heuristics Two types of time oriented Heuristics defined. Time threshold (1 = 30 mins): • [P1, P20, P13, P49] (t(P1) - t(P49) = 29 < 30) • [P34, P23] (t(P34) - t(P23) = 15 < 30) total duration of a discovered session is limited with a threshold1 Example:

  12. Time-oriented Heuristics -2 Previous Session Reconstruction Heuristics Time threshold (2 = 10 mins): • [P1, P20, P13] • [P49, P34] • [P23] The time spent on any page is limited with a threshold2 . That means t(Pn+1) - t(Pn) < 2 Example:

  13. Navigation-Oriented Heuristic Previous Session Reconstruction Heuristics In Navigation Oriented Heuristics, when processing user request sequence, There are two cases for Adding new page WPN+1 to a session [WP1, WP2, …, WPN] • If WPN has a hyperlink to WPN+1 [WP1, WP2, …, WPN, WPN+1] • If WPN does not have a hyperlink to WPN+1 Assume that WPKmax is the nearest page having a hyperlink to WPN+1add backward browser moves [WP1, WP2,…, WPN, WPN-1, WPN-2,..., WPKmax, WPN+1]

  14. Navigation-Oriented Heuristic Previous Session Reconstruction Heuristics User request sequence Example:

  15. OUTLINE • Web Mining • Previous Session Reconstruction Heuristics • Smart-SRA • Agent Simulator • Experimental Results • Conclusion

  16. Smart-SRA • Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria • Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that: • between each consecutive page pair in a session there is a hyperlink from the previous page to the next page Topology Rule: i:1 i<n, there is a hyperlink from Pi to Pi+1 Time Rules: • oi: 1 i<n, Timestam(Pi) < Timestamp(Pi+1) • oi: 1 i<n Timestamp(Pi+1) - Timestamp(Pi) r (page stay time) • oTimestamp(Pn) - Timestamp(P1)  δ (session duration time).

  17. Smart-SRA Phase2 of Smart-SRA process a candidate session from left to right by repeating the following steps until the candidate session is empty: • Determine the web pages without any referrer (on its left) and remove them from the candidate session • For each one of these pages • For each previously constructed session • If there is a hyperlink from the last page of the session to the web page and page stay time constraint is satisfied then append the web page to the session • Remove non-maximal sessions

  18. Smart-SRA Example Candidate Session Example Web Topology Used of Applying Smart-SRA

  19. Smart-SRA

  20. OUTLINE • Web Mining • Previous Session Reconstruction Heuristics • Smart-SRA • Agent Simulator • Experimental Results • Conclusion

  21. Agent Simulator • Models the behavior of web users and generates web user navigation and the log data kept by the web server • Used to compare the performances of alternative session reconstruction heuristics

  22. Agent Simulator Provides 4 basic behaviors of Web User. • A Web user can start session with any one of the possible entry pages of a web site. • A Web user can select the next page having a link from the most recently accessed page. • A Web user can press the back button one more time and thus selects as the next page a page having a link from any one of the previously browsed pages (i.e., pages accessed before the most recently accessed one). • A Web user can terminate his/her session.

  23. P P 13 1 S1 1 P P 23 20 2 S2 P P 3 4 49 Web user can start a new session with any one of the possible entry pages of the web site Agent Simulator Behavior I

  24. Web user can select a new page having a link from the most recently accessed page. Agent Simulator Behavior II P P 1 13 1 P P 23 20 2 P P 34 49

  25. 4 P P 13 1 1 3 5 P P 23 20 2 P P 34 49 Web user can select as the next page having a link from any one of the previously browsed pages. Agent Simulator Behavior III

  26. 4 P P 13 1 1 3 5 P P 6 23 20 2 P P 34 49 Web user can terminate the session. Agent Simulator Behavior IV Example session is terminated in P23.

  27. 3 Parameters for simulating behavior of web user Agent Simulator • Session Termination Probability (STP) • Link from Previous pages Probability (LPP) • New Initial page Probability (NIP)

  28. OUTLINE • Web Mining • Previous Session Reconstruction Heuristics • Smart-SRA • Agent Simulator • Experimental Results • Conclusion

  29. Heuristics Tested Experimental Results • Time oriented heuristic (heur1) (total time  30 min) • Time oriented heuristic (heur2) (page stay  10 min) • Navigation oriented heuristic (heur3) • Smart-SRA heuristic (heur4)

  30. Accuracy is determined as: Experimental Results Reconstructed session H captures a real session R if R occurs as a subsequence of H (R  H) String-matching relation needed R = [P1, P3, P5] H = [P9, P1, P3, P5, P8] => R  H Yes H = [P1, P9, P3, P5, P8] => R  H No

  31. Parameters for generating user sessions and web topology Experimental Results

  32. Accuracy vs. STP Experimental Results Increasing STP leads to sessions with fewer pages. It becomes more easy to predict. In small length sessions the probability of LPP and NIP that holds is also small.

  33. Experimental Results Accuracy vs LPP As LPP increases the real accuracy decreases. Increasing LPP leads to more complex sessions. Intelligent Path completion is needed for discovering more accurate sessions.

  34. Accuracy vs. NIP Experimental Results Increasing NIP causes more complex sessions, the accuracy decreases for all heuristics. Path separation is needed for discovering more accurate sessions.

  35. OUTLINE • Web Mining • Previous Session Reconstruction Heuristics • Smart-SRA • Agent Simulator • Experimental Results • Conclusion

  36. Conclusion • New session reconstruction heuristic: Smart-SRA • Does not allow sequences with unrelated consecutive requests (no hyperlink between the previous one to the next one) • No artificial browser (back) requests insertion in order to prevent unrelated consecutive requests • Only maximal sessions discovered. • Agent simulator simulates behaviors of real www users. • It is possible to evaluate accuracy of heuristics by using Agent Simulator. • Experimental results show Smart-SRA outperforms previous reactive heuristics.

  37. References [Bayir 06-1] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006) A Performance Comparison of Pattern Discovery Methods on Web Log Data, AICCSA-06, the 4th ACS/IEEE International Conference on Computer Systems and Applications. [Bayir 06-2] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006): A New Approach for Reactive Web Usage Data Processing. ICDE Workshops, 44. [Cooley 99-1] R. Cooley, B. Mobasher, and J. Srivastava (1999), Data Preparation for Mining World Wide Web Browsing Patterns . Knowledge and Information Systems Vol. 1, No. 1. [Cooley 99-2] R. Cooley, P. Tan and J. Srivastava (1999), Discovery of interesting usage patterns from Web data. Advances in Web Usage Analysis and User Profiling. LNAI 1836, Springer, Berlin, Germany. 163-182. [Spiliopoulou 98] M. Spiliopoulou, L.C. Faulstich (1998). WUM: A tool for Web Utilization analysis. Proceedings EDBT workshop WebDB’98, LNCS 1590, Springer, Berlin, Germany. 184-203.

  38. Thank you for Listening  Any Questions ?

More Related