1 / 34

Web Usage Mining

Web Usage Mining. Part-1. Web Usage Mining. It’s main goal is to: Discover usage patterns from web data in order to understand and better serve the needs of web based applications. Web Usage Mining. Web usage mining consists of three phases Preprocessing Pattern discovery

dimaia
Download Presentation

Web Usage Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Usage Mining Part-1

  2. Web Usage Mining It’s main goal is to: Discover usage patterns from web data in order to understand and better serve the needs of web based applications

  3. Web Usage Mining Web usage mining consists of three phases • Preprocessing • Pattern discovery • Pattern analysis

  4. Web-Usage Mining Generated by users’ interaction with the Web, data sources include: • web-server access logs • proxy-server logs • browser logs • user profiles • registration data • user sessions and transactions • cookies • user queries • bookmark data • mouse clicks and scrolls

  5. Web-Log Processing A server log: • set of files consisting of the details of an activity performed by a server • files are automatically created and maintained by the server • The World Wide Web Consortium (W3C) has specified a standard format for web-server log files • There are other proprietary formats for web-server logs.

  6. Web-Log Processing Most web logs contain: • IP address of the client making the request • date and time of the request • URL of the requested page • number of bytes sent to serve the request • user agent (such as a web browser or web crawler) • referrer (the URL that triggered the request) • Logs can all be stored in one file • A better alternative is to separate: • access log • error log • referrer log

  7. Format of Web Logs Common log format (http://www. W3.org/Daemon/User/Config/Logging.html#common-logfile-format)

  8. Examples of Common Log Format 140.14.6.11 - pawan [06/Sep/2001:10:46:07 -0300] "GET /s.htm HTTP/1.0" 200 2267 140.14.7.18 - raj [06/Sep/2001:11:23:53 -0300] "POST /s.cgi HTTP/1.0" 200 499 • GET request that retrieves a file s.htm • POST request sends data to a program s.cgi • Fields: • client machine’s IP address (140.14.6.11) • RFC 1413 identity of the client is missing (-) • Date and time • Request • Error code • Number of bytes transferred

  9. Examples of Common Log Format An example of a log file in extended format

  10. Format of Web Logs #Version: version of the extended log file format used #Fields: fields recorded in the log #Software: software that generated the log #Start-Date: date and time at which the log was started #End-Date: date and time at which the log was finished #Date: date and time at which the entry was added #Remark: Comments that are ignored by analysis tools

  11. Format of Web Logs • The directives #Version and #Fields are mandatory and must appear before all the entries • Each field in the #Fields directive can be specified in one of the following ways: • an identifier; e.g., time • an identifier with a prefix separated by a hyphen; e.g., cs-method • a prefix following a header in parentheses; e.g., sc(Content-type)

  12. Format of Web Logs • No prefixes fordate, time, time-taken, bytes, cached • Prefixes forip, dns, status, comment, method, uri, uri-stem, uri-query, host • Prefixes can be: csclient to server scserver to client srserver to remote server (this prefix is used by proxies) rsremote server to server (this prefix is used by proxies) x application-specific identifier

  13. Analyzing Web logs

  14. Analyzing Web Logs General Summary from Analog

  15. Analyzing Web Logs Monthly report from Analog

  16. Analyzing Web Logs Daily summary from Analog

  17. Hourly summary from Analog

  18. Analyzing Web Logs

  19. Organization report from Analog Organization report from Analog

  20. Search-word report from Analog

  21. Operation-system report from Analog

  22. Status-code report from Analog

  23. File size report from Analog

  24. File type report from Analog

  25. Directory report from Analog

  26. FRequestreport from Analog

  27. Analysis of Clickstream: Studying Navigation Paths

  28. Analysis of Clickstream: Studying Navigation Paths Clickstream using Pathalizer with seven link specification

  29. Analysis of Clickstream: Studying Navigation Paths Clickstream using Pathalizer with twenty link specification

  30. Visualizing Individual User Sessions A brief on-campus session identified by StatViz that browses the bulletin board

  31. Visualizing Individual User Sessions A brief off-campus session identified by StatViz with three distinct activities

  32. Visualizing Individual User Sessions A long on-campus session identified by StatViz with multiple activities

  33. Caution in Interpreting Web-Access Logs • Requests may not always reach the server as they may be served from a proxy server’s cache • You do not really know: • Identity of readers • Number of visitors • Number of visits • User’s navigation path through the site • Entry point and referral • How users left the site or where they went next • How long people spent reading each page • How long people spent on the site

  34. Turner (2004) I’ve presented a somewhat negative view here, emphasizing what you can’t find out. Web statistics are still informative: it’s just important not to slip from “this page has received 30,000 requests” to “30,000 people have read this page.” In some sense these problems are not really new to the web—they are present just as much in print media too. For example, you only know how many magazines you’ve sold, not how many people have read them. In print media we have learnt to live with these issues, using the data which are available, and it would be better if we did on the Web too, rather than making up spurious numbers.

More Related