1 / 17

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. Srivastava J., Cooley R., Deshpande M, Tan P.N. Appeared in SIGKDD Explorations, Vol. 1, Issue 2, 2000. Web Mining. What is? Data Mining efforts associated with the Web What kind of? Content Mining

alm
Download Presentation

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data Srivastava J., Cooley R., Deshpande M, Tan P.N. Appeared in SIGKDD Explorations, Vol. 1, Issue 2, 2000

  2. Web Mining • What is? • Data Mining efforts associated with the Web • What kind of? • Content Mining • Structure Mining • Usage Mining

  3. Web Data • Content • Ex) texts and graphics • Structure • Ex) HTML tags • Usage • Ex) IP address, page reference, date/time • User profile • Ex) registration data, customer profile

  4. Web Usage Mining • The application of data mining techniques to discover usage patterns from Web Data. • Three phrases • Preprocessing • Pattern discovery • Pattern analysis

  5. Data Sources Where the usage data can be collected from? • Server Level Collections • The web server log records the browsing behavior of site visitors, but cached page views are not recorded. • The packet sniffing extracts usage data directly from TCP/IP packets.

  6. Data Sources (contd.) <Sample Web Server Log> # IP Address Userid Time Method/ URL/ Protocol Status Size Referrer Agent 1 123.456.78.9 - [25/Apr/1998:03:04:41 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.04 (Win95, I) 2 123.456.78.9 - [25/Apr/1998:03:05:34 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.04 (Win95, I) 3 123.456.78.9 - [25/Apr/1998:03:05:39 -0500] "GET L.html HTTP/1.0" 200 4130 - Mozilla/3.04 (Win95, I) 4 123.456.78.9 - [25/Apr/1998:03:06:02 -0500] "GET F.html HTTP/1.0" 200 5096 B.html Mozilla/3.04 (Win95, I) 5 123.456.78.9 - [25/Apr/1998:03:06:58 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.01 (X11, I, IRIX6.2, IP22) 6 123.456.78.9 - [25/Apr/1998:03:07:42 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) 7 123.456.78.9 - [25/Apr/1998:03:07:55 -0500] "GET R.html HTTP/1.0" 200 8140 L.html Mozilla/3.04 (Win95, I) 8 123.456.78.9 - [25/Apr/1998:03:09:50 -0500] "GET C.html HTTP/1.0" 200 1820 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) 9 123.456.78.9 - [25/Apr/1998:03:10:02 -0500] "GET O.html HTTP/1.0" 200 2270 F.html Mozilla/3.04 (Win95, I) 10 123.456.78.9 - [25/Apr/1998:03:10:45 -0500] "GET J.html HTTP/1.0" 200 9430 C.html Mozilla/3.01 (X11, I, IRIX6.2, IP22) 11 123.456.78.9 - [25/Apr/1998:03:12:23 -0500] "GET G.html HTTP/1.0" 200 7220 B.html Mozilla/3.04 (Win95, I) 12 209.456.78.2 - [25/Apr/1998:05:05:22 -0500] "GET A.html HTTP/1.0" 200 3290 - Mozilla/3.04 (Win95, I) 13 209.456.78.3 - [25/Apr/1998:05:06:03 -0500] "GET D.html HTTP/1.0" 200 1680 A.html Mozilla/3.04 (Win95, I)

  7. Data Sources (contd.) • Client Level Collections • By using remote agents ex) java applet (overhead), java script (not able to capture all user clicks) • By modifying the source code of existing browser ex) Mosaic (hard to convince users to use browser)

  8. Data Sources (contd.) • Proxy Level Collections • Intermediate level of caching between web server and client browser. • Characterize the browsing behavior of a group of users sharing a common proxy server.

  9. Data Abstractions • User : a single individual that is accessing file from one or more Web servers through a browser • Page Views : every file displayed on user’s browser at one time • Click Stream : a sequential series of page view requests • User Session : the click stream of page views for a single user across the entire Web • Server Session : the set of page views in a user session for a particular Web site • Episode : any semantically meaningful subset of a user or server session

  10. Web Usage Mining Process

  11. Preprocessing • Usage Processing The most difficult task due to the incompleteness of the available data (IP address, agent, server side click stream) • Single IP address/Multiple Server Sessions • Multiple IP address/Single Server Session • Multiple IP address/Single User • Multiple Agent/Single User

  12. Preprocessing(contd.) • Content Preprocessing • Converting the text, image, scripts into useful forms (ex. vectors of words) • Classification/clustering algorithm can be used to filter discovered patterns based on topic or intended use • Structure Preprocessing • Hyperlinks between page views

  13. Pattern Discovery • Statistical Analysis • Page views, viewing time, length of navigational path • Association Rules • Apriori algorithm: correlation between users • Clustering • Usage clustering : inferring user demographics • Page clustering: pages having related content

  14. Pattern Discovery (contd.) • Classification • 30% of users who placed an online order in /Product/Music are in the 18-25 age group and live on the West Coast. • Sequential Patterns • Time-ordered set of sessions: predicting future visit patters for where to put advertisement

  15. Pattern Analysis • Motivation • Filter out uninteresting rules / patterns from the set found in the pattern discovery phrase.

  16. Application Areas

  17. Examples • Personalization • http://aztec.cs.depaul.edu/scripts/ACR2/ • Business • http://www.accrue.com/

More Related