1 / 27

Web Usage Mining

Web Usage Mining. A case study of the GoMercer.com website. Martin Zhao Mar 16, 2007. Topics. What is data mining? The data mining process Web usage mining: basic concepts The robust fuzzy relational clustering algorithm An application to the GoMercer.com web logs Q & A.

hana
Download Presentation

Web Usage Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007

  2. Topics • What is data mining? • The data mining process • Web usage mining: basic concepts • The robust fuzzy relational clustering algorithm • An application to the GoMercer.com web logs • Q & A

  3. What is Data Mining? – definition • A concise definition Finding hidden information from large datasets • A slightly longer version Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules • Differences from accessing info in a database • The query is not well formed or precisely stated • The data needs to be pre-processed before mining • The output is new knowledge, which may not be a subset of the database

  4. What is Data Mining? – a historical perspective • Data mining is a relatively new field of study. • The 1st International Conference on Knowledge Discovery and Data Mining (KDD) was held in 1995 • But its roots can be traced back to five areas: Artificial Intelligence Neural networks (1940s) Genetic algorithms (1970s) Decision tree alg.s (1980s) Statistics Bayes theorem (1700s) Regression (1900s) Classification (1960s) K-means clustering (1970s) Information Retrieval Similarity measures (1960s) Clustering (1960s) SMART IR systems (1970s) Data Mining Databases Batch reports (1960s) Relational data models (1970s) Data warehousing & OLAP (1990s) Algorithms

  5. Why Data Mining? • The growth of data is the most important factor propelling the growth of data mining • In 2003, Wal-Mart captured 20 million transactionsper day in a 10-terabyte database (1TB = 106 MB) • In 1950, the largest companies had only several dozen megabytes • The total amount of data that were produced in 2002 was estimated as 5 exabytes (1XB = 106 TB) • 40% of this was produced in the US • When we have more data, we are expecting more sophisticated information from them

  6. Business Intelligence – from data to knowledge • Data • Factual information • May be incomplete • Stored in huge amount Intelligence Using knowledge in decision making • Information • Relevant data • Well formatted • For targeted audience • Knowledge • Models, patterns, and rules • Can be used in prediction

  7. Basic Data Mining Tasks • Classification (map data into predefined groups) • Regression (map a data item to a real valued prediction variable) • Prediction (similar to classification, but deal with a future state) • Clustering (similar to classification, but the groups are defined by the data) • Association rules (identifies association among data) • Sequence discovery (determine sequential patterns in data)

  8. The Data Mining Process – the steps • Develop an understanding of the purpose • Obtain the dataset to be used • Explore, clean, and preprocess the data • Reduce the data, if necessary • Determine the data mining tasks • Choose the data mining techniques to be used • Use algorithm to perform the task • Interpret the results • Deploy the model

  9. Phases in the DM Process–CRISP-DM

  10. Web Data Mining • Web mining: the use of data mining techniques to automatically discover and extract useful and novel information from web docs and services • Web mining can be categorized as • Content mining: extract model from web contents, such as text, images, video, and semi- structures (HTML or XML) or structures documents (digital libraries) • Structure mining: aims at finding the underlying topology and organization of web resources • Usage mining: discover usage patterns from web server log files, user queries, and registration data

  11. User Clustering and Profiling – goals • Major application areas for web usage mining • Personalization • System improvement • Site modification • Business intelligence • Usage characterization

  12. User Clustering and Profiling –process • Data cleaning • omitting entries about individual objects on a page (such as .gif or .jpg image files) • (User and) session identification: • including identifying distinct pages, IPs, and agents • a session is a sequence of page views accessed through a certain IP using a certain agent within a certain amount of time (set as 45 minutes) • Clustering and profiling: • Define similarity between page views • Categorize user sessions into clusters based on similarity of the pages visited

  13. Web Log File Entries • Web log files keep track of the following data • Date and time (e.g., 2006-10-01@00:01:01) • Client IP address (e.g., 70.168.242.49) • Server IP address (e.g., 192.168.1.52 or www.GoMercer.com) • URI stem (web page or a specific file requested, e.g., /choose-mercer/apply-online.aspx) • User Agent (browser used by the user, e.g., Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)) • Referrer (the previous page visited) • Cookie • Etc

  14. Data Model User Cluster 1 Within 45 minutes * IP Address User Session Web Page * * 5..* 1 * 1 Web Browser

  15. Session Identification • Use original web server log files as input • Parse log entries to omit individual objects (such as images), and • Keep track of unique client IPs, URIs of interest, and user agents • Keep track of date/time and identifiers for IP, URI, and agent for each entry of interest • For each entry of interest • add the URI to an existing session with the same {IP, agent} identifiers and within 45 minutes • create a new session with the URI • Persist the session information to a file (or DB)

  16. Sample Session Information

  17. Clustering – a one-dimensional example Clustering: Just specify number of groups.Groups themselves are defined by data Classification: Map data into pre-defined groups 8 6 6 Inter-cluster distance (gap used here) Let’s try to group this set of test scores into letter grades Maximize the inter-clusterdistance and minimize theintra-cluster distance Intra-cluster distance 3 4 2.13 3.33

  18. Page and Session (Dis-)Similarity • The “syntactic” similarity between (the URL’s of) the ith and jth pages, is defined as the smaller of 1 and the ratio of the overlap of the two and the larger of the two lengths Su(i, j) = min(1, |pi^pj|/max(1, max(|pi|, |pj|)) • For instance, the similarity score for /mercer-411/contact.aspx and /mercer-411/ask-a-student.aspxis 1/2, whereas the score for /mercer-411/contact.aspx and assets/flash/location.xml is 0 • Dissimilarity is defined as (1 - Su(i, j))2 • Dissimilarity between two clusters is then calculated by summing up pair-wise dissimilarity scores

  19. Page Similarity – an example • For instance, the similarity score for /mercer-411/contact.aspx and /mercer-411/ask-a-student.aspxis 1/2, whereas the score for /mercer-411/contact.aspx and assets/flash/location.xml / /mercer-411 /assets … /contact.aspx … /ask-a-student.aspx /flash /CLA_1.flv … /location.xml

  20. Medoid and Membership • Each cluster is represented by a medoid, which is a centrally located session in the cluster • The affiliation of a session to a cluster is represented as a membership score, or the similarity to the corresponding medoid • A session is not considered to exclusively belong to a single cluster • The affiliation is determined by the highest membership score in a given iteration

  21. Relational Clustering Algorithm • Use identified sessions as input • Specify number of clusters, C and maximum number of iterations, M to be used • Choose an initial medoid for each cluster i in [1, C] • Compute membership uij for each session j in [1, N] with regard to each cluster i (using the similarity measure) • Store the old medoids • Compute the new medoids to minimize overall intra-cluster distances • Repeat steps 4 through 6 until the medoids do not change or the maximum number of iterations M is reached

  22. Application to GoMercer.com Meeting w/ Rob Saxon Obtain & read Web log files Preliminary study using CSC data Parsing data for sessions Clustering w/ FCMdd Data analysis & visualization On going

  23. Results – summary of log files • 148 files (one per day from 09/29/06 to 02/23/07), totaling about 2.5 GB • File sizes for Oct 2006 and Feb 2007 as shown • Session counts in the same periods present similar patterns

  24. Results – frequencies by URI type /choose-mercer /accepted /mercer-411 • User client programs (or browsers used) • Main page • ASP scripts • Breakdown for /accepted, /choose-mercer, and /mercer-411 • Flash videos • Individual videos • Combined by topic

  25. Results – user cluster and profiles 162 147 166 263 112 150 345 206 233 281 291 151 186 229 279 128 156 278 267 145 305399 190 320 268 279 158 251 225

  26. Questions and Discussions

  27. References • Data mining for business intelligence, by Shmuli et al, Wiley Inter-Science, 2007 • Data mining, by Dunham, Prentice Hall, 2003 • Web mining: applications and techniques, Scime (ed.), IDEA group, 2005 • What is data mining? by Squier, (www.dama-ncr.org/Library/2001.11.14-Laura%20Squier.ppt) • Automatic web user profiling and personalization using robust fuzzy relational clustering, by Nasraoui et al, 1999 • Web usage mining: discovery and application of interesting patterns from web data, by Cooley, PhD thesis, Univ. of Minnesota, 2000

More Related