Web mining

Web mining • Web mining deals with mining of patterns from web and e-commerce data. • Web data • Web pages • Web structures • Web logs • E-commerce sites • E-mail messages • Customer profiles • Telephone calls • The aim is to create an intelligent enterprise.

E-commerce Business Objectives • E-commerce business objectives [PPS 03] [BL 97] [MMPH 03][BM 98] [LPSHG 01] [SM 00][FCJ 01][MPT 99]

Objectives • Web mining is the hottest areas in computer science because of its direct applications in e-commerce, information retrieval and filtering, and web information systems. • Web mining. • Web usage mining. • Web content mining • Web structure mining. • And applications to e-commerce and business intelligence. • Data mining, text mining, information retrieval, Machine learning techniques will be covered for business intelligence, site management, personalization and user profiling.

Web Mining Web Structure Mining Web Usage Mining Web Content Mining The taxonomy of web mining [CMS 97] Web Mining Taxonomy • Web structure mining • Web content mining • Web usage mining • Mining the usage data • User’s behaviour • Click the link • Browsing time • Transaction

Topics • Overview of data mining and E-business analytics • Data preprocessing, association rules, classification, clustering. • Web structure mining • Link based search algorithms • Web content/text mining • Information retrieval • Web usage mining: E-metrics and E-commerce data analysis. • Web log mining • Web personalization and recommendation systems

The web miner system Registration or Remote agent data INPUT Referrer Log Agent Log Site files Access Log Data cleaning Path completion Session identification User Identification Classification algorithm Site Crawler Page classification Site topology User session file Site Filter Transaction Identification Standard Statistics Package Sequential Pattern mining Clustering Transaction file Association Rule mining Usage statistics Page clusters OLAP/VISUALIZATION Sequential patterns Association rules

WWW Web server Web docs Access log What is Web log mining ? • Web servers register a log entry for every single access they get. • A huge number of accesses (hits) are registered and collected in an ever growing we log. • Web log mining • Enhance server performance • Improve web site navigation • Improve system design of web applications • Target customers for EC • Identify potential prime advertisement locations

Existing Web Log Analysis Tools • Many products are available • Slow and make assumptions to reduce the size of the log file to analyze. • Frequently used, predefined reports • Summary report of hits and bytes transferred. • List of top requested URLs • List of referrers. • List of most common browsers. • Hits per hour/day/week/month reports. • Hits per Internet domain • Error report • Directory tree report etc. • Tools are limited in their performance, comprehensiveness, and depth of analysis.

client’s IP address, the date and time the request is received, the time zone where the server is located, the request command, the URL (Uniform Resource Locator) of the requested page, the protocol of the request, the return code of server, and the size of the page. mac04cville.wam.umd.edu - - [01/Apr/1997:00:00:00 -0600] "GET /~ka/graphics/mel/melting_glass.html HTTP/1.0" 200 11880 mac04cville.wam.umd.edu - - [01/Apr/1997:00:00:20 -0600] "GET /~ka/graphics/mel/glass.html HTTP/1.0" 200 11880

Timestamp Method URL/path status size Reference agent cookie IP address User ID Web Server log file entries dd23-125.compuserve.com-rhuia[01/Apr/1997:00:03:25-0800]”GET/SFU/cgi-bin/rg/vg-dspmsg.cgi?ci=40154&mi=49 HTTP/1.0”200417 129.128.4.241-[15/Aug/1999:10:45:32-0800] ”GET/source/pages/chapter.html”200618 /source/pages/index.html Mozilla/3.04 (win95)

Diversity of web log mining • Weblog provides rich information about web dynamics • Multidimensional web log analysis • Disclose potential customers, users, markets, etc • Plan mining (mining general web accessing regularities) • Web linkage adjustment, performance improvements. • Web accessing association/sequential pattern analysis • Web caching, pre-fetching • Trend analysis • Dynamics of the web: what has been changing ? • Customized to individual users.

More on web log mining • Information NOT contained in the log files. • Use of browser functions, e.g. backtracking within-page navigation, e.g., scrolling up and down. • Requests of pages stored in the cache • Requests of pages stored in cache server. • Etc. • Special problems with dynamic pages • Different user actions call same cgi script • Same user action at different times may call different cgi scripts. • One user using more than one browser at a time. • Etc.

Use of log files • Basic summarization • Get frequency of individual actions by user, domain and session. • Group actions into activities, e.g., reading messages in a conference. • Get frequency of different errors. • Questions answerable by such summary • Which components or features are the most/least used ? • Which events are most frequent ? • What is the user distribution over the domain areas ? • Are there, and what are the differences in access from different domain areas or geographic areas ?

In-depth analysis of log files • In-depth analysis • Pattern analysis e.g between users, over different courses, instructional design and materials. • Trend analysis: e.g user behaviour change over time, network traffic change over time. • Questions can be answered by in-depth analysis • In what contexts, the components or features used ? • What are the typical event sequences ? • What are the differences in usage and access patterns among users ? • What are the overall patterns of use of a given environment ? • What user behaviours change over time ? • How usage patterns change over quality of service (slow/fast)? • What is the distribution of network traffic over time ?

Formatted Data In database patterns Pattern discovery Data Pre-processing Web log files Data cube Pattern analysis knowledge Main Web mining steps • Data preparation • Data mining • Pattern analysis

Data preprocessing • Problems • Identifying types of pages: content page of navigation page • Identify visitor (user) • Identify session, transaction, sequence, action,… • Inferring cached pages • Identifying visitors • Login/cookies/combination: IP addresses, agent, path followed • Identification of session (division of click stream) • We do not know when a visitor leaves Use a timeout (usually 30 minutes) • Identification of user actions • Parameters and path analysis.

Preprocessing • Inputs to pre-processing • Server logs, site files, usage statistics • Outputs • User session file, transaction file, site topology, and page classifications • Impediments: browser and proxy caching. • Methods used to collect information about cached references • Cookies • Can be deleted by user • Cache busting: forcing a browser to download a page from browser. • Defeats the advantage of caching • Registration • Privacy concerns

Data cleaning • Data cleaning steps • Eliminate irrelevant items • HTTP protocol requires separate connection for every request. • Graphics is automatically downloaded. • Only log entry of HTML request should be kept. • All the log entries with file name suffixes such as gif, jpeg, GIF, JPEG, jpg, JPG are removed. • WEBMINER uses default list of suffixes to remove files. • Main intent of web usage mining • Knowing the intent of the user.

User Identification • Unique users must be identified • This task is complicated by the existence of local caches, corporate firewalls, and proxy servers. • Heuristic can be applied • If IP address is same and agent is different, the user is different • If referrer is different then users are different

Session Identification • Users will visit server more than once. • Goal of session identification • Divide the page accesses of each user into individual sessions. • Simplest method • Use timeout. Many commercial products use 30 minutes.

Read the next record in the Web log file. • Parse the record to get users’ address, URL of the requested page, and time and date of the request. • If the page is an image file or an unsuccessful request, it is discarded. The image files can be detected by looking at their file name extensions, e.g., .gif, .jpeg. The unsuccessful request will have a return code other than 200. [Unsuccessful request will have a return error code.] • Decide the session that the page will belong to based on user’s IP address. • If the elapsed time from last request is within max_idle_time, the page is appended into appropriate session. • If the elapsed time from last request is greater than max_idle_time, the session is closed and a new session is created for the IP address. The closed session is filtered using min_time and min_page, and the output.”

User Access Patterns • Due to cache function in client-side machine, the server log file doesn’t store every request from client. • How to find user access pattern between two consecutive pages in a session?

User Access Patterns • Assumption: Users always follow the shortest path for navigation. • Principles: • Once a click through happens, no back is allowed. Otherwise, it is not a shortest path. • Whenever there is ambiguity of click through or back, we choose back. This will make future "back" streams shorter

User Access Patterns • CT -> click through. CB -> click back. • 1. CT1 -> …->CTx where x >= 1 • 2. CB1 -> …->CBy ->CT0->…->CTz where y >=1, z >= 0

User Access Patterns -Example A B C D F E Order of Pages in session: A->C->B->D->E

User Access Patterns -Example • 1. A 2. A -> C Step: CT Cache: A Cache: AC Path: A Path: AC 3. C -> B 4. B->D Step: CB->CT Step:CT Cache:ACB Cache:ACBD Path: AB Path: ABD

User Access Patterns -Example 5. D ->E possible Path: 1. D->F->E 2. D->C->E 3. D->B->A->C->E Step: CT->CT Cache: ACBDE Path: ABDCE

Path Completion • Path completion: Finding out important accesses that are not recorded in the log. • If a page request is made which is not directly linked to last page, the referrer log has to be checked. • Missing page references are inferred and added to the user session file.

Example X Content page X Auxiliary page A Multiple Purpose Page J E B C D M N L F I J K G H O P S O R T Session identification: A-B-F-O-G, A-D, A-B-C-J, L-R Path completion: A-B-F-O-F-B-G, A-D, A-B-A-C-J, L-R

Task Result Clean Log • A-B-L-F-A-B-R-C-O-J-G-A-D User Identification • A-B-F-O-G-A-D • A-B-C-J • L-R • A-B-F-O-G • A-D • A-B-C-J • L-R Session Identification Path Completion • A-B-F-O-F-B-G • A-D • A-B-AC-J • L-R Summary of Sample Log Pre-processing Results

Transaction Identification • Goal of transaction identification: create meaningful clusters of references for each user. • Input to transaction identification process • All of page references for a given user session. Let L is a user session file • General transaction model • t=<ipt,uidt, {(lt1.url, lt1.time),…., (ltm.url,ltm.time)}> • Apply divide approach to generate transactions. • Three methods to identify transactions. • Reference length • Maximal forward reference • Time window

Transaction Identification by reference length • Transaction Identification by Reference Length • The amount of time the user spends on a page correlates to whether the page should be classified as a auxiliary or content page for that user. • Approximate to cut-off for auxiliary and content reference length can be calculated. • The definition of a transaction within the reference length approach is a quadruple with the reference length is addded for each page. • t=<ipt,uidt, {(lt1.url, lt1.time, lt1.length),…., (ltm.url, ltm.time, ltm.length)}> • The length of each reference is calculated by taking the difference between the time of the references. • After determining the cut-off • Auxiliary content transaction: for 1<= k <= (m-1): ltk.length <C and k=m: ltk.length >C • Content –only transaction: for 1 <=k<=m: lk.length >C

Use of content and structure in data cleaning • Structure • The structure of a web site is needed to analyze session and transactions • Hyper-tree of links between pages • Content • Contents of web pages visited can give hints for data cleaning and selection • Eg. Grouping web transactions by terminal page content. • Content of web pages gives a clue on type of page: navigation or content.

Formatted Data In database patterns Pattern discovery Data Pre-processing Web log files Data cube Pattern analysis knowledge Data mining: pattern discovery • Kinds of mining activities • Clustering • Classification • Association mining • Sequential pattern analysis • Prediction

Clustering • Grouping together objects that have similar characteristics • Clustering of transactions • Grouping same behaviours regardless of visitor or content. • Clustering of pages and paths • Grouping same pages visited based on content and visits • Clustering of visitors • Grouping of visitors with some behaviour.

Classification • Classification of visitors • Categorizing or profiling visitors by selecting features that best describe the properties of their behaviour. • 25 % of visitors who buy fiction books come from Ontario are aged between 18 and 35 and visit after 5.00 PM. • The behaviour (i.e., class) of a visitor may change in time.

Association mining • Association of frequently visited pages • Pages visited in the same session constitute a transaction. • Relating pages that are often referenced together regardless of the order in which they are accessed (may not be hyper-linked) • Inter-session or intra-session association

Sequential Pattern Aalysis • Sequential patterns are inter-session ordered sequences of page visits. Pages in a session are time-ordered sets of episodes by the same visitor. • (<A,B,C>, <A,B,C,E,F>, B, <A,B,C,E,F>) • (<A,B,C>, <E,F>, B, <A,*,F>),…

Pattern Analysis • Set of rules can be discovered can be very large • Pattern analysis reduces the set of rules by filtering out uninteresting rules and directly pinpointing interesting rules. • SQL analysis • OLAP from data cube • Viewing data at different levels • visualization

Web usage mining systems • General web usage mining: • WebLogMiner (Zaiane et al. 1998) • WUM (Spiliopoulou et al 1998) • WebSIFT (Cooley et al. 1999) • Adaptive web sites (Perkowitz et al. 1998) • Personalization and recommendation • WebWatcher (Joachims et al 1997) • Clustering of users (Mobhasher et al 1999) • Traffic and caching improvement • (Cochen et al 1998)

Data cube Web log Database Sliced and diced Data Cleaning Knowledge Design of web log miner • Web log is filtered to generate a relational database • A data cube is generated from database • OLAP is used to drill-down or roll-up in the cube • Months, year, day • OLAM is used for mining interesting knowledge.

Data cleaning and transformation • IP address, user, timestamp, method, file+parameters, Status, Size • IP address, user, timestamp, method, file+parameters, Status, Size WebLog • Machine, Internet domain, user, day, month, year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size • Machine, Internet domain, user, day, month, year, Hour, Minute, Seconds, Method, File, Parameters, Status, Size Site Structure • Machine, Internet domain, user, Field Site, Day, Month, Year, Hour, Minute, Seconds, Resource, Module/Action, Status, Size, duration Relational database

Typical summaries • Request summary: • Request statistics for all modules/pages/files • Domain summary: • Request statistics from different domains • Event summary: • statistics of the occurring of all events /actions • Session summary: • statistics of sessions • Bandwidth summary: • statistics of generated network traffic • Error summary: • statistics of all error messages • Referring organization summary: • statistics of where the users were from • Agent summary: • statistics of the use of different browsers, etc

From OLAP to Mining • OLAP can answer questions such as: • Which components or features are the most/least used ? • What is the distribution of network traffic over time (hour of the day, day of the week, month of the year etc) • What is the user distribution over different domain areas ? • Are there and what are the differences in access for users from different geographic areas ? • Some questions need further analysis: mining • In what context are the components or the features used ? • What are the typical event sequences ? • Are there any general behaviour patterns across all users, and what are they ? • What are the differences in usage and behaviour for different user population ? • Whether user behaviours change over time and How ?

Web Log Data Mining • Data characterization • Class comparison • Association • Prediction • Classification • Time-series analysis • Web traffic analysis • Typical event sequence and user behaviour pattern analysis • Transition analysis • Trend analysis

Discussion • Analyzing the web access logs can help understand the user behavior and web structure, there by improving the design of web collections and web applications, targeting e-commerce potential customers, etc. • Web log entries do not collect enough information. • Data cleaning and transformation is crucial and often requires site structure knowledge (MetaData) • OLAP provides data views from different perspectives and at different conceptual levels. • Web Log data mining provides in depth reports like time series analysis, associations, classification, etc.

Web mining

Web mining

Presentation Transcript

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

Web Mining

Web Mining

Web Mining

Web Mining

WEB MINING

WEB MINING

Web-Mining Agents Data Mining