Web Mining - I

Web Mining - I PowerPoint PPT Presentation


State University of New York at Stony Brook. 105948678. Abhijit Aparadh. 106069880 ... Tools with AI, Newport Beach, CA, pp. 558-567, 1997. Classification ...

Download Presentation

Web Mining - I

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide 1:Web Mining - I

Group Number 3 Course Instructor: Prof. Anita Wasilewska State University of New York at Stony Brook

Slide 2:Introduction to Web Mining

By Rajat Gogri

Slide 3:References

R. Kosala. and H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations, 2(1):1-15, 2000. R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Journal of Knowledge and Information Systems 1, 5-32, 1999 S. Chakrabarti, Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations, 1(2):1-11, 2000System, 1(1), 1999 Mining the Web Discovering Knowledge from Hypertext Data - Soumen Chakrabarti

Slide 4:Introduction

Why we need ? What is it ? How it is different from classical data mining ? How big is the web ? What are the problems ? Role of web mining ? Subtasks of web mining ? Web mining Taxonomy ?

Slide 5:Why we need Web Mining?

Explosive growth of amount of content on the internet Web search engines return thousands of results so difficult to browse Online repositories are growing rapidly Using web mining web documents can easily BROWSED, ORGANISED and CATALOGED with minimal human intervention # With the recent explosive growth of the amount of content on the Internet, It has become increasingly difficult for users to find and utilize information and for content providers to classify and catalog documents. # Traditional web search engines often return hundreds or thousands of results for a search, which is time consuming for users to browse. On-line libraries, search engines, and other large document repositories (e.g. customer support databases, product specification databases, press release archives, news story archives, etc.) are growing so rapidly that it is difficult and costly to categorize every document manually. # In order to deal with these problems, researchers look toward automated methods of working with web documents so that they can be more easily browsed, organized, and cataloged with minimal human intervention. # With the recent explosive growth of the amount of content on the Internet, It has become increasingly difficult for users to find and utilize information and for content providers to classify and catalog documents. # Traditional web search engines often return hundreds or thousands of results for a search, which is time consuming for users to browse. On-line libraries, search engines, and other large document repositories (e.g. customer support databases, product specification databases, press release archives, news story archives, etc.) are growing so rapidly that it is difficult and costly to categorize every document manually. # In order to deal with these problems, researchers look toward automated methods of working with web documents so that they can be more easily browsed, organized, and cataloged with minimal human intervention.

Web mining - data mining techniques to automatically discover and extract information from web documents/services

Slide 6:What is it?

We can simply say that web mining is an extension to the KDD (knowledge discovery in database) that is applied on the web data. We can simply say that web mining is an extension to the KDD (knowledge discovery in database) that is applied on the web data.

The web is not a relation Textual information and linkage structure Usage data is huge and growing rapidly Google’s usage logs are bigger than their web crawl Data generated per day is comparable to largest conventional data warehouses Ability to react in real-time to usage patterns No human in the loop

Slide 7:How does it differ from “classical” Data Mining?

# The web is not relation -- usually data mining data is very well structured in tables etc etc. the web data is not well structured. It is in the form of text and links (HTML TAGS etc) so we can call it as a semi structured. # Usage data is huge and growing rapidly.# The web is not relation -- usually data mining data is very well structured in tables etc etc. the web data is not well structured. It is in the form of text and links (HTML TAGS etc) so we can call it as a semi structured. # Usage data is huge and growing rapidly.

How big is the Web ? Number of pages Technically, infinite Because of dynamically generated content Lots of duplication (30-40%) Best estimate of “unique” static HTML pages comes from search engine claims Google = 8 billion, Yahoo = 20 billion Lots of marketing hype 76,184,000 web sites (Feb 2006) http://news.netcraft.com/archives/web_server_survey.html Netcraft survey Web Mining: Problems The “abundance” problem Limited coverage of the Web Limited query interface based on keyword-oriented search Limited customization to individual users Dynamic and semi structured

Slide 10:# Growing very rapidly # Limited coverage :- hidden web sources, majority of data is in the databases # interface is based on keyword oriented search # Limited customization to the individual users # Dynamic and semi structured # Growing very rapidly # Limited coverage :- hidden web sources, majority of data is in the databases # interface is based on keyword oriented search # Limited customization to the individual users # Dynamic and semi structured

Role of web mining Finding Relevant Information Creating knowledge from Information available Personalization of the information Learning about customers / individual users Web Mining : Subtasks Resource Finding Task of retrieving intended web-documents Information Selection & Pre-processing Automatic selection and pre-processing specific information from retrieved web resources Generalization Automatic Discovery of patterns in web sites Analysis Validation and / or interpretation of mined patterns

Slide 12:Resource finding :- By resource finding we mean that retrieving the data that is either online or offline from the text sources available on the web such as electronic newsletters, electronics newswire, newsgroups, text contents of the HTML documents obtained by removing HTML tags. We can also include the text sources that were originally not accessible from the web but are now accessible ..such as online text available for research purpose. Information Selection and Pre-processing kind of transformation process of the data retrieved from IR process. These transformations could be either a kind of pre-processing like removing stop words, stemming, finding phrases etc etc Generalization Analysis Resource finding :- By resource finding we mean that retrieving the data that is either online or offline from the text sources available on the web such as electronic newsletters, electronics newswire, newsgroups, text contents of the HTML documents obtained by removing HTML tags. We can also include the text sources that were originally not accessible from the web but are now accessible ..such as online text available for research purpose. Information Selection and Pre-processing kind of transformation process of the data retrieved from IR process. These transformations could be either a kind of pre-processing like removing stop words, stemming, finding phrases etc etc Generalization Analysis

Slide 13:Web Mining Taxonomy

Web involves three kinds of data.. Content of the web page Web log to see usage patterns Web structure Web Content mining ================ from search results categorize the documents using phrases in titles and snippets. Web Structure ============ web structure mining tries to discover the model underlying the link structures of the web. The model is based on topology of the hyperlinks with or without the description of the links. The model can be used to categorize web pages and is useful to generate information such as the similarity and relationship between different web sites. #### Web content mining and structure mining utilizes the real or primary data on the web while usage uses the data generated from the interactions of the users while interacting with the web. #### Web usage Mining =============== tries to make sense of the data generated by the web surfer’s sessions and behaviors. Analyzes access patterns of a user to improve response.Web involves three kinds of data.. Content of the web page Web log to see usage patterns Web structure Web Content mining ================ from search results categorize the documents using phrases in titles and snippets. Web Structure ============ web structure mining tries to discover the model underlying the link structures of the web. The model is based on topology of the hyperlinks with or without the description of the links. The model can be used to categorize web pages and is useful to generate information such as the similarity and relationship between different web sites. #### Web content mining and structure mining utilizes the real or primary data on the web while usage uses the data generated from the interactions of the users while interacting with the web. #### Web usage Mining =============== tries to make sense of the data generated by the web surfer’s sessions and behaviors. Analyzes access patterns of a user to improve response.

Slide 14:Web Mining Classification

By Pranav Moolwaney

Slide 15:References

http://en.wikipedia.org/wiki/Web_mining http://en.wikipedia.org/wiki/Shop_bot Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents. Proc. Fifth International World Wide Web Conference, May 6-10 1996. Cooley, R., B. Mobasher, et al. (1997). Web Mining: Information and Pattern Discovery on the World Wide Web, Proc. IEEE Intl. Conf. Tools with AI, Newport Beach, CA, pp. 558-567, 1997.

Slide 16:Classification

Web Mining Web Content Web Usage Web Structure

Slide 17:Web Content Mining

Web Content Mining Agent Based Approach Database Approach Intelligent Search Agent Information Filtering & Categorization Multilevel Databases Web Query Systems Personalized Web Agent

Slide 18:Intelligent Search Agents

Concentrate on searching relevant information using the characteristics of a particular domain to interpret and organize the collected information. It can be further classified into two types: Interpretation Based on Pre-Specified Information: Examples: Harvest, FAQFinder, Information Manifold, OCCAM Interpretation Based on Unfamiliar Source: Example: ShopBot Cooley, R., B. Mobasher, et al. (1997). Web Mining: Information and Pattern Discovery on the World Wide Web

Slide 19:ShopBot

A ShopBot is an autonomous software agent that comb the internet providing users with low price product or product recommendations. A ShopBot basically looks for product information from a variety of vendor sites using the general information about the product domain. The following example displays a shopBot at www.allbookstores.com. http://en.wikipedia.org/wiki/Shop_bot

Slide 23:Information Filtering & Categorization

This makes use of various information retrieval techniques and characteristics of hypertext web documents to interpret and categorize data. Examples: HyPursuit, BO (Bookmark Organizer). Cooley, R., B. Mobasher, et al. (1997). Web Mining: Information and Pattern Discovery on the World Wide Web

Slide 24:Bookmark Organizer (BO)

Makes use of hierarchical clustering techniques and involves user interaction to organize a collection of web documents. It operates in two modes: Automatic Manual Frozen Nodes: In a hierarchical structure, if we freeze a node N, then the subtree rooted at N represents a coherent group of documents. Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents.

Slide 25:Additions & Deletions in BO

For addition, we can use either of the following two mechanisms: Fully Automatic: Makes use of an extremely precise search algorithm to find the most relevant frozen cluster to insert into. Semi Automatic: Insert the bookmark, and then climb up the tree to find the closest frozen ancestor, and then re-cluster the sub folder. When we Delete, we must re-cluster the containing sub folder. Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents.

Slide 26:Personalized Web Agents

This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests. Examples: WebWatcher PAINT Syskill&Webert GroupLens Firefly Cooley, R., B. Mobasher, et al. (1997). Web Mining: Information and Pattern Discovery on the World Wide Web

Slide 27:Multilevel Databases

Slide 28:Levels of a MLDB

Layer 0 : Unstructured, massive and global information base. Layer 1: Derived from lower layers. Relatively structured. Obtained by data analysis, transformation & Generalization. Higher Layers (Layer n): Further generalization to form smaller, better structured databases for more efficient retrieval. Cooley, R., B. Mobasher, et al. (1997). Web Mining: Information and Pattern Discovery on the World Wide Web

Slide 29:Web Query System

These systems attempt to make use of: Standard database query language – SQL Structural information about web documents Natural language processing for queries made in www searches. Examples: WebLog: Restructuring extracted information from Web sources. W3QL: Combines structure query (organization of hypertext) and content query (information retrieval techniques). Cooley, R., B. Mobasher, et al. (1997). Web Mining: Information and Pattern Discovery on the World Wide Web

Slide 30:Architecture of a Global MLDB

Slide 31:Web Mining Classification cntd...

By Dhiraj Chawla

Slide 32:References

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30 David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Inferring web communities from link topology. In Conference on Hypertext and Hypermedia. ACM, 1998. www.iprcom.com/papers/pagerank/ http://maya.cs.depaul.edu/~mobasher/webminer/survey/node23.html

Slide 33:Mining the World Wide Web

Web Structure Mining Using Links PageRank (Brin et al., 1998) CLEVER (Chakrabarti et al., 1998) Use interconnections between web pages to give weight to pages. Database Approach Agent Based Approach Web Content Mining Web Usage Mining Web Mining

Slide 34:Mining the World Wide Web

Database Approach Agent Based Approach Web Content Mining Web Structure Mining Web Usage Mining Customized Usage Tracking Web Mining

Slide 35:Mining the World Wide Web

General Access Pattern Tracking Database Approach Agent Based Approach Web Content Mining Web Structure Mining Web Usage Mining Web Mining

Slide 36:Web Structure Mining

Different Algorithms for Web Structures: Page-Rank Method Sergey Brin and Lawrence Page: The anatomy of a large-scale hypertextual web search engine. In Proc. Of WWW, pages 107–117, Brisbane, Australia, 1998. CLEVER Method http://www.almaden.ibm.com/projects/clever.shtml

Slide 37:Page-Rank Method

Introduced by Brin and Page (1998) Used in Google Search Engine Mine hyperlink structure of web to produce ‘global’ importance ranking of every web page Web search result is returned in the rank order Treats link as like academic citation Assumption: Highly linked pages are more ‘important’ than pages with a few links A page has a high rank if the sum of the ranks of its back-links is high http://infolab.stanford.edu/pub/papers/google.pdf

Backlink Link Structure of the Web http://infolab.stanford.edu/pub/papers/google.pdf

Slide 39:Page Rank: Computation

Assume: PR(Tn) : Page Rank of web page Tn C(Tn) : Number of outgoing links for page Tn PR(Tn)/C(Tn) : Share of vote page A will get d : Damping Factor (between 0 and 1) Page Rank Computation: PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)) http://infolab.stanford.edu/pub/papers/google.pdf

Slide 40:Page Rank: Results

Google utilizes a number of factors to rank the search results: proximity, anchor text, page rank The benefits of Page Rank are the greatest for underspecified queries, example: ‘Stanford University’ query using Page Rank lists the university home page the first http://infolab.stanford.edu/pub/papers/google.pdf

Slide 41:Page Rank: Advantages

Global ranking of all web pages – regardless of their content, based solely on their location in web Higher quality search results – central, important, and authoritative web pages are given preference Help find representative pages to display for a cluster center Other applications: traffic estimation, back-link predictor, user navigation, personalized page rank http://infolab.stanford.edu/pub/papers/google.pdf

Slide 42:CLEVER Method

CLient–side EigenVector-Enhanced Retrieval Developed by a team of IBM researchers at IBM Almaden Research Centre Ranks pages primarily by measuring links between them Continued refinements of HITS ( Hypertext Induced Topic Selection) Basic Principles – Authorities, Hubs Good hubs points to good authorities Good authorities are referenced by good hubs http://www.almaden.ibm.com/projects/clever.shtml

Slide 43:Problems Prior to CLEVER

Textual content that is ignored leads to problems caused by some features of web: HITS returns good resources for more general topic when query topics are narrowly-focused HITS occasionally drifts when hubs discuss multiple topics Usually pages from single Web site take over a topic and often use same html template therefore pointing to a single popular site irrelevant to query topic http://www.almaden.ibm.com/projects/clever.shtml

Slide 44:CLEVER: Solution

Extension 1: Anchor Text using text that surrounds hyperlink definitions (href’s) in Web pages, often referred as ‘anchor text’ boost weight enhancements of links that occur near instances of query terms Extension 2: Mini Hubs/Pagelets breaking large hub into smaller units treat contiguous subsets of links as mini-hubs or ‘pagelets’ contiguous sets of links on a hub page are more focused on single topic than the entire page http://www.almaden.ibm.com/projects/clever.shtml

Slide 45:CLEVER: The Process

Starts by collecting a set of pages Gathers all pages of initial link, plus any pages linking to them Ranks result by counting links Links have noise, not clear which pages are best Recalculate scores Pages with most links are established as most important, links transmit more weight Repeat calculation no. of times till scores are refined http://www.almaden.ibm.com/projects/clever.shtml

Slide 46:CLEVER: Advantages

Used to populate categories of different subjects with minimal human assistance Able to leverage links to fill category with best pages on web Can be used to compile large taxonomies of topics automatically Emerging new directions: Hypertext classification, focused crawling, mining communities http://www.almaden.ibm.com/projects/clever.shtml

Slide 47:Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

J. Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan Computer Science Department University of Minnesota Proceedings from SIGKDD Exploration, vol. 2, issue 1, 1999

Slide 48:Web Usage Mining

Web usage mining also known as Web log mining mining techniques to discover interesting usage patterns from the data derived from the interactions of the users while surfing the web mining Web log records to discover user access patterns of Web pages http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf

Slide 49:Web Data

Content Structure Usage User Profile Web logs provide rich information about Web dynamics Typical Web log entry includes the URL requested, the IP address from which the request originated, and a timestamp http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf

Slide 51:Web Usage Mining – Three Phases

http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf

Content Preprocessing - the process of converting text, image, scripts and other files into the forms that can be used by the usage mining. Structure Preprocessing - The structure of a Website is formed by the hyperlinks between page views, the structure preprocessing can be done by parsing and reformatting the information. Usage Preprocessing - the most difficult task in the usage mining processes, the data cleaning techniques to eliminate the impact of the irrelevant items to the analysis result.

Slide 52:Preprocessing

http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf

Slide 53:Pattern Discovery

Statistical Analysis Most common method to extract knowledge about visitors to a Web site. Association Rules Refers to sets of pages that are accessed together with a support value exceeding some specified threshold. Clustering A technique to group together users or data items (pages) with the similar characteristics. http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf

Slide 54:Classification Helps to establish a profile of users belonging to a particular class or category. Sequential Patterns Attempts to find inter-session patterns Dependency Modeling Goal is to develop a model capable of representing dependencies

Pattern Discovery http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf

Slide 55:Pattern Analysis

Pattern Analysis - final stage of the Web usage mining. To eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process. Mechanisms used: SQL, OLAP, visualization etc. http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf

Slide 56:Techniques for Web usage mining

Construct multidimensional view on the Weblog database Perform multidimensional OLAP analysis to find the top N users, top N accessed Web pages, most frequently accessed time periods, etc. Perform data mining on Weblog records Find association patterns, sequential patterns, and trends of Web accessing May need additional information,e.g., user browsing sequences of the Web pages in the Web server buffer Conduct studies to Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping

Slide 57:WEBMINER : introduces a general architecture for Web usage mining, automatically discovering association rules and sequential patterns from server access logs. proposes an SQL-like query mechanism for querying the discovered knowledge in the form of association rules and sequential patterns. WebLogMiner Web log is filtered to generate a relational database Data mining on web log data cube and web log database

Software for Web Usage Mining http://maya.cs.depaul.edu/~mobasher/papers/webminer-kais.pdf http://www-sop.inria.fr/axis/Publications/uploads/pdf/arnoux_synasc03.pdf

Slide 58:WEBMINER

SQL-like Query A framework for Web mining, the applications of data mining and knowledge discovery techniques, association rules and sequential patterns, to Web data: Association rules: using Apriori algorithm 40% of clients who accessed the Web page with URL /company/products/product1.html, also accessed /company/products/product2.html Sequential patterns: 60% of clients who placed an online order in /company/products/product1.html, also placed an online order in /company/products/product4.html within 15 days http://maya.cs.depaul.edu/~mobasher/papers/webminer-kais.pdf

Slide 59:WebLogMiner

Database construction from server log file: data cleaning data transformation Multi-dimensional web log data cube construction and manipulation Data mining on web log data cube and web log database http://www-sop.inria.fr/axis/Publications/uploads/pdf/arnoux_synasc03.pdf

Slide 60:Mining the World-Wide Web

Design of a Web Log Miner Web log is filtered to generate a relational database A data cube is generated from the database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge 1 Data Cleaning 2 Data Cube Creation 3 OLAP 4 Mining Web log Database Data Cube Sliced and diced cube Knowledge http://www-sop.inria.fr/axis/Publications/uploads/pdf/arnoux_synasc03.pdf

Slide 61:Web Usage Mining

Applications Target potential customers for electronic commerce Enhance the quality and delivery of Internet information services to the end user Improve Web server system performance Identify potential prime advertisement locations Facilitates personalization/adaptive sites Improve site design Fraud/intrusion detection Predict user’s actions (allows prefetching)

Slide 62:WebSIFT : The web site information filter system

Robert Cooley, Pang-Ning Tan, Jaideep Srivastava Computer Science Department University of Minnesota Proceedings of the Web Usage Analysis and User Profiling Workshop, WEBKDD workshop, August 1999. By Abhijit Aparadh

Slide 63:Introduction

Web Usage Mining is application of DM techniques to large web data repositories in order to extract usage patterns. Need to find interesting patterns rather than simply generating them. Quantify what is considered uninteresting in order to form a basis of comparison. Web usage mining uses 3 types of domain info : usage, content, structure WebSIFT uses content and structure to identify interesting results from usage mining data.

Slide 64:Web usage mining

Helps is designing complex websites Discover frequent itemsets, association rules, clusters of similar web pages, path analysis etc Uses support and confidence to restrict discovered rules High threshold: rarely discovers any new knowledge, low threshold: unmanageable rules.

Slide 65:“Interesting” knowledge

Novelty and unexpectedness of a rule Unexpectedness is deviation from set of beliefs. A rule that doesn’t contradict existing beliefs but points out a relationship that hadn’t been considered yet is also interesting.

Slide 66:WebSIFT architecture

Input: 3 types of server logs: Access, referrer and agent, HTML files that make up the site, optional data such as registration data Preprocessing: This phase uses input data to create user session file which gives idea about user browsing behavior based on predefined methods and heuristics Knowledge Discovery: Generation of general usage stats such as hits per page, most frequent page, common start page, average time spent on a page etc. This info is fed to pattern analysis tools

http://www.cs.umn.edu/research/websift/papers/webkdd99.ps

Slide 68:Information filtering

Web usage mining: Content and structure provide evidence about beliefs. Content and structure data can be used as surrogates to domain knowledge. Links between pages show that they are related. Strength of evidence for pages to be related is proportional to strength of topological connection between set of pages. Table gives examples of interesting beliefs in web usage mining domain.

Slide 69:Interesting beliefs in web usage mining domain

http://www.cs.umn.edu/research/websift/papers/webkdd99.ps

Slide 70:Methods to find “interesting” results

BME: Beliefs with mined evidence To declare itemsets that contain pages not directly connected to be interesting Set of pages that are related has no domain or existing evidence but there is mined evidence. BCE: Beliefs with contradicting evidence Absence of certain frequent itemsets is evidence against a belief that pages are related Domain evidence suggests that the pages are related (linked) then absence of frequent itemset can be interesting.

Slide 71:Conclusions and future work

Conclusions Simplest use of structural info to represent domain knowledge is highly effective in filtering discovered rules Experimental results: 700 discovered frequent itemsets: 21 interesting itemsets were identified and of these 21 : 2 identified out of date info and 1 pointed out poor design. Future work Filtering frequent itemsets, sequential patterns and clusters discovered from usage data using both structure and content data. To extend simple boolean logic in BME and BCE, we can use probablilities and fuzzy logic in info filter.

Slide 72:Thank you!

  • Login