Web mining research a survey
Download
1 / 34

Web Mining Research: A Survey - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Web Mining Research : A Survey. Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised and presented by Fan Min, 4/22/2009. Outline. Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Web Mining Research: A Survey' - Samuel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Web mining research a survey

WebMiningResearch: ASurvey

Raymond Kosala and Hendrik Blockeel

ACM SIGKDD, July 2000

Presented by Shan Huang, 4/24/2007

Revised and presented by Fan Min, 4/22/2009


Outline
Outline

  • Introduction

  • Web Mining

  • Web Content Mining

  • Web Structure Mining

  • Web Usage Mining

  • Conclusion & Exam Questions


Four problems
Four Problems

  • Finding relevant information

    • Low precision and unindexed information

  • Creating new knowledge out of available information on the web

  • Personalizing the information

    • Catering to personal preference in content and presentation

  • Learning about the consumers

    • What does the customer want to do?

    • Using web data to effectively market products and/or services


Other approaches
Other Approaches

Web mining is NOT the only approach

  • Database approach (DB)

  • Information retrieval (IR)

  • Natural language processing (NLP)

    • In-depth syntactic and semantic analysis

  • Web document community

    • Standards, manually appended meta-information, maintained directories, etc


Direct vs indirect web mining
Direct vs. Indirect Web Mining

  • Web mining techniques can be used to solve the information overload problems:

    • Directly

      Attack the problem with web mining techniques

      E.g. newsgroup agent classifies news as relevant

    • Indirectly

      Used as part of a bigger application that addresses problems

      E.g. used to create index terms for a web search service


The research
The Research

  • Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning)

  • Focusing on research from the machine learning point of view


Outline1
Outline

  • Introduction

  • Web Mining

  • Web Content Mining

  • Web Structure Mining

  • Web Usage Mining

  • Conclusion & Exam Questions


Web mining definition
Web Mining: Definition

  • “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.”

    • Can be viewed as four subtasks

    • Not the same as Information Retrieval

    • Not the same as Information Extraction


Web mining subtasks
Web Mining: Subtasks

  • Resource finding

    • Retrieving intended documents

  • Information selection/pre-processing

    • Select and pre-process specific information from selected documents

  • Generalization

    • Discover general patterns within and across web sites

  • Analysis

    • Validation and/or interpretation of mined patterns


Web mining not ir
Web Mining: Not IR

  • Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible

  • Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)


Web mining not ie
Web Mining: Not IE

  • Information extraction (IE) aims to extract the relevant facts from given documents

    • IE systems for the general Web are not feasible

    • Most focus on specific Web sites or content


Web mining and machine learning
Web Mining and Machine Learning

  • Machine learning is concerned with the development of algorithms and techniques that allow computers to "learn".

  • Web mining is NOT learning from the Web.

  • Some applications of machine learning on the web are NOT Web Mining

  • Methods used for Web Mining are NOT limited to machine learning

  • Oops, there is a close relationship between web mining and machine learning


Web mining the agent paradigm
Web Mining: The Agent Paradigm

  • User Interface Agents

    • information retrieval agents, information filtering agents, & personal assistant agents.

  • Distributed Agents

    • distributed agents for knowledge discovery or data mining.

    • Problem solving by a group of agents

  • Mobile Agents


Web mining the agent paradigm1
Web Mining: The Agent Paradigm

  • Content-based approach

    • The system searches for items that match based on an analysis of the content using the user preferences.

  • Collaborative approach

    • The system tries to find users with similar interests

    • Recommendations given based on what similar users did


Outline2
Outline

  • Introduction

  • Web Mining

  • Web Content Mining

  • Web Structure Mining

  • Web Usage Mining

  • Conclusion & Exam Questions


Web mining categories
Web Mining Categories

  • Web Content Mining

    • Discovering useful information from web contents/data/documents.

  • Web Structure Mining

    • Discovering the model underlying link structures (topology) on the Web. E.g. discovering authorities and hubs

  • Web Usage Mining

    • Make sense of data generated by surfers

    • Usage data from logs, user profiles, user sessions, cookies, user queries, bookmarks, mouse clicks and scrolls, etc.


Web content data structure
Web Content Data Structure

  • Unstructured – free text

  • Semi-structured – HTML

  • More structured – Table or Database generated HTML pages

  • Multimedia data – receive less attention than text or hypertext


Outline3
Outline

  • Introduction

  • Web Mining

  • Web Content Mining

  • Web Structure Mining

  • Web Usage Mining

  • Conclusion & Exam Questions


Web content mining ir view
Web Content Mining: IR View

  • Unstructured Documents

    • Bag of words, or phrase-based feature representation

    • Features can be boolean or frequency based

    • Features can be reduced using different feature selection techniques

    • Word stemming, combining morphological variations into one feature


Web content mining ir view1
Web Content Mining: IR View

  • Semi-Structured Documents

    • Uses richer representations for features, based on information from the document structure (typically HTML and hyperlinks)

    • Uses common data mining methods (whereas unstructured might use more text mining methods)


Web content mining db view
Web Content Mining: DB View

  • Tries to infer the structure of a Web site or transform a Web site to become a database

    • Better information management

    • Better querying on the Web

  • Can be achieved by:

    • Finding the schema of Web documents

    • Building a Web warehouse

    • Building a Web knowledge base

    • Building a virtual database


Web content mining db view1
Web Content Mining: DB View

  • Mainly uses the Object Exchange Model (OEM)

    • Represents semi-structured data (some structure, no rigid schema) by a labeled graph

  • Process typically starts with manual selection of Web sites for content mining

  • Main application: building a structural summary of semi-structured data (schema extraction or discovery)


Outline4
Outline

  • Introduction

  • Web Mining

  • Web Content Mining

  • Web Structure Mining

  • Web Usage Mining

  • Conclusion & Exam Questions


Web structure mining
Web Structure Mining

  • Interested in the structure between Web documents (not within a document)

  • Inspired by the study of social networks and citation analysis

  • Example: PageRank – Google

  • Application: Discovering micro-communities in the Web

  • Measuring the “completeness” of a Web site


Outline5
Outline

  • Introduction

  • Web Mining

  • Web Content Mining

  • Web Structure Mining

  • Web Usage Mining

  • Conclusion & Exam Questions


Web usage mining
Web Usage Mining

  • Tries to predict user behavior from interaction with the Web

  • Wide range of data (logs)

    • Web client data

    • Proxy server data

    • Web server data

  • Two common approaches

    • Map usage data into relational tables before using adapted data mining techniques

    • Use log data directly by utilizing special pre-processing techniques


Web usage mining1
Web Usage Mining

  • Typical problems: Distinguishing among unique users, server sessions, episodes, etc in the presence of caching and proxy servers

  • Often Usage Mining uses some background or domain knowledge

    E.g. site topology, Web content, etc


Web usage mining2
Web Usage Mining

  • Two main categories:

    • Learning a user profile (personalized)

      Web users would be interested in techniques that learn their needs and preferences automatically

    • Learning user navigation patterns (impersonalized)

      Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site


Outline6
Outline

  • Introduction

  • Web Mining

  • Web Content Mining

  • Web Structure Mining

  • Web Usage Mining

  • Conclusion & Exam Questions


Conclusions
Conclusions

  • The paper tried to resolve confusion with regards to the term Web Mining

    • Differentiated from IR and IE

  • Suggest three Web mining categories

    • Content, Structure, and Usage Mining

  • Briefly described approaches for the three categories

  • Explored connection with agent paradigm


Exam question 1
Exam Question #1

  • Question: Outline the main characteristics of Web information.

  • Answer: Web information is huge, diverse, and dynamic.


Exam question 2
Exam Question #2

  • Question: How data mining techniques can be used in Web information analysis? Give at least two examples.

    • Classification: classification on server logs using decision tree, Naïve-Bayes classifier to discover the profiles of users belonging to a particular class

    • Clustering: Clustering can be used to group users exhibiting similar browsing patterns.

    • Association Analysis: association analysis can be used to relate pages that are most often referenced together in a single server session.


Exam question 3
Exam Question #3

  • Question: What are the three main areas of interest for Web mining?

  • Answer: (1) Web Content

    (2) Web Structure

    (3) Web Usage


Thank you

Thank you!

And Raymond Kosala, Hendrik Blockeel

And Shan Huang!


ad