web mining and applications
Download
Skip this Video
Download Presentation
WEB MINING AND APPLICATIONS

Loading in 2 Seconds...

play fullscreen
1 / 84

web mining - PowerPoint PPT Presentation


  • 535 Views
  • Uploaded on

WEB MINING AND APPLICATIONS. Pallavi Tripathi 105956127 Vaishali Kshatriya 105951122 Mehru Anand 106113525 Minnie Virk 106113516. REFERENCES.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'web mining ' - Philip


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
web mining and applications

WEB MINING AND APPLICATIONS

Pallavi Tripathi 105956127

Vaishali Kshatriya 105951122

Mehru Anand 106113525

Minnie Virk 106113516

references
REFERENCES
  • Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber
  • Presentation Slides of Prof. Anita Wasilewska
  • http://www.cs.rpi.edu/~youssefi/research/VWM/
  • http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf
  • http://www.galeas.de/webimining.html
  • http://www.cs.helsinki.fi/u/gionis/seminar_papers/zaki00spade.ps

CSE:634 Web Mining

citations
CITATIONS
  • Amir H. Youssefi, David J. Duke, Mohammed J. Zaki, Ephraim P. Glinert, Visual Web Mining 13th International World Wide Web Conference (poster proceedings), New York, NY, May 2004.
  • Amir H. Youssefi, David Duke, Ephraim P. Glinert, and Mohammed J. Zaki, Toward Visual Web Mining, 3rd International Workshop on Visual Data Mining (with ICDM\'03), Melbourne, FL, November 2003.

CSE:634 Web Mining

slide4
With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

http://www.galeas.de/webimining.html

CSE:634 Web Mining

what is web mining
WHAT IS WEB MINING?

Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World­Wide Web.

CSE:634 Web Mining

areas of classification
AREAS OF CLASSIFICATION
  • WEB CONTENT MINING is the process of extracting knowledge from the content of documents or their descriptions.
  • WEB STRUCTURE MINING is the process of inferring knowledge from the World­Wide Web organization and links between references and referents in the Web.
  • WEB USAGE MINING, also known as WEB LOG MINING, is the process of extracting interesting patterns in web access logs
  • Inaddition to these three web mining types, there are other helpful approaches for web knowledge discovery, such as information visualization which helps us to understand the complex relationships and structures of many search results.

http://www.galeas.de/webimining.html

CSE:634 Web Mining

topics covered
TOPICS COVERED

In today’s presentation we would be covering the following algorithms related to the various aspects of Web Mining :

  • Spade Algorithm and its applications in Visual Web Mining
  • Sentiment Classification
  • Community Trawling Algorithm

CSE:634 Web Mining

visual web mining
VISUAL WEB MINING

Application of Information visualization techniques on results of Web Mining in order to further amplify the perception of extracted patterns and visually explore new ones in web domain.

Application Domain is Web Usage Mining and Web Content Mining

http://www.cs.rpi.edu/~youssefi/research/VWM/

CSE:634 Web Mining

approach used
APPROACH USED
  • Make personalized results for targeted web surfers
  • Use data mining algorithms for extracting new insight and measures
  • Employ a database server and relational query language as a means to submit specific queries against data
  • Utilize visualization to obtain an overall picture

http://www.cs.rpi.edu/~youssefi/research/VWM/

CSE:634 Web Mining

spade overview
SPADE OVERVIEW
  • Proposed by Mohammed J Zaki
  • Sequential PAttern Discovery Using Equivalent Class
  • An algorithm based on Apriori for fast discovery of frequent sequences
  • Needs three database scans in order to extract sequential patterns
  • Given: A database of customer transactions, each of which having the following characteristics: sequence-id or customer-id, transaction-time and the item involved in the transaction.
  • The aim is to obtain typical behaviors according to the user\'s viewpoint.

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining

definitions
DEFINITIONS
  • Item : Can be considered as the object bought by a customer, or the page requested by the user of a website, etc.
  • Itemset: An itemset is the set of items that are grouped by timestamp.
  • Data Sequence: Sequence of itemsets associated to a customer.
  • Sequential Mining: Discovering frequent sequences over time of attribute sets in large databases.
  • Frequent Sequential Pattern: Sequence whose statistical significance in the database is above user-specified threshold.

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining

spade algorithm
SPADE ALGORITHM
  • In the first scan ,find frequent items
  • The second scan aims at finding frequent sequences of length 2
  • The last scan associates to frequent sequences of length 2, a table of the corresponding sequences id and itemsets id in the database
  • Based on this representation in main memory, the support of the candidate sequences of length k is the result of join operations on the tables related to the frequent sequences of length (k-1) able to generate this candidate

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining

slide13

Data Sequence of 4 customers

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining

an example
AN EXAMPLE
  • With a minimum support of “50%” a sequential pattern can be considered as frequent if it occurs at least in the data sequences of 2 customers (2/4).
  • In this case a maximal sequential pattern mining process will find three patterns:

S1: (“Camera,DVD”)(“DVD-R,DVD-Rec”)

S2: (“DVD-R,DVD-Rec”)(“Videosoft”)

S3: (“Memory Card”)(“USB”)

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining

determining support
Determining Support

SUFFIX JOIN ON ID LIST

ORIGINAL ID LIST DATABASE

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining

advantages
ADVANTAGES
  • Uses simple join operations on id table
  • No complicated hash tree structures used
  • No overhead of generating and searching subsequences
  • Cuts down on I/O operations by limiting itself to three scans

http://www.cs.helsinki.fi/u/gionis/seminar_papers/zaki00spade.ps

CSE:634 Web Mining

slide17
The visual Web Mining Framework provides prototype implementation for applying information visualization techniques on these results.

http://www.cs.rpi.edu/~youssefi/research/VWM/

CSE:634 Web Mining

system architecture
SYSTEM ARCHITECTURE

http://www.cs.rpi.edu/~youssefi/research/VWM

CSE:634 Web Mining

slide19
A robot (webbot) is used to retrieve the pages of the Website
  • Web Server log files are downloaded and processed
  • The Integration Engine is a suite of programs for data preparation ie extracting, cleaning, transforming, integrating data and finally loading into database and later generating graphs in XGML.

http://www.cs.rpi.edu/~youssefi/research/VWM

CSE:634 Web Mining

slide20
We extract user sessions from web logs , this yields results of roughly related to a specific user
  • The user sessions are converted into format suitable for Sequence Mining
  • Outputs are frequent contiguous sequence with given minimum support.
  • These are imported into a database
  • Different queries are executed against this data.

http://www.cs.rpi.edu/~youssefi/research/VWM

CSE:634 Web Mining

applications
APPLICATIONS
  • Designing different visualization diagrams and exploring frequent patterns of user access on a website
  • Classification of web pages into two classes : hot and cold : attracting high and low number of visitors.
  • A webmaster can make exploratory changes to website structure and analyze the change in user access patterns in real world.

http://www.cs.rpi.edu/~youssefi/research/VWM/

CSE:634 Web Mining

sentiment classification

Sentiment Classification

Vaishali Kshatriya

105951122

references23
References
  • The Sentimental Factor: Improving Review Classification via Human-Provided Information. - Philip Beineke , Shivakumar Vaithyanathan and Trevor Hastie
  • Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (July 2002)
  • http://wing.comp.nus.edu.sg/chime/050427/SentimentClassification3_files/frame.htm
  • http://www.cse.iitb.ac.in/~cs621/seminar/SentimentDetection.ppt#267,12,Recent Advances
  • Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web" Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan.

CSE:634 Web Mining

sentiment classification24
Sentiment Classification
  • It is a task of labeling a review document according to the polarity of its prevailing opinion.

CSE:634 Web Mining

online shopping
Online Shopping

CSE:634 Web Mining

topical vs sentimental classification
Topical vs. Sentimental Classification

Topical Classification

  • Classifying documents into various subjects for example : Mathematics, Sports etc
  • comparing individual words (unigrams) in various subject areas (Bag-of-Words approach). Example : “score”, “referee”, “football” => Sports

Sentiment Classification

  • classifying documents according to the overall sentiment positive vs. negative E.g. like vs. dislike; Recommended vs. not recommended
  • More difficult compared to traditional topical classification. May need more linguistic processing E.g. “you will be disappointed” and “it is not satisfactory”

http://wing.comp.nus.edu.sg/chime/050427/SentimentClassification3_files/frame.htm

CSE:634 Web Mining

challenges
Challenges
  • Dependence of context on the document – “unpredictable” plot, “unpredictable” performance
  • Negations have to be captured
    • The movie was not that bad.
    • The pictures taken by the cell is not of best quality.
  • Subtle Expressions:
    • “How can someone sit through the entire movie?”

http://www.cse.iitb.ac.in/~cs621/seminar/SentimentDetection.ppt#267,12,Recent Advances

CSE:634 Web Mining

unsupervised review classification turney acl 02
Unsupervised review classification (Turney ACL -02)
  • Input: Written review
  • Output: classification (i.e. positive or negative)
  • Step 1: Use part-of-speech tagger to identify phrases
  • Step 2: Estimate the semantic orientation of extracted phrase
  • Step 3: Assign the given review to a class (either recommended or not recommended)

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining

step 1 extract the phrases
Step 1: Extract the phrases
  • Part-of-speech tagger is applied to the review
  • Two consecutive words are extracted from the review if their tags conform to any of the patterns in the table

where JJ: Adjective and NN: Noun

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining

step 2 estimate the semantic orientation
Step 2: Estimate the semantic orientation
  • Uses PMI-IR (Pointwise Mutual Information and Information Retrieval)
  • PMI between 2 words, word1 and word2 can be defined as :
  • The Semantic Orientation (SO) of a phrase is calculated as :
    • SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”)
  • SO is positive when the phrase is more strongly associated with excellent and negative when it is more strongly associated with poor.

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining

step 2 cont d
Step 2 (cont’d)
  • PMI-IR estimates PMI by issuing queries to a search engine (hence the IR in PMI-IR) and noting the number of hits (matching documents).
  • The experiment uses AltaVista

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining

step 3 assign a class
Step 3: Assign a Class
  • Calculate the average of the SO of the phrases and classify them as recommended if the average is positive and not recommended if the average is negative.

Reviews of a bank

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining

drawbacks
Drawbacks
  • Sentiment classification is useful but it does not find what the reviewer liked or disliked.
  • A negative sentiment on an object does not imply that the user did not like anything about the product
  • Similarly a positive sentiment does not imply that the user liked everything about the product
  • The solution is to go to sentence and feature level

http://www.cs.uic.edu/~liub/EITC-06.ppt#493,20,Identify opinion orientation of features

CSE:634 Web Mining

feature based opinion mining and summarization hu and liu 04
Feature based Opinion mining and summarization (Hu and Liu ‘04)
  • Interested in what reviewers liked and disliked
  • Since the number of reviews of an object can be large, the goal was to produce simple summary of the reviews
  • The summary can be easily visualized and compared

http://www.cs.uic.edu/~liub/EITC-06.ppt#493,20,Identify opinion orientation of features

CSE:634 Web Mining

three main tasks
Three main tasks:
  • Step1 : Identify and extract object features that have been commented on in each review
  • Step 2: Determine whether the opinion on the review is positive, negative or neutral
  • Step 3: Group synonyms of features
  • Produce a feature-based summary!!

http://www.cs.uic.edu/~liub/EITC-06.ppt#493,20,Identify opinion orientation of features

CSE:634 Web Mining

online shopping36
Online Shopping

CSE:634 Web Mining

summary
Summary
  • Classification of reviews as good or bad: sentimental classification
  • Unsupervised review classification extracts the phrases from the review, estimates the semantic orientation and assigns a class to the review
  • The solution for the short-comings of the sentimental classification is feature-based opinion extraction

CSE:634 Web Mining

references39
References
  • Inferring Web Communities from Link Topology (1998)David Gibson, Jon Kleinberg, Prabhakar Raghavan, UK Conference on Hypertext.
  • Trawling the web for emerging cyber-communities (1999) Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, WWW8 / Computer Networks.
  • Finding Related Pages in the World Wide Web (1999) Jeffrey Dean, Monika R. Henzinger, WWW8 / Computer Networks.
  • A System for Collaborative Web Resource Categorization and RankingMaxim Lifantsev.
  • Web Mining : A Bird’s Eye View by Sanjay Kumar Madria Department of Computer Science,University of Missouri-Rolla, MO ,[email protected]

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide40

Introduction

  • Introduction of the cyber-community
  • Methods to measure the similarity of web pages on the web graph
  • Methods to extract the meaningful communities through the link structure

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide41

What is cyber-community

  • A community on the web is a group of web pages sharing a common interest
    • Eg. A group of web pages talking about POP Music
    • Eg. A group of web pages interested in data-mining
  • Main properties:
    • Pages in the same community should be similar to each other in contents
    • The pages in one community should differ from the pages in another community
    • Similar to cluster

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide42

Two different types of communities

  • Explicitly-defined communities
    • They are well known ones, such as the resource listed by Yahoo!
  • Implicitly-defined communities
    • They are communities unexpected or invisible to most users

eg.

Arts

Music

Painting

Classic

Pop

eg. The group of web pages interested in a particular singer

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide44

Two different types of communities

  • The explicit communities are easy to identify
    • Eg. Yahoo!, InfoSeek, Clever System
  • In order to extract the implicit communities, we need analyze the web-graph objectively
  • In research, people are more interested in the implicit communities

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide45

Similarity of web pages

  • Discovering web communities is similar to clustering. For clustering, we must define the similarity of two nodes
  • A Method I:
    • For page and page B, A is related to B if there is a hyper-link from A to B, or from B to A
    • Not so good. Consider the home page of IBM and Microsoft.

Page A

Page B

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide46

Similarity of web pages

  • Method II (from Bibliometrics)
    • Co-citation: the similarity of A and B is measured by the number of pages cite both A and B
    • Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B.

Page A

Page B

Page A

Page B

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide47

Methods of clustering

  • Clustering methods based on co-citation analysis:
  • Methods derived from HITS (Kleinberg)
    • Using co-citation matrix
  • All of them can discover meaningful communities

But their methods are very expensive to the whole World Wide Web with billions of web pages.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide48

Trawling the Web for emerging cyber-communitiesProceeding of the eighth international conference on World Wide Web Toronto, Canada Pages: 1481 - 1493 Year of Publication: 1999 ISSN:1389-1286 Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide49

A cheaper method

  • The method from Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins
    • IBM Almaden Research Center
  • They call their method communities trawling (CT)
  • They implemented it on the graph of 200 millions pages, it worked very well

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide50

Fans

Centers

Basic idea of CT

  • Definition of communities
  • dense directed bipartite sub graphs
    • Bipartite graph: Nodes are partitioned into two sets, F and C
    • Every directed edge in the graph is directed from a node u in F to a node v in C
    • dense if many of the possible edges between F and C are present

F

C

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide51

Basic idea of CT

  • Bipartite cores
    • a complete bipartite subgraph with at least i nodes from F and at least j nodes from C
    • i and j are tunable parameters
    • A (i, j) Bipartite core
  • Every community have such a core with a certain i and j.

A (i=3, j=3) bipartite core

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide52

Basic idea of CT

  • A bipartite core is the identity of a community
  • To extract all the communities is to enumerate all the bipartite cores on the web.
  • Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning -- elimination-generation pruning

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide53

• Completebipartite graph: there is an edge between each node in F and each node in C

• (i,j)-Core: a complete bipartite graph with at least inodes in F and jnodes in C

• (i,j)-Coreis a good signature for finding online communities

•“Trawling”: finding cores

• Find all (i,j)-cores in the Web graph.

– In particular: find “fans” (or “hubs”) in the graph

– “centers” = “authorities”

– Challenge: Web is huge. How to find cores efficiently?

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

main idea pruning
Main idea: pruning

• Step 1: using out-degrees

– Rule: each fan must point to at least 6 different websites

– Pruning results: 12% of all pages (= 24M pages) are potential fans

– Retain only links, and ignore page contents

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide55

Step 2: Eliminate mirroring pages

  • Many pages are mirrors (exactly the same page)
  • They can produce many spurious fans
  • Use a “shingling” method to identify and eliminate duplicates
  • Results:
  • – 60% of 24M potential-fan pages are removed
  • – # of potential centers is 30 times of # of potential fans

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide56

Step 3: Iterative pruning

  • To find (i,j)-cores

– Remove all pages whose # of out-links is < i

– Remove all pages whose # of in-links is < j

– Do it iteratively

  • Step 4: inclusion-exclusion pruning
  • Idea: in each step, we
  • – Either “include” a community”
  • – Or we “exclude” a page from further contention

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide57

Check a page x with j out-degree. x is a fan of a (i,j)-core if:

  • – There are i-1 fans point to all the forward neighbors of x
  • – This step can be checked easily using the index on fans and centers
  • Result: for (3,3)-cores, 5M pages remained
  • Final step:
  • – Since the graph is much smaller, we can afford to “enumerate” the remaining cores

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide58

Step 5: using in-degrees of pages

  • Delete pages highly references, e.g., yahoo, altavista
  • Reason: they are referenced for many reasons, not likely forming an emerging community
  • Formally: remove all pages with more than k inlinks (k = 50,for instance)
  • Results:

– 60M pages pointing to 20M pages

– 2M potential fans

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide59

Weakness of CT

  • The bipartite graph cannot suit all kinds of communities
  • The density of the community is hard to adjust

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide60

Experiment on CT

  • 200 millions web pages
  • IBM PC with an Intel 300MHz Pentium II processor, with 512M of memory, running Linux
  • i from 3 to 10 and j from 3 to 20
  • 200k potential communities were discovered

29% of them cannot be found in Yahoo!.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide61

Summary

  • Conclusion: The methods to discover communities from the web depend on how we define the communities through the link structure
  • Future works:
    • How to relate the contents to link structure

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

mining topic specific concepts and definitions on the web

Mining Topic-Specific Concepts and Definitions on the Web

Minnie Virk

May 2003,  Proceedings of the 12th International conference on World Wide Web, ACM Press

Bing Liu, University of Illinois at Chicago, 851 S. Morgan Street Chicago IL 60607-7053

Chee Wee Chin,

Hwee Tou Ng, National University of Singapore

3 Science Drive 2 Singapore

references63
References
  • Agrawal, R. and Srikant, R. “Fast Algorithm for Mining Association Rules”, VLDB-94, 1994.
  • Anderson, C. and Horvitz, E. “Web Montage: A Dynamic Personalized Start Page”, WWW-02, 2002.
  • Brin, S. and Page, L. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, WWW7, 1998.
  • Web Mining : A Bird’s Eye View by Sanjay Kumar Madria Department of Computer Science,University of Missouri-Rolla, MO ,[email protected]

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

introduction
Introduction
  • When one wants to learn about a topic, one reads a book or a survey paper.
  • One can read the research papers about the topic.
  • None of these is very practical.
  • Learning from web is convenient, intuitive, and diverse.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

purpose of the paper
Purpose of the Paper
  • This paper’s task is “mining topic-specific knowledge on the Web”.
  • The goal is to help people learn in-depth knowledge of a topic systematically on the Web.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

learning about a new topic
Learning about a New Topic
  • One needs to find definitions and descriptions of the topic.
  • One also needs to know the sub-topics and salient concepts of the topic.
  • Thus, one wants the knowledge as presented in a traditional book.
  • The task of this paper can be summarized as “compiling a book on the Web”.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

proposed technique
Proposed Technique
  • First, identify sub-topics or salient concepts of that specific topic.
  • Then, find and organize the informative pages containing definitions and descriptions of the topic and sub-topics.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

why are the current search tecnhiques not sufficient
Why are the current search tecnhiques not sufficient?

For definitions and descriptions of the topic:

Existing search engines rank web pages based on keyword matching and hyperlink structures. NOT very useful for measuring the informative value of the page.

For sub-topics and salient concepts of the topic:

A single web page is unlikely to contain information about all the key concepts or sub-topics of the topic. Thus, sub-topics need to be discovered from multiple web pages. Current search engine systems do not perform this task.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

related work
Related Work
  • Web information extraction wrappers
  • Web query languages
  • User preference approach
  • Question answering in information retrieval
  • Question answering is a closely-related work to this paper. The objective of a question-answering system is to provide direct answers to questions submitted by the user. In this paper’s task, many of the questions are about definitions of terms.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

the algorithm
The Algorithm

WebLearn (T)

1) Submit T to a search engine, which returns a set of relevant pages

2) The system mines the sub-topics or salient concepts of T using a set S of top ranking pages from the search engine

3) The system then discovers the informative pages containing definitions of the topic and sub-topics (salient concepts) from S

4) The user views the concepts and informative pages.

If s/he still wants to know more about sub-topics then

for each user-interested sub-topic Ti of T do

WebLearn (Ti);

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

sub topic or salient concept discovery
Sub-Topic or Salient Concept Discovery
  • Observation:

Sub-topics or salient concepts of a topic are important word phrases, usually emphasized using some HTML tags (e.g., <h1>,...,<h4>,<b>).

  • However, this is not sufficient. Data mining techniques are able to help to find the frequent occurring word phrases.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

sub topic discovery
Sub-Topic Discovery
  • After obtaining a set of relevant top-ranking pages (using Google), sub-topic discovery consists of the following 5 steps.

1) Filter out the “noisy” documents that rarely contain sub-topics or salient-concepts. The resulting set of documents is the source for sub-topic discovery.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

sub topic discovery73
Sub-Topic Discovery

2) Identify important phrases in each page (discover phrases emphasized by HTML markup tags).

Rules to determine if a markup tag can safely be ignored

  • Contains a salutation title (Mr, Dr, Professor).
  • Contains an URL or an email address.
  • Contains terms related to a publication (conference, proceedings, journal).
  • Contains an image between the markup tags.
  • Too lengthy (the paper uses 15 words as the upper limit)

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

sub topic discovery74
Sub-Topic Discovery
  • Also, in this step, some preprocessing techniques such as stopwords removal and word stemming are applied in order to extract quality text segments.
  • Stopwords removal:Eliminating the words that occur too frequently and have little informational meaning.
  • Word stemming: Finding the root form of a word by removing its suffix.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

sub topic discovery75
Sub-Topic Discovery
  • 3) Mine frequent occurring phrases:

- Each piece of text extracted in step 2 is stored in a dataset called a transaction set.

- Then, an association rule miner based on Apriori algorithm is executed to find those frequent itemsets. In this context, an itemset is a set of words that occur together, and an itemset is frequent if it appears in more than two documents.

- We only need the first step of the Apriori algorithm and we only need to find frequent itemsets with three words or fewer (this restriction can be relaxed).

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

sub topic discovery76
Sub-Topic Discovery
  • 4) Eliminate itemsets that are unlikely to be sub-topics, and determine the sequence of words in a sub-topic. (postprocessing)
  • Heuristic: If an itemset does not appear alone as an important phrase in any page, it is unlikely to be a main sub-topic and it is removed.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

sub topic discovery77
Sub-Topic Discovery
  • 5) Rank the remaining itemsets. The remaining itemsets are regarded as the sub-topics or salient concepts of the search topic and are ranked based on the number of pages that they occur.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

definition finding
Definition Finding
  • This step tries to identify those pages that include definitions of the search topic and its sub-topics discovered in the previous step.
  • Preprocessing steps:
  • Texts that will not be displayed by browsers (e.g., <script>...</ script >,<!—comments-->) are ignored.
  • Word stemming is applied.
  • Stopwords and punctuation are kept as they serve as clues to identify definitions.
  • HTML tags within a paragraph are removed.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

definition finding79
Definition Finding
  • After that, following patterns are applied to identify definitions:

[1] Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining Topic-Specific Concepts and Definitions on the Web

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

definition finding80
Definition Finding
  • Besides using the above patterns, the paper also relies on HTML structuring and hyperlink structures.
  • 1) If a page contains only one header or one big emphasized text segment at the beginning in the entire document, then the document contains a definition of the concept in the header.
  • 2) Definitions at the second level of the hyperlink structure are also discovered. All the patterns and methods described above are applied to these second level documents.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

definition finding81
Definition Finding
  • Observation:Sometimes no informative page is found for a particular sub-topic when the pages for the main topic are very general and do not contain detailed information for sub-topics.
  • In such cases, the sub-topic can be submitted to the search engine and sub-subtopics may be found recursively.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

conclusions
Conclusions
  • The proposed techniques aim at helping Web users to learn an unfamiliar topic in-depth and systematically.
  • This is an efficient system to discover and organize knowledge on the web, in a way similar to a traditional book, to assist learning.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria

slide83
Questions?

CSE:634 Web Mining

slide84
Thank You!

CSE:634 Web Mining

ad