Web mining and applications
Download
1 / 84

web mining - PowerPoint PPT Presentation


  • 535 Views
  • Updated On :

WEB MINING AND APPLICATIONS. Pallavi Tripathi 105956127 Vaishali Kshatriya 105951122 Mehru Anand 106113525 Minnie Virk 106113516. REFERENCES.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'web mining ' - Philip


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Web mining and applications l.jpg

WEB MINING AND APPLICATIONS

Pallavi Tripathi 105956127

Vaishali Kshatriya 105951122

Mehru Anand 106113525

Minnie Virk 106113516


References l.jpg
REFERENCES

  • Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber

  • Presentation Slides of Prof. Anita Wasilewska

  • http://www.cs.rpi.edu/~youssefi/research/VWM/

  • http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

  • http://www.galeas.de/webimining.html

  • http://www.cs.helsinki.fi/u/gionis/seminar_papers/zaki00spade.ps

CSE:634 Web Mining


Citations l.jpg
CITATIONS

  • Amir H. Youssefi, David J. Duke, Mohammed J. Zaki, Ephraim P. Glinert, Visual Web Mining 13th International World Wide Web Conference (poster proceedings), New York, NY, May 2004.

  • Amir H. Youssefi, David Duke, Ephraim P. Glinert, and Mohammed J. Zaki, Toward Visual Web Mining, 3rd International Workshop on Visual Data Mining (with ICDM'03), Melbourne, FL, November 2003.

CSE:634 Web Mining


Slide4 l.jpg

With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

http://www.galeas.de/webimining.html

CSE:634 Web Mining


What is web mining l.jpg
WHAT IS WEB MINING? available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World­Wide Web.

CSE:634 Web Mining


Areas of classification l.jpg
AREAS OF CLASSIFICATION available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

  • WEB CONTENT MINING is the process of extracting knowledge from the content of documents or their descriptions.

  • WEB STRUCTURE MINING is the process of inferring knowledge from the World­Wide Web organization and links between references and referents in the Web.

  • WEB USAGE MINING, also known as WEB LOG MINING, is the process of extracting interesting patterns in web access logs

  • Inaddition to these three web mining types, there are other helpful approaches for web knowledge discovery, such as information visualization which helps us to understand the complex relationships and structures of many search results.

http://www.galeas.de/webimining.html

CSE:634 Web Mining


Topics covered l.jpg
TOPICS COVERED available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

In today’s presentation we would be covering the following algorithms related to the various aspects of Web Mining :

  • Spade Algorithm and its applications in Visual Web Mining

  • Sentiment Classification

  • Community Trawling Algorithm

CSE:634 Web Mining


Visual web mining l.jpg
VISUAL WEB MINING available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

Application of Information visualization techniques on results of Web Mining in order to further amplify the perception of extracted patterns and visually explore new ones in web domain.

Application Domain is Web Usage Mining and Web Content Mining

http://www.cs.rpi.edu/~youssefi/research/VWM/

CSE:634 Web Mining


Approach used l.jpg
APPROACH USED available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

  • Make personalized results for targeted web surfers

  • Use data mining algorithms for extracting new insight and measures

  • Employ a database server and relational query language as a means to submit specific queries against data

  • Utilize visualization to obtain an overall picture

http://www.cs.rpi.edu/~youssefi/research/VWM/

CSE:634 Web Mining


Spade overview l.jpg
SPADE OVERVIEW available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

  • Proposed by Mohammed J Zaki

  • Sequential PAttern Discovery Using Equivalent Class

  • An algorithm based on Apriori for fast discovery of frequent sequences

  • Needs three database scans in order to extract sequential patterns

  • Given: A database of customer transactions, each of which having the following characteristics: sequence-id or customer-id, transaction-time and the item involved in the transaction.

  • The aim is to obtain typical behaviors according to the user's viewpoint.

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining


Definitions l.jpg
DEFINITIONS available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

  • Item : Can be considered as the object bought by a customer, or the page requested by the user of a website, etc.

  • Itemset: An itemset is the set of items that are grouped by timestamp.

  • Data Sequence: Sequence of itemsets associated to a customer.

  • Sequential Mining: Discovering frequent sequences over time of attribute sets in large databases.

  • Frequent Sequential Pattern: Sequence whose statistical significance in the database is above user-specified threshold.

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining


Spade algorithm l.jpg
SPADE ALGORITHM available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

  • In the first scan ,find frequent items

  • The second scan aims at finding frequent sequences of length 2

  • The last scan associates to frequent sequences of length 2, a table of the corresponding sequences id and itemsets id in the database

  • Based on this representation in main memory, the support of the candidate sequences of length k is the result of join operations on the tables related to the frequent sequences of length (k-1) able to generate this candidate

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining


Slide13 l.jpg

Data Sequence of 4 customers available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining


An example l.jpg
AN EXAMPLE available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

  • With a minimum support of “50%” a sequential pattern can be considered as frequent if it occurs at least in the data sequences of 2 customers (2/4).

  • In this case a maximal sequential pattern mining process will find three patterns:

    S1: (“Camera,DVD”)(“DVD-R,DVD-Rec”)

    S2: (“DVD-R,DVD-Rec”)(“Videosoft”)

    S3: (“Memory Card”)(“USB”)

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining


Determining support l.jpg
Determining Support available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

SUFFIX JOIN ON ID LIST

ORIGINAL ID LIST DATABASE

http://www-sop.inria.fr/axis/personnel/Florent.Masseglia/International_Book_Encyclopedia_2005.pdf

CSE:634 Web Mining


Advantages l.jpg
ADVANTAGES available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in finding the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge

  • Uses simple join operations on id table

  • No complicated hash tree structures used

  • No overhead of generating and searching subsequences

  • Cuts down on I/O operations by limiting itself to three scans

http://www.cs.helsinki.fi/u/gionis/seminar_papers/zaki00spade.ps

CSE:634 Web Mining


Slide17 l.jpg

http://www.cs.rpi.edu/~youssefi/research/VWM/

CSE:634 Web Mining


System architecture l.jpg
SYSTEM ARCHITECTURE implementation for applying information visualization techniques on these results.

http://www.cs.rpi.edu/~youssefi/research/VWM

CSE:634 Web Mining


Slide19 l.jpg

  • A robot (webbot) is used to retrieve the pages of the Website

  • Web Server log files are downloaded and processed

  • The Integration Engine is a suite of programs for data preparation ie extracting, cleaning, transforming, integrating data and finally loading into database and later generating graphs in XGML.

http://www.cs.rpi.edu/~youssefi/research/VWM

CSE:634 Web Mining


Slide20 l.jpg

  • We extract user sessions from web logs , this yields results of roughly related to a specific user

  • The user sessions are converted into format suitable for Sequence Mining

  • Outputs are frequent contiguous sequence with given minimum support.

  • These are imported into a database

  • Different queries are executed against this data.

http://www.cs.rpi.edu/~youssefi/research/VWM

CSE:634 Web Mining


Applications l.jpg
APPLICATIONS of roughly related to a specific user

  • Designing different visualization diagrams and exploring frequent patterns of user access on a website

  • Classification of web pages into two classes : hot and cold : attracting high and low number of visitors.

  • A webmaster can make exploratory changes to website structure and analyze the change in user access patterns in real world.

http://www.cs.rpi.edu/~youssefi/research/VWM/

CSE:634 Web Mining


Sentiment classification l.jpg

Sentiment Classification of roughly related to a specific user

Vaishali Kshatriya

105951122


References23 l.jpg
References of roughly related to a specific user

  • The Sentimental Factor: Improving Review Classification via Human-Provided Information. - Philip Beineke , Shivakumar Vaithyanathan and Trevor Hastie

  • Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (July 2002)

  • http://wing.comp.nus.edu.sg/chime/050427/SentimentClassification3_files/frame.htm

  • http://www.cse.iitb.ac.in/~cs621/seminar/SentimentDetection.ppt#267,12,Recent Advances

  • Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web" Proceedings of the 14th international World Wide Web conference (WWW-2005), May 10-14, 2005, in Chiba, Japan.

CSE:634 Web Mining


Sentiment classification24 l.jpg
Sentiment Classification of roughly related to a specific user

  • It is a task of labeling a review document according to the polarity of its prevailing opinion.

CSE:634 Web Mining


Online shopping l.jpg
Online Shopping of roughly related to a specific user

CSE:634 Web Mining


Topical vs sentimental classification l.jpg
Topical vs. Sentimental Classification of roughly related to a specific user

Topical Classification

  • Classifying documents into various subjects for example : Mathematics, Sports etc

  • comparing individual words (unigrams) in various subject areas (Bag-of-Words approach). Example : “score”, “referee”, “football” => Sports

    Sentiment Classification

  • classifying documents according to the overall sentiment positive vs. negative E.g. like vs. dislike; Recommended vs. not recommended

  • More difficult compared to traditional topical classification. May need more linguistic processing E.g. “you will be disappointed” and “it is not satisfactory”

http://wing.comp.nus.edu.sg/chime/050427/SentimentClassification3_files/frame.htm

CSE:634 Web Mining


Challenges l.jpg
Challenges of roughly related to a specific user

  • Dependence of context on the document – “unpredictable” plot, “unpredictable” performance

  • Negations have to be captured

    • The movie was not that bad.

    • The pictures taken by the cell is not of best quality.

  • Subtle Expressions:

    • “How can someone sit through the entire movie?”

http://www.cse.iitb.ac.in/~cs621/seminar/SentimentDetection.ppt#267,12,Recent Advances

CSE:634 Web Mining


Unsupervised review classification turney acl 02 l.jpg
Unsupervised review classification (Turney ACL -02) of roughly related to a specific user

  • Input: Written review

  • Output: classification (i.e. positive or negative)

  • Step 1: Use part-of-speech tagger to identify phrases

  • Step 2: Estimate the semantic orientation of extracted phrase

  • Step 3: Assign the given review to a class (either recommended or not recommended)

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining


Step 1 extract the phrases l.jpg
Step 1: Extract the phrases of roughly related to a specific user

  • Part-of-speech tagger is applied to the review

  • Two consecutive words are extracted from the review if their tags conform to any of the patterns in the table

where JJ: Adjective and NN: Noun

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining


Step 2 estimate the semantic orientation l.jpg
Step 2: Estimate the semantic orientation of roughly related to a specific user

  • Uses PMI-IR (Pointwise Mutual Information and Information Retrieval)

  • PMI between 2 words, word1 and word2 can be defined as :

  • The Semantic Orientation (SO) of a phrase is calculated as :

    • SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”)

  • SO is positive when the phrase is more strongly associated with excellent and negative when it is more strongly associated with poor.

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining


Step 2 cont d l.jpg
Step 2 (cont’d) of roughly related to a specific user

  • PMI-IR estimates PMI by issuing queries to a search engine (hence the IR in PMI-IR) and noting the number of hits (matching documents).

  • The experiment uses AltaVista

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining


Step 3 assign a class l.jpg
Step 3: Assign a Class of roughly related to a specific user

  • Calculate the average of the SO of the phrases and classify them as recommended if the average is positive and not recommended if the average is negative.

Reviews of a bank

Citation : Thumbs Up or Thumbs Down? Semantic orientation applied to unsupervised classification of reviews: Turney (02)

CSE:634 Web Mining


Drawbacks l.jpg
Drawbacks of roughly related to a specific user

  • Sentiment classification is useful but it does not find what the reviewer liked or disliked.

  • A negative sentiment on an object does not imply that the user did not like anything about the product

  • Similarly a positive sentiment does not imply that the user liked everything about the product

  • The solution is to go to sentence and feature level

http://www.cs.uic.edu/~liub/EITC-06.ppt#493,20,Identify opinion orientation of features

CSE:634 Web Mining


Feature based opinion mining and summarization hu and liu 04 l.jpg
Feature based Opinion mining and summarization (Hu and Liu ‘04)

  • Interested in what reviewers liked and disliked

  • Since the number of reviews of an object can be large, the goal was to produce simple summary of the reviews

  • The summary can be easily visualized and compared

http://www.cs.uic.edu/~liub/EITC-06.ppt#493,20,Identify opinion orientation of features

CSE:634 Web Mining


Three main tasks l.jpg
Three main tasks: ‘04)

  • Step1 : Identify and extract object features that have been commented on in each review

  • Step 2: Determine whether the opinion on the review is positive, negative or neutral

  • Step 3: Group synonyms of features

  • Produce a feature-based summary!!

http://www.cs.uic.edu/~liub/EITC-06.ppt#493,20,Identify opinion orientation of features

CSE:634 Web Mining


Online shopping36 l.jpg
Online Shopping ‘04)

CSE:634 Web Mining


Summary l.jpg
Summary ‘04)

  • Classification of reviews as good or bad: sentimental classification

  • Unsupervised review classification extracts the phrases from the review, estimates the semantic orientation and assigns a class to the review

  • The solution for the short-comings of the sentimental classification is feature-based opinion extraction

CSE:634 Web Mining


Discovering web communities on the web l.jpg

Discovering Web communities on the web ‘04)

Mehru Anand (106113525)


References39 l.jpg
References ‘04)

  • Inferring Web Communities from Link Topology (1998)David Gibson, Jon Kleinberg, Prabhakar Raghavan, UK Conference on Hypertext.

  • Trawling the web for emerging cyber-communities (1999) Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, WWW8 / Computer Networks.

  • Finding Related Pages in the World Wide Web (1999) Jeffrey Dean, Monika R. Henzinger, WWW8 / Computer Networks.

  • A System for Collaborative Web Resource Categorization and RankingMaxim Lifantsev.

  • Web Mining : A Bird’s Eye View by Sanjay Kumar Madria Department of Computer Science,University of Missouri-Rolla, MO ,[email protected]

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide40 l.jpg

Introduction ‘04)

  • Introduction of the cyber-community

  • Methods to measure the similarity of web pages on the web graph

  • Methods to extract the meaningful communities through the link structure

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide41 l.jpg

What is cyber-community ‘04)

  • A community on the web is a group of web pages sharing a common interest

    • Eg. A group of web pages talking about POP Music

    • Eg. A group of web pages interested in data-mining

  • Main properties:

    • Pages in the same community should be similar to each other in contents

    • The pages in one community should differ from the pages in another community

    • Similar to cluster

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide42 l.jpg

Two different types of communities ‘04)

  • Explicitly-defined communities

    • They are well known ones, such as the resource listed by Yahoo!

  • Implicitly-defined communities

    • They are communities unexpected or invisible to most users

eg.

Arts

Music

Painting

Classic

Pop

eg. The group of web pages interested in a particular singer

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria



Slide44 l.jpg

Two different types of communities ‘04)

  • The explicit communities are easy to identify

    • Eg. Yahoo!, InfoSeek, Clever System

  • In order to extract the implicit communities, we need analyze the web-graph objectively

  • In research, people are more interested in the implicit communities

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide45 l.jpg

Similarity of web pages ‘04)

  • Discovering web communities is similar to clustering. For clustering, we must define the similarity of two nodes

  • A Method I:

    • For page and page B, A is related to B if there is a hyper-link from A to B, or from B to A

    • Not so good. Consider the home page of IBM and Microsoft.

Page A

Page B

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide46 l.jpg

Similarity of web pages ‘04)

  • Method II (from Bibliometrics)

    • Co-citation: the similarity of A and B is measured by the number of pages cite both A and B

    • Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B.

Page A

Page B

Page A

Page B

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide47 l.jpg

Methods of clustering ‘04)

  • Clustering methods based on co-citation analysis:

  • Methods derived from HITS (Kleinberg)

    • Using co-citation matrix

  • All of them can discover meaningful communities

    But their methods are very expensive to the whole World Wide Web with billions of web pages.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide48 l.jpg

Trawling the Web for emerging cyber-communities ‘04)Proceeding of the eighth international conference on World Wide Web Toronto, Canada Pages: 1481 - 1493 Year of Publication: 1999 ISSN:1389-1286 Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide49 l.jpg

A cheaper method ‘04)

  • The method from Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins

    • IBM Almaden Research Center

  • They call their method communities trawling (CT)

  • They implemented it on the graph of 200 millions pages, it worked very well

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide50 l.jpg

Fans ‘04)

Centers

Basic idea of CT

  • Definition of communities

  • dense directed bipartite sub graphs

    • Bipartite graph: Nodes are partitioned into two sets, F and C

    • Every directed edge in the graph is directed from a node u in F to a node v in C

    • dense if many of the possible edges between F and C are present

F

C

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide51 l.jpg

Basic idea of CT ‘04)

  • Bipartite cores

    • a complete bipartite subgraph with at least i nodes from F and at least j nodes from C

    • i and j are tunable parameters

    • A (i, j) Bipartite core

  • Every community have such a core with a certain i and j.

A (i=3, j=3) bipartite core

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide52 l.jpg

Basic idea of CT ‘04)

  • A bipartite core is the identity of a community

  • To extract all the communities is to enumerate all the bipartite cores on the web.

  • Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning -- elimination-generation pruning

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide53 l.jpg

‘04)Completebipartite graph: there is an edge between each node in F and each node in C

• (i,j)-Core: a complete bipartite graph with at least inodes in F and jnodes in C

• (i,j)-Coreis a good signature for finding online communities

•“Trawling”: finding cores

• Find all (i,j)-cores in the Web graph.

– In particular: find “fans” (or “hubs”) in the graph

– “centers” = “authorities”

– Challenge: Web is huge. How to find cores efficiently?

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Main idea pruning l.jpg
Main idea: pruning ‘04)

• Step 1: using out-degrees

– Rule: each fan must point to at least 6 different websites

– Pruning results: 12% of all pages (= 24M pages) are potential fans

– Retain only links, and ignore page contents

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide55 l.jpg

Step 2: Eliminate mirroring pages ‘04)

  • Many pages are mirrors (exactly the same page)

  • They can produce many spurious fans

  • Use a “shingling” method to identify and eliminate duplicates

  • Results:

  • – 60% of 24M potential-fan pages are removed

  • – # of potential centers is 30 times of # of potential fans

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide56 l.jpg

Step 3: Iterative pruning ‘04)

  • To find (i,j)-cores

    – Remove all pages whose # of out-links is < i

    – Remove all pages whose # of in-links is < j

    – Do it iteratively

  • Step 4: inclusion-exclusion pruning

  • Idea: in each step, we

  • – Either “include” a community”

  • – Or we “exclude” a page from further contention

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide57 l.jpg

  • Check a page ‘04)x with j out-degree. x is a fan of a (i,j)-core if:

  • – There are i-1 fans point to all the forward neighbors of x

  • – This step can be checked easily using the index on fans and centers

  • Result: for (3,3)-cores, 5M pages remained

  • Final step:

  • – Since the graph is much smaller, we can afford to “enumerate” the remaining cores

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide58 l.jpg

  • Step 5: using in-degrees of pages ‘04)

  • Delete pages highly references, e.g., yahoo, altavista

  • Reason: they are referenced for many reasons, not likely forming an emerging community

  • Formally: remove all pages with more than k inlinks (k = 50,for instance)

  • Results:

    – 60M pages pointing to 20M pages

    – 2M potential fans

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide59 l.jpg

Weakness of CT ‘04)

  • The bipartite graph cannot suit all kinds of communities

  • The density of the community is hard to adjust

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide60 l.jpg

Experiment on CT ‘04)

  • 200 millions web pages

  • IBM PC with an Intel 300MHz Pentium II processor, with 512M of memory, running Linux

  • i from 3 to 10 and j from 3 to 20

  • 200k potential communities were discovered

    29% of them cannot be found in Yahoo!.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide61 l.jpg

Summary ‘04)

  • Conclusion: The methods to discover communities from the web depend on how we define the communities through the link structure

  • Future works:

    • How to relate the contents to link structure

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Mining topic specific concepts and definitions on the web l.jpg

Mining Topic-Specific Concepts and Definitions on the Web ‘04)

Minnie Virk

May 2003,  Proceedings of the 12th International conference on World Wide Web, ACM Press

Bing Liu, University of Illinois at Chicago, 851 S. Morgan Street Chicago IL 60607-7053

Chee Wee Chin,

Hwee Tou Ng, National University of Singapore

3 Science Drive 2 Singapore


References63 l.jpg
References ‘04)

  • Agrawal, R. and Srikant, R. “Fast Algorithm for Mining Association Rules”, VLDB-94, 1994.

  • Anderson, C. and Horvitz, E. “Web Montage: A Dynamic Personalized Start Page”, WWW-02, 2002.

  • Brin, S. and Page, L. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, WWW7, 1998.

  • Web Mining : A Bird’s Eye View by Sanjay Kumar Madria Department of Computer Science,University of Missouri-Rolla, MO ,[email protected]

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Introduction l.jpg
Introduction ‘04)

  • When one wants to learn about a topic, one reads a book or a survey paper.

  • One can read the research papers about the topic.

  • None of these is very practical.

  • Learning from web is convenient, intuitive, and diverse.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Purpose of the paper l.jpg
Purpose of the Paper ‘04)

  • This paper’s task is “mining topic-specific knowledge on the Web”.

  • The goal is to help people learn in-depth knowledge of a topic systematically on the Web.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Learning about a new topic l.jpg
Learning about a New Topic ‘04)

  • One needs to find definitions and descriptions of the topic.

  • One also needs to know the sub-topics and salient concepts of the topic.

  • Thus, one wants the knowledge as presented in a traditional book.

  • The task of this paper can be summarized as “compiling a book on the Web”.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Proposed technique l.jpg
Proposed Technique ‘04)

  • First, identify sub-topics or salient concepts of that specific topic.

  • Then, find and organize the informative pages containing definitions and descriptions of the topic and sub-topics.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Why are the current search tecnhiques not sufficient l.jpg
Why are the current search tecnhiques not sufficient? ‘04)

For definitions and descriptions of the topic:

Existing search engines rank web pages based on keyword matching and hyperlink structures. NOT very useful for measuring the informative value of the page.

For sub-topics and salient concepts of the topic:

A single web page is unlikely to contain information about all the key concepts or sub-topics of the topic. Thus, sub-topics need to be discovered from multiple web pages. Current search engine systems do not perform this task.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Related work l.jpg
Related Work ‘04)

  • Web information extraction wrappers

  • Web query languages

  • User preference approach

  • Question answering in information retrieval

  • Question answering is a closely-related work to this paper. The objective of a question-answering system is to provide direct answers to questions submitted by the user. In this paper’s task, many of the questions are about definitions of terms.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


The algorithm l.jpg
The Algorithm ‘04)

WebLearn (T)

1) Submit T to a search engine, which returns a set of relevant pages

2) The system mines the sub-topics or salient concepts of T using a set S of top ranking pages from the search engine

3) The system then discovers the informative pages containing definitions of the topic and sub-topics (salient concepts) from S

4) The user views the concepts and informative pages.

If s/he still wants to know more about sub-topics then

for each user-interested sub-topic Ti of T do

WebLearn (Ti);

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Sub topic or salient concept discovery l.jpg
Sub-Topic or Salient Concept Discovery ‘04)

  • Observation:

    Sub-topics or salient concepts of a topic are important word phrases, usually emphasized using some HTML tags (e.g., <h1>,...,<h4>,<b>).

  • However, this is not sufficient. Data mining techniques are able to help to find the frequent occurring word phrases.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Sub topic discovery l.jpg
Sub-Topic Discovery ‘04)

  • After obtaining a set of relevant top-ranking pages (using Google), sub-topic discovery consists of the following 5 steps.

    1) Filter out the “noisy” documents that rarely contain sub-topics or salient-concepts. The resulting set of documents is the source for sub-topic discovery.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Sub topic discovery73 l.jpg
Sub-Topic Discovery ‘04)

2) Identify important phrases in each page (discover phrases emphasized by HTML markup tags).

Rules to determine if a markup tag can safely be ignored

  • Contains a salutation title (Mr, Dr, Professor).

  • Contains an URL or an email address.

  • Contains terms related to a publication (conference, proceedings, journal).

  • Contains an image between the markup tags.

  • Too lengthy (the paper uses 15 words as the upper limit)

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Sub topic discovery74 l.jpg
Sub-Topic Discovery ‘04)

  • Also, in this step, some preprocessing techniques such as stopwords removal and word stemming are applied in order to extract quality text segments.

  • Stopwords removal:Eliminating the words that occur too frequently and have little informational meaning.

  • Word stemming: Finding the root form of a word by removing its suffix.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Sub topic discovery75 l.jpg
Sub-Topic Discovery ‘04)

  • 3) Mine frequent occurring phrases:

    - Each piece of text extracted in step 2 is stored in a dataset called a transaction set.

    - Then, an association rule miner based on Apriori algorithm is executed to find those frequent itemsets. In this context, an itemset is a set of words that occur together, and an itemset is frequent if it appears in more than two documents.

    - We only need the first step of the Apriori algorithm and we only need to find frequent itemsets with three words or fewer (this restriction can be relaxed).

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Sub topic discovery76 l.jpg
Sub-Topic Discovery ‘04)

  • 4) Eliminate itemsets that are unlikely to be sub-topics, and determine the sequence of words in a sub-topic. (postprocessing)

  • Heuristic: If an itemset does not appear alone as an important phrase in any page, it is unlikely to be a main sub-topic and it is removed.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Sub topic discovery77 l.jpg
Sub-Topic Discovery ‘04)

  • 5) Rank the remaining itemsets. The remaining itemsets are regarded as the sub-topics or salient concepts of the search topic and are ranked based on the number of pages that they occur.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Definition finding l.jpg
Definition Finding ‘04)

  • This step tries to identify those pages that include definitions of the search topic and its sub-topics discovered in the previous step.

  • Preprocessing steps:

  • Texts that will not be displayed by browsers (e.g., <script>...</ script >,<!—comments-->) are ignored.

  • Word stemming is applied.

  • Stopwords and punctuation are kept as they serve as clues to identify definitions.

  • HTML tags within a paragraph are removed.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Definition finding79 l.jpg
Definition Finding ‘04)

  • After that, following patterns are applied to identify definitions:

[1] Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining Topic-Specific Concepts and Definitions on the Web

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Definition finding80 l.jpg
Definition Finding ‘04)

  • Besides using the above patterns, the paper also relies on HTML structuring and hyperlink structures.

  • 1) If a page contains only one header or one big emphasized text segment at the beginning in the entire document, then the document contains a definition of the concept in the header.

  • 2) Definitions at the second level of the hyperlink structure are also discovered. All the patterns and methods described above are applied to these second level documents.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Definition finding81 l.jpg
Definition Finding ‘04)

  • Observation:Sometimes no informative page is found for a particular sub-topic when the pages for the main topic are very general and do not contain detailed information for sub-topics.

  • In such cases, the sub-topic can be submitted to the search engine and sub-subtopics may be found recursively.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Conclusions l.jpg
Conclusions ‘04)

  • The proposed techniques aim at helping Web users to learn an unfamiliar topic in-depth and systematically.

  • This is an efficient system to discover and organize knowledge on the web, in a way similar to a traditional book, to assist learning.

Web Mining: A Bird Eye View WISE-30 by Sanjay Kumar Madria


Slide83 l.jpg

Questions? ‘04)

CSE:634 Web Mining


Slide84 l.jpg

Thank You! ‘04)

CSE:634 Web Mining


ad