WEB CLUSTERING ENGINES
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Web clustering Engines PowerPoint PPT Presentation


Web clustering Engines are emerging trend in the field of data retrieval. They organize search results by topic, thus providing a complementary view to the flat ranked list returned by the standard search engines.

Download Presentation

Web clustering Engines

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Web clustering engines

WEB CLUSTERING ENGINES


Search engine

Search Engine?

  • Search engines are an invaluable tool for retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query.

  • Eg: Google, Yahoo etc.


Flat ranked vs clustered

Flat Ranked VS Clustered

  • Google (Flat Ranked Search Engine)


Web clustering engines

Northern Lights (Clustered Search Engine)


Why web clustering engines

Why Web Clustering Engines?

  • Conventional Engines are not much efficient in ‘Ambiguous’ queries.

  • The search results returned by conventional search engines on query will be mixed together in the list irrelevant items occurs.


Web clustering engines

  • This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories).

Web clustering engines:

1. Northern Light - predefined set of clusters

2. Credo Reference

3. Kartoo

4. Eyeplorer


Main advantages of the cluster hierarchy

Main advantages of the cluster hierarchy

  • It makes for shortcuts to the items that relate to the same meaning.

  • It allows better topic understanding.


Issues in implementation of clusters

Issues in Implementation Of clusters

  • Short input data description.

  • Meaningful labels.

  • Selection of similarity measure.

  • Grouping of objects into clusters.

  • Computational efficiency.

  • Unknown number of clusters.


Architecture techniques

Architecture & Techniques


1 search results acquisition

1.Search Results Acquisition

  • Provides input for the rest of the system.

  • Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL

  • The source of search results can be any public search engines, such as Google,Yahoo etc.

  • Fetching results from other search engines.


2 preprocessing of search results

2.Preprocessing of Search results

  • Primary aim is to convert the search results into ‘features’

    steps:

    i.Language identification

    ii.Tokenization

    iii.Stemming

    iv.Selection features


Web clustering engines

ii.Tokenization:

Text of each search result gets split into a sequence of basic independent units called tokens represent by word, number or symbol.


Web clustering engines

iii.Stemming:

Remove the inflectional prefixes and suffixes of each word to reduce different grammatical form of the word to a common base form called a ‘stem’.

Eg:

connected,connecting & interconnection

↓ ↓ ↓

‘connect’


Web clustering engines

iv.Selection features:

  • Extract features for each search result present in the input.

  • Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm.

  • Features vary from single word to tuples of word.


How can represent a feature text

How can represent a feature/text?

  • Vector Space Model(VSM)

  • Document d is represented in the VSM as a vector [wt0 , wt1 , . . .wtn]

    where t0, t1, . . . tnis a set of words/features

    andwtiis the weight/importance of feature ti

    Eg:

    d→“Pollyhad a dog and the dog had Polly”

vsm representation


3 cluster construction labelling

3.Cluster Construction & Labelling

  • The set of search results along with their features are input to the clustering algorithm,

    for building the clusters and labeling.

    Three types of Algorithms:

    1.Data Centric Algorithms

    2.Description aware

    3.Description centric


Data centric clustering algorithm

Data Centric Clustering Algorithm

  • It has initial clustering of a collection of documents in a set of k clusters(scatter)

  • At Query time the user selected clusters of interest(gather) and the system re-clustered those documents.

  • Process repeats until a small cluster with relevant documents is found


Difficulties in data centric algorithms

Difficulties in Data centric algorithms

  • All these algorithms are not incremental in nature - each document arrives from the web, we “clean” it and add it to the available model.

  • Missing of meaningful labels.


4 visualization of clustered results

4.Visualization of Clustered Results

  • One prominent approach is based on hierarchical folders

  • Clusty, CREDO, Lingo3G - hierarchical folder visualization approach

  • Grokker - Nesting ,zooming approach

  • KartOO - Graph based interfaces


Web clustering engines

THANK YOU


  • Login