WEB CLUSTERING ENGINES
Sponsored Links
This presentation is the property of its rightful owner.
1 / 20

Web clustering Engines PowerPoint PPT Presentation


Web clustering Engines are emerging trend in the field of data retrieval. They organize search results by topic, thus providing a complementary view to the flat ranked list returned by the standard search engines.

Download Presentation

Web clustering Engines

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


WEB CLUSTERING ENGINES


Search Engine?

  • Search engines are an invaluable tool for retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query.

  • Eg: Google, Yahoo etc.


Flat Ranked VS Clustered

  • Google (Flat Ranked Search Engine)


Northern Lights (Clustered Search Engine)


Why Web Clustering Engines?

  • Conventional Engines are not much efficient in ‘Ambiguous’ queries.

  • The search results returned by conventional search engines on query will be mixed together in the list irrelevant items occurs.


  • This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories).

Web clustering engines:

1. Northern Light - predefined set of clusters

2. Credo Reference

3. Kartoo

4. Eyeplorer


Main advantages of the cluster hierarchy

  • It makes for shortcuts to the items that relate to the same meaning.

  • It allows better topic understanding.


Issues in Implementation Of clusters

  • Short input data description.

  • Meaningful labels.

  • Selection of similarity measure.

  • Grouping of objects into clusters.

  • Computational efficiency.

  • Unknown number of clusters.


Architecture & Techniques


1.Search Results Acquisition

  • Provides input for the rest of the system.

  • Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL

  • The source of search results can be any public search engines, such as Google,Yahoo etc.

  • Fetching results from other search engines.


2.Preprocessing of Search results

  • Primary aim is to convert the search results into ‘features’

    steps:

    i.Language identification

    ii.Tokenization

    iii.Stemming

    iv.Selection features


ii.Tokenization:

Text of each search result gets split into a sequence of basic independent units called tokens represent by word, number or symbol.


iii.Stemming:

Remove the inflectional prefixes and suffixes of each word to reduce different grammatical form of the word to a common base form called a ‘stem’.

Eg:

connected,connecting & interconnection

↓ ↓ ↓

‘connect’


iv.Selection features:

  • Extract features for each search result present in the input.

  • Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm.

  • Features vary from single word to tuples of word.


How can represent a feature/text?

  • Vector Space Model(VSM)

  • Document d is represented in the VSM as a vector [wt0 , wt1 , . . .wtn]

    where t0, t1, . . . tnis a set of words/features

    andwtiis the weight/importance of feature ti

    Eg:

    d→“Pollyhad a dog and the dog had Polly”

vsm representation


3.Cluster Construction & Labelling

  • The set of search results along with their features are input to the clustering algorithm,

    for building the clusters and labeling.

    Three types of Algorithms:

    1.Data Centric Algorithms

    2.Description aware

    3.Description centric


Data Centric Clustering Algorithm

  • It has initial clustering of a collection of documents in a set of k clusters(scatter)

  • At Query time the user selected clusters of interest(gather) and the system re-clustered those documents.

  • Process repeats until a small cluster with relevant documents is found


Difficulties in Data centric algorithms

  • All these algorithms are not incremental in nature - each document arrives from the web, we “clean” it and add it to the available model.

  • Missing of meaningful labels.


4.Visualization of Clustered Results

  • One prominent approach is based on hierarchical folders

  • Clusty, CREDO, Lingo3G - hierarchical folder visualization approach

  • Grokker - Nesting ,zooming approach

  • KartOO - Graph based interfaces


THANK YOU


  • Login