WEB CLUSTERING ENGINES
Download
1 / 20

Web clustering Engines - PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on

Web clustering Engines are emerging trend in the field of data retrieval. They organize search results by topic, thus providing a complementary view to the flat ranked list returned by the standard search engines.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Web clustering Engines' - factscomputersoftware


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Search engine
Search Engine?

  • Search engines are an invaluable tool for retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query.

  • Eg: Google, Yahoo etc.


Flat ranked vs clustered
Flat Ranked VS Clustered

  • Google (Flat Ranked Search Engine)



Why web clustering engines
Why Web Clustering Engines?

  • Conventional Engines are not much efficient in ‘Ambiguous’ queries.

  • The search results returned by conventional search engines on query will be mixed together in the list irrelevant items occurs.


Web clustering engines:

1. Northern Light - predefined set of clusters

2. Credo Reference

3. Kartoo

4. Eyeplorer


Main advantages of the cluster hierarchy
Main advantages of the cluster hierarchy into a hierarchy of labeled clusters (also called categories).

  • It makes for shortcuts to the items that relate to the same meaning.

  • It allows better topic understanding.


Issues in implementation of clusters
Issues in Implementation Of clusters into a hierarchy of labeled clusters (also called categories).

  • Short input data description.

  • Meaningful labels.

  • Selection of similarity measure.

  • Grouping of objects into clusters.

  • Computational efficiency.

  • Unknown number of clusters.


Architecture techniques
Architecture & Techniques into a hierarchy of labeled clusters (also called categories).


1 search results acquisition
1.Search Results Acquisition into a hierarchy of labeled clusters (also called categories).

  • Provides input for the rest of the system.

  • Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL

  • The source of search results can be any public search engines, such as Google,Yahoo etc.

  • Fetching results from other search engines.


2 preprocessing of search results
2.Preprocessing of Search results into a hierarchy of labeled clusters (also called categories).

  • Primary aim is to convert the search results into ‘features’

    steps:

    i.Language identification

    ii.Tokenization

    iii.Stemming

    iv.Selection features


ii.Tokenization: into a hierarchy of labeled clusters (also called categories).

Text of each search result gets split into a sequence of basic independent units called tokens represent by word, number or symbol.


iii.Stemming: into a hierarchy of labeled clusters (also called categories).

Remove the inflectional prefixes and suffixes of each word to reduce different grammatical form of the word to a common base form called a ‘stem’.

Eg:

connected,connecting & interconnection

↓ ↓ ↓

‘connect’


iv.Selection into a hierarchy of labeled clusters (also called categories). features:

  • Extract features for each search result present in the input.

  • Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm.

  • Features vary from single word to tuples of word.


How can represent a feature text
How can represent a feature/text? into a hierarchy of labeled clusters (also called categories).

  • Vector Space Model(VSM)

  • Document d is represented in the VSM as a vector [wt0 , wt1 , . . .wtn]

    where t0, t1, . . . tnis a set of words/features

    andwtiis the weight/importance of feature ti

    Eg:

    d→“Pollyhad a dog and the dog had Polly”

vsm representation


3 cluster construction labelling
3.Cluster Construction & Labelling into a hierarchy of labeled clusters (also called categories).

  • The set of search results along with their features are input to the clustering algorithm,

    for building the clusters and labeling.

    Three types of Algorithms:

    1. Data Centric Algorithms

    2. Description aware

    3. Description centric


Data centric clustering algorithm
Data Centric Clustering Algorithm into a hierarchy of labeled clusters (also called categories).

  • It has initial clustering of a collection of documents in a set of k clusters(scatter)

  • At Query time the user selected clusters of interest(gather) and the system re-clustered those documents.

  • Process repeats until a small cluster with relevant documents is found


Difficulties in data centric algorithms
Difficulties in Data centric algorithms into a hierarchy of labeled clusters (also called categories).

  • All these algorithms are not incremental in nature - each document arrives from the web, we “clean” it and add it to the available model.

  • Missing of meaningful labels.


4 visualization of clustered results
4.Visualization of Clustered Results into a hierarchy of labeled clusters (also called categories).

  • One prominent approach is based on hierarchical folders

  • Clusty, CREDO, Lingo3G - hierarchical folder visualization approach

  • Grokker - Nesting ,zooming approach

  • KartOO - Graph based interfaces


THANK YOU into a hierarchy of labeled clusters (also called categories).


ad