slide1 n.
Skip this Video
Download Presentation
Web clustering Engines

Loading in 2 Seconds...

play fullscreen
1 / 20

Web clustering Engines - PowerPoint PPT Presentation

  • Uploaded on

Web clustering Engines are emerging trend in the field of data retrieval. They organize search results by topic, thus providing a complementary view to the flat ranked list returned by the standard search engines.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Web clustering Engines' - factscomputersoftware

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
search engine
Search Engine?
  • Search engines are an invaluable tool for retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query.
  • Eg: Google, Yahoo etc.
flat ranked vs clustered
Flat Ranked VS Clustered
  • Google (Flat Ranked Search Engine)
why web clustering engines
Why Web Clustering Engines?
  • Conventional Engines are not much efficient in ‘Ambiguous’ queries.
  • The search results returned by conventional search engines on query will be mixed together in the list irrelevant items occurs.

This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories).

Web clustering engines:

1. Northern Light - predefined set of clusters

2. Credo Reference

3. Kartoo

4. Eyeplorer

main advantages of the cluster hierarchy
Main advantages of the cluster hierarchy
  • It makes for shortcuts to the items that relate to the same meaning.
  • It allows better topic understanding.
issues in implementation of clusters
Issues in Implementation Of clusters
  • Short input data description.
  • Meaningful labels.
  • Selection of similarity measure.
  • Grouping of objects into clusters.
  • Computational efficiency.
  • Unknown number of clusters.
1 search results acquisition
1.Search Results Acquisition
  • Provides input for the rest of the system.
  • Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL
  • The source of search results can be any public search engines, such as Google,Yahoo etc.
  • Fetching results from other search engines.
2 preprocessing of search results
2.Preprocessing of Search results
  • Primary aim is to convert the search results into ‘features’


i.Language identification



iv.Selection features



Text of each search result gets split into a sequence of basic independent units called tokens represent by word, number or symbol.



Remove the inflectional prefixes and suffixes of each word to reduce different grammatical form of the word to a common base form called a ‘stem’.


connected,connecting & interconnection

↓ ↓ ↓



iv.Selection features:

  • Extract features for each search result present in the input.
  • Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm.
  • Features vary from single word to tuples of word.
how can represent a feature text
How can represent a feature/text?
  • Vector Space Model(VSM)
  • Document d is represented in the VSM as a vector [wt0 , wt1 , . . .wtn]

where t0, t1, . . . tnis a set of words/features

andwtiis the weight/importance of feature ti


d→“Pollyhad a dog and the dog had Polly”

vsm representation

3 cluster construction labelling
3.Cluster Construction & Labelling
  • The set of search results along with their features are input to the clustering algorithm,

for building the clusters and labeling.

Three types of Algorithms:

1. Data Centric Algorithms

2. Description aware

3. Description centric

data centric clustering algorithm
Data Centric Clustering Algorithm
  • It has initial clustering of a collection of documents in a set of k clusters(scatter)
  • At Query time the user selected clusters of interest(gather) and the system re-clustered those documents.
  • Process repeats until a small cluster with relevant documents is found
difficulties in data centric algorithms
Difficulties in Data centric algorithms
  • All these algorithms are not incremental in nature - each document arrives from the web, we “clean” it and add it to the available model.
  • Missing of meaningful labels.
4 visualization of clustered results
4.Visualization of Clustered Results
  • One prominent approach is based on hierarchical folders
  • Clusty, CREDO, Lingo3G - hierarchical folder visualization approach
  • Grokker - Nesting ,zooming approach
  • KartOO - Graph based interfaces