a personalized search engine based on web snippet hierarchical clustering
Download
Skip this Video
Download Presentation
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

Loading in 2 Seconds...

play fullscreen
1 / 12

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering. Paolo Ferragina, Antonio Gulli Presented by Bin Tan. Clustering Web Search Results. Challenges: On short snippets instead of whole docs Clustering must be done on the fly

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A Personalized Search Engine Based on Web Snippet Hierarchical Clustering' - creda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a personalized search engine based on web snippet hierarchical clustering

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

Paolo Ferragina, Antonio Gulli

Presented by Bin Tan

clustering web search results
Clustering Web Search Results
  • Challenges:
    • On short snippets instead of whole docs
    • Clustering must be done on the fly
    • Clusters should be labeled with meaningful text (accurate and intelligible)
    • Clusters need to be distinctive
  • Vivisimo
  • SNAKET
categorization of works
Categorization of Works
  • Flat clustering vs. Hierarchical clustering
  • Label representation: Bag of words vs. contiguous phrase vs. non-contiguous phrase (“gapped sentence”)
preprocessing
Preprocessing
  • Fetch snippets from 16 search engines
  • Enrich snippets with anchor texts from a crawled database of 200M web pages
identification of candidate phrases for labels
Identification of Candidate Phrases for Labels
  • Enumerate all pairs of words within a certain proximity window (of size 4) in snippets
  • Score them based on:
    • NLP features: PoS, NE
    • ODP occurrences: term frequency (col freq * inv cat freq?), containing category
  • Discard low-score pairs
identification of candidate phrases for labels cont
Identification of Candidate Phrases for Labels (cont.)
  • Word pairs are atomic phrases (how about single words?)
  • Incrementally merge word pairs into longer phrases (preserve ordering and limit size)
  • Score phrases based on its constitutes’ scores
  • Discard low-score phrases
hierarchical clustering
Hierarchical Clustering
  • Group all snippets containing a candidate phrase into an atomic cluster – allow overlapping
  • Primary label: the aforementioned candidate phrase
  • Secondary labels: other candidate phrases occurring in 80% of the snippets in the cluster
hierarchical clustering cont
Hierarchical Clustering (cont.)
  • Merge atomic clusters into candidate second-level clusters if they share primary/secondary labels
  • Primary label: the shared label
  • Secondary label: other labels occurring in 80% of the snippets in the cluster
  • Prune second-level clusters that are have similar coverage or similar labels
  • Recursively produce third-level clusters
how s nake t can be used
How SNAKET can be Used
  • Hierarchical browsing for knowledge extraction
  • Hierarchical browsing for result selection
  • Query reformulation
  • Personalized ranking(?)
clustering technology pagerank of the future
Clustering technology: PageRank of the future?
  • Pros:
    • Ambiguous query: narrow down result list
    • Less-ambiguous query: get a bird’s eye view of different aspects
  • Cons:
    • Clustering is slow but often unnecessary
    • Takes time to look at the clusters
    • Cluster and label quality still to be desired
ad