a personalized search engine based on web snippet hierarchical clustering n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering PowerPoint Presentation
Download Presentation
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

Loading in 2 Seconds...

play fullscreen
1 / 12

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering. Paolo Ferragina, Antonio Gulli Presented by Bin Tan. Clustering Web Search Results. Challenges: On short snippets instead of whole docs Clustering must be done on the fly

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Personalized Search Engine Based on Web Snippet Hierarchical Clustering' - creda


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a personalized search engine based on web snippet hierarchical clustering

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

Paolo Ferragina, Antonio Gulli

Presented by Bin Tan

clustering web search results
Clustering Web Search Results
  • Challenges:
    • On short snippets instead of whole docs
    • Clustering must be done on the fly
    • Clusters should be labeled with meaningful text (accurate and intelligible)
    • Clusters need to be distinctive
  • Vivisimo
  • SNAKET
categorization of works
Categorization of Works
  • Flat clustering vs. Hierarchical clustering
  • Label representation: Bag of words vs. contiguous phrase vs. non-contiguous phrase (“gapped sentence”)
preprocessing
Preprocessing
  • Fetch snippets from 16 search engines
  • Enrich snippets with anchor texts from a crawled database of 200M web pages
identification of candidate phrases for labels
Identification of Candidate Phrases for Labels
  • Enumerate all pairs of words within a certain proximity window (of size 4) in snippets
  • Score them based on:
    • NLP features: PoS, NE
    • ODP occurrences: term frequency (col freq * inv cat freq?), containing category
  • Discard low-score pairs
identification of candidate phrases for labels cont
Identification of Candidate Phrases for Labels (cont.)
  • Word pairs are atomic phrases (how about single words?)
  • Incrementally merge word pairs into longer phrases (preserve ordering and limit size)
  • Score phrases based on its constitutes’ scores
  • Discard low-score phrases
hierarchical clustering
Hierarchical Clustering
  • Group all snippets containing a candidate phrase into an atomic cluster – allow overlapping
  • Primary label: the aforementioned candidate phrase
  • Secondary labels: other candidate phrases occurring in 80% of the snippets in the cluster
hierarchical clustering cont
Hierarchical Clustering (cont.)
  • Merge atomic clusters into candidate second-level clusters if they share primary/secondary labels
  • Primary label: the shared label
  • Secondary label: other labels occurring in 80% of the snippets in the cluster
  • Prune second-level clusters that are have similar coverage or similar labels
  • Recursively produce third-level clusters
how s nake t can be used
How SNAKET can be Used
  • Hierarchical browsing for knowledge extraction
  • Hierarchical browsing for result selection
  • Query reformulation
  • Personalized ranking(?)
clustering technology pagerank of the future
Clustering technology: PageRank of the future?
  • Pros:
    • Ambiguous query: narrow down result list
    • Less-ambiguous query: get a bird’s eye view of different aspects
  • Cons:
    • Clustering is slow but often unnecessary
    • Takes time to look at the clusters
    • Cluster and label quality still to be desired