1 / 26

Help People Find What They Don ’ t Know

Help People Find What They Don ’ t Know. Hao Ma 16-10-2007 CSE, CUHK. Questions. ?. How many times do you search every day? Have you ever clicked to the 2 nd page of the search results, or 3 rd , 4 th , … ,10 th page?

marly
Download Presentation

Help People Find What They Don ’ t Know

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Help People Find What They Don’t Know Hao Ma 16-10-2007 CSE, CUHK

  2. Questions ? • How many times do you search every day? • Have you ever clicked to the 2nd page of the search results, or 3rd, 4th,…,10th page? • Do you have problem to choose correct words to represent your search queries? • ……

  3. Query Analysis Results Goal of a Search Engine • Retrieve the docs that are “relevant” for the user query • Doc: file word or pdf, web page, email, blog, book,... • Query: paradigm “bag of words” • Relevant ? • Subjective and time-varying concept • Users are lazy • Selective queries are difficult to be composed • Web pages are heterogeneous, numerous and changing frequently • Web search is a difficult, cyclic process

  4. Haifa weather Mars surface images Canon DC User Needs • Informational – want to learn about something (~40%) • Navigational – want to go to that page (~25%) • Transactional – want to do something (~35%) • Access a service • Downloads • Shop • Gray areas • Find a good hub • Exploratory search “see what’s there” SVM Cuhk Car rental in Finland

  5. ill-defined queries Short 2001: 2.54 terms avg 80% less than 3 terms Imprecise terms 78% are not modified Wide variance in Needs Expectations Knowledge Patience: 85% look at 1 page Queries

  6. Different Coverage Google vs Yahoo Share 3.8 results in the top 10 on avg Share 23% in the top 100 on avg

  7. In summary Current search engines incur in many difficulties: • Link-based ranking may be inadequate: bags of words paradigm, ambiguous queries, polarized queries, … • Coverage of one search engine is poor, meta-search engines cover more but “difficult” to fuse multiple sources • User needs are subjective and time-varying • Users are lazy and look to few results

  8. Web Search Results Clustering & Query Optimization How important? Totally, at least 8 papers published at SIGIR and WWW each year, recently! Two “complementary” approaches

  9. Web Search Results Clustering

  10. The user has no longer to browse through tedious pages of results, but (s)he may navigate a hierarchy of folders whose labels give a glimpse of all themes present in the returned results. An interesting approach Web-snippet

  11. Web-Snippet Hierarchical Clustering • The folder hierarchy must be formed • “on-the-fly from the snippets”: because it must adapt to the themes of the results without any costly remote access to the original web pages or documents • “and his folders may overlap”: because a snippet may deal with multiple themes • Canonical clustering is instead persistent and generated only once • The folder labels must be formed • “on-the-fly from the snippets” because labels must capture the potentially unbounded themes of the results without any costly remote access to the original web pages or documents. • “and be intelligible sentences” because they must facilitate the user post-navigation • It seems a “document organization into topical context”, but snippets are poorly composed, no “structural information” is available for them, and “static classification” into predefined categories would be not appropriate.

  12. The Literature • We may identify four main approaches (ie. taxonomy) • Single words and Flat clustering – Scatter/Gather, WebCat, Retriever • Sentences and Flat clustering – Grouper, Carrot2, Lingo, Microsoft China • Single words and Hierarchical clustering – FIHC, Credo • Sentences and Hierarchical clustering – Lexical Affinities clustering, Hierarchical Grouper, SHOC, CIIRarchies, Highlight, IBM India • Conversely, we have many commercial proposals: • Northerlight (stopped 2002) • Copernic, Mooter, Kartoo, Groxis, Clusty, Dogpile, iBoogie,… • Vivisimo is surely the best !

  13. To be presented at WWW 2005

  14. SnakeT’s main features • 2 knowledge bases for ranking/choosing the labels • DMOZ is used as a feature selection and sentence ranker index • Text anchors are used for snippets enrichment • Labels are gapped sentences of variable length • Grouper’s extension, to match sentences which are “almost the same” • Lexical Affinities clustering extension to k-long LAs

  15. SnakeT’s main features • Hierarchy formation deploys the folder labels and coverage • “Primary” and “secondary” labels for finer/coarser clustering • Syntactic and covering pruning rules for simplification and compaction • 18 engines (Web, news and books) are queried on-the-fly • Google, Yahoo, Teoma, A9 Amazon, Google-news, etc.. • They are used as black-boxes

  16. Generation of the Candidate Labels • Extract all word pairs occurring in the snippets within some proximity window • Rank them by exploiting: KB + frequency within snippets • Discard the pairs whose rank is below a threshold • Merge repeatedly the remaining pairs by taking into account their original position, their order, and the sentence boundary within the snippets

  17. Query Optimization

  18. Disadvantages

  19. Can we do BETTER???

  20. Q & A The End

More Related