1 / 34

Query by Document

Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University) Nick Koudas (University of Toronto) Dimitris Papadias (Hong Kong University of Science and Technology). Query by Document.

ivory
Download Presentation

Query by Document

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) WisamDakka (Google) Panagiotis Ipeirotis (New York University) Nick Koudas (University of Toronto) Dimitris Papadias (Hong Kong University of Science and Technology) Query by Document

  2. Motivation • Explosion of Web 2.0 content • blogs, micro-blogs, social networking • Need for “cross reference” on the web • after we read a news article, we wonder if there are any blogs discussing it • and vice versa

  3. Query by Document • A service of the BlogScope system • a real blog search engine serving 20K users /day • Input: a text document • Output: relevant blog posts • Methodology • extract key phrases from the input document • use these phrases to query BlogScope

  4. BlogScope

  5. QBD User Interface

  6. Main Contributions • Novel Query-by-Document (QBD) model • Practical phrase extractor • Phrase set enhancement with Wikipedia knowledge (QBD-W) • Evaluation of all proposed methods using Amazon Mechanical Turk • Human annotators are serious because they get paid for the tasks

  7. Related Work – Relevance Feedback • Example of RF • Distinctions between RF and QBD • RF involves interaction, while QBD does not • RF is most effective for improving recall, whereas QBD aims at both high precision and recall • RF starts with a keyword query; QBD directly takes a document as input

  8. Related Work – Phrase Extraction • Two classes of methods • Very slow but accurate, from the machine learning community • Practical, not so accurate as the above (our method falls in this category) • Phrase extraction in QBD has distinct goals • Document retrieval accuracy is more important than that of the phrase set itself • A better phrase extractor is not necessarily more suitable for QBD, as shown in our experiments

  9. Other Related Work • Query expansion • Used when user’s keyword set does not express herself properly • PageRank, TrustRank, … • QBD-W follows this framework • Wikipedia mining

  10. QBD: Phrase Extraction • Recall that Query-by-Document • Extracts key phrases from the input document • And then query them against a search engine • Idea: given a query document D • Identify all phrases from D • Score each individual phrase • Obtain the set of phrases with highest scores, and refine it

  11. Step 1:Extracting All Phrases • Process the document with a Part-of-Speech tagger • Nouns, adjectives, verbs, … • We compiled a list of POS patterns • Indexed by a POS trie forest • Each term sequence following such a POS pattern is considered a phrase

  12. Example POS Patterns

  13. PS Trie Forest

  14. Step 2: Scoring Phrases • Two scoring functions • ft, based on TF/IDF • fl, based on the concept of mutual information

  15. TF/IDF Scoring • Extract the most characteristic phrases from the input document D • But may obtain term sequences which are not really phrases • Example: “moment Down Jones” in “at this moment Dow Jones”

  16. Mutual Information Scoring • MI: the conditional probability of a pair of events, with respect to their individual probabilities • Eliminates non-phrases

  17. Step 3: Obtaining Key Phrase Set • Take the top-k phrases with highest scores • Eliminates duplicates • Two different phrases may carry similar meanings • Remove phrases who are • Subsumed by another with higher score • Differ from a better phrase only in the last term • And other rules …

  18. QBD-W • Motivation: • The user may also be interested in web documents related to the given one, but does not contain the same key phrases • Example: after reading an article on Michelle Obama, the user may also want to learn her husband, and past American presidents • Main idea: • Obtain an initial phrase set with QBD • Use Wikipedia knowledge to identify phrases that are related to the initial phrases • Our method follows the spreading-activation framework

  19. Wikipedia Graph

  20. QBD-W Algorithm • Given an initial phrase set • Locate nodes corresponding to these phrases on the Wiki Graph • Assign weights to these nodes • Iteratively spreads node weights to neighbors • Assume the random surfer model • With a certain probability, return to one of the initial nodes

  21. Initial RelevanceRank Values • S is the initial phrase set • Initial weights are normalized • s(cv) is the score of cv, assigned by QBD

  22. Transition Matrix

  23. RelevanceRank Iteration • With probability αv’ , proceed to a neighbor; • Otherwise, return to one of the initial nodes • αv’ is a function of the node v’

  24. Calculating Probability αv • αv is not a constant, unlike other algorithms (e.g., TrustRank) • αv gets smaller, and eventually drops to zero, for nodes increasingly farther away from the initial ones • Reduce CPU overhead of RelevanceRank computation, since only a subset of nodes are considered • Important, as RelevanceRank is calculated online

  25. Example RelevanceRank Scores

  26. Experiments • Methodology • Employ human annotators at Amazon Mturk • Dataset • A random sample of news articles from the New York Times, the Economist, Reuters, and Financial Times during Aug-Sep 2007 • Competitors for phrase extraction • QBD-TFIDF (tf-idf scoring) • QBD-MI (mutual information scoring) • QBD-YAHOO (Yahoo! phrase extractor)

  27. Experimental Results • Quality of Phrase Retrieval • Quality of Document Retrieval • Efficiency • The total running time of QBD is negligible

  28. Quality of Phrase Retrieval

  29. Quality of Phrase Retrieval (Cont.)

  30. Quality of Retrieved Documents

  31. Quality of Retrieved Documents (QBD-W)

  32. QBD-W Running Time

  33. Conclusion • We propose • the query-by-document model • two effective phrase extraction algorithms • enhancing the phrase set with the Wikipedia graph • Future work • more sophisticated phrase extraction (e.g., with additional background knowledge) • blog matching using key phrases

  34. Thank you! Questions?

More Related