1 / 13

Comparing and Ranking Documents

Comparing and Ranking Documents. Once our search engine has retrieved a set of documents, we may want to Rank them by relevance Which are the best fit to my query? This involves determining what the query is about and how well the document answers it Compare them Show me more like this.

Download Presentation

Comparing and Ranking Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing and Ranking Documents • Once our search engine has retrieved a set of documents, we may want to • Rank them by relevance • Which are the best fit to my query? • This involves determining what the query is about and how well the document answers it • Compare them • Show me more like this. • This involves determining what the document is about.

  2. Determining Relevance by Keyword • The typical web query consists entirely of keywords. • Retrieval can be binary: present or absent • More sophisticated is to look for degree of relatedness: how much does this document reflect what the query is about? • Simple strategies: • How many times does word occur in document? • How close to head of document? • If multiple keywords, how close together?

  3. Keywords for Relevance Ranking • Count: repetition is an indicaiton of emphasis • Very fast (usually in the index) • Reasonable heuristic • Unduly influenced by document length • Can be "stuffed" by web designers • Position: Lead paragraphs summarize content • Requires more computation • Also reasonably heuristic • Less influenced by document length • Harder to "stuff"; can only have a few keywords near beginning

  4. Keywords for Relevant Ranking • Proximity for multiple keywords • Requires even more computation • Obviously relevant only if have multiple keywords • Effectiveness of heuristic varies with information need; typically either excellent or not very helpful at all • Very hard to "stuff" • All keyword methods • Are computationally simple and adequately fast • Are effective heuristics • typically perform as well as in-depth natural language methods for standard search

  5. Comparing Documents • "Find me more like this one" really means that we are using the document as a query. • This requires that we have some conception of what a document is about overall. • Depends on context of query. We need to • Characterize the entire content of this document • Discriminate between this document and others in the corpus

  6. Comparing Documents cont • Two very general approaches: • statistical • semantic • We will discuss semantic approaches more in text mining • Statistical approach still focuses on keywords: • To what extent does each term characterize this document? • To what extent does each term discriminate this document from other documents?

  7. Characterizing a Document: Term Frequency • Adocument can be treated as a sequence of words. • Each word characterizes that document to some extent. • When we have eliminated stop words, the most frequent words tend to be what the document is about • Therefore: fkd (# of occurrences of word K in document d) will be an important measure. • Also called the term frequency

  8. Characterizing a Document: Document Frequency • What makes this document distinct from others in the corpus? • The terms which discriminate best are not those which occur with high frequency! • Therefore: Dk (# of documents in which word K occurs) will also be an important measure. • Also called the document frequency

  9. TF*IDF • This can all be summarized as: • Words are best discriminators when they • occur often in this document (term frequency) • don’t occur in a lot of documents (document frequency) • One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency • There are multiple formulas for actually computing this; the book gives Robertson and Jones. The underlying concept is the same in all of them.

  10. Describing an Entire Document • So what is a document about? • TF*IDF: can simply list keywords in order of their TF*IDF values • Document is about all of them to some degree: it is at some point in some vector space of meaning

  11. Vector Space • Any corpus has defined set of terms (index) • These terms define a knowledge space • Every document is somewhere in that knowledge space -- it is or is not about each of those terms. • Consider each term as a vector. Then • We have an n-dimensional vector space • Where n is the number of terms (very large!) • Each document is a point in that vector space • The document position in this vector space can be treated as what the document is about.

  12. Similarity Between Documents • How similar are two documents? • Measures of association • How much do the feature sets overlap? • Modified for length: DICE coefficient • DICE coefficient: # terms compared to intersection • Simple Matching coefficient: take into account exclusions • Cosine similarity • similarity of angle of the two document vectors • not sensitive to vector length

  13. Bag of Words • All of these techniques are what is known as bag of words approaches. • Keywords treated in isolation • Difference between "man bites dog" and "dog bites man" non-existent • In text mining will discuss linguistic approaches which pay attention to semantics

More Related