html5-img
1 / 64

Algoritmi per IR

Algoritmi per IR. Ranking. The big fight: find the best ranking. Ranking: Google vs Google.cn. Algoritmi per IR. Text-based Ranking (1° generation). Similarity between binary vectors. Document is binary vector X,Y in {0,1} D Score: overlap measure. What’s wrong ?. Normalization.

Download Presentation

Algoritmi per IR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algoritmi per IR Ranking

  2. The big fight: find the best ranking...

  3. Ranking: Google vs Google.cn

  4. Algoritmi per IR Text-based Ranking (1° generation)

  5. Similarity between binary vectors • Document is binary vector X,Y in {0,1}D • Score: overlap measure What’s wrong ?

  6. Normalization • Dice coefficient (wrt avg #terms): • Jaccard coefficient (wrt possible terms): NO, triangular OK, triangular

  7. What’s wrong in doc-similarity ? Overlap matching doesn’t consider: • Term frequency in a document • Talks more of t ? Then t should be weighted more. • Term scarcity in collection • of commoner than baby bed • Length of documents • score should be normalized

  8. æ ö n = ç ÷ idf log where nt = #docs containing term t n = #docs in the indexed collection è ø t n t A famous “weight”: tf-idf = tf Frequency of term t in doc d = #occt / |d| t,d Vector Space model

  9. Sec. 6.3 Why distance is a bad idea

  10. Postulate: Documents that are “close together” in the vector space talk about the same things. Euclidean distance sensible to vector length !! Easy to Spam A graphical example t3 cos(a)= v w / ||v|| * ||w|| d2 d3 Sophisticated algos to find top-k docs for a query Q d1 a t1 d5 t2 d4 The user query is a very short doc

  11. Sec. 6.3 cosine(query,document) Dot product qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

  12. Cos for length-normalized vectors • For length-normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length-normalized.

  13. Sec. 6.3 Cosine similarity amongst 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting.

  14. Sec. 6.3 3 documents example contd. Log frequency weighting After length normalization cos(SaS,PaP) ≈ 0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?

  15. Vector spaces and other operators • Vector space OK for bag-of-words queries • Clean metaphor for similar-document queries • Not a good combination with operators: Boolean, wild-card, positional, proximity • First generation of search engines • Invented before “spamming” web search

  16. Algoritmi per IR Top-k retrieval

  17. Sec. 7.1.1 Speed-up top-k retrieval • Costly is the computation of the cos • Find a set A of contenders, with K < |A| << N • A does not necessarily contain the top K, but has many docs from among the top K • Return the top K docs in A, according to the score • The same approach is also used for other (non-cosine) scoring functions • Will look at several schemes following this approach

  18. Sec. 7.1.2 Index elimination • Consider docs containing at least one query term. Hence this means… • Take this further: • Only consider high-idf query terms • Only consider docs containing many query terms

  19. Sec. 7.1.2 High-idf query terms only • For a query such as catcher in the rye • Only accumulate scores from catcher and rye • Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much • Benefit: • Postings of low-idf terms have many docs  these (many) docs get eliminated from set A of contenders

  20. Sec. 7.1.2 Docs containing many query terms • For multi-term queries, compute scores for docs containing several of the query terms • Say, at least 3 out of 4 • Imposes a “soft conjunction” on queries seen on web search engines (early Google) • Easy to implement in postings traversal

  21. Sec. 7.1.2 3 2 4 4 8 8 16 16 32 32 64 64 128 128 1 2 3 5 8 13 21 34 3 of 4 query terms Antony Brutus Caesar Calpurnia 13 16 32 Scores only computed for docs 8, 16 and 32.

  22. Champion Lists • Preprocess: Assign to each term, its m best documents • Search: • If |Q| = q terms, merge their preferred lists ( mq answers). • Compute COS between Q and these docs, and choose the top k. Need to pick m>k to work well empirically. Now SE use tf-idf PLUS PageRank (PLUS other weights)

  23. Sec. 7.1.4 Complex scores • Consider a simple total score combining cosine relevance and authority • net-score(q,d) = PR(d) + cosine(q,d) • Can use some other linear combination than an equal weighting • Now we seek the top K docs by net score

  24. Advanced: Fancy-hits heuristic • Preprocess: • Assign docID by decreasing PR weight • Define FH(t) =m docs for t with highest tf-idf weight • Define IL(t) = the rest (i.e. incr docID = decr PR weight) • Idea: a document that scores high should be in FH or in the front of IL • Search for a t-term query: • First FH: Take the common docs of their FH • compute the score of these docs and keep the top-k docs. • Then IL: scan ILs and check the common docs • Compute the score and possibly insert them into the top-k. • Stop when M docs have been checked or the PR score becomes smaller than some threshold.

  25. Algoritmi per IR Speed-up querying by clustering

  26. Sec. 7.1.6 Visualization Query Leader Follower

  27. Sec. 7.1.6 Cluster pruning: preprocessing • Pick N docs at random: call these leaders • For every other doc, pre-compute nearest leader • Docs attached to a leader: its followers; • Likely: each leader has ~ N followers.

  28. Sec. 7.1.6 Cluster pruning: query processing • Process a query as follows: • Given query Q, find its nearest leader L. • Seek K nearest docs from among L’s followers.

  29. Sec. 7.1.6 Why use random sampling • Fast • Leaders reflect data distribution

  30. Sec. 7.1.6 General variants • Have each follower attached to b1=3 (say) nearest leaders. • From query, find b2=4 (say) nearest leaders and their followers. • Can recur on leader/follower construction.

  31. Algoritmi per IR Relevance feedback

  32. Sec. 9.1 Relevance Feedback • Relevance feedback: user feedback on relevance of docs in initial set of results • User issues a (short, simple) query • The user marks some results as relevant or non-relevant. • The system computes a better representation of the information need based on feedback. • Relevance feedback can go through one or more iterations.

  33. Sec. 9.1.1 Rocchio’s Algorithm • The Rocchio algorithm uses the vector space model to pick a relevance feed-back query • Rocchio seeks the query qopt that maximizes

  34. Sec. 9.1.1 Rocchio (SMART) • Used in practice: • Dr =set of known relevant doc vectors • Dnr = set of known irrelevant doc vectors • qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically) • New query moves toward relevant documents and away from irrelevant documents

  35. Relevance Feedback: Problems • Users are often reluctant to provide explicit feedback • It’s often harder to understand why a particular document was retrieved after applying relevance feedback • There is no clear evidence that relevance feedback is the “best use” of the user’s time.

  36. Sec. 9.1.4 Relevance Feedback on the Web • Some search engines offer a similar/related pages feature (this is a trivial form of relevance feedback) • Google (link-based) • Altavista • Stanford WebBase • Some don’t because it’s hard to explain to users: • Alltheweb • bing • Yahoo • Excite initially had true relevance feedback, but abandoned it due to lack of use. α/β/γ ??

  37. Sec. 9.1.6 Pseudo relevance feedback • Pseudo-relevance feedback automates the “manual” part of true relevance feedback. • Retrieve a list of hits for the user’s query • Assume that the top k are relevant. • Do relevance feedback (e.g., Rocchio) • Works very well on average • But can go horribly wrong for some queries. • Several iterations can cause query drift.

  38. Sec. 9.2.2 Query Expansion • In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents • In query expansion, users give additional input (good/bad search term) on words or phrases

  39. Sec. 9.2.2 How augment the user query? • Manual thesaurus (costly to generate) • E.g. MedLine: physician, syn: doc, doctor, MD • Global Analysis(static; all docs in collection) • Automatically derived thesaurus • (co-occurrence statistics) • Refinements based on query-log mining • Common on the web • Local Analysis (dynamic) • Analysis of documents in result set

  40. Query assist Would you expect such a feature to increase the query volume at a search engine?

  41. Algoritmi per IR Zone indexes

  42. Sec. 6.1 Parametric and zone indexes • Thus far, a doc has been a term sequence • But documents have multiple parts: • Author • Title • Date of publication • Language • Format • etc. • These are the metadata about a document

  43. Sec. 6.1 Zone • A zone is a region of the doc that can contain an arbitrary amount of text e.g., • Title • Abstract • References … • Build inverted indexes on fields AND zones to permit querying • E.g., “find docs with merchant in the title zone and matching the query gentle rain”

  44. Sec. 6.1 Example zone indexes Encode zones in dictionary vs. postings.

  45. Sec. 7.2.1 Tiered indexes • Break postings up into a hierarchy of lists • Most important • … • Least important • Inverted index thus broken up into tiers of decreasing importance • At query time use top tier unless it fails to yield K docs • If so drop to lower tiers

  46. Sec. 7.2.1 Example tiered index

  47. Sec. 7.2.2 Query term proximity • Free text queries: just a set of terms typed into the query box – common on the web • Users prefer docs in which query terms occur within close proximity of each other • Would like scoring function to take this into account – how?

  48. Sec. 7.2.3 Query parsers • E.g. query rising interest rates • Run the query as a phrase query • If <K docs contain the phrase rising interest rates, run the two phrase queries rising interest and interest rates • If we still have <K docs, run the vector space query rising interest rates • Rank matching docs by vector space scoring

  49. Algoritmi per IR Quality of a Search Engine

  50. Is it good ? • How fast does it index • Number of documents/hour • (Average document size) • How fast does it search • Latency as a function of index size • Expressiveness of the query language

More Related