1 / 37

Search A Basic Overview

Search A Basic Overview. Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014. Back in those days. We had access to much smaller amount of information Had to find information manually.

lewis-olsen
Download Presentation

Search A Basic Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SearchA Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014

  2. Back in those days We had access to much smaller amount of information Had to find information manually Once upon a time in the world, there were days without search engines

  3. Search engine User needs some information A search engine tries to bridge this gap Assumption: the required information is present somewhere How: • User “expresses” the information need – query • Engine returns – list of documents, or by some better means

  4. Search engine User needs some information A search engine tries to bridge this gap Assumption: the required information is present somewhere Simplest model • User submits query – a set of words (terms) • Search engine returns documents “matching” the query • Assumption: matching the query would satisfy the information need • Modern search has come a long way from the simple model, but the fundamentals are still required

  5. Basic approach This is in Indian Statistical Institute, Kolkata, India • Documents contain terms • Documents are represented by terms present in them • Match queries and documents by terms • For simplicity: ignore positions, consider documents as “bag-of-words” • There may be many matching documents – need to rank them Diwali is a huge festival in India Statistically flying is the safest mode of journey This is autumn Thank god it is a holiday India’s population is huge There is no end of learning Query: india statistics

  6. Vector space model Each term represents a dimension Documents are vectors in the term-space Term-document matrix: a very sparse matrix Query is also a vector in the term-space • Similarity of each document d with the query q is measured by the cosine similarity (dot product normalized by norms of the vectors)

  7. Scoring function: TF.iDF • How important is a term t in a document d • Approach: take two factors into account • With what significance does t occur in d? [term frequency] • Does t occur in many other documents also? [document frequency] • Called TF.iDF: TF× iDF, has many variants for TF and iDF • Variants for TF(t, d) • Number of times t occurs in d: freq(t, d) • Logarithmically scaled frequency: 1 + log(freq(t, d)) • Augmented frequency: avoid bias towards longer documents • Inverse document frequency of t : iDF(t) Half the score for just being present Rest a function of frequency • for all t in d; 0 otherwise • where N = total number of documents • DF(t) = number of documents in which t occurs

  8. BM25 • Okapi IR system – Okapi BM25 • If the query q = {q1, … , qn} where qi’s are words in the query where N = total number of documents avgdl = average length of documents k1and b are optimized parameters, usually b = 0.75 and 1.2 ≤ k1 ≤ 2.0 • BM25 exhibited better performance than TF.iDF in TREC consistently

  9. Relevance • Simple IR model: query, documents, returned results • Relevant document: a document that satisfies the information need expressed by the query • Merely matching query terms does not make a document relevant • Relevance is human perception, not a mathematical statement • User may want some statistics on population of India by the query “india statistics” • The document “Indian Statistical Institute” matches the query terms, but not relevant • To evaluate effectiveness of a system, we need for each query • Given a result, an assessment of whether it is relevant • The set of all relevant results assessed (pre-validated) • If the second is available, it serves the purpose of the first as well • Measures: precision, recall, F-measure (harmonic mean of precision and recall)

  10. Inverted index 3 2 • Standard representation: document  terms • Inverted index: term  documents • For each term t, store the list of the documents in which t occurs 1 This is in Indian Statistical Institute, Kolkata, India Diwali is a huge festival in India Statistically flying is the safest mode of journey 5 India’s population is huge Thank god it is a holiday 4 This is autumn 7 6 There is no end of learning Scores?

  11. Inverted index 3 2 • Standard representation: document  terms • Inverted index: term  documents • For each term t, store the list of the documents in which t occurs 1 This is in Indian Statistical Institute, Kolkata, India Diwali is a huge festival in India Statistically flying is the safest mode of journey 5 India’s population is huge Thank god it is a holiday 4 This is autumn 7 6 There is no end of learning Note: These scores are dummy, not by any formula

  12. Positional index 3 2 • Just documents and scores follows bag of words model • Cannot perform proximity search or phrase query search • Positional inverted index: also store position of each occurrence of term t in each document d where t occurs 1 This is in Indian Statistical Institute, Kolkata, India Diwali is a huge festival in India Statistically flying is the safest mode of journey 5 India’s population is huge Thank god it is a holiday 4 This is autumn 7 6 There is no end of learning

  13. Pre-processing • Removal of stopwords: of, the, and, … • Modern search does not completely remove stopwords • Such words add meaning to sentences as well as queries • Stemming: words  stem (root) of words • Statistics, statistically, statistical  statistic (same root) • Loss of slight information (the form of the word also matters) • But unifies differently expressed queries on the same topic • Lemmatization: doing this properly with morphological analysis of words • Normalization: unify equivalent words as much as possible • U.S.A, USA • Windows, windows • Stemming, lemmatization, normalization, synonym finding, all are important subfields on their own!!

  14. Creating an inverted index 3 2 • For each document, write out pairs (term, docid) • Sort by term • Group, compute DF 1 This is in Indian Statistical Institute, Kolkata, India Diwali is a huge festival in India Statistically flying is the safest mode of journey 5 India’s population is huge Thank god it is a holiday 4 This is autumn 7 6 There is no end of learning

  15. Traditional architecture User Different types of documents Results Query Query handler (query parsing) Results handler (displaying results) Basic format conversion, parsing Query Results Core query processing (accessing index, ranking) Analysis (stemming, normalization, …) Index Indexing

  16. List 2 List 1 List 3 Query processing One pointer in each list lists sorted by doc id Pick the smallest doc id

  17. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  18. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  19. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  20. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  21. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  22. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  23. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  24. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  25. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  26. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  27. List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id

  28. List 2 List 1 List 3 Merge One pointer in each list (Partial) sort lists sorted by doc id still sorted by doc id Complexity? klogn Top-2 Merged list

  29. Merge Simple and efficient, minimal overhead Merge Lists sorted by doc id Merged list But, have to scan the lists fully!

  30. Top-k algorithms • If there are millions of documents in the lists • Can the ranking be done without accessing the lists fully? • Exact top-k algorithms (used more in databases) • Family of threshold algorithms (Ronald Fagin et al) • Threshold algorithm (TA) • No random access algorithm (NRA) [we will discuss, as an example] • Combined algorithm (CA) • Other follow up works • Inexact top-k algorithms • Exact top-k not required, the scores are only “crude” approximation of “relevance” (human perception) • Several heuristics • Further reading: IR book by Manning, Raghavan and Schuetze, Ch. 7

  31. List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: lists sorted by score read one doc from every list

  32. current score best-score List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 1 0.6 + 0.6 + 0.9 = 2.1 Candidates min top-2 score: 0.6 maximum score for unseen docs: 2.1 lists sorted by score min-top-2 < best-score of candidates read one doc from every list

  33. List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 2 0.5 + 0.6 + 0.7 = 1.8 Candidates min top-2 score: 0.9 maximum score for unseen docs: 1.8 lists sorted by score min-top-2 < best-score of candidates read one doc from every list

  34. List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 3 0.4 + 0.6 + 0.3 = 1.3 Candidates min top-2 score: 1.3 maximum score for unseen docs: 1.3 lists sorted by score min-top-2 < best-score of candidates no more new docs can get into top-2 but, extra candidates left in queue read one doc from every list

  35. List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 4 0.3 + 0.6 + 0.2 = 1.1 Candidates min top-2 score: 1.3 maximum score for unseen docs: 1.1 lists sorted by score min-top-2 < best-score of candidates no more new docs can get into top-2 but, extra candidates left in queue read one doc from every list

  36. List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 5 More approaches: • Periodically also perform random accesses on documents to reduce uncertainty (CA) • Sophisticated scheduling on lists • Crude approximation: NRA may take a lot of time to stop. Just stop after a while with approximate top-k – who cares if the results are perfect according to the scores? 0.2 + 0.5 + 0.1 = 0.8 Candidates min top-2 score: 1.6 maximum score for unseen docs: 0.8 lists sorted by score no extra candidate in queue Done! read one doc from every list

  37. References • Primarily: IR Book by Manning, Raghavan and Schuetze: http://nlp.stanford.edu/IR-book/

More Related