1 / 49

Information Retrieval and Vector Space Model Presented by Jun Miao York University

Information Retrieval and Vector Space Model Presented by Jun Miao York University. Information Retrieval (IR). What is Information Retrieval?. = IR ? IR: Retrieve information which is relative to your need Search Engine Question Answering

khuong
Download Presentation

Information Retrieval and Vector Space Model Presented by Jun Miao York University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval and Vector Space Model Presented by Jun Miao York University

  2. Information Retrieval (IR)

  3. What is Information Retrieval? = IR ? IR: Retrieve information which is relative to your need • Search Engine • Question Answering • Information Extraction • Information Filtering • Information Recommendation

  4. In old days… • The term "information retrieval" may have been coined by Calvin Mooers • Early IR applications are used in libraries • Set-based retrieval the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.

  5. In nowadays • Ranked Retrieval the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query. • free-form query expresses user’s information need • rank documents by decreasing likelihood of relevance • many studies prove it is superior

  6. An Information Retrieval Process • (Borrow from Prof. Nie’s slides) Info. need Query IR system Document collection Retrieval Answer list

  7. Inside a IR system

  8. Indexing Document

  9. Lexical Analysis • What counts as a word or token in the indexing scheme? • A big topic

  10. Stop List • function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, … • Stop list: contain stop words, not to be used as index • Prepositions • Articles • Pronouns • Some adverbs and adjectives • Some frequent words (e.g. document) • The removal of stop words usually improves IR effectiveness • A few “standard” stop lists are commonly used.

  11. Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word dancer dancers dance danced dancing dance

  12. Stemming(Cont’d) • Two main methods : Linguistic/dictionary-based stemming • high stemming accuracy • high implementation and processing costs and higher coverage Porter-style stemming • lower stemming accuracy • lower implementation and processing costs and lower coverage • Usually sufficient for IR

  13. Flat file indexing • Each document is represented by a set of weighted keywords (terms): D1 {(t1, w1), (t2,w2), …} e.g. D1  {(comput, 0.2), (architect, 0.3), …} D2  {(comput, 0.1), (network, 0.5), …}

  14. Inverted Index

  15. Query Analysis • Parse Query • Clean Stopwords • Stemming • Get terms • Adjacent operations • connect related terms together

  16. Models • Matching score model • Document D = a set of weighted keywords • Query Q = a set of non-weighted keywords • R(D, Q) = i w(ti , D) where ti is in Q.

  17. Models(Cont’d) • Boolean Model • Vector Space Model • Probability Model • Language Model • Neural Network Model • Fuzzy Set Model • ……

  18. tf*idf weighting schema • tf = term frequency • frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc. • df = document frequency • no. of documents containing the term • distribution of the term • idf = inverse document frequency • the unevenness of term distribution in the corpus • the specificity of term to a document • Idf = log(d/df) d= total number of documents The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t)

  19. Evaluation • A result list according to a query What is its performance? retrieved relevant Relevant Retrieved

  20. Metrics often used (together): • Precision = retrieved relevant docs / retrieved docs • Recall = retrieved relevant docs / relevant docs

  21. Precision-Recall Trade-off Usually, more precision, less recall; More recall, less precision Return all documents: recall rate = 1 precision is very low

  22. relevant documents relevant documents For Ranked List • Consider two result lists of two IR systems S1 and S2 according to one query: 1. 2. Which one is better???

  23. relevant documents Average Precision • AP = sum(R(xi)/P(xi)) / n Xi ∈ Set of retrieved relative documents P(xi) : Rank of xi in retrieved list R(xi) : Rank of xi in retrieved relative document list n : Number of retrieved relative documents List 1: AP1 = ((1/1)+(2/3)+(3/6)+(4/9)+(5/10))/5 = 0.622

  24. relevant documents Average Precision (Cont’d) • List 2 AP2 = ( (1/1)+(2/2)+(3/3)+(4/5)+(5/6) ) / 5 = 0.927 S2 is better than S1

  25. Evaluating over multiple queries Mean Average Precision: Arithmetic mean of average precisions over all queries 5 Queries (Topics) and 2 IR systems S1 is better than S2

  26. Other Measurements • Precision@N • R-Precision • F-measurement • E-measurement • ……

  27. Problem • Sometimes, documents in the collections are numerous. It is hard to calculate recall rate.

  28. Pooling • Step 1. Get top N documents from the results of IR systems to make a document pool. • Step 2. Experts check the pool, and tag these documents by relevant or non-relevant according to different topics

  29. Difficulties in text IR • Vocabularies mismatching • Synonymy: e.g. car v.s. automobile • Polysemy: table • Queries are ambiguous, they are partial specification of user’s need • Content representation may be inadequate and incomplete • The user is the ultimate judge, but we don’t know how the judge judges… • The notion of relevance is imprecise, context- and user-dependent

  30. Difficulties in web IR • No stable document collection (spider, crawler) • Invalid document, duplication, etc. • Huge number of documents (partial collection) • Multimedia documents • Great variation of document quality • Multilingual problem • …

  31. NLP in IR • Simple methods: stop word, stemming • Higher-level processing: chunking, parsing, word sense disambiguation • Research about using NLP in IR needs more attention

  32. Popular systems • SMART http://ftp.cs.cornell.edu/pub/smart/ • Terrier http://ir.dcs.gla.ac.uk/terrier/ • Okapi http://www.soi.city.ac.uk/~andym/OKAPIPACK/index.html • Lemur http://www-2.cs.cmu.edu/~lemur/ etc…

  33. Conference and Journal • Conference • SIGIR • TREC • CLEF • WWW • ECIR … • Journal • ACM Transactions on Information Systems(TOIS) • ACM Transactions on Asian Language Information Processing(TALIP) • Information Processing & Management(IP&M) • Information Retrieval

  34. Vector Space Model

  35. Idea • Convert documents and queries into vectors, and use Similarity Coefficient(SC) to measure the similarity • Presented by Gerard Salton et al. in 1975, implemented in SMART IR system • Premise: all terms are independent

  36. Construct Vector Each dimension corresponds to a separate term. Wi,j = weight of term j in document or query i

  37. Doc-Term Matrix • N documents and M terms

  38. Three Key problems 1.Term selection 2.Term weighting 3.Similarity Coefficient Calculation

  39. Term Selection • Terms represent the content of documents • Term purification • Stemming • Stoplist • Only choose Nouns

  40. Term Weight • Boolean weight: 1: appear 0: not appear • Term Frequency: • tf • 1+log(tf) • 1+(1+log(tf)) • Inverse Document Frequency • tf*idf

  41. Term Weight (Cont’d) • Document Length • Two opinions: • Longer documents contain more terms • Longer documents have more information • Punish long documents and compensate to short documents Pivoted Normalization: 1-b+b*doclen/avgdoclen b in (0,1)

  42. Similarity Coefficient Calculation Dot product Cosine Dice Jaccard t1 D Q t2

  43. Example Q: “gold silver truck” • D1: “Shipment of gold delivered in a fire” • D2: “Delivery of silver arrived in a silver truck” • D3: “Shipment of gold arrived in a truck” • Document Frequency of the jth term (dfj ) • Inverse Document Frequency (idf) = log10(n / dfj) Tf*idf is used as term weight here

  44. Example (Cont’d)

  45. Example(Cont’d) Tf*idf is used here SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031 SC(Q, D2 ) = 0.486 SC(Q,D3) = 0.062 The ranking would be D2,D3,D1. • This SC uses the dot product.

  46. Advantages of VSM • Fairly cheap to compute • Yields decent effectiveness • Very popular -- SMART is one of the most commonly used academic prototype

  47. Disadvantages of VSM • No theoretical foundation • Weights in the vectors are very arbitrary • Assumes term independence • Sparse Matrix

More Related