1 / 15

CS144 Discussion Week 4 Information Retrieval

CS144 Discussion Week 4 Information Retrieval. Young Cha Oct. 25, 2013. Projects. Project 2 deadline is 11pm today (10/25) 2 grace days  11pm 10/27 (Sun) Please double check your implementation before submission Project 3 has 3 parts and 2 submission deadlines

garry
Download Presentation

CS144 Discussion Week 4 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS144 Discussion Week 4Information Retrieval Young Cha Oct. 25, 2013

  2. Projects • Project 2 deadline is 11pm today (10/25) • 2 grace days  11pm 10/27 (Sun) • Please double check your implementation before submission • Project 3 has 3 parts and 2 submission deadlines • Part A: Building indexes (-11/1) No Grace Period Allowed! • Part B: Implementing Java search functions (-11/8) • Part C: Publishing Java class as Web service (-11/8) • You may resubmit your project 2 after fixing bugs • We don’t grade your Part A submission but we may check how different it is from your Part B/C submission  if it is largely different, briefly write down what has changed in README.txt

  3. Boolean Model • Bag of words • Order doesn’t matter • Boolean query • AND/OR/NOT • 3 documents • doc1: Bruins beat Trojans • doc2: Trojans envy Bruins • doc3: Bruins! Go Bruins! lexicon/dictionary postings list

  4. Vector Model • Tf-idf • f x log (N/n) • Cosine similarity • 3 documents • doc1: Bruins beat Trojans • doc2: Trojans envy Bruins • doc3: Bruins! Go Bruins! * Used N/n instead of log(N/n) for simplicity

  5. Precision & Recall • 1K docs in a corpus • 50 relevant docs • Among 10 docs retrieved by a search engine, • 3 are relevant • 7 are irrelevant • Precision? • Recall? |R&D|/|D| = 0.3 |R&D|/|R| = 0.06 All Recall R:Relevant D:Retrieved Search Engine B 3 7 47 Search Engine A Precision

  6. Index Size Estimation • Given that • 100 M docs • 5 KB/doc • 400 unique words/doc • 20 bytes/word • 10 bytes/docid • Questions • Document collection size? • Inverted index size? • Size of postings list? • Size of lexicion? (C=1, k=0.5 in Cˑnk ) 100M x 5KB = 500GB 400GB + 200KB 100M x 400 x 10B = 400GB (100M)0.5 x 20B = 200KB

  7. Topic-model based IR • Topic models assume that there are hidden topics behind words • An IR system with topic models can match a doc containing automobile for a query vehicle as it assumes they come from the same topic car … automobile Can be matched … Searcher Author Topic-model based IR

  8. Document Corpus Example • Document corpus (textual dataset)  matrix • Assumed hidden (latent) topics behind docs/words • We can infertopics by analyzing co-occurrence of docs and words • We can generatedocs by multiplying assumed doc-topic and topic-word matrices Inference = doc1: auto auto ... vehicle vehicle … doc2: film theater film theater … doc3: film … theater … vehicle … auto … X Generation Document-Word Observed Document-Word Observed Document-Topic Assumed Topic-Word Assumed

  9. Latent Semantic Indexing (LSI) by SVD W (words) W S (diagonal) D (docs) X X D U VT C n x p n x n n x p p x p W (words) T (Topics) Rank Reduction to k W Ck (Rank-k appr.) X X D (Docs) D Uk Sk VkT T n x k k x k k x p n x p T-W D-T

  10. Latent Semantic Indexing (LSI) by SVD • Query is viewed as a document  query matching is a process to find a similar document W (words) q q W 1 x p q Ck (Rank-k appr.) D X D W W (words) Ck (Rank-k appr.) D (docs) n x p p x 1 n x 1 Each value in the vector represents the similarity between q and di n x p

  11. Example Topics - PLSI W T • We can group words with Topic-Word matrix

  12. Lucene Example • Goal: build index for hotels to support keyword search • Each Hotel item has id, name, city, description • E.g. 1, Hotel Rivoli, Paris, If you like historical Paris … • 40 hotels • Requirements • Search over name, city, description or full text • In a search result page, you should show name, city and description • May need to be incorporated with RDB for a complex query • E.g. modern hotel in New York with price < $100

  13. Lucene Example • We first need to create an IndexWriter

  14. Lucene Example • Which field to store? to index?

  15. Lucene Example • Now we can perform search using the index

More Related