1 / 28

CS511 Design of Database Management Systems

CS511 Design of Database Management Systems. Lecture 13: Information Retrieval: Overview Kevin C. Chang. Announcements. MT format: Wednesday 2:00-3:15pm open notes, papers, books. Calc. OK (won’t need). PDA no. 75 points (for 75 minutes) 4 problems Prob. 1: True/False problems

faxon
Download Presentation

CS511 Design of Database Management Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS511Design of Database Management Systems Lecture 13: Information Retrieval: Overview Kevin C. Chang

  2. Announcements • MT format: • Wednesday 2:00-3:15pm • open notes, papers, books. Calc. OK (won’t need). PDA no. • 75 points (for 75 minutes) • 4 problems • Prob. 1: True/False problems • Prob. 2-4: longer problems • Preparation: • study lecture notes, HW, SGP– use them to review papers • ask why ask that... • discussion with peers • think more (beyond stated) and try to relate issues CS411

  3. Some History • Early Days-- • 1945: V. Bush’s article “As we may think” • 1957: H. P. Luhn’s idea of word counting and matching • Indexing & Evaluation Methodology (1960’s) • Smart system (G. Salton’s group) • Cranfield test collection (C. Cleverdon’s group) • Indexing: automatic can be as good as manual • IR Models (1970’s & 1980’s) … • Large-scale Evaluation & Applications (1990’s) • TREC (D. Harman & E. Voorhees, NIST) • Large scale Web search CS411

  4. ?? Text Search vs. Database Queries • Two related areas: • information retrieval (IR) • databases • traditionally separate-- brought together by the Web • ?? Any differences in • data models? • query semantics? • desirable functionalities? CS411

  5. Text vs. Rel. DB: Art vs. Algebra • Data models: • unstructured text vs. well-structured data • Query semantics: • fuzzy vs. well-defined • text search: to satisfy “information need” <-- art • DB queries: to perform data computation <-- algebra • relevant vs. correct answers • ranked vs. Boolean answers • Functionalities: • read-mostly vs. read-write/transactions/cc ... CS411

  6. Recall: Measuring False-Negatives • Recall = |x| / |relevant| • e.g.: relevant = {D1, D2}, retrieved = {D1, D3, D4} • recall R = 1 / 2 = 0.5 • there is 1 false negative: D2 • ? How to fool recall? x relevant retrieved collection CS411

  7. Precision: Measuring False-Positives • Precision = |x| / |retrieved| • e.g.: relevant = {D1, D2}, retrieved = {D1, D3, D4} • precision P = 1 / 3 = 0.33 • there are 2 false positives: D3 and D4 x relevant retrieved collection CS411

  8. Models • Boolean: criteria-based • Vector space: similarity-based • Probabilistic: probability-based CS411

  9. Boolean Model • Query: • Q1: data AND web • Q2: (knowledge OR information) AND base • Q3: data NOT info • Documents: • D1: “web data and web queries” • D2: “digital data index” • D3: “data base for dummies” CS411

  10. Boolean Model • View: Satisfaction by criteria • Query: a Boolean expression • Q1: data AND web • Document: a Boolean conjunction • D1: “web data and web queries” = • web AND data AND queries • Query results: • {D | D implies Q}, i.e., all docs that satisfy Q CS411

  11. Boolean Queries: Problems • Query matching is “exact” and not flexible • exact matching can result in too few/many matches • Hard to formulate a right query • what is query for “documents about color printer”? • Results are not ranked/ordered for exploration • Boolean is binary: yes or no • In short: “relevance” not captured • traditional DB queries are similarly bad at “fuzzy” concepts • new research work in top-k queries CS411

  12. Vector Space Model • View: Similarity of content • Intuitions: • docs consist of words --> put docs in the word space • space: n-dimension for n words • similarity becomes geometric comparison • document-query similarity = vector-vector similarity D Q CS411

  13. Probabilistic Models • View: Probability of relevance • the “probabilistic ranking principle” • Estimate and rank by P(R | Q, D) • or by log-odds: CS411

  14. Probabilistic Models • To rank by • I.e., (see next page) • Assume pithe same for all query terms • Assume qi= ni/N • N is the collection size; i.e., “all” docs are “irrelevant” • Similar to using “IDF” • intuition: e.g., “apple computer” in a computer DB CS411

  15. Probabilistic Models • To rank by CS411

  16. Feedback judgments System Architecture docs INDEXING query Query Rep Doc Rep User Ranking results CS411

  17. Technique: Term Selection/Weighting • Basis for matching query with document • Query and document should be represented using the same units/terms • Controlled vocabulary vs. full text indexing CS411

  18. What is a good indexing term? • Specific (phrases) or general (single word)? • Luhn found that words with middle frequency are most useful • Not too specific • Not too general • All words or a (controlled) subset? • When term weighting is used, it is a matter of weighting not selecting of indexing terms • more later CS411

  19. Technique: Stemming • Words with similar meanings should be mapped to the same indexing term • Stemming: Mapping all inflectional forms of words to the same root form, e.g. • computer -> compute • computation -> compute • computing -> compute • Porter’s Stemmer is popular for English • In general: clustering of “synonym” words CS411

  20. Technique: Stopwords • A “common word” that bears little semantic content • preposition: for, on, … • article: a, an, the • non-informative words (collection specific) • e.g., “database” in this class • e.g., “PC” in a computer collection • You can search the Web for stopwords list CS411

  21. Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query User Document collection Judgments: d1 + d2 - d3 + … dk - ... Feedback Technique: Relevance Feedback(or Query Modification) • Motivation: easier to judge results than to formulate queries right CS411

  22. Pseudo Feedback • Motivation: top results are often relevant Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query Document collection Judgments: d1 + d2 + d3 + … dk - ... top 10 Feedback CS411

  23. Technique: Inverted List • ti  <d1, …>, …….., <dn, …> • E.g.,: • color  <d1, …>, <d2, …>, <d5, …> • printer  <d2, …>, <d5, …>, <d8, …> • How to evaluate Q: color AND printer? • How to evaluate Q: “color printer”? • what info to maintain in each entry? • More later… CS411

  24. DB Meets IR • Multimedia databases • relational data + text, images, audio, video… • Fuzzy retrieval for relational data • similarity, preference-based queries • e.g., product search in e-commerce • XML represents text-based data • IR type search will be helpful • how can we extend it to retrieve XML documents?# CS411

  25. ?? Web Search • Text IR as natural starting point: • Web as a collection of HTML documents • find pages satisfy information need • Web search as killer-app of IR! • Web search vs. traditional document search • ?? how are they related? • ?? any differences or new issues? • why search engines give lousy results?# CS411

  26. Web Search: New Issues and Challenges • Highly topic-heterogeneous documents • notion of “collection” lost • stopwords, idf scheme for term selection/weighting challenged • Structured/semi-structured documents • Highly-linked pages: collection no longer flat • how to use links cleverly – link analysis (more in TW2) • ideas from social networks for “standing” or “importance” • Extremely large scale: Billions docs and counting • Many documents/data hidden behind databases • Multi-lingual documents • Spamming CS411

  27. What’s Next • Vector space model CS411

  28. End of Talk CS411

More Related