1 / 113

Information Retrieval

Information Retrieval. CSE 8337 (Part I) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza -Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/

kirra
Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval CSE 8337 (Part I) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and BerthierRibeiro-Netohttp://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book Introduction to Information Retrieval by Christopher D. Manning, PrabhakarRaghavan, and HinrichSchutze http://informationretrieval.org

  2. CSE 8337 Outline • Introduction • Text Processing • Indexes • Boolean Queries • Web Searching/Crawling • Vector Space Model • Matching • Evaluation • Feedback/Expansion

  3. Information Retrieval • Information Retrieval (IR): retrieving desired information from textual data. • Library Science • Digital Libraries • Web Search Engines • Traditionally keyword based • Sample query: Find all documents about “data mining”.

  4. Motivation • IR: representation, storage, organization of, and access to information items • Focus is on the user information need • User information need (example): • Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. • Emphasis is on the retrieval of information (not data)

  5. DB vs IR • Records (tuples) vs. documents • Well defined results vs. fuzzy results • DB grew out of files and traditional business systesm • IR grew out of library science and need to categorize/group/access books/articles

  6. Unstructured data • Typically refers to free text • Allows • Keyword queries including operators • More sophisticated “concept” queries e.g., • find all web pages dealing with drug abuse • Classic model for searching text documents

  7. Semi-structured data • In fact almost no data is “unstructured” • E.g., this slide has distinctly identified zones such as the Title and Bullets • Facilitates “semi-structured” search such as • Title contains data AND Bullets contain search … to say nothing of linguistic structure

  8. DB vs IR (cont’d) • Data retrieval • which docs contain a set of keywords? • Well defined semantics • a single erroneous object implies failure! • Information retrieval • information about a subject or topic • semantics is frequently loose • small errors are tolerated • IR system: • interpret contents of information items • generate a ranking which reflects relevance • notion of relevance is most important

  9. Motivation • IR software issues: • classification and categorization • systems and languages • user interfaces and visualization • Still, area was seen as of narrow interest • Advent of the Web changed this perception once and for all • universal repository of knowledge • free (low cost) universal access • no central editorial board • many problems though: IR seen as key to finding the solutions!

  10. Basic Concepts • The User Task • Retrieval • information or data • purposeful • Browsing • glancing around • Feedback Response Retrieval Database Browsing Feedback

  11. Accents spacing Noun groups Manual indexing Docs stopwords stemming structure structure Full text Index terms Basic Concepts Logical view of the documents

  12. Text User Interface user need Text Text Operations logical view logical view Query Operations DB Manager Module Indexing user feedback inverted file query Searching Index retrieved docs Text Database / WWW Ranking ranked docs The Retrieval Process

  13. Basic assumptions of Information Retrieval • Collection: Fixed set of documents • Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task

  14. Fuzzy Sets and Logic • Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. • f(x): Probability x is in F. • 1-f(x): Probability x is not in F. • EX: • T = {x | x is a person and x is tall} • Let f(x) be the probability that x is tall • Here f is the membership function

  15. Fuzzy Sets

  16. IR is Fuzzy Relevant Relevant Not Relevant Not Relevant Simple Fuzzy

  17. Information Retrieval Metrics • Similarity: measure of how close a query is to a document. • Documents which are “close enough” are retrieved. • Metrics: • Precision = |Relevant and Retrieved| |Retrieved| • Recall= |Relevant and Retrieved| |Relevant|

  18. IR Query Result Measures IR

  19. CSE 8337 Outline • Introduction • Text Processing (Background) • Indexes • Boolean Queries • Web Searching/Crawling • Vector Space Model • Matching • Evaluation • Feedback/Expansion

  20. Text Processing TOC • Simple Text Storage • String Matching • Approximate (Fuzzy) Matching (Spell Checker) • Parsing • Tokenization • Stemming/ngrams • Stop words • Synonyms

  21. Text storage • EBCDIC/ASCII • Array of character • Linked list of character • Trees- B Tree, Trie • Stuart E. Madnick, “String Processing Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp 420-424.

  22. Pattern Matching(Recognition) • Pattern Matching: finds occurrences of a predefined pattern in the data. • Applications include speech recognition, information retrieval, time series analysis.

  23. Similarity Measures • Determine similarity between two objects. • Similarity characteristics: • Alternatively, distance measures measure how unlike or dissimilar objects are.

  24. String Matching Problem • Input: • Pattern – length m • Text string – length n • Find one (next, all) occurrences of string in pattern • Ex: • String: 00110011011110010100100111 • Pattern: 011010

  25. String Matching Algorithms • Brute Force • Knuth-Morris Pratt • Boyer Moore

  26. 011010 011010 011010 Brute Force String Matching • Brute Force • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html • Space O(m+n) • Time O(mn) 00110011011110010100100111

  27. FSR

  28. Creating FSR • Create FSM: • Construct the “correct” spine. • Add a default “failure bus” to state 0. • Add a default “initial bus” to state 1. • For each state, decide its attachments to failure bus, initial bus, or other failure links.

  29. Knuth-Morris-Pratt • Apply FSM to string by processing characters one at a time. • Accepting state is reached when pattern is found. • Space O(m+n) • Time O(m+n) • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html

  30. Boyer-Moore • Scan pattern from right to left • Skip many positions on illegal character string. • O(mn) • Expected time better than KMP • Expected behavior better • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html

  31. Approximate String Matching • Find patterns “close to” the string • Fuzzy matching • Applications: • Spelling checkers • IR • Define similarity (distance) between string and pattern

  32. String-to-String Correction • Levenshtein Distance • http://www.mendeley.com/research/binary-codes-capable-of-correcting-insertions-and-reversals/ • Measure of similarity between strings • Can be used to determine how to convert from one string to another • Cost to convert one to the other • Transformations • Match: Current characters in both strings are the same • Delete: Delete current character in input string • Insert: Insert current character in target string into string

  33. Distance Between Strings

  34. Spell Checkers • Check or Replace or Expand or Suggest • Phonetic • Use phonetic spelling for word • Truespelwww.foreignword.com/cgi-bin//transpel.cgi • Phoneme – smallest sounds • Jaro Winkler • distance measure • http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance • Autocomplete • www.amazon.com

  35. Tokenization • Find individual words (tokens) in text string. • Look for spaces, commas, etc. • http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

  36. Stemming/ngrams • Convert token/word into smallest word with similar derivations • Remove suffixes (s, ed, ing, …) • Remove prefixes (pre, re, un, …) • ngram –subsequences of length n

  37. Stopwords • Common words • “Bad” words • Implementation: • Text file

  38. Synonyms • Exact/similar meaning • Hierarchy • One way • Bidirectional • Expand Query • Replace terms • Implementation: • Synonym File or dictionary

  39. CSE 8337 Outline • Introduction • Text Processing • Indexes • Boolean Queries • Web Searching/Crawling • Vector Space Model • Matching • Evaluation • Feedback/Expansion

  40. Index • Common access is by keyword • Fast access by keyword • Index organizations? • Hash • B-tree • Linked List • Process document and query to identify keywords

  41. Term-document incidence 1 if play contains word, 0 otherwise BrutusANDCaesar but NOTCalpurnia

  42. Incidence vectors • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. • 110100 AND 110111 AND 101111 = 100100. • http://www.rhymezone.com/shakespeare/

  43. 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Inverted index • For each term T, we must store a list of all documents that contain T. • Do we use an array or a list for this? Brutus Calpurnia Caesar 13 16 What happens if the word Caesar is added to document 14?

  44. Brutus Calpurnia Caesar Dictionary Postings lists Inverted index • Linked lists generally preferred to arrays • Dynamic space allocation • Insertion of terms into documents easy • Space overhead of pointers Posting 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Sorted by docID (more later on why).

  45. Tokenizer Friends Romans Countrymen Token stream. Linguistic modules friend friend roman countryman Modified tokens. roman Indexer 2 4 countryman 1 2 Inverted index. 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen.

  46. Indexer steps • Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

  47. Sort by terms. Core indexing step.

  48. Multiple term entries in a single document are merged. • Frequency information is added. Why frequency? Will discuss later.

  49. The result is split into a Dictionary file and a Postings file.

  50. Where do we pay in storage? Terms Pointers

More Related