1 / 109

Review for midterm

This review discusses the process of information retrieval, including the sources of information, the information acquisition process, and the assumptions and capabilities of an automated information retrieval system. It also explores the measures of performance for an IR system and the algorithms used in IR software. Additionally, the review covers the hierarchy of data, information, and knowledge, the growth of information resources, and the use of Big-O notation to evaluate algorithms. Lastly, it addresses the importance of queries and the matching criteria used in retrieving documents.

hayesa
Download Presentation

Review for midterm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review for midterm

  2. What is information retrieval • Gathering information from a source(s) based on a need • Major assumption - that information exists. • Broad definition of information • Sources of information • Other people • Archived information (libraries, maps, etc.) • Web • Radio, TV, etc.

  3. Information retrieved • Impermanent information • Conversation • Documents • Text • Video • Files • Etc.

  4. The information acquisition process • Know what you want and go get it • Ask questions to information sources as needed (queries) - SEARCH • Have information sent to you on a regular basis based on some predetermined information need • Push/pull models

  5. What IR assumes • Information is stored (or available) • A user has an information need • An automated system exists from which information can be retrieved • Why an automated system? • The system works!!

  6. What IR is usually not about • Usually just unstructured data • Retrieval from databases is usually not considered • Database querying assumes that the data is in a standardized format • Transforming all information, news articles, web sites into a database format is difficult for large data collections

  7. What an IR system should do • Store/archive information • Provide access to that information • Answer queries with relevant information • Stay current • WISH list • Understand the user’s queries • Understand the user’s need • Acts as an assistant

  8. How good is the IR system Measures of performance based on what the system returns: • Relevance • Coverage • Recency • Functionality (e.g. query syntax) • Speed • Availability • Usability • Time/ability to satisfy user requests

  9. How do IR systems work Algorithms implemented in software • Gathering methods • Storage methods • Indexing • Retrieval • Interaction

  10. Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

  11. Crawlers • Web crawlers (spiders) gather information (files, URLs, etc) from the web. • Primitive IR systems

  12. Information Seeking Behavior • Two parts of the process: • search and retrieval • analysis and synthesis of search results

  13. What is knowledge? • Data - Facts, observations, or perceptions. • Information - Subset of data, only including those data that possess context, relevance, and purpose. • Knowledge -A more simplistic view considers knowledge as being at the highest level in a hierarchy with data (at the lowest level) and information (at the middle level). • Data refers to bare facts void of context. • A telephone number. • Information is data in context. • A phone book. • Knowledge is information that facilitates action. • Recognizing that a phone number belongs to a good client, who needs to be called once per week to get his orders.

  14. From Facts to Wisdom(Haeckel & Nolan, 1993)one example of the hierarchy

  15. Size of information resources • Why important? • Scaling • Time • Space • Which is more important?

  16. Trying to fill a terabyte in a year Moore’s Law and its impact!

  17. Measuring the Growth of Work While it is possible to measure the work done by an algorithm for a given set of input, we need a way to: • Measure the rate of growth of an algorithm based upon the size of the input • Compare algorithms to determine which is better for the situation

  18. Time vs. Space Very often, we can trade space for time: For example: maintain a collection of students’ with SSN information. • Use an array of a billion elements and have immediate access (better time) • Use an array of number of students and have to search (better space)

  19. Introducing Big O Notation • Will allow us to evaluate algorithms. • Has precise mathematical definition • Used in a sense to put algorithms into families

  20. Why Use Big-O Notation • Used when we only know the asymptotic upper bound. • What does asymptotic mean? • What does upper bound mean? • If you are not guaranteed certain input, then it is a valid upper bound that even the worst-case input will be below. • Why worst-case? • May often be determined by inspection of an algorithm.

  21. Simplifying O( ) Answers We say Big O complexity of 3n2 + 2 = O(n2) drop constants! because we can show that there is a n0 and a c such that: 0  3n2 + 2  cn2 for n  n0 i.e. c = 4 and n0 = 2 yields: 0  3n2 + 2  4n2 for n  2 What does this mean?

  22. Comparing Algorithms • Now that we know the formal definition of O( ) notation (and what it means)… • If we can determine the O( ) of algorithms… • This establishes the worst they perform. • Thus now we can compare them and see which has the “better” performance.

  23. Comparing Factors N2 N Work done log N 1 Size of input

  24. Why the interest in Queries? • Queries are ways we interact with IR systems • Nonquery methods? • Types of queries?

  25. Issues with Query Structures Matching Criteria • Given a query, what document is retrieved? • In what order?

  26. Types of Query Structures Query Models (languages) – most common • Boolean Queries • Extended-Boolean Queries • Natural Language Queries • Vector queries • Others?

  27. Simple query language: Boolean • Earliest query model • Terms + Connectors (or operators) • terms • words • normalized (stemmed) words • phrases • thesaurus terms • connectors • AND • OR • NOT

  28. Simple query language: Boolean • Geek-speak • Variations are still used in search engines!

  29. Truth Tables – Boolean Logic Presence of P, P = 1 Absence of P, P = 0 True = 1 False = 0

  30. Problems with Boolean Queries • Incorrect interpretation of Boolean connectives AND and OR • Example - Seeking Saturday entertainment Queries: • Dinner AND sports AND symphony • Dinner OR sports OR symphony • Dinner AND sports OR symphony

  31. Order of precedence of operators Example of query. Is • A AND B • the same as • B AND A • Why?

  32. Order of Preference • Define order of preference • EX: a OR b AND c • Infix notation • Parenthesis evaluated 1st with left to right precedence of operators • Next NOT’s are applied • Then AND’s • Then OR’s • a OR b AND c becomes • a OR (b AND c)

  33. Infix Notation • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a

  34. Pseudo-Boolean Queries • A new notation, from web search • +cat dog +collar leash • Does not mean the same thing! • Need a way to group combinations. • Phrases: • “stray cat” AND “frayed collar” • +“stray cat” + “frayed collar”

  35. Ordering (ranking) of Retrieved Documents • Pure Boolean has no ordering • Term is there or it’s not • In practice: • order chronologically • order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to have one of each term or many of one term?

  36. Boolean Query - Summary • Advantages • simple queries are easy to understand • relatively easy to implement • Disadvantages • difficult to specify what is wanted • too much returned, or too little • ordering not well determined • Dominant language in commercial systems until the WWW

  37. Vector Space Model • Documents and queries are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents

  38. Document Vectors • Documents are represented as “bags of words” • Represented as vectors when used computationally • A vector is like an array of floating point values • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse

  39. Queries Vocabulary (dog, house, white) Queries: • dog (1,0,0) • house (0,1,0) • white (0,0,1) • house and dog (1,1,0) • dog and house (1,1,0) • Show 3-D space plot

  40. Documents (queries) in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2

  41. Vector Query Problems • Significance of queries • Can different values be placed on the different terms – eg. 2dog 1house • Scaling – size of vectors • Number of words in the dictionary? • 100,000

  42. Proximity Searches • Proximity: terms occur within K positions of one another • pen w/5 paper • A “Near” function can be more vague • near(pen, paper) • Sometimes order can be specified • Also, Phrases and Collocations • “United Nations” “Bill Clinton” • Phrase Variants • “retrieval of information” “information retrieval”

  43. Representation of documents and queries Why do this? • Want to compare documents • Want to compare documents with queries • Want to retrieve and rank documents with regards to a specific query A document representation permits this in a consistent way (type of conceptualization)

  44. Measures of similarity • Retrieve the most similar documents to a query • Equate similarity to relevance • Most similar are the most relevant • This measure is one of “lexical similarity” • The matching of text or words

  45. Document space • Documents are organized in some manner - exist as points in a document space • Documents treated as text, etc. • Match query with document • Query similar to document space • Query not similar to document space and becomes a characteristic function on the document space • Documents most similar are the ones we retrieve • Reduce this a computable measure of similarity

  46. Representation of Documents • Consider now only text documents • Words are tokens (primitives) • Why not letters? • Stop words? • How do we represent words? • Even for video, audio, etc documents, we often use words as part of the representation

  47. Documents as Vectors • Documents are represented as “bags of words” • Example? • Represented as vectors when used computationally • A vector is like an array of floating point values • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse

  48. Vector Space Model • Documents and queries are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents

  49. The Vector-Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| • Each term i in a document or query j is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)

More Related