1 / 55

Boolean IR and Text Processing - Lecture Overview

This lecture provides an overview of information retrieval, including the history, IR system structure, Boolean logic, and text processing. It also covers the information-seeking process and central concepts in IR. The lecture includes discussions and credit to Marti Hearst for some of the slides.

Download Presentation

Boolean IR and Text Processing - Lecture Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004 http://www.sims.berkeley.edu/academics/courses/is202/f04/ Lecture 4: Boolean IR and Text Processing SIMS 202: Information Organization and Retrieval

  2. Advertisement • Not doing anything on Friday afternoon? • Please come to the Friday Afternoon Seminar – Open to ALL • This Week: • Clifford Lynch, director of the Coalition for Networked Information and Adjunct Professor of SIMS on “Research Questions in Digital Stewardship” • See • http://www.sims.berkeley.edu/academics/courses/is296a-1/f04/

  3. Lecture Overview • Review • Introduction to Information Retrieval • The Information Seeking Process • History of IR Research • IR System Structure • Central Concepts in IR • Boolean Logic and Boolean IR Systems • Text Processing • Discussion Credit for some of the slides in this lecture goes to Marti Hearst

  4. Lecture Overview • Review • Introduction to Information Retrieval • The Information Seeking Process • History of IR Research • IR System Structure (revisited) • Central Concepts in IR • Boolean Logic and Boolean IR Systems • Text Processing • Discussion Credit for some of the slides in this lecture goes to Marti Hearst

  5. IR is an Iterative Process Repositories Goals Workspace

  6. Berry-Picking Model A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89) Q2 Q4 Q3 Q1 Q5 Q0

  7. Restricted Form of the IR Problem • The system has available only pre-existing, “canned” text passages • Its response is limited to selecting from these passages and presenting them to the user • It must select, say, 10 or 20 passages out of millions or billions!

  8. Information Retrieval • Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries • This set of assumptions underlies the field of Information Retrieval

  9. Paradox • The “Fundamental paradox of Information Retrieval” as stated by Roland Hjerrpe • The need to describe that which you do not know in order to find it

  10. Lecture Overview • Review • Introduction to Information Retrieval • The Information Seeking Process • History of IR Research • IR System Structure (revisited) • Central Concepts in IR • Boolean Logic and Boolean IR Systems • Text Processing • Discussion Credit for some of the slides in this lecture goes to Marti Hearst

  11. Structure of an IR System Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents

  12. Lecture Overview • Review • Introduction to Information Retrieval • The Information Seeking Process • History of IR Research • IR System Structure (revisited) • Central Concepts in IR • Boolean Logic and Boolean IR Systems • Text Processing • Discussion Credit for some of the slides in this lecture goes to Marti Hearst

  13. Central Concepts in IR • Documents • Queries • Collections • Evaluation • Relevance

  14. Documents • What do we mean by a document? • Full document? • Document surrogates? • Pages? • Buckland (JASIS, Sept. 1997) “What is a Document” • Are IR systems better called Document Retrieval systems? • A document is a representation of some aggregation of information, treated as a unit

  15. Collection • A collection is some physical or logical aggregation of documents • A database • A Library • An index? • Others?

  16. Queries • A query is some expression of a user’s information needs • Can take many forms • Natural language description of need • Formal query in a query language • Queries may not be accurate expressions of the information need • Differences between conversation with a person and formal query expression

  17. Evaluation: Why Evaluate? • Determine if the system is desirable • Make comparative assessments • Others?

  18. What To Evaluate? • How much of the information need was satisfied • How much was learned about a topic • Incidental learning • How much was learned about the collection • How much was learned about other topics • How inviting the system is…

  19. What To Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) • Coverage of information • Form of presentation • Effort required/ease of use • Time and space efficiency • Recall • Proportion of relevant material actually retrieved • Precision • Proportion of retrieved material actually relevant Effectiveness

  20. Lecture Overview • Review • Introduction to Information Retrieval • The Information Seeking Process • History of IR Research • IR System Structure (revisited) • Central Concepts in IR • Boolean Logic and Boolean IR Systems • Text Processing • Discussion Credit for some of the slides in this lecture goes to Marti Hearst

  21. Query Languages • A way to express the question (information need) • Types: • Boolean • Natural Language • Stylized Natural Language • Form-Based (GUI)

  22. Simple Query Language: Boolean • Terms + Operators • Terms • Words • Normalized (stemmed) words • Phrases • Thesaurus terms • Boolean Operators • AND • OR • NOT

  23. Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat ANDDog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash)

  24. Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • Each of the following combinations works: Recall the card based systems? They mechanically implement Boolean AND

  25. Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • None of the following combinations works:

  26. Boolean Logic A B

  27. Boolean Queries • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a

  28. Boolean Logic t1 t2 D9 D2 D1 m5 m3 m6 D11 D4 D5 D3 m1 D6 m2 m4 D10 m7 m8 D8 D7 t3 m1= t1t2t3 m2= t1 t2t3 m3 = t1 t2t3 m4 = t1t2t3 m5 = t1t2t3 m6 = t1t2t3 m7 = t1t2t3 m8= t1t2t3

  29. Boolean Searching Formal Query: CracksANDBeams ANDWidth_measurement ANDPrestressed_concrete “Measurement of the width of cracks in prestressed concrete beams” Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

  30. Pseudo-Boolean Queries • A new notation, from web search • +cat dog +collar leash • Does not mean the same thing! • Need a way to group combinations • Phrases: • “stray cat” AND “frayed collar” • +“stray cat” + “frayed collar”

  31. Another View of IR Information Need Collections Pre-Process Text Input Query Index Parse Rank

  32. Result Sets • Run a query, get a result set • Two choices • Reformulate query, run on entire collection • Reformulate query, run on result set • Example: Dialog query • (Redford AND Newman) • -> S1 1450 documents • (S1 AND Sundance) • ->S2 898 documents

  33. Feedback Queries Information Need Collections Pre-Process Text Input Query Index Parse Reformulated Query Rank Re-Rank

  34. Ordering of Retrieved Documents • Pure Boolean has no ordering • In practice: • Order chronologically • Order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to have one of each term or many of one term? • Fancier methods have been investigated • p-norm is most famous • Usually impractical to implement • Usually hard for user to understand

  35. Boolean • Advantages • Simple queries are easy to understand • Relatively easy to implement • Disadvantages • Difficult to specify what is wanted • Too much returned, or too little • Ordering not well determined • Dominant language in commercial IR systems until the WWW, and still the language of Database Management Systems

  36. Faceted Boolean Query • Strategy: Break query into facets (polysemous with earlier meaning of facets) • Conjunction of disjunctions • a1 OR a2 OR a3 • b1 OR b2 • c1 OR c2 OR c3 OR c4 • Each facet expresses a topic • “rain forest” OR jungle OR amazon • medicine OR remedy OR cure • Smith OR Zhou AND AND Also known as Conjunctive Normal Form or CNF

  37. Faceted Boolean Query • Query still fails if one facet missing • Alternative: Coordination level ranking • Order results in terms of how many facets (disjuncts) are satisfied • Also called Quorum ranking, Overlap ranking, and Best Match • Problem: Facets still undifferentiated • Alternative: Assign weights to facets

  38. Proximity Searches • Proximity: Terms occur within K positions of one another • pen w/5 paper • A “Near” function can be more vague • near(pen, paper) • Sometimes order can be specified • Also, Phrases and Collocations • “United Nations” “Bill Clinton” • Phrase Variants • “retrieval of information” “information retrieval”

  39. Filters • Filters: Reduce set of candidate docs • Often specified simultaneous with query • Usually restrictions on metadata • Restrict by: • Date range • Internet domain (.edu .com .berkeley.edu) • Author • Size • Limit number of documents returned

  40. Boolean Systems • Most of the commercial database search systems that pre-date the WWW are based on Boolean search • Dialog, Lexis-Nexis, etc. • Most Online Library Catalogs are Boolean systems • E.g., MELVYL • Database systems use Boolean logic for searching • Many of the search engines sold for intranet search of web sites are Boolean

  41. Why Boolean? • Easy to implement • Efficient searching across very large databases • Easy to explain results • “Has to have all of the words…” (AND) • “Has to have at least one of the words…” (OR)

  42. Lecture Overview • Review • Introduction to Information Retrieval • The Information Seeking Process • History of IR Research • IR System Structure (revisited) • Central Concepts in IR • Boolean Logic and Boolean IR Systems • Text Processing • Discussion Credit for some of the slides in this lecture goes to Marti Hearst

  43. Content Analysis • Automated Transformation of raw text into a form that represents some aspect(s) of its meaning • Including, but not limited to: • Automated Thesaurus Generation • Phrase Detection • Categorization • Clustering • Summarization

  44. Techniques for Content Analysis • Statistical • Single Document • Full Collection • Linguistic • Syntactic • Semantic • Pragmatic • Knowledge-Based (Artificial Intelligence) • Hybrid (Combinations)

  45. Text Processing • Standard Steps: • Recognize document structure • Titles, sections, paragraphs, etc. • Break into tokens • Usually space and punctuation delineated • Special issues with Asian languages • Stemming/morphological analysis • Store in inverted index (to be discussed later)

  46. Content Analysis Areas Information Need Collections How is the query constructed? Pre-Process Text Input How is the text processed? Query Index Parse Rank

  47. Document Processing Steps From “Modern IR” Textbook

  48. Stemming and Morphological Analysis • Goal: “normalize” similar words • Morphology (“form” of words) • Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class • dog, dogs • tengo, tienes, tiene, tenemos, tienen • Derivational Morphology • Derive one word from another, • Often change grammatical class • build, building; health, healthy

  49. Automated Methods • Powerful multilingual tools exist for morphological analysis • PCKimmo, Xerox Lexical technology • Require a grammar and dictionary • Use “two-level” automata • Stemmers: • Very dumb rules work well (for English) • Porter Stemmer: Iteratively remove suffixes • Improvement: Pass results through a lexicon

  50. Errors Generated by Porter Stemmer From Krovetz ‘93

More Related