1 / 36

Query Models

Query Models. Use Types What do search engines do. What we have covered. What is IR Evaluation Tokenization and properties of text Web crawling This time Query models. Index. Query Engine. Interface. Indexer. Users. Crawler. Web. A Typical Web Search Engine. Queries. Index.

selia
Download Presentation

Query Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Models • Use • Types • What do search engines do

  2. What we have covered • What is IR • Evaluation • Tokenization and properties of text • Web crawling • This time • Query models

  3. Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

  4. Queries Index Query Engine Interface Indexer Users Crawler Web A Typical Web Search Engine

  5. Why the interest in Queries? • Queries are ways we interact with IR systems • Expression of an information need • Nonquery methods? • Types of queries?

  6. Issues with Query Structures Matching and ranking criteria • Given a query, what documents are retrieved? • In what order (rank)?

  7. Types of Query Structures Query Models (languages) – most common • Boolean Queries • Extended-Boolean Queries • Natural Language Queries • Vector queries • Others?

  8. Simple query language: Boolean • Earliest query model • Terms + Connectors (or operators) • terms • words • normalized (stemmed) words • phrases • thesaurus terms • connectors • AND • OR • NOT • Ex: Beethoven AND sonata

  9. Truth Tables – Boolean Logic Presence of P, P = 1 Absence of P, P = 0 True = 1 False = 0

  10. Problems with Boolean Queries • Ranking? • Incorrect interpretation of Boolean connectives AND and OR • Example - Seeking Saturday entertainment Queries: • Dinner AND sports AND symphony • Dinner OR sports OR symphony • Dinner AND sports OR symphony

  11. Order of precedence of operators Example of query. Is • A AND B • the same as • B AND A • Why?

  12. Sample Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat ANDDog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash)

  13. Satisfaction of Boolean Query • (Cat OR Dog) AND (Collar OR Leash) • Each of the following column combinations works: • Cat x x x x • Dog x x x x x • Collar x x x x • Leash x x x x Others?

  14. Order of Preference • Define order of preference • EX: a OR b AND c • Infix notation • Parenthesis evaluated 1st with left to right precedence of operators • Next NOT’s are applied • Then AND’s • Then OR’s • a OR b AND c becomes • a OR (b AND c)

  15. Infix Notation • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a

  16. DNFs and CNFs All queries can be rewritten as • Disjunctive Normal Forms (DNFs) • Conjunctive Normal Forms (CNFs) • DNF Constituents: • Terms (words or phrases) • Conjuncts (terms joined by ANDs) • Disjuncts (conjuncts joined by ORs) • Ex: (A AND B) OR (A ANDNOTC) • CNF Constituents: • Terms (words or phrases) • Disjuncts (terms joined by ORs) • Conjuncts (disjuncts joined by ANDs) • Ex: (A OR B) AND (A ORNOTC)

  17. Effect of CNFs • All complex Boolean queries can be simplified • Why do reference librarians like CNFs? • AND’s reduce the size of the set returned and are easily expandable • So do minus’s

  18. Boolean Searching Formal Query: cracksANDbeams ANDWidth_measurement ANDPrestressed_concrete “Measurement of the width of cracks in prestressed concrete beams” Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

  19. Pseudo-Boolean Queries • A new notation, from web search • +cat dog +collar leash • Does not mean the same thing! • Need a way to group combinations. • Phrases: • “stray cat” AND “frayed collar” • +“stray cat” + “frayed collar”

  20. Ordering (ranking) of Retrieved Documents • Pure Boolean has no ordering • Term is there or it’s not • In practice: • order chronologically • order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to have one of each term or many of one term?

  21. Boolean Query - Summary • Advantages • simple queries are easy to understand • relatively easy to implement • Disadvantages • difficult to specify what is wanted • too much returned, or too little • ordering not well determined • Dominant language in commercial systems until the WWW

  22. Vector Space Model • Documents and queries are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents

  23. Document Vectors • Documents are represented as “bags of words” • Words are terms with no order • Represented as vectors when used computationally • A vector is like an array of floating point values • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse

  24. Queries Vocabulary (dog, house, white) Queries: • dog (1,0,0) • house (0,1,0) • white (0,0,1) • house and dog (1,1,0) • dog and house (1,1,0) • Show 3-D space plot

  25. Documents (queries) in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2

  26. Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

  27. Vector Query Problems • Significance of queries • Can different values be placed on the different terms – eg. 2dog 1house • Scaling – size of vectors • Number of words in the dictionary? • 100,000

  28. Proximity Searches • Proximity: terms occur within K positions of one another • pen w/5 paper • A “Near” function can be more vague • near(pen, paper) • Sometimes order can be specified • Also, Phrases and Collocations • “United Nations” “Bill Clinton” • Phrase Variants • “retrieval of information” “information retrieval”

  29. Filters • Filters: Reduce set of candidate docs • Often specified simultaneous with query • Usually restrictions on metadata • restrict by: • date range • internet domain (.edu .com .berkeley.edu) • author • size • limit number of documents returned

  30. Natural Language Queries • The “Holy Grail” of information retrieval • Issues in Natural Language Processing • syntax • semantics • pragmatics • speech understanding • speech generation

  31. What do search engines do? • Tags • Title • Meta • Term frequency and location • Popularity

  32. UC Berkeley Search Engine Guide http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html

  33. UC Berkeley Search Engine Guide http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html

  34. Old:Search Engine Query Differences

  35. Older: Search engine query models

  36. Types of Query Structures Query Models (languages) – most common • Boolean Queries • Old model • Vector queries • Very common • Probabilistic models • Mostly research • Holy grail of search • Natural Language Queries

More Related