Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages

Special Topics in Computer ScienceThe Art of Information RetrievalChapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com

Previous Chapter • Main measures: Precision & Recall. • For sets • Rankings are evaluated through initial subsets • There are measures that combine them into one • Involve user-defined preferences. In F-measure set to 50-50 • Many (other) characteristics • An algorithm can be good at some and bad at others • Averages are used, but not always are meaningful • Reference collection exists with known answers to evaluate new algorithms

Previous chapter: research issues • Different types of interfaces; interactive systems: • What measures to use? • How people judge relevance? • How the “user satisfaction” can be measured? Modeled?

Query languages • Query language = type of possible queries • Type of queries depend on the IR model • Types: • IR (= ranked output) • Data retrieval • User-oriented • Low-level (= protocols) • Assume all pre-processing has been done • Thesaurus, stop-words, ... • (I think this must be a part of the language!) • Returns “documents” (chapter, paragraph, ...)

In this chapter • Keyword-based languages • Pattern matching • Structure taken into account • Protocols

Keyword-based languages: Single word • Intuitive, easy to express, fast ranking. • Words can be highlighted in the output. • What a word is? • Letters, separators • Non-splitting characters: on-line. • Database decides. • TF-IDF are designed for words • Used for the main models (Boolean, Vector, Probabilistic)

Keyword-based languages:Context Queries • Ensure that the words are related • Phrase • “enhance retrieval” • Allows separators and stopwords: “enhance the retrieval” • Proximity • “enhance the quality of information retrieval” • Distance: words, letters. Order: same or not • Not clear how to rank • Research issue

Keyword-based languages:Boolean Queries • Boolean expressions (can combine basic queries) • Query syntax tree • translation AND (syntax OR syntactic) •  operations on the sets • Result: set • OR, AND, e1 BUT e2 • NOT not used, could give (almost) all docs (= unsafe) • Good: Can highlight occurrences, sort • Bad: Difficult for the users • Remedy (?): fuzzy Boolean (see below). Basic = keyword, pattern

Keyword-based languages:Fuzzy Boolean, Natural Language • Fuzzy Boolean: OR  AND = some. • AND punishes for absence, OR encourages multiple. • Natural ranking: how many times? • Natural Language: OR = AND • BUT can be expressed (= penalty) • How to rank? Different ways • Vector space model • Query is a vector • A doc can be taken as a vector.  Relevance feedback! • Proximity is ignored • (Why? Research issue.)

Pattern matching... • Pattern = sequence of features • Text segment matches the pattern Types: • Words • Prefixes, suffixes, substrings: • comput-, -ters, -any flow- (many flowers). • Ranges • implies some order, e.g., lexicographical = alphabetic • Allowing errors • Levenshtein (= edit) distance: historical / hysterical • # insertions, deletions, replacements. Threshold.

...Pattern matching ...Types • Regular expressions • union = or: if e1, e2 are expressions, (e1 | e2) too • concatenation: e1 e2 • repetition: e* (0 or more occurrences) • Extended patterns • user-friendly; can be internally converted into simple • case-insensitive, “anything” (wildcard), digit, vowel, ... • conditionals, optional • some parts match exactly and other with errors, • etc.

Structural queries • Old days: fields. No nesting, no overlap, fixed order. • Email: subject, body, sender, ... • = Relational database with text type, treated as text should be • Versions of SQL with text operators • Hypertext • Not well developed. Too free • WebGlimpse: search the neighborhood • Hierarchical • Intermediate level of freedom • Volumes, chapters, sections, paragraphs, sentences, ...

Too fixed Too free Intermediate

Hierarchical Models ... • PAT expressions • Hierarchy is defined at query time. • Regions are included in the index, e.g., sections, italics, ... • Different types of regions can overlap, same type can’t • Can query for words in a region, regions in a region, etc. • Complex computation, unclear semantics • Overlapped lists • Evolution of PAT: areas of same type can overlap (not nest) • Uses same inverted file • Can combine regions, specify order, ... • n-words: all (overlapping) areas of n words.

Overlapping lists

... Hierarchical Models ... • List of references • Answers are references (pointers) to regions • Only one type of regions (e.g., only sections). No nesting. • Known at index time • Ancestry of nodes. Can query paths • Proximal nodes • Compromise between expressiveness and efficiency • Many (overlapping) fixed hierarchies • Interesting queries: “3rd paragraph of each chapter”, ...

Proximal nodes

... Hierarchical Models • Tree matching • Query is a tree. Match the text tree. • Ordered or unordered trees (are siblings ordered?) • Prolog-like constraints on different parts of the tree • Variables • Answer: root of a match • Very inefficient (usually NP-hard) • Due to variables and unordered matching

Research issuesin hierarchical models • Static or dynamic? • Define the hierarchy at index time or at query time? • Static: text markup. Dynamic: tags, indexed. • Restrictions on the structure • Restrict structure of restrict the query language • For efficiency • Integration with text • of secondary importance: structure (in IR) or text (in DB)? • combine • Query language • Standardization, expressiveness taxonomy, categorization

Query protocols • Used internally • Standard: one client can query different libraries • In CD-ROMS, disk interchangeability • Z39.50: bibliographic (used for other types, too) • WAIS (Wide Area Information Service) • Includes Z39.50 • For CD-ROMs: • CCL, Common Command Language • CD-RDx (Compact Disk Read only Data Exchange) • SFQL (Structured Full-text Query Language). Like DB.

Types of querieswe have discussed

Trends and research topics • Models: to better understand the user needs • Query languages: flexibility, power, expressiveness, functionality • Visual languages • Example: library shown on the screen. Act: take books, open catalogs, etc. • Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

Conclusions • Width-wide: • words, phrases, proximity, fuzzy Boolean, natural language • Depth-wide: • Pattern matching • If return sets, can be combined using Boolean model • Combining with structure • Hierarchical structure • Standardized low level languages: protocols • Reusable

Thank you! Till October 16 October 23: midterm exam

Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages