1 / 24

Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages

Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages. Alexander Gelbukh www.Gelbukh.com. Previous Chapter. Main measures: Precision & Recall. For sets Rankings are evaluated through initial subsets There are measures that combine them into one

lauradean
Download Presentation

Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in Computer ScienceThe Art of Information RetrievalChapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com

  2. Previous Chapter • Main measures: Precision & Recall. • For sets • Rankings are evaluated through initial subsets • There are measures that combine them into one • Involve user-defined preferences. In F-measure set to 50-50 • Many (other) characteristics • An algorithm can be good at some and bad at others • Averages are used, but not always are meaningful • Reference collection exists with known answers to evaluate new algorithms

  3. Previous chapter: research issues • Different types of interfaces; interactive systems: • What measures to use? • How people judge relevance? • How the “user satisfaction” can be measured? Modeled?

  4. Query languages • Query language = type of possible queries • Type of queries depend on the IR model • Types: • IR (= ranked output) • Data retrieval • User-oriented • Low-level (= protocols) • Assume all pre-processing has been done • Thesaurus, stop-words, ... • (I think this must be a part of the language!) • Returns “documents” (chapter, paragraph, ...)

  5. In this chapter • Keyword-based languages • Pattern matching • Structure taken into account • Protocols

  6. Keyword-based languages: Single word • Intuitive, easy to express, fast ranking. • Words can be highlighted in the output. • What a word is? • Letters, separators • Non-splitting characters: on-line. • Database decides. • TF-IDF are designed for words • Used for the main models (Boolean, Vector, Probabilistic)

  7. Keyword-based languages:Context Queries • Ensure that the words are related • Phrase • “enhance retrieval” • Allows separators and stopwords: “enhance the retrieval” • Proximity • “enhance the quality of information retrieval” • Distance: words, letters. Order: same or not • Not clear how to rank • Research issue

  8. Keyword-based languages:Boolean Queries • Boolean expressions (can combine basic queries) • Query syntax tree • translation AND (syntax OR syntactic) •  operations on the sets • Result: set • OR, AND, e1 BUT e2 • NOT not used, could give (almost) all docs (= unsafe) • Good: Can highlight occurrences, sort • Bad: Difficult for the users • Remedy (?): fuzzy Boolean (see below). Basic = keyword, pattern

  9. Keyword-based languages:Fuzzy Boolean, Natural Language • Fuzzy Boolean: OR  AND = some. • AND punishes for absence, OR encourages multiple. • Natural ranking: how many times? • Natural Language: OR = AND • BUT can be expressed (= penalty) • How to rank? Different ways • Vector space model • Query is a vector • A doc can be taken as a vector.  Relevance feedback! • Proximity is ignored • (Why? Research issue.)

  10. Pattern matching... • Pattern = sequence of features • Text segment matches the pattern Types: • Words • Prefixes, suffixes, substrings: • comput-, -ters, -any flow- (many flowers). • Ranges • implies some order, e.g., lexicographical = alphabetic • Allowing errors • Levenshtein (= edit) distance: historical / hysterical • # insertions, deletions, replacements. Threshold.

  11. ...Pattern matching ...Types • Regular expressions • union = or: if e1, e2 are expressions, (e1 | e2) too • concatenation: e1 e2 • repetition: e* (0 or more occurrences) • Extended patterns • user-friendly; can be internally converted into simple • case-insensitive, “anything” (wildcard), digit, vowel, ... • conditionals, optional • some parts match exactly and other with errors, • etc.

  12. Structural queries • Old days: fields. No nesting, no overlap, fixed order. • Email: subject, body, sender, ... • = Relational database with text type, treated as text should be • Versions of SQL with text operators • Hypertext • Not well developed. Too free • WebGlimpse: search the neighborhood • Hierarchical • Intermediate level of freedom • Volumes, chapters, sections, paragraphs, sentences, ...

  13. Too fixed Too free Intermediate

  14. Hierarchical Models ... • PAT expressions • Hierarchy is defined at query time. • Regions are included in the index, e.g., sections, italics, ... • Different types of regions can overlap, same type can’t • Can query for words in a region, regions in a region, etc. • Complex computation, unclear semantics • Overlapped lists • Evolution of PAT: areas of same type can overlap (not nest) • Uses same inverted file • Can combine regions, specify order, ... • n-words: all (overlapping) areas of n words.

  15. Overlapping lists

  16. ... Hierarchical Models ... • List of references • Answers are references (pointers) to regions • Only one type of regions (e.g., only sections). No nesting. • Known at index time • Ancestry of nodes. Can query paths • Proximal nodes • Compromise between expressiveness and efficiency • Many (overlapping) fixed hierarchies • Interesting queries: “3rd paragraph of each chapter”, ...

  17. Proximal nodes

  18. ... Hierarchical Models • Tree matching • Query is a tree. Match the text tree. • Ordered or unordered trees (are siblings ordered?) • Prolog-like constraints on different parts of the tree • Variables • Answer: root of a match • Very inefficient (usually NP-hard) • Due to variables and unordered matching

  19. Research issuesin hierarchical models • Static or dynamic? • Define the hierarchy at index time or at query time? • Static: text markup. Dynamic: tags, indexed. • Restrictions on the structure • Restrict structure of restrict the query language • For efficiency • Integration with text • of secondary importance: structure (in IR) or text (in DB)? • combine • Query language • Standardization, expressiveness taxonomy, categorization

  20. Query protocols • Used internally • Standard: one client can query different libraries • In CD-ROMS, disk interchangeability • Z39.50: bibliographic (used for other types, too) • WAIS (Wide Area Information Service) • Includes Z39.50 • For CD-ROMs: • CCL, Common Command Language • CD-RDx (Compact Disk Read only Data Exchange) • SFQL (Structured Full-text Query Language). Like DB.

  21. Types of querieswe have discussed

  22. Trends and research topics • Models: to better understand the user needs • Query languages: flexibility, power, expressiveness, functionality • Visual languages • Example: library shown on the screen. Act: take books, open catalogs, etc. • Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

  23. Conclusions • Width-wide: • words, phrases, proximity, fuzzy Boolean, natural language • Depth-wide: • Pattern matching • If return sets, can be combined using Boolean model • Combining with structure • Hierarchical structure • Standardized low level languages: protocols • Reusable

  24. Thank you! Till October 16 October 23: midterm exam

More Related