Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Special Topics in Computer ScienceThe Art of Information RetrievalChapter 2: Modeling Alexander Gelbukh www.Gelbukh.com

Previous chapter • User Information Need • Vague • Semantic, not formal • Document Relevance • Order, not retrieve • Huge amount of information • Efficiency concerns • Tradeoffs • Art more than science

Modeling • Still science: computation is formal • No good methods to work with (vague) semantics • Thus, simplify to get a (formal) model • Develop (precise) math over this (simple) model Why math if the model is not precise (simplified)? phenomenon  model = step 1 = step 2 = ... = result math phenomenon  model  step 1  step 2  ...  ?!

Modeling in IR: idea • Tag documents with fields • As in a (relational) DB: customer = {name, age, address} • Unlike DB, very many fields: individual words! • E.g., bag of words: {word1, word2, ...}: {3, 5, 0, 0, 2, ...} • Define a similarity measure between query and such a record • Unlike DB, order, not retrieve (yes/no) • Justify your model (optional, but nice) • Develop math and algorithms for fast access • as relational algebra in DB

Taxonomy of IR systems

Aspects of an IR system • IR model • Boolean, Vector, Probabilistic • Logical view of documents • Full text, bag of words, ... • User task • retrieval, browsing Independent, though some are more compatible

Taxonomy of IR models • Boolean (set theoretic) • fuzzy • extended • Vector (algebraic) • generalized vector • latent semantic indexing • neural network • Probabilistic • inference network • belief network

Taxonomy of other aspects Text structure • Non-overlapping lists • Proximal nodes model Browsing • Flat • Structure guided • hypertext

Appropriate models

Retrieval operation mode • Ad-hoc • static documents • interactive • ordered • Filtering ( ad-hoc on new docs) • changing document collection • notification • not interactive • machine learning techniques can be used • yes/no

Characterization of an IR model • D = {dj}, collection of formal representations of docs • e.g., keyword vectors • Q = {qi}, possible formal representations of user information need (queries) • F, framework for modeling these two: reason for the next • R(qi,dj): Q D  R, ranking function • defines ordering

Specific IR models

IR models • Classical • Boolean • Vector • Probabilistic (clear ideas, but some disadvantages) • Refined • Each one with refinements • Solve many of the problems of the “basic” models • Give good examples of possible developments in the area • Not investigated well • We can work on this

Basic notions • Document: Set of index term • Mainly nouns • Maybe all, then full text logical view • Term weights • some terms are better than others • terms less frequent in this doc and more frequent in other docs are less useful • Documents  index term vector {w1j, w2j, ..., wtj} • weights of terms in the doc • t is the number of terms in all docs • weights of different terms are independent (simplification)

Boolean model • Weights  {0, 1} • Doc: set of words • Query: Boolean expression • R(qi,dj)  {0, 1} • Good: • clear semantics, neat formalism, simple • Bad: • no ranking ( data retrieval), retrieves too many or too few • difficult to translate User Information Need into query • No term weighting

Vector model • Weights (non-binary) • Ranking, much better results (for User Info Need) • R(qi,dj) = correlation between query vector and doc vector • E.g., cosine measure: (there is a typo in the book)

Projection

Weights • How are the weights wijobtained? Many variants. One way: TF-IDF balance • TF: Term frequency • How well the term is related to the doc? • If appears many times, is important • Proportional to the number of times that appears • IDF: Inverse document frequency • How important is the term to distinguish documents? • If appears in many docs, is not important • Inversely proportional to number of docs where appears • Contradictory. How to balance?

TF-IDF ranking • TF: Term frequency • IDF: Inverse document frequency • Balance: TF  IDF • Other formulas exist. Art.

Advantages of vector model One of the best known strategies • Improves quality (term weighting) • Allows approximate matching (partial matching) • Gives ranking by similarity (cosine formula) • Simple, fast But: • Does not consider term dependencies • considering them in a bad way hurts quality • no known good way • No logical expressions (e.g., negation: “mouse & NOT cat”)

Probabilistic model • Assumptions: • set of “relevant” docs, • probabilities of docs to be relevant • After Bayes calculation: probabilities of terms to be important for defining relevant docs • Initial idea: interact with the user. • Generate an initial set • Ask the user to mark some of them as relevant or not • Estimate the probabilities of keywords. Repeat • Can be done without user • Just re-calculate the probabilities assuming the user’s acceptance is the same as predicted ranking

(Dis)advantages of Probabilistic model Advantage: • Theoretical adequacy: ranks by probabilities Disadvantages: • Need to guess the initial ranking • Binary weights, ignores frequencies • Independence assumption (not clear if bad) Does not perform well (?)

Alternative Set Theoretic modelsFuzzy set model • Takes into account term relationships (thesaurus) • Bible is related to Church • Fuzzy belonging of a term to a document • Document containing Bible also contains “a little bit of” Church, but not entirely • Fuzzy set logic applied to such fuzzy belonging • logical expressions with AND, OR, and NOT • Provides ranking, not just yes/no • Not investigated well. • Why not investigate it?

Alternative Set Theoretic modelsExtended Boolean model • Combination of Boolean and Vector • In comparison with Boolean model, adds “distance from query” • some documents satisfy the query better than others • In comparison with Vector model, adds the distinction between AND and OR combinations • There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like • This can be even different within one query • Not investigated well. Why not investigate it?

Alternative Algebraic modelsGeneralized Vector Space model • Classical independence assumptions: • All combinations of terms are possible, none are equivalent (= basis in the vector space) • Pair-wise orthogonal: cos ({ki}, {kj}) = 0 • This model relaxes the pair-wise orthogonality:cos ({ki}, {kj})  0 • Operates by combinations (co-occurrences) of index terms, not individual terms • More complex, more expensive, not clear if better • Not investigated well. Why not investigate it?

Alternative Algebraic modelsLatent Semantic Indexing model • Index by larger units, “concepts”  sets of terms used together • Retrieve a document that share concepts with a relevant one (even if it does not contain query terms) • Group index terms together (map into lower dimensional space). So some terms are equivalent. • Not exactly, but this is the idea • Eliminates unimportant details • Depends on a parameter (what details are unimportant?) • Not investigated well. Why not investigate it?

Alternative Algebraic modelsNeural Network model • NNs are good at matching • Iteratively uses the found documents as auxiliary queries • Spreading activation. • Termsdocs terms docstermsdocs ... • Like a built-in thesaurus • First round gives same result as Vector model • No evidence if it is good • Not investigated well. Why not investigate it?

Alternative Probabilistic modelsBayesian Inference Network model (One of the authors of the book worked in this. In fact not so important) • Probability as belief (not as frequency) • Belief in importance of terms. Query terms have 1.0 • Similar to Neural Net • Documents found increase the importance of their terms • Thus act as new queries • But different propagation formulas • Flexible in combining sources of evidence • Can be applied to different ranking strategies (Boolean or TF-IDF) • Good quality of results (Warning! Authors work in this)

Alternative Probabilistic modelsBelief Network model (Introduced by one of the authors of the book.) • Better network topology • Separation of document and term space • More general than Inference model -------------------------------------------------------------------- • Bayesian network models: • do not include cycles and this have linear complexity • unlike Neural Nets • Combine distinct evidence sources (also user feedback) • Are a neat formalism. • Better alternative to combinations of Boolean and Vector

Models for structured text • Cat in the 3rd chapter. Cat in same paragraph as Dog • Non-overlapping lists • Chapters, sections, paragraphs – as regions • Technically treated much like terms (ranges of positions) • Sections containing Cat • Proximal nodes model (suggested by the authors) • Chapters, sections, paragraphs – as objects (nodes)

Models for browsing • Flat browsing • Just as a list of paper • No context cues provided • Structure guided • Hierarchy • Like directory tree in the computer • Hypertext (Internet!) • No limitations of sequential writing • Modeled by a directed graph: links from unit A to unit B • units: docs, chapters, etc. • A map (with traversed path) can be helpful

The Web • Internet • Not hypertext • Authors call “hypertext” a well-organized hypertext • Internet: not depository but heap of information

Research issues • How people judge relevance? • ranking strategies • How to combine different sources of evidence? • What interfaces can help users to understand and formulate their Information Need? • user interfaces: an open issue • Meta-search engines: combine results from different Web search engines • They almost do not intersect • How to combine ranking?

Conclusions • Modeling is needed for formal operations • Boolean model is the simplest • Vector model is the best combination of quality and simplicity • TF-IDF term weighting • This (or similar) weighting is used in all further models • Many interesting and not well-investigated variations • possible future work

Thank you! Till October 2

Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling