WEB BAR 2004 Advanced Retrieval and Web Mining

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14

Today’s Topics • Latent Semantic Indexing / Dimension reduction • Interactive information retrieval / User interfaces • Evaluation of interactive retrieval

How LSI is used for Text Search • LSI is a technique for dimension reduction • Similar to Principal Component Analysis (PCA) • Addresses (near-)synonymy: car/automobile • Attempts to enable concept-based retrieval • Pre-process docs using a technique from linear algebra called Singular Value Decomposition. • Reduce dimensionality to: • Fewer dimensions, more “collapsing of axes”, better recall, worse precision • More dimensions, less collapsing, worse recall, better precision • Queries handled in this new (reduced) vector space.

n dj ti m Input: Term-Document Matrix • wi,j = (normalized) weighted count (ti , dj) • Key idea: Factorize this matrix

Matrix Factorization A = W x H hj dj n n k = x k m m Basis Representation hj is representation of dj in terms of basis W If rank(W) ≥rank(A) then we can always find H so A = WH Notice duality of problem More “semantic” dimensions -> LSI (latent semantic indexing)

Minimization Problem • Minimize • Minimize information loss • Given: • norm • for SVD, the 2-norm • constraints on W, S, V • for SVD, W and V are orthonormal, and S is diagonal

Matrix Factorizations: SVD A = W x S x VT n n k = x x k m m Singular Values Representation Basis Restrictions on representation: W, V orthonormal; S diagonal

Dimension Reduction • For some s << Rank, zero out all but the s biggest singular values in S. • Denote by Ss this new version of S. • Typically s in the hundreds while r (Rank) could be in the (tens of) thousands. • Before: A= W SVt • Let As = W Ss Vt = WsSsVst • Asis a good approximation to A. • Best rank s approximation according to 2-norm

Dimension Reduction As = W x Ss x VT n s k n s 0 0 = x x 0 0 0 k m m Singular Values Representation Basis The columns of As represent the docs, but in s << m dimensions Best rank s approximation according to 2-norm

More on W and V • Recall mn matrix of terms  docs, A. • Define term-term correlation matrix T = AAt • At denotes the matrix transpose of A. • T is a square, symmetric mm matrix. • Doc-doc correlation matrix D=AtA. • D is a square, symmetric nn matrix. Why?

Eigenvectors • Denote by W the mr matrix of eigenvectors of T. • Denote by V the nr matrix of eigenvectors of D. • Denote by S the diagonal matrix with the squares of the eigenvalues of T = AAt in sorted order. • It turns out that A = WSVt is the SVD of A • Semi-precise intuition: The new dimensions are the principal components of term correlation space.

Query processing • Exercise: How do you map the query into the reduced space?

Take Away • LSI is optimal: optimal solution for given dimensionality • Caveat: Mathematically optimal is not necessarily “semantically” optimal. • LSI is unique • Except for signs, singular values with same value • Key benefits of LSI • Enhances recall, addresses synonymy problem • But can decrease precision • Maintenance challenges • Changing collections • Recompute in intervals? • Performance challenges • Cheaper alternatives for recall enhancement • E.g. Pseudo-feedback • Use of LSI in deployed systems Why?

Resources: LSI • Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html • Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html • Latent semantic indexing: http://citeseer.nj.nec.com/deerwester90indexing.html http://cs276a.stanford.edu/handouts/fsnlp-svd.pdf • Books: FSNLP 15.4, MG 4.6, MIR 2.7.2.

Interactive Information RetrievalUser Interfaces

The User in Information Access Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no yes Stop

Main Focus of Information Retrieval Query Formulate/ Reformulate Find starting point Send to system Receive results Information need Explore results Done? User no Focus of most IR! yes Stop

Information Access Information Access in Context Analyze Synthesize High-Level Goal Done? User no yes Stop

Queries on the WebMost Frequent on 2002/10/26

Queries on the Web (2000) Why only 9% sex?

3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map 773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid Intranet Queries (Aug 2000) Source: Ray Larson

Intranet Queries • Summary of sample data from 3 weeks of UCB queries • 13.2% Telebears/BearFacts/InfoBears/BearLink (12297) • 6.7% Schedule of classes or final exams (6222) • 5.4% Summer Session (5041) • 3.2% Extension (2932) • 3.1% Academic Calendar (2846) • 2.4% Directories (2202) • 1.7% Career Center (1588) • 1.7% Housing (1583) • 1.5% Map (1393) Source: Ray Larson

Types of Information Needs • Need answer to question (who won the superbowl?) • Re-find a particular document • Find a good recipe for tonight’s dinner • Exploration of new area (browse sites about Mexico City) • Authoritative summary of information (HIV review) • In most cases, only one interface! • Cell phone / pda / camera / mp3 analogy

Find Starting Point By Browsing Entry point x x x x x x x x x x x x x x Starting point for search (or the answer?)

Hierarchical browsing Level 0 Level 1 Level 2

Visual Browsing: Hyperbolic Tree

Visual Browsing: Themescape

Scatter/Gather • Scatter/gather allows the user to find a set of documents of interest through browsing. • It iterates: • Scatter • Take the collection and scatter it into n clusters. • Gather • Pick the clusters of interest and merge them.

Scatter/Gather

Browsing vs. Searching • Browsing and searching are often interleaved. • Information need dependent • Open-ended (find information about mexico city) -> browsing • Specific (who won the superbowl) -> searching • User dependent • Some users prefer searching, others browsing (confirmed in many studies: some hate to type) • Advantage of browsing: You don’t need to know the vocabulary of the collection • Compare to physical world • Browsing vs. searching in a grocery store

Browsers vs. Searchers • 1/3 of users do not search at all • 1/3 rarely search • Or urls only • Only 1/3 understand the concept of search • (ISP data from 2000) Why?

Starting Points • Methods for finding a starting point • Select collections from a list • Highwire press • Google! • Hierarchical browsing, directories • Visual browsing • Hyperbolic tree • Themescape, Kohonen maps • Browsing vs searching

Form-based Query Specification (Infoseek) Credit: Marti Hearst

Boolean Queries • Boolean logic is difficult for the average user. • Some interfaces for average users support formulation of boolean queries • Current view is that non-expert users are best served with non-boolean or simple +/- boolean (pioneered by altavista). • But boolean queries are the standard for certain groups of expert users (eg, lawyers).

Direct Manipulation Spec.VQUERY (Jones 98) Credit: Marti Hearst

One Problem With Boolean Queries: Feast or Famine Specifying a well targeted query is hard. Bigger problem for Boolean. Google: 1860 hits for “standard user dlink 650” 0 hits after adding “no card found” Feast Famine How general is the query?

Boolean Queries • Summary • Complex boolean queries are difficult for average user • Feast or famine problem • Prior to google, many IR researchers thought boolean queries were a bad idea. • Google queries are strict conjunctions. • Why is this working well?

Parametric search example Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.

Parametric search example We can add text search.

Parametric search • Each document has, in addition to text, some “meta-data” e.g., Make, Model, City, Color • A parametric search interface allows the user to combine a full-text query with selections on these parameters

Interfaces for term browsing

Re/Formulate Query • Single text box (google, stanford intranet) • Command-based (socrates) • Boolean queries • Parametric search • Term browsing • Other methods • Relevance feedback • Query expansion • Spelling correction • Natural language, question answering

Category Labels to Support Exploration • Example: • ODP categories on google • Advantages: • Interpretable • Capture summary information • Describe multiple facets of content • Domain dependent, and so descriptive • Disadvantages • Domain dependent, so costly to acquire • May mis-match users’ interests Credit: Marti Hearst

WEB BAR 2004 Advanced Retrieval and Web Mining