Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman

Efficient Processing of Complex Features for Information RetrievalDissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC, Fall 2008

Overview • Indexing • Ranking • Query Expansion • Query Evaluation • Tupleflow

Topics Not Covered • Binned Probabilities • Score-Sorted Index Optimization • Document-Sorted Index Optimization • Navigational Search with Complex Features

Document Indexing • Inverted List A mapping from a single word to a set of documents that contain the word • Inverted Index A set of inverted lists

Inverted Index • Contain one inverted list for each term in the document collection • Often omit frequently occurring words such as “a,” “and” and “the.”

Sample Documents Cats, dogs, dogs. Dogs, cats, sheep. Whales, sheep, goats. Fish, whales, whales. Inverted Index Inverted Index Example

Expanding Inverted Indexes • Include term frequency More terms implies “about”

Expanding Inverted Indexes (cont.) • Add word position information Facilitates phrase searching

Inverted Index Statistics • Compressed inverted indexes containing only word counts • 5% of the document collection in size • Built and queried faster • Compressed inverted indexes containing word counts and positions • 20% of the document collection in size • Essential for high effectiveness, even in queries not using phrases

Document Ranking • Documents returned in order of relevance • Perfect ranking impossible • Retrieval systems calculate probability a document is relevant

Computing Relevance • Assume “bag of words” with term independence • Simple estimation • Problems • If a document does not contain all words of a multi-word query it will not be retrieved. document containing 0 words = document containing some words • All words are treated equally. Query = Maltese falcon document(maltese:2, falcon:1) = document(maltese:1,falcon:2) for documents of similar length • Smoothing can help

Computing Relevance (cont.) • Add additional features • Position/field in document, ex. title • Proximity of query terms • Combinations

Computing Relevance (cont.) Add query independent information • # links from other documents • URL depth shorter general longer specific • User clicks May match expectations but not relevance • Dwell time • Document quality models Unusual term distribution implies poor grammar so the document is not a good retrieval candidate

Query Expansion Stemming Groups words that mean the same concept based on natural language rules. ex: run, runs, running, ran • Aggressive Stemmer May group words that are not related. ex. marine, marinate • Conservative Stemmer May fail to group words that are related. ex. run, ran • Statistical Stemmer Uses word co-occurrence data to determine if they are related. Would probably avoid the marine, marinate mistake.

Query Expansion (cont.) Synonyms Group by terms that mean the same concept • Problem May be different depending on context US: President = head of state = commander in chief UK: prime minister = head of state Corporation: president = chief executive (maybe) • Solutions • Include synonyms in query but prefer term matches • Use context from the whole query “president of canada” “prime minister”

Query Expansion (cont.) Relevance Feedback User selects relevant documents and they are used to find similar documents. Pseudo Relevance Feedback System assumes the first few documents retrieved are relevant and uses them to search for more. No user involvement, so not as precise.

Evaluation • Effectiveness • Efficiency

Effectiveness • Precision # of relevant results / # results • Success Whether the first document was relevant • Recall # relevant docs found / # relevant docs that exist • Mean Average Precision (MAP) Average precision over all relevant documents • Normalized Discounted Cumulative Gain (NDCG) Calculates using sum over result ranks

Calculating MAP Assume a retrieval set of 10 documents with 1, 5, 7, 8 and 10 relevant. If there were only 5 relevant documents, then (1 + .2 + .43 + .5 + .5) / 5 = .53 If we retrieved only 5 of 6 relevant documents, then (1 + .2 + .43 + .5 + .5) / 6 = .44

NDCG • Uses 4 values for relevance, not just is/is not with 0 being not relevant and 4 being most relevant. • Calculated as N (2r(i) − 1)/ log(1 + i) Where i is the rank and r(i) is the relevance value at that rank. Example: with the following results where g is relevant and c is not S i

Efficiency • Throughput # of queries processed per second Must use identical systems. • Latency Time between when the user issues a query and the system delivers a response. < 150ms considered “instantaneous” • Generally, improving one implies worsening the other

Measuring Efficiency • Direct Attempt to create a real world system and measure statistics. Straightforward but limited to experimenter access. • Simulation System operation is simulated in software. Repeatable but is only as good as its model.

Query Evaluation • Document-at-a-time Evaluate each term for a document before moving to the next document. • Term-at-a-time Evaluate each document for a term before moving to the next term.

Document-at-a-Time • Produces complete document scores early so can quickly display partial results. • Can incrementally fetch the inverted list data so uses less memory.

Document-at-a-Time Algorithm procedure DocumentAtATimeRetrieval(Q) L ← Array() R ← PriorityQueue() for all terms wi in Q do li ← InvertedList(wi) L.add( li ) end for for all documents D in the collection do for all inverted lists li in L do sD ← sD + f(Q,C,wi)(c(wi;D)) #Update the document score end for sD ← sD · d(Q,C)(|D|) #Multiply by a document-dependent factor R.add( sD,D ) end for return the top n results from R end procedure

Term-at-a-Time • Does not jump between inverted lists so saves branching. • Inner loop iterates over documents so is executed for a long time, thus is easier to optimize. • Efficient query processing strategies have been developed for term-at-a-time. • Preferred for efficient system implementation.

Term-at-a-Time Algorithm procedure TermAtATimeRetrieval(Q) A ← HashTable() for all terms wi in Q do li ← InvertedList(wi) for all documents D in li do swi,D ← A[D] + f(Q,C,wi)(c(wi;D)) end for end for R ← PriorityQueue() for all accumulators A[D] in A do sD ← A[D] · d(Q,C)(|D|) #Normalize the accumulator value R.add( sD,D ) end for return the top n results from R end procedure

Optimization Types • Unoptimized • Unsafe • Set Safe • Rank Safe • Score Safe

Unoptimized • Compare the query to each document and calculate the score. • Sort the documents. Documents with the same score may appear in any order. • Return results in ranked order. “Top k documents” could be different.

Optimized • Unsafe Documents returned have no guaranteed set of properties. • Set Safe Documents are guaranteed to be in the result set but may not be in the same order as the unoptimized results. • Rank Safe Documents are guaranteed to be in the result set and in the correct order, but document scores may not be thes same as the unoptimized results. • Score Safe Documents are guaranteed to be in the result set and have the same scores as the unoptimized results.

Tupleflow Distributed computing framework for indexing. • Flexibility Settings made in parameter files, no ode changes required • Scalability Independent tasks spread across processors • Disk abstraction Streaming data model • Low abstraction penalty Code handles custom hashing, sorting and serialization

Traditional Indexing Approach Create a word occurrence model by counting the unique terms in each document. • Serial processing Parse one document, move to the next • Large memory requirements for unique word hash over large document set words, misspellings, numbers, urls, etc. • Different code required for each document type Documents, web pages, databases, etc.

Tupleflow Approach Break processing into steps • Count terms (countsMaker) • Sort terms • Combine counts (countsReducer)

Tupleflow Example The cat in the hat.

Single Processor Multiple Processors Tupleflow Execution Graph filenames filenames read text read text read text read text parse text parse text parse text parse text count words count words count words count words combine counts

Summary Document indexing and querying are time and resource intensive tasks. Optimizing and parallelizing wherever possible is essential to minimize resources and maximize efficiency. Tupleflow is one example of efficient indexing by parallelization.

Questions?

Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman