信息检索建模

信息检索建模 武港山 Tel : 83594243 Office: 蒙民伟楼608B Email : gswu@nju.edu.cn

检索请求 检索机制检索对象前言数据建模操作 Wu Gangshan: Modern Information Retrieval

信息检索系统的体系结构 用户界面文档用户需求文档处理逻辑视图用户反馈提问处理建索引数据库管理倒排文档搜索提问索引文档数据库排序后的文档检出的文档排序 Wu Gangshan: Modern Information Retrieval

前言 • 信息检索模型是指在特定文档逻辑视图的基础上，为实现信息检索而定义的一组处理机制。 • 如：相似性判断机制等 • 本书中，将用户的信息获取方式定为两种 • 信息检索 • 信息浏览 • 下面就在这两种方式的基础上讨论信息检索模型。 Wu Gangshan: Modern Information Retrieval

内容提要 • 信息检索模型分类 • 信息检索模型的规范化描述 • 传统的信息检索模型 • 布尔、向量空间、概率 • 其他变种模型 • 集合论模型、代数模型、概率模型变种 • 结构化文本信息检索模型 • 浏览处理模型 • 发展和研究的方向 Wu Gangshan: Modern Information Retrieval

1、检索模型分类 • 检索模型 • 传统模型: • 布尔:集合论(模糊逻辑模型:扩展的布尔模型) • 向量 : 代数模型 (通用向量模型:潜在语义分析: 神经网络.) • 概率模型: (推理网络, 置信网络) • 结构化模型 • 非覆盖列表 • 最相近节点 • 浏览模型: • 平坦型 • 结构制导 • 超文本 Wu Gangshan: Modern Information Retrieval

Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Models U s e r T a s k Retrieval: Adhoc Filtering Browsing Wu Gangshan: Modern Information Retrieval

1、信息检索模型分类 Logical View of Documents User task IR Model, Logical view of documents, and the user task Are orthogonal aspects of a retrieval system. Wu Gangshan: Modern Information Retrieval

检索和过滤对比* • IR • 检索集中的文档相对不变 • 传统的信息需求方式 • 过滤 • 查询请求相对不变，而文档不断的进来或离开 • 股票市场和有线新闻服务 Wu Gangshan: Modern Information Retrieval

检索和过滤对比* • 过滤 • 用户文件: 描述用户的信息偏好 • 用一组关键词创建. • 用户提供 :不太现实 • 训练、学习 : 动态变化 • 排序: • 两种处理：对过滤文档排序/不排序 • 相关性判断时一定有类似与排序的处理过程 Wu Gangshan: Modern Information Retrieval

2、信息检索模型的规范化描述 • 定义: 信息检索模型可以定义成一个四元组 {D, Q, F, R(qi,dj)}，其中 • D 是被检索文档的逻辑视图。 • Q 是用户检索请求的逻辑视图，也称为查询。 • F定义了文档表示、查询以及它们关系的框架。 • R(qi,dj) 是排序函数，它对任意一个检索请求和任一个检索文档建立一种。 Wu Gangshan: Modern Information Retrieval

3、传统的信息检索模型 • 文档被看成是一组索引项。 • 可以用来表达文档的主题 • 用于索引和概括文档内容 • 主要是名词。可独立表达自身的意义. • 每个索引项在表达文档含义时有权重： • 一个词如果在所有的文档中都出现，表示该词对区分文档内容没有帮助。 • 仅出现在几个文档中的索引项，表示该词对于表征文档内容有重要的贡献度。 Wu Gangshan: Modern Information Retrieval

3、传统的信息检索模型 • 文档逻辑视图定义: • Let t be the number of index terms in the system and ki be a generic index term, K={ki} is the set of all index terms,（逻辑视图空间） • a weight wi,jfor a index term which doesn’t appear in the document text, wi,j =0. • With the document di is associated an index term vector {wi,j}, • Further, let gi be a function that returns the weight associated with the index term ki in any t-dimensional vector. Wu Gangshan: Modern Information Retrieval

3、传统的信息检索模型 • Boolean model • Vector model • Probabilistic model Wu Gangshan: Modern Information Retrieval

3.1、布尔逻辑操作 • 定义在集合上的操作。 • True and false: • A set is a collection of items with certain common characteristics. • Any item either belongs to the set (true) or not belong to the set (false) • AND • combine two sets, A and B, to create a smaller (or at least not larger) set C. • any items in C must be in BOTH set A and set B. • OR • Union of two sets, A and B, to create a larger set C. • any item in C must be either in set A or in set B. • Not • to exclude items in a set. Wu Gangshan: Modern Information Retrieval

Example: • Given: A={1, 3, 7, 12, 14, 25,36,} B={1, 2, 3,4,5,7,8,12,13, 14, 15, 25, 26} C={2,4,6,8,10,11,12,13,14} • Derive: • A AND B • A OR B • A AND B AND C • (A AND B) NOT C • (A AND B) OR C • (A OR B) AND C • A AND (B OR C) Wu Gangshan: Modern Information Retrieval

布尔查询请求 • 检索关键词用逻辑符号相连接 • 系统基于布尔逻辑查询，返回一组检索到的文档 • Examples: • (network or networks or structured or system or systems) and (information or retrieval) • 每个关键词表示一组包含该关键词的文档集合 Wu Gangshan: Modern Information Retrieval

布尔模型定义 • 索引项的权重是二进制，只有 {0,1} • 查询Q是一个布尔逻辑表达式. • 相似度只有两种 =0: 不相关 1: 相关 Wu Gangshan: Modern Information Retrieval

Sec. 1.1 布尔模型的实现 • Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia? • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? • Slow (for large corpora) • NOTCalpurnia is non-trivial • Other operations (e.g., find the word Romans nearcountrymen) not feasible • Ranked retrieval (best documents to return) • Later lectures

Sec. 1.1 Term-document incidence 1 if play contains word, 0 otherwise BrutusANDCaesar but NOTCalpurnia

Sec. 1.1 Incidence vectors • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. • 110100 AND 110111 AND 101111 = 100100.

Sec. 1.1 Answers to query • Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. • Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Bigger collections • Consider N = 1 million documents, each with about 1000 words. • Avg 6 bytes/word including spaces/punctuation • 6GB of data in the documents. • Say there are M = 500K distinct terms among these.

Can’t build the matrix • 500K x 1M matrix has half-a-trillion 0’s and 1’s. • But it has no more than one billion 1’s. • matrix is extremely sparse. • What’s a better representation? • We only record the 1 positions.

Sec. 1.2 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Inverted index • For each term T, we must store a list of all documents that contain T. • Do we use an array or a list for this? Brutus Calpurnia Caesar 13 16 What happens if the word Caesar is added to document 14?

Sec. 1.2 Brutus Calpurnia Caesar Dictionary Postings lists Inverted index • Linked lists generally preferred to arrays • Dynamic space allocation • Insertion of terms into documents easy • Space overhead of pointers Posting 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Sorted by docID (more later on why).

Sec. 1.2 Indexer steps • Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Sec. 1.2 • Sort by terms. Core indexing step.

Sec. 1.2 • Multiple term entries in a single document are merged. • Frequency information is added. Why frequency? Will discuss later.

Sec. 1.2 • The result is split into a Dictionary file and a Postings file.

Sec. 1.2 • Where do we pay in storage? Terms Pointers

Sec. 1.3 2 4 8 16 32 64 1 2 3 5 8 13 21 Query processing: AND • Consider processing the query: BrutusANDCaesar • Locate Brutus in the Dictionary; • Retrieve its postings. • Locate Caesar in the Dictionary; • Retrieve its postings. • “Merge” the two postings: 128 Brutus Caesar 34

Sec. 1.3 Brutus Caesar 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 The merge • Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 2 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

Sec. 1.3 Boolean queries: Exact match • The Boolean Retrieval model is being able to ask a query that is a Boolean expression: • Boolean Queries are queries using AND, OR and NOT to join query terms • Views each document as a set of words • Is precise: document matches condition or not. • Primary commercial retrieval tool for 3 decades. • Professional searchers (e.g., lawyers) still like Boolean queries: • You know exactly what you’re getting.

Sec. 1.4 Example: WestLaw http://www.westlaw.com/ • Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) • Tens of terabytes of data; 700,000 users • Majority of users still use boolean queries • Example query: • What is the statute of limitations in cases involving the federal tort claims act? • LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM • /3 = within 3 words, /S = in same sentence

Sec. 1.4 Example: WestLaw http://www.westlaw.com/ • Another example query: • Requirements for disabled people to be able to access a workplace • disabl! /p access! /s work-site work-place (employment /3 place • Note that SPACE is disjunction, not conjunction! • Long, precise queries; proximity operators; incrementally developed; not like web search • Professional searchers often like Boolean search: • Precision, transparency and control • But that doesn’t mean they actually work better….

Sec. 1.3 2 4 8 16 32 64 128 1 2 3 5 8 16 21 34 Query optimization • What is the best order for query processing? • Consider a query that is an AND of t terms. • For each of the t terms, get its postings, then AND them together. Brutus Calpurnia Caesar 13 16 Query: BrutusANDCalpurniaANDCaesar 37

Sec. 1.3 Brutus 2 4 8 16 32 64 128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16 Query optimization example • Process in order of increasing freq: • start with smallest set, then keepcutting further. This is why we kept freq in dictionary Execute the query as (CaesarANDBrutus)ANDCalpurnia.

Sec. 1.3 More general optimization • e.g., (madding OR crowd) AND (ignoble OR strife) • Get freq’s for all terms. • Estimate the size of each OR by the sum of its freq’s (conservative). • Process in increasing order of OR sizes.

Faster postings merges:Skip pointers/Skip lists

Recall basic merge • Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 4 8 41 48 64 128 Brutus 2 8 1 2 3 8 11 17 21 31 Caesar If the list lengths are m and n, the merge takes O(m+n) operations. Can we do better? Yes (if index isn’t changing too fast).

128 17 2 4 8 41 48 64 1 2 3 8 11 21 Augment postings with skip pointers (at indexing time) 128 41 31 11 31 • Why? • To skip postings that will not figure in the search results. • How? • Where do we place skip pointers?

17 2 4 8 41 48 64 We then have 41 and 11 on the lower. 11 is smaller. 1 2 3 8 11 21 But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings. Query processing with skip pointers 128 16 128 31 11 31 Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.

Where do we place skips? • Tradeoff: • More skips  shorter skip spans  more likely to skip. But lots of comparisons to skip pointers. • Fewer skips  few pointer comparison, but then long skip spans  few successful skips.

Placing skips • Simple heuristic: for postings of length L, use L evenly-spaced skip pointers. • This ignores the distribution of query terms. • Easy if the index is relatively static; harder if L keeps changing because of updates. • This definitely used to help; with modern hardware it may not (Bahle et al. 2002) • The I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging!

布尔查询的优点 • 简单、明确 • 高效 • AND reduces the number of hits very quickly • OR expands search scope • 很强的逻辑理论基础 • proved mathematical foundations Wu Gangshan: Modern Information Retrieval

布尔查询的问题 • 布尔查询是一个精确查询 • either retrieving or not retrieving a document. • Requesting “computer” would not find “computing” unless more programming is done • 没有权重机制 • in query, A and B, you can’t specify A is more important than B. Wu Gangshan: Modern Information Retrieval

无法基于布尔逻辑对检索结果排序 • 每一个检索回来的文档都一视同仁. • 可能出现次序混乱 • A AND B OR C Wu Gangshan: Modern Information Retrieval

传统信息检索模型 • Boolean model • Vector model • 向量空间模型（Vector Space Model） • Probabilistic model Wu Gangshan: Modern Information Retrieval

VSM概述－布尔模型的改进 • 一个文档可以用一个词袋表示（bag of words）(带词频的无序关键词集). • 用户定义一组带有权重的关键词请求: • 带权重的形式: Q = < database 0.5; text 0.8; information 0.2 > • 不带权重的请求: Q = < database; text; information > • 查询中不需要逻辑操作 Wu Gangshan: Modern Information Retrieval

信息检索建模

信息检索建模

Presentation Transcript