slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
搜索引擎技术 PowerPoint Presentation
Download Presentation
搜索引擎技术

Loading in 2 Seconds...

play fullscreen
1 / 62

搜索引擎技术 - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

搜索引擎技术. 闫宏飞, yhf@net.pku.edu.cn 北京大学计算机系网络实验室 2004 年 12 月 24 日 @CERNET2004. 内容提要. 搜索引擎工作原理 信息检索相关研究和机构. 搜索引擎 — Web Search Engines. 定义: 允许用户递交查询,检索出与查询相关的网页结果列表,并且排序输出。 创建索引的方法 手工索引 自动索引 系统结构 集中式体系结构 分布式体系结构. Two service extremes. Browsing Services. Search Engine Services.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '搜索引擎技术' - vanya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

搜索引擎技术

闫宏飞,yhf@net.pku.edu.cn

北京大学计算机系网络实验室

2004年12月24日@CERNET2004

slide2
内容提要
  • 搜索引擎工作原理
  • 信息检索相关研究和机构
web search engines
搜索引擎 — Web Search Engines
  • 定义:允许用户递交查询,检索出与查询相关的网页结果列表,并且排序输出。
  • 创建索引的方法
    • 手工索引
    • 自动索引
  • 系统结构
    • 集中式体系结构
    • 分布式体系结构
slide6

Two service extremes

Browsing

Services

Search

Engine

Services

???

Web

Pages

???

Bag of

Words

Two semantics extremes

slide7
搜索引擎三段式工作流程
  • 搜集
    • 批量搜集,增量式搜集;搜集目标,搜集策略
  • 预处理
    • 关键词提取;重复网页消除;链接分析;索引
  • 服务
    • 查询方式和匹配;结果排序;文档摘要

搜集

整理

服务

slide10

抓取

进程

抓取

进程

抓取

进程

协调

进程

(节点)

协调

进程

(节点)

协调

进程

(节点)

……

调度模块

分布式Web搜集系统结构
slide11
天网存储格式

version: 1.0 // version number

url: http://www.pku.edu.cn/ // URL

origin: http://www.somewhere.cn/ // original URL

date: Tue, 15 Apr 2003 08:13:06 GMT // time of harvest

ip: 162.105.129.12 // IP address

unzip-length: 30233 // If included, the data must be compressed

length: 18133 // data length

// a blank line

XXXXXXXX // the followings are data part

XXXXXXXX

….

XXXXXXXX // data end

// insert a new line

file organizations indexes
File Organizations (Indexes)
  • Choices for accessing data during query evaluation
  • Scan the entire collection
    • Typical in early (batch) retrieval systems
    • Computational and I/O costs are O(characters in collection)
    • Practical for only “small” text collections
    • Large memory systems make scanning feasible
  • Use indexes for direct access
    • Evaluation time O(query term occurrences in collection)
    • Practical for “large” collections
    • Many opportunities for optimization
  • Hybrids: Use small index, then scan a subset of the collection
indexes
Indexes
  • What should the index contain?
  • Database systems index primary and secondarykeys
    • This is the hybrid approach
    • Index provides fast access to a subset of database records
    • Scan subset to find solution set
  • IR Problem:
  • Cannot predict keys that people will use in queries
    • Every word in a document is a potential search term
  • IR Solution: Index by all keys (words) full text indexes
index contents
Index Contents
  • The contents depend upon the retrieval model
  • Feature presence/absence
    • Boolean
    • Statistical (tf, df, ctf, doclen, maxtf)
    • Often about 10% the size of the raw data, compressed
  • Positional
    • Feature location within document
    • Granularities include word, sentence, paragraph, etc
    • Coarse granularities are less precise, but take less space
    • Word-level granularity about 20-30% the size of the raw data,compressed
indexes implementation
Indexes: Implementation
  • Common implementations of indexes
    • Bitmaps
    • Signature files
    • Inverted files
  • Common index components
    • Dictionary (lexicon)
    • Postings
      • document ids
      • word positions

No positional data indexed

inverted search algorithm
Inverted Search Algorithm
  • Find query elements (terms) in the lexicon
  • Retrieve postings for each lexicon entry
  • Manipulate postings according to the retrieval model
word level inverted file20
Word-Level Inverted File

lexicon

posting

Query:

1.porridge & pot (BOOL)2.“porridge pot” (BOOL)

3. porridge pot (VSM)

Answer

slide21
内容提要
  • 搜索引擎工作原理
  • 信息检索相关研究和机构
a brief history of modern information retrieval
A Brief history of Modern Information Retrieval
  • In 1945, Vannevar Bush published "As We May Think" in the Atlantic monthly.
  • In the 1960s, the SMART system by Gerard Salton and his students
  • Cranfield evaluations done by Cyril Cleverdon
  • The 1970s and 1980s saw many developments built on the advances of the 1960s.
  • In 1992 with the inception of Text Retrieval Conference.
  • The algorithms developed
  • The algorithms developed in IR were employed for searching the Web from 1996.
slide37
信息检索相关研究和机构
  • CIIR, University of Massachusetts
  • LTI, Carnegie Mellon University
  • The Stanford University DB Group
  • Microsoft Research Asia
  • TREC
  • 北京大学, 网络实验室, 天网组
lemur
Lemur简介
  • http://www-2.cs.cmu.edu/~lemur/
lemur toolkit
Lemur Toolkit
  • 目标:为促进LM和IR研究的research system
    • ad hoc , distributed retrieval, cross-language IR, summarization, filtering, and classification
  • 功能:
    • 支持大规模文档数据库的索引
    • 建立Simple Language Model
    • 实现基于Language Model和其它多个检索模型的系统
  • 实现:
    • C and C++
    • Unix / Windows
    • Current Version 3.1
mra towards next generation web search
MRA: Towards Next Generation Web Search
  • From Pages to Blocks
    • Analyze the Web at finer granularity
  • From Surface Web to Deep Web
    • Unleash the huge assets of high-value information
  • From Unstructure to Structure
    • Provide well organized results
  • From relevance to intelligence
    • Contribute knowledge discovery with search
  • From Desktop Search to Mobile Search
    • Bridge physical world search to digital world search
the stanford univ db group
The Stanford Univ. DB Group
  • WebBase
    • Crawling, storage, indexing, and querying of large collections of Web pages.
  • Digital Libraries
    • Infrastructure and services for creating, disseminating, sharing and managing information
trec conference
TREC Conference
  • Established in 1992 to evaluate large-scale IR
    • Retrieving documents from a gigabyte collection
  • Has run continuously since then
    • TREC 2004(13th) meeting is in November
  • Run by NIST’s Information Access Division
  • Probably most well known IR evaluation setting
    • Started with 25 participating organizations in 1992 evaluation
    • In 2003, there were 93 groups from 22 different countries
  • Proceedings available on-line (http://trec.nist.gov )
    • Overview of TREC 2003 at http://trec.nist.gov/pubs/trec12/papers/OVERVIEW.12.pdf
trec general format
TREC General Format
  • TREC consists of IR research tracks
    • Ad hoc, routing, confusion ( scanned documents, speech recognition ), video, filtering, multilingual ( cross-language, Spanish, Chinese ), question answering, novelty, high precision, interactive, Web, database merging, NLP, …
  • Each track works on roughly the same model
    • November: track approved by TREC community
    • Winter: track’s members finalize format for track
    • Spring: researchers train system based on specification
    • Summer: researchers carry out format evaluation
      • Usually a “blind” evaluation: research do not know answer
    • Fall: NIST carries out evaluation
    • November: Group meeting (TREC) to find out:
      • How well your site did
      • How others tackled the program
    • Many tracks are run by volunteers outside of NIST (e.g. Web)
  • “Coopetition” model of evaluation
    • Successful approaches generally adopted in next cycle
cwt100g
CWT100g构建时间表

我是一小步,人类的一大步!

slide53
提交结果的参加队

注:pooling还包括google,yisou,baidu,sogou,zhongsou五个SE的检索结果。

slide54

评测结果

主题提取

导航搜索

其中TIANWANG_RUN仅供参考

slide55
总结
  • 搜索引擎工作原理
  • 信息检索相关研究和机构
vector space model
Vector Space Model
  • 文档d和查询q在向量空间中表示为两个m维向量,每维度的权值用TF∙IDF,其相似度用向量夹角余弦度量,有: (使用原始的tf,idf公式)

BACK

query answer
Query Answer
  • 1.porridge & pot (BOOL)
    • d2
  • 2.“porridge pot” (BOOL)
    • null
  • 3. porridge pot (VSM)
    • d2 > d1>d5
    • Next page

BACK

ciir center for intelligent information retrieval @umass
CIIR-Center for Intelligent Information Retrieval @UMASS
  • One of the leading research groups in IR
    • improving the probabilistic models,
    • first description of a retrieval system based on statistical language models.
    • introduced and improved a number of techniques for text and query representation
    • automatically representing databases and combining local searches for DIR
    • first high capacity probabilistic filtering architecture
    • define and evaluate the first versions of event detection and tracking software
    • earliest research on ranking and representation techniques for Asian languages
    • first approaches to information extraction that emphasized learning
    • novel techniques for indexing images and video
ciir cont
CIIR cont.
  • Research
    • more than 500 journal and refereed conference papers over the past 12 years (52 submissions in 2003).
  • industrial and government collaboration
    • INQUERY
    • licensed our software to nearly 300 sites
  • Education
    • 20 Ph.D.s , 29 M.S.
    • 123/145, 34/4 graduate/undergraduate
ciir cont61
CIIR cont.
  • Personnel
    • Faculty 4 (W. BRUCE CROFT)
    • Technical personel 10
    • Graduate student 34/10
  • Groups
    • IESL:Information Extraction and Synthesis Laboratory
    • IR :Information Retrieval Laboratory
    • MIR :Multimedia Indexing and Retrieval Laboratory
  • The CIIR is currently concentrating on the unsolved long-term research problems that underlie effective information retrieval
    • text representation,
    • query acquisition,
    • retrieval models
lti language technologies institue @cmu
LTI : Language Technologies Institue @CMU
  • Machine Translation, Natural Language Processing, Speech, and Information Retrieval
  • IR Projects (Jamie Callan and Yiming Yang )
    • Adaptive Information Filtering
    • Distributed Information Retrieval / Federated Search
    • Email Classification and Prioritization
    • Minerva: Web Mining for Question Answering
    • MuchMore: Translingual Information Retrieval
    • JAVELIN: Open-Domain Question Answering

BACK