信息抽取（ 2 ）

信息抽取（2） Information Extraction (IE)

IE • 什么是信息抽取 • IR和IE • 面对的工作和IE模型 • IE体系结构

Wrapper Induction

Wrapper • 分装器，包装器 • 是一个程序，用于从特定的信息源中抽取相关内容，并以特定形式加以表示。 • 由一系列的抽取规则以及应用这些规则的计算机程序代码组成。

基于Wrapper方法的信息抽取

wrapper的构造方式 • 手工生成 • 半自动生成 • 自动生成

Wrapper Induction • 归纳学习算法 • 是一种通过分析一个未知的集合中的某些遵循某种规律的实例集合，归纳出一般性的规则，并利用这些规则来推知该集合的其他部分的方法。

Wrapper Induction • a technique for automatically constructing wrappers from labeled examples of a resource's content.

Wrapper induction: Delimiter-based extraction <HTML><TITLE>Some Country Codes</TITLE> Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>  Use , , , for extraction

<HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> <HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> <HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> <HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> Learning LR wrappers labeled pages wrapper Example: Find 4 strings , , ,  l1, r1, l2, r2 l1,r1,…,lK,rK

：datai-1与datai之间字符串的后缀 • ：datai与datai+1之间字符串的前缀

LR: Finding r1 <HTML><TITLE>Some Country Codes</TITLE>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> r1 can be any prefixeg

LR: Finding l1, l2 and r2 <HTML><TITLE>Some Country Codes</TITLE>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> r2 can be any prefixeg l2 can be any suffix eg l1 can be any suffixeg

A problem with LR wrappers Distracting text in head and tail <HTML><TITLE>Some Country Codes</TITLE> <BODY>Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML>

Head-Left-Right-Tail wrappers One (of many) solutions: HLRT end of head Ignore page’s head and tail<HTML><TITLE>Some Country Codes</TITLE><BODY>Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML> head } } body tail } start of tail 

Extraction • HLRT wrapper as a vector <h, t, l1,r1,l2,r2,…… > • Web pages as Example, output tuples as Label, ExecHLRT() as a Hypothesis function

Induction

Induction as search • Search the hypothesis space

Induction as search • Generate-and–test • Depth-first search, 2K+2 levels for wrapper vector

隐马尔可夫模型Hidden Markov model (HMM)

Generating Patterns • 生成模型是指在数据预处理基础上通过神经元网络、回归分析等数据建模算法从训练样本集中提炼出数据模型.

Generating Patterns • 确定性的生成模型

Generating Patterns • 非确定性的生成模型

Markov过程与Markov链 • Markov过程：具有无后效性的随机过程。即t时刻所处状态的概率只和t-1时刻的状态有关，而与t-1时刻之前的状态无关。 • Markov链：时间离散，状态离散的马尔可夫（Markov）过程。

Markov链的参数 • 转移概率：A=akl=P(πi=l|πi-1=k) • 初始概率：π

Markov链的例子 Rain（状态3） Sun（状态1） Cloudy（状态2） States State transition matrix Sun Cloud Rain ( 0.0 0.0 1.0 ) Initial Distribution

Markov链的例子 • 设第一天（t=1）是雨，问题：根据这个模型，在以后的7天里天气是“雨-雨-晴-晴-雨-多云-雨” 的概率是多少？ • 说得更抽象些，令对应t=1,2,…,8观察序列为O={ }

Hidden Markov Models-HMM • HMM是一个双重随机过程，两个组成部分： • 马尔可夫链：描述状态的转移，用转移概率描述。 • 一般随机过程：描述状态与观察序列间的关系，用观察值概率描述。

HMM组成 Markov链（, A）随机过程（B）观察值序列状态序列 q1, q2, ..., qT o1, o2, ..., oT HMM的组成示意图

HMM • Graphical Model Representation: Variables by time • Circles indicate states • Arrows indicate probabilistic dependencies between states

HMM • Green circles are hidden states • Dependent only on the previous state: Markov process • “The past is independent of the future given the present.”

HMM • Purple nodes are observed states • Dependent only on their corresponding hidden state

HMM的基本要素 • {N,M, ∏, A, B} • N : {s1…sN } are the values for the hidden states • M : {k1…kM } are the values for the observations S S S S S K K K K K

HMM的基本要素 • {N, M, ∏ , A, B} • ∏ = {pi} are the initial state probabilities • A = {aij} are the state transition probabilities • B = {bik} are the observation state probabilities A A A A S S S S S B B B K K K K K

HMM的应用 (1) 评估根据已知的HMM找出一个观察序列的概率 (2) 解码根据观察序列找到最有可能出现的隐状态序列 (3) 学习从观察序列中得出HMM

oT HMM应用(1) • 给定观察序列O=O1,O2,…OT,以及模型 , 计算P(O|λ) o1 ot-1 ot ot+1

x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT HMM应用(1)

Forward Procedure • 定义前向变量 • 初始化： • 递归： • 终结：

x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Forward Procedure

Forward Procedure

Backward Procedure • 定义后向变量 • 初始化： • 递归： • 终结：

Backward Procedure x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT

HMM的应用(2) Viterbi Algorithm • 目的：给定观察序列O以及模型λ, 选择一个对应的状态序列S ，使得S能够最为合理的解释观察序列O • 我们所要找的，就是T时刻最大的所代表的那个状态序列

o1 ot-1 ot ot+1 oT Viterbi Algorithm x1 xt-1 xt xt+1

o1 ot-1 ot ot+1 oT Viterbi Algorithm x1 xt-1 xt xt+1 xT

HMM的应用(3) Baum-Welch算法(模型训练算法) • 目的：给定观察值序列O，通过计算确定一个模型l，使得P(O| l)最大。 • 算法步骤： 1. 初始模型（待训练模型） l0, 2. 基于l0以及观察值序列O，训练新模型l； 3. 如果 log P(X|l) - log(P(X|l0) < Delta，说明训练已经达到预期效果，算法结束。 4. 否则，令l0 ＝ l ，继续第2步工作

Baum-Welch算法 • 定义：

信息抽取（ 2 ）

信息抽取（ 2 ）

Presentation Transcript