1 / 45

Conditional Random Fields model

Conditional Random Fields model. QingSong.Guo. Recent work. XML keyword query refinement Two ways: Focus on XML tree structure Focus on keywords. XML tree. In keyword query, there are many nodes in the XML tree matching the keywords.

mercia
Download Presentation

Conditional Random Fields model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conditional Random Fields model QingSong.Guo

  2. Recent work XML keyword query refinement Two ways: • Focus on XML tree structure • Focus on keywords

  3. XML tree • In keyword query, there are many nodes in the XML tree matching the keywords. • Try to find semantically related keywords to avoid returning irrelevant XML nodes to users. • LCA • SLCA

  4. Keywords Ambiguity “Mary Author Title Year”? 1、Find title and year of publications, of which Mary is an author. 2、Find additional author of the publications, of which Mary is an author. 3、Find year and author of publications with similar titles to Mary’s publications.

  5. keywords • Spelling error correction : machin machine • Word splitting : universtyof RUC university of ruc • Word merging: on line online • Phrase segmentation: mark word’s postion in phrase • Word stemming: do doing • Acronym expansion: RUC Renming University of China Question: How to do that?

  6. x1 x2 x3 X: Thinking noun is verb being noun Y: y1 y2 y3 Labeling Sequence Data • X are random variables over data sequences • Y are random variables over label sequences • A is the set of possible part-of-speech tags • problem: how to get labeling sequence y from data sequence x ?

  7. Hidden Markov models (HMMs) • Assign a joint probability to paired observation and label sequences • The parameters typically trained to maximize the joint likelihood of train examples

  8. Markov model Markov property means that, given the present state, future states are independent of the past states State space, Random variables sequence from S Markov property:

  9. HMM • the state is not directly visible, but variables influenced by the state are visible • labeling to the data sequence

  10. Example of HMM • 假设你有一个住得很远的朋友,他每天跟你打电话告诉你他那天作了什么.你的朋友仅仅对三种活动感兴趣:公园散步,购物以及清理房间.他选择做什么事情只凭天气.你对于他所住的地方的天气情况并不了解,但是你知道总的趋势.在他告诉你每天所做的事情基础上,你想要猜测他所在地的天气情况. • 你认为天气的运行就像一个马尔可夫链.其有两个状态 "雨"和"晴",但是你无法直接观察它们,也就是说,它们对于你是隐藏的.每天,你的朋友有一定的概率进行下列活动:"散步", "购物", 或 "清理". 因为你朋友告诉你他的活动,所以这些活动就是你的观察数据.这整个系统就是一个隐马尔可夫模型HMM.

  11. HMM Three Problems: • Towards model λ=(A,B,π), how to compute the p(Y|λ) ? • How to select the proper state sequence Y? • how to estimate the parametrs to maximize the p(Y|λ)?

  12. HMM Get data Creat model application training Parameter Estimation Model establish

  13. HMM Definition: quintuple form(五元组) (S , K, A, B, π ) • S = {S1,...,Sn}:set of states • K = {K1,...,Km}:set of observations • A = {aij},aij = p(Xt+1 = qj |Xt = qi): state transition probability • B = {bik},bik = p(Ot = vk | Xt = qi): output probability • π = {πi}, πi = p(X1 = qi): initial state parobability

  14. HMM

  15. Generative Models • Difficulties and disadvantages • Need to enumerate all possible observation sequences • Not practical to represent multiple interacting features or long-range dependencies of the observations • Very strict independence assumptions on the observations

  16. Discriminative models • used in machine learning • modeling the dependence of an unobserved variable y on an observed variable x. • modeling the conditional probability distribution P(y | x), which can be used for predicting y from x.

  17. Maximum Entropy Markov Models (MEMMs) • A conditional model that representing the probability of reaching a state given an observation and the previous state • Given training set X with label sequence Y: • Train parameter θthat maximizes P(Y|X, θ) • For a new data sequence x, the predicted label y maximizes P(y|x, θ)

  18. MEMMs • Have all the advantages of Conditional Models • Subject to Label Bias Problem Bias toward states with fewer outgoing transitions

  19. Label Bias Problem • P(1,2|ro) = P(2|1,ro)P(1|ro) = P(2|1,o)P(1|r) • P(1,2|ri) = P(2|1,ri)P(1|ri) = P(2|1,i)P(1|r) • Since • P(2|1,x)=1 for all x, P(1,2|ro) = P(1,2|ri) • In the training data, label value 2 is the only label value observed after label value 1 • ThereforeP(2|1) = 1, so P(2|1,x) = 1 for all x • However, we expectP(1,2|ri)to be greater thanP(1,2|ro). • Per-state normalization does not allow the required expectation

  20. Random Field

  21. Conditional Random Fields (CRFs) • have all the advantages of MEMMs without label bias problem • MEMM uses per-state exponential model for the conditional probabilities of next states given the current state • CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence • Undirected acyclic graph • Allow some transitions “vote” more strongly than others depending on the corresponding observations

  22. DefinitionofCRFs X : random variable over data sequences to be labeled Y : random variable over corresponding label sequences

  23. ExampleofCRFs Here,we suppose the graph G isa chain

  24. If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: ConditionalDistribution v is a vertex from vertex set V set of label random variables e is an edge from edge set E over V k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

  25. Conditional Distribution • CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization factor over the data sequence x

  26. Feature function Transition feature funtion: State feature function: if the xi is the word “september” otherwise if y i-1 = IN and yi = NNP othewise

  27. MaximumEntropyPrinciple The form of CRF given is heavily motivated by the principle of maximum entropy—‘‘A mathematical theory of communication‘‘,shannon • The only probability distribution that can justifiably be constructed from finite training data is that which has maximum entropy subject to a set of constraints representing the information available

  28. MaximumEntropyPrinciple If the information within the training data is represented using a set of feature functions desribed previously • The maximum entropy distribution is that which is as uniform as possible while ensuring that the expectation of feature function with respect to the empirical distribution of the training data equals the expected value of that feature function with respect to the model distribution

  29. LearningforCRFs • Assumption: the features fk and gk are given and fixed • The learning problem • determine the parameters λ = (λ1, λ2, . . . ; µ1, µ2, . . .) • maximize the log-likelihood function of training data D = {(x(k), y(k))} with empirical distribution p~(x, y). • We simplify the notations by writing • This allows the probability of a label sequence y given an observation sequence x to be written as

  30. CRFParameterEstimation • For a CRF, the log-likelihood is given by • Differentiating the log-likelihood function with respect to parameters gives

  31. CRFParameterEstimation • There is no analytical solutions for the parameter by maximizing the log-likelihood • Setting the derivative to zero and solving for does not always yield a closed form solution • Iterative technique is adopted • Iterative scaling • Gradient decent • The core of the above techniques lies in computing the expectation of each feature function with respect to the CRF model distribution

  32. CRFProbabilityasMatrixComputations • Augment the label sequence with start and end state. We define n+1 matrices of size : • The probability of label sequence y given observation sequence x can be written as the product of the appropriate elements of the n+1 matrices for that pair of sequences • Normalization factor can be computed based on graph theory

  33. DynamicProgramming • The expectation of each feature function with respect to the CRF model distribution for every observation sequence x(k) in the training data is given by • Rewriting right hand side of above equation

  34. DynamicProgramming • Defining forward and backward vectors • The probability of Yi and Yi-1 taking on labels y’ and y given observation sequence x(k) can be computed as

  35. MakingPredictions • Once a CRF model has been trained, there are (at least) two possible ways to do inference for a test sequence • We can predict the entire sequence Y that has the highest probability by Viterbi algorithm (MAP) • We can also make predictions for individual yt and by forward-backward algorithm (MPM)

  36. POStaggingExperiments

  37. POStagging Experiments • Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging • Each word in a given input sentence must be labeled with one of 45 syntactic tags • Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies • oov = out-of-vocabulary (not observed in the training set)

  38. CRF for XML Trees • XML documents are represented by DOM tree • Only consider element nodes, attribute nodes and text nodes • Attribute nodes are unordered, element nodes and text nodes are ordered

  39. CRF for XML Trees • With every set of nodes,associate a random field X of observables Xn and a random field Y of output variables Yn,n is position • Xn will be the symbols of the input trees,and Yn will be the labels of their labelings • Triangle feature function:

  40. CRF for XML Trees table account tr tr client product td id td td td td @class number name address price id name Yn Xn Y0 Y1 Y2 Y1.1 Y1.2 Y2.2 Y2.1 Y2.3 Y2.4 DOM tree

  41. CRF-Query Refinement • Introduce operations and incorporate the operations into the CRF model • Let o denote a sequence of refinement operations o=o1,o2,…,on • Conditional model P(y,o|x) called CRF-QR model

  42. Operations

  43. Graphical representation

  44. CRF-Query Refinement

  45. Next work Along with the two ways Thanks!

More Related