**Machine Learning Group** Department of Computer Science The University of Texas at Austin Online Max-Margin Weight Learning with Markov Logic Networks Star AI 2010, July 12, 2010 Tuyen N. Huynh and Raymond J. Mooney

**Outline** • Motivation • Background • Markov Logic Networks • Primal-dual framework • New online learning algorithm for structured prediction • Experiments • Citation segmentation • Search query disambiguation • Conclusion

**Motivation** • Most of the existing weight learning for MLNs are in the batch setting. • Need to run inference over all the training examples in each iteration • Usually take a few hundred iterations to converge • Cannot fit all the training examples in the memory Conventional solution: online learning

**Background**

**Markov Logic Networks (MLNs)** [Richardson & Domingos, 2006] • An MLN is a weighted set of first-order formulas • Larger weight indicates stronger belief that the clause should hold • Probability of a possible world (a truth assignment to all ground atoms) x: 2.5Center(i,c) => InField(Ftitle,i,c) 1.2InField(f,i,c) ^ Next(j,i) ^ ¬HasPunc(c,i)=> InField(f,j,c) Weight of formula i No. of true groundings of formula i in x

**Existing discriminative weight learning methods for MLNs** • maximize the Conditional Log Likelihood (CLL)[Singla & Domingos, 2005], [Lowd & Domingos, 2007], [Huynh & Mooney, 2008] • maximize the margin, the log ratio between the probability of the correct label and the closest incorrect one [Huynh & Mooney, 2009]

**Online learning**

**Primal-dual framework [Shalev-Shwartzet al., 2006]** • A general and latest framework for deriving low-regret online algorithms • Rewriting the regret bound as an optimization problem (called the primal problem), then considering the dual problem of the primal one • A condition that guarantees the increase in the dual objective in each step Incremental-Dual-Ascent (IDA) algorithms. For example: subgradient methods

**Primal-dual framework (cont.)** • Proposed a new class of IDA algorithms called Coordinate-Dual-Ascent (CDA) algorithm: • The CDA update rule only optimizes the dual w.r.t the last dual variable • A closed-form solution of CDA update rule CDA algorithms have the same cost as subgradient methods but increase the dual objective more in each step converging to the optimal value faster

**Primal-dual framework (cont.)**

**CDA algorithms for max-margin structured prediction**

**Max-margin structured prediction**

**Steps for deriving new CDA algorithms** • Define the regularization and loss functions • Find the conjugate functions • Derive a closed-form solution for the CDA update rule

**1. Define the regularization and loss functions** Label loss function

**1. Define the regularization and loss functions (cont.)**

**2. Find the conjugate functions**

**2. Find the conjugate functions (cont.)**

**3. Closed-form solution for the CDA update rule** • Optimization problem: • Solution:

**CDA algorithms for max-margin structured prediction**

**Experiments**

**Citation segmentation** • Citeseer dataset [Lawrence et.al., 1999] [Poon and Domingos, 2007] • 1,563 citations, divided into 4 research topics • Each citation is segmented into 3 fields: Author, Title, Venue • Used the simplest MLN in [Poon and Domingos, 2007] • Similar to a linear chain CRF: Next(j,i) ^ !HasPunc(c,i) ^ InField(c,+f,i) => InField(c,+f,j)

**Experimental setup** • Systems compared: • MM: the max-margin weight learner for MLNs in batch setting [Huynh & Mooney, 2009] • 1-best MIRA [Crammer et al., 2005] • Subgradient [Ratliff et al., 2007] • CDA1/PA1 • CDA2

**Experimental setup (cont.)** • 4-fold cross-validation • Metric: • CiteSeer: micro-average F1 at the token level • Used exact MPE inference (Integer Linear Programming) for all online algorithms and approximate MPE inference (LP-relaxation) for the batch one. • Used Hamming loss as the label loss function

**Average F1**

**Average training time in minutes**

**Microsoft web search query dataset** • Used the clean-up dataset created by Mihalkova & Mooney [2009] • Has thousands of search sessions where an ambiguous queries was asked • Goal: disambiguate search query based on previous related search sessions • Used 3 MLNs proposed in [Mihalkova & Mooney, 2009]

**Experimental setup** • Systems compared: • Contrastive Divergence (CD) [Hinton 2002]: used in [Mihalkova & Mooney, 2009] • 1-best MIRA • Subgradient • CDA1/PA1 • CDA2 • Metric: • Mean Average Precision (MAP): how close the relevant results are to the top of the rankings

**MAP scores**

**Conclusion** • Derived CDA algorithms for max-margin structured prediction • Have same computational cost as existing online algorithms but increase the dual objective more • Experimental results on two real-world problems show that the new algorithms generally achieve better accuracy and also have more consistent performance.

**Thank you!** Questions?