1 / 1

Experiments

Efficient Decomposed Learning for Structured Prediction R ajhans Samdani and Dan Roth, University of Illinois at Urbana-Champaign. Experiments. DecL: Learning via Decompositions. Structured Prediction. Synthetic data: random linear scoring function with random constraints.

tarak
Download Presentation

Experiments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Decomposed Learning for Structured Prediction Rajhans Samdani and Dan Roth, University of Illinois at Urbana-Champaign Experiments DecL: Learning via Decompositions Structured Prediction • Synthetic data: random linear scoring function with random constraints • Predicty= {y1,y2,…,yn} 2Ygiven input x • Features: Á (x, y);weight parameters: w • Inference: • argmaxy2Y f(x,y) = w¢Á(x,y) • Learning: estimate w Learn by varying a subset of the output variables, while fixing the remaining to their gold labels in yj ! y2 y3 y2 y3 y3 y3 y2 y1 Local Learning (LL) baselines y1 y2 y1 y1 y4 y4 Avg. Hamming Loss y4 y4 Global Learning (GL) DecL-1 aka Pseudomax y5 y5 y6 y6 y5 y5 y6 • Decomposition is a collection of different (non-inclusive, possibly overlapping) sets of variables which we perform argmax over Sj = {s1,…,sl| 8i, siµ {1,…,n}; 8i, k, si*sk} • Learning with Decompositions in which all subsets of size k are considered: DecL-k • In practice, decompositions based on domain knowledge highly coupled variables together y6 • Structural SVMs learn by minimizing Global Learning (GL) and Dec. Learning (DecL)-2,3 No. of training examples • Information extraction: • Given a citation, extract author, book-title, titleetc. • Given ads text, extract features, size, neighborhood, etc. • Constraints like: • ‘Title’ tokens are likely to appear together in a single block, • A paper should have at most one ‘title’ • Domain Knowledge: HMM transition matrix is diagonal heavy – generalization of submodular pairwise potentials.) Theoretical Results: Decompositions which yield Exactness Exact Inference Update W *: {w* | f(xj, yj ;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8y 2Y , yj2training-data } Wdecl: {w* | f(xj, yj ;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8y 2nbr(yj), yj2training-data } • Global inference as an intermediate step • Exactness:DecL is exact if it has the same set of separating weights as GL – Wdecl =W* • Exactness with finite data is much more useful than asymptotic consistency Main Theorem:DecL is exact if 8w2W *, 9² > 0, such that 8w’2 B(w,²), 8 ( xj,yj) 2 D we have if 9y2Y such that f(xj, y ; w′) + ¢(yj,y) > f(xj, yj; w′) then 9y’2nbr(yj) with f(xj,y’; w′) + ¢(yj,y’) > f(xj, yj; w′) • Global Inference slow ) Global learning time consuming!!! For weights immediately outside W*, • Global Inseparability ) • DecL Inseparability Baseline: Local Learning (LL) • Approximations to GL which ignore certain structural interactions such that the remain structure become easy to learn • E.g. ignoring global constraints or pairwise interactions in a Markov network • Another baseline: LL+Cwhere we learn pieces independently and apply full structural inference (e.g. constraints, if available), during test-time • Fast but oblivious to rich structural information!! Exactness for Special Cases Training Time (hours) Accuracy • Pairwise Markov Network over a graph with edges E • Assume domain knowledge onW*:we know that for separating w,if Ái,k (.;w)is: • Submodular:Ái,k(0,0)+ Ái,k(1,1) >Ái,k(0,1) + Ái,k(1,0) OR • Supermodular:Ái,k(0,0)+ Ái,k(1,1) <Ái,k(0,1) + Ái,k(1,0) • . • Linear scoring function structured via constraints on Y • For simple constraints, possible to show exactness for decompositions with set sizes independent of n • Theorem: If Yis specified by kORconstraints, then DecL-(k+1)is exact • As an example consequence, when Yis specified by khorn-clauses: y1,1Æy1,2 … Æy1,r!y1,r+1, y2,1Æy2,2 … Æy2,r!y2,r+1,  yk,1Æyk,2 … Æyk,l!yk,r+1 • decompositions with set-size (k+1), i.e. • independent of the number of variables • in constraints, r, yield exactness. • Multi-label Document Classification • Experiments on Reuters data • Documents with multi-labels corn, crude, earn, grain, interest… • Modeled as a PMN over a complete graph over the labels – singleton and pairwise components Decomposed learning (DecL) • Reduce inference to a neighborhood • around yj 0 1 sub(Á) sup(Á) • Small neighborhoods ) efficient learning • We theoretically and experimentally show that Decomposed Learning with small neighborhoods can be identical to Global Learning (GL) E Ej F1 Scores Training Time (hours) Theorem:Spairdecomposition consisting of connected components of Ejyields Exactness Supported by the Army Research Laboratory (ARL), Defense Advanced Research Projects Agency (DARPA), and the Office of Naval Research (ONR)..

More Related