1 / 56

Semi-supervised Structured Prediction Models

Semi-supervised Structured Prediction Models. Ulf Brefeld. Joint work with…. Christoph Thomas Peter Tobias Stefan Alexander

inari
Download Presentation

Semi-supervised Structured Prediction Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-supervised Structured Prediction Models Ulf Brefeld Joint work with… Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien

  2. Binary Classification • Inappropriate for complex real world problems. + + + w - - -

  3. Label Sequence Learning • Protein secondary structure prediction: • Named entity recognition (NER): x = “Tom comes from London.” y = “Person,–,–,Location” x = “The secretion of PTH and CT...”y = “–,–,–,Gene,–,Gene,…” • Part-of-speech (POS) tagging: x =“Curiosity kills the cat.”y = “noun, verb, det, noun” x =“XSITKTELDG ILPLVARGKV…” y = „SS   TT SS EEEE SS…“

  4. Natural Language Parsing x =„Curiosity kills the cat“ y = Classification with Taxonomies y = x =

  5. Structural Learning • Given: • n labeled pairs (x1,y1),…,(xn,yn)XxY, drawn iid according to • Learn a ranking function: with • Decision value measures how good y fits to x. • Compute prediction: • Find hypothesis that realizes the smallest regularized empirical risk: inference/decoding model: hinge loss: M3Networks, SVMs Log-loss: kernel CRFs

  6. Semi-supervised Discriminative Learning • Labeled training data is scarce and expensive. • Eg., experiments in computational biology. • Need for expert knowledge. • Tedious and time consuming. • Unclassified instances are abundant and cheap. • Extract texts/sentences from www (POS-tagging, NER, NLP). • Assess primary structure of proteins from DNA/RNA. • … There is a need for semi-supervised techniques in structural learning!

  7. Overview • Semi-supervised learning. • Co-regularized least squares regression. • Semi-supervised structured prediction models. • Co-support vector machines. • Transductive SVMs and efficient optimization. • Case study: email batch detection • Supervised Clustering. • Conclusion.

  8. Overview • Semi-supervised learning techniques. • Co-regularized least squares regression. • Semi-supervised structured prediction models. • Co-support vector machines. • Transductive SVMs and efficient optimization. • Email batch detection • Supervised Clustering. • Conclusion.

  9. Cluster Assumption • Now: m unlabeled inputs in addition to the n labeled pairs are given. • m>>n. • Decision boundary should not cross high density regions. • Examples: transductive learning, graph kernels,… • But: cluster assumption is frequently inappropriate, eg., regression! • What else can we do? - +

  10. Learning from Multiple Views / Co-learning • Split attributes into 2 disjoint sets (views) V1, V2. • E.g., web page classification. • View 1: content of web page. • View 2: anchor text of inbound links. • In each view learn a hypothesis fv, v=1,2. • Each fv provides its peer with predictions on unlabeled examples. • Strategy: maximize consensus between f1 and f2.

  11. Hypothesis Space Intersection true labeling function View V1 View V2 • Hypothesis spaces H1 und H2. • Minimize error rate and disagreement for all hypotheses in H1H2. • Unlabeled examples = data-driven regularization! Consensus maximization principle: • Labeled examples → minimize the error. • Unlabeled examples → minimize disagreement.  Minimize an upper bound on the error! hypothesis space version space intersection H1H2

  12. Co-optimization Problem • Given: • n labeled pairs: (x1,y1),…,(xn,yn) XxY • m unlabeled inputs: xn+1,…,xn+m X • Loss function: Δ:YxY→R+ • V hypotheses: f1,…,fVH1x…x HV • Goal: • Representer theorem: empirical risk of fv regularization V n Q(f1,…fV) = Δ(yi,argmaxy’ fv(xi,y’)) + η ||fv||2 min v=1 i=1 n+m V + λΔ(argmaxy’ fu(xj,y’),argmaxy’’fv(xj,y’’)) u,v=1 j=n+1 pairwise disagreements

  13. Overview • Semi-supervised learning techniques. • Co-regularized least squares regression. • Semi-supervised structured prediction models. • Co-support vector machines. • Transductive SVMs and efficient optimization. • Email batch detection • Supervised Clustering. • Conclusion.

  14. Semi-supervised Regularized Least Squares Regression • Special case: • Output space Y=R . • Consider functions • Squared loss: • Given: • n labeled examples • m unlabeled inputs • V views (V kernel functions ) • Consensus maximization principle: • Minimize squared error for labeled examples. • Minimize squared differences for unlabeled examples.

  15. Co-regularized Least Squares Regression • Kernel matrix: • Optimization problem: • Closed-form solution: disagreement regularization empirical risk strictly positive definite if K_v is strictly positive definite strictly positive definite if is strictly positive definite

  16. Co-regularized Least Squares Regression • Kernel matrix: • Optimization problem: • Closed-form solution: • Execution time: disagreement regularization empirical risk as good (or bad) as the state-of-the-art

  17. Semi-parametric Approximation • Restrict hypothesis space: • Convex objective function:

  18. Semi-parametric Approximation • Restrict hypothesis space: • Convex objective function: • Solution: • Execution time: only linear in the amount of unlabeled data

  19. Semi-supervised Methods for Distributed Data • Participants keep labeled data private. • Agree on fixed set of unlabeled data. • Converges to global optimum.

  20. Empirical Results • 32 UCI data sets, 10 fold “inverse” cross validation. • Dashed lines indicate equal performance. • RMSE: exact coRLSR , semi-parametric c < RLSR RLSR coRLSR (approx.) coRLSR (exact) Results taken from: Brefeld, Gärtner, Scheffer, Wrobel, “Efficient CoRLSR”, ICML 2006

  21. Empirical Results • 32 UCI data sets, 10 fold “inverse” cross validation. • Dashed lines indicate equal performance. • RMSE: exact coRLSR < semi-parametric c < RLSR RLSR coRLSR (approx.) coRLSR (exact) Results taken from: Brefeld, Gärtner, Scheffer, Wrobel, “Efficient CoRLSR”, ICML 2006

  22. Execution Time • Exact solution is cubic in the number of unlabeled examples. • Approximation only linear! Results taken from: Brefeld, Gärtner, Scheffer, Wrobel, “Efficient CoRLSR”, ICML 2006

  23. Overview • Semi-supervised learning techniques. • Co-regularized least squares regression. • Semi-supervised structured prediction models. • Co-support vector machines. • Transductive SVMs and efficient optimization. • Email batch detection • Supervised Clustering. • Conclusion.

  24. Semi-supervised Learning for Structured Output Variables • Given • n labeled examples • m unlabeled inputs • Joint decision function: • where • Apply consensus maximization principle. • Minimize the error for labeled examples. • Minimize the disagreement for unlabeled examples. • Compute argmax • Viterbi algorithm (sequential output) • CKY algorithm (recursive grammar) Distinct joint feature mappings in V1 and V2

  25. CoSVM Optimization Problem confidence of peer view • View v=1,2: • Dual representation: • Dual parameters are bound to input examples. • Working sets associated with subspaces. • Sparse models! prediction of peer view prediction of peer view

  26. Labeled Examples, View v=1,2 yi=<N,V,D,N> xi=“John ate the cat” Error/Margin violation! 1. Update Working set Ωi 2. Optimize αi v y =<N,D,D,N> =<N,V,V,N> =<N,V,D,N> Return αi, Ωi Viterbi Decoding Working setΩi ={ }, αi=(). φv(xi,yi)-φv(xi,<N,V,V,N>)αiv(<N,V,V,N>) φv(xi,yi)-φv(xi,<N,D,D,N>) αiv(<N,D,D,N>) v v v v αj≠i fixed. Working set Ωj≠i fixed,

  27. Working set Ωj≠i fixed. 1 αj≠i fixed, 1 αj≠i fixed, Working set Ωj≠i fixed. 2 2 Unlabeled Examples xi=“John went home” View 1 Working set Ωi ={ }, αi=(), φ1(xi,<N,V,V>)-φ1(xi,<D,V,N>) αi1(<D,V,N>) 1 1 1 =<N,V,N> y =<D,V,N> Viterbi Decoding Consensus: return αi1, αi2, Ωi, Ωi Disagreement / margin violation! 1. Update working sets Ωi1, Ωi2 2. Optimize αi1, αi2 View 2 y 2 =<N,V,N> =<N,V,V> Viterbi Decoding Working set Ωi ={ }, αi=(). φ2(xi,<D,V,N>)-φ2(xi,<N,V,V>) αi2(<N,V,V>) 2 2

  28. Biocreative Named Entity Recognition • BioCreative (Task1A, BioCreative Challenge, 2003). • 7500 sentences from biomedical papers. • Task: recognize gene/protein names. • 500 holdout sentences. • Approximately 350000 features (letter n-grams, surface clues,…) • Random feature split. • Baseline is trained on all features. Results taken from: Brefeld, Büscher, Scheffer, “Semi-supervised Discriminative Sequential Learning”, ECML 2005

  29. Biocreative Gene/Protein Name Recognition • CoSVM more accurate than SVM. • Accuracy positively correlated with number of unlabeled examples. Results taken from: Brefeld, Büscher, Scheffer, “Semi-supervised Discriminative Sequential Learning”, ECML 2005

  30. Natural Language Parsing • Wall Street Journal corpus (Penn tree bank). • Subsets 2-21. • 8,666 sentences of length ≤ 15 tokens. • Contex free grammar contains > 4,800 production rules. • Negra corpus. • German news paper archive. • 14,137 sentences of between 5 and 25 tokens. • CfG contains >26,700 production rules. • Experimental setup: • Local features (rule identity, rule at border, span width, …). • Loss: (ya,yb) = 1 - F1(ya,yb). • 100 holdout examples. • CKY parser by Mark Johnson. Results taken from: Brefeld, Scheffer, “Semi-supervised Learning for Structured Ouptut Variables”, ICML 2006

  31. Wall Street Journal / Negra Corpus Natural Language Parsing • CoSVM significantly outperforms SVM. • Adding unlabeled instances further improves F1 score. Results taken from: Brefeld, Scheffer, “Semi-supervised Learning for Structured Ouptut Variables”, ICML 2006

  32. Execution Time • CoSVM scales quadratically in the number of unlabeled examples. Results taken from: Brefeld, Scheffer, “Semi-supervised Learning for Structured Ouptut Variables”, ICML 2006

  33. Overview • Semi-supervised learning techniques. • Co-regularized least squares regression. • Semi-supervised structured prediction models. • Co-support vector machines. • Transductive SVMs and efficient optimization. • Email batch detection • Supervised Clustering. • Conclusion.

  34. Transductive Support Vector Machines for Structured Variables • Binary transductive SVMs: • Cluster assumption. • Discrete variables for unlabeled instances. • Optimization is expensive even for binary tasks! • Structural transductive SVMs. • Decoding = combinatorial optimization of discrete variables. • Intractable! • Efficient optimization: • Transform, remove discrete variables. • Differentiable, continuous optimization. • Apply gradient-based, unconstraint optimization techniques.

  35. hinge loss is not differentiable! BUT: Huber loss is! Unconstraint Support Vector Machines • SVM optimization problem: • Unconstraint SVM: solving constraints for slack variables: solving constraints for slack variables: BUT: Huber loss is!

  36. Unconstraint Support Vector Machines • SVM optimization problem: • Unconstraint SVM: • Differentiable objective without constraints! solving constraints for slack variables: solving constraints for slack variables: still a max in the objective! Substitute differentiable softmax for max!

  37. Unconstraint Transductive Support Vector Machines Mitigate margin violations by moving w in two symmetric ways • Unconstraint SVM objective function: • Include unlabeled instances by an appropriate loss function. • Unconstraint transductive SVM objective: • Optimization problem is not convex! loss function. overall influence of unlabeled instances 2-best decoder

  38. Execution Time • Gradient-based optimization faster than solving QPs. • Efficient transductive integration of unlabeled instances. + 500 unlabeled examples + 250 unlabeled examples Results taken from: Zien, Brefeld, Scheffer, “TSVMs for Structured Variables”, ICML 2007

  39. Spanish News Wire Named Entity Recognition • Spanish News Wire (Special Session of CoNLL, 2002). • 3100 sentences of between 10 and 40 tokens. • Entities: person, location, organization and misc. names (9 labels). • Window of size 3 around each token. • Approximately 120,000 features (token itself, surface clues...). • 300 holdout sentences. Results taken from: Zien, Brefeld, Scheffer, “TSVMs for Structured Variables”, ICML 2007

  40. Spanish News Named Entity Recognition • TSVM has significantly lower error rates than SVMs. • Error decreases in terms of the number of unlabeled instances. token error [%] number of unlabeled examples Results taken from: Zien, Brefeld, Scheffer, “TSVMs for Structured Variables”, ICML 2007

  41. Artificial Sequential Data RBF Laplacian • 10 nearest neighbor Laplacian kernel vs. RBF kernel. • Laplacian kernel well suited. • Only little improvement by TSVM, if any. • Different cluster assumptions: • Laplacian: local (token level). • TSVM: global (sequence level). Results taken from: Zien, Brefeld, Scheffer, “TSVMs for Structured Variables”, ICML 2007

  42. Overview • Semi-supervised learning techniques. • Co-regularized least squares regression. • Semi-supervised structured prediction models. • Co-support vector machines. • Transductive SVMs and efficient optimization. • Email batch detection. • Supervised Clustering. • Conclusion.

  43. Supervised Clustering of Data Streams for Email Batch Detection • Spam characteristics: • Amount of spam messages in electronic messaging is ~80%. • Approximately 80-90% of these spams are generated by only a few spammers. • Spammers maintain templates and exchange them rapidly. • Many emails generated by the same template (=batch) in short time frames. • Goal: • Detect batches in the data stream. • Ground-truth of exact clusterings exist! • Batch information: • Black/white listing. • Improve spam/non-spam classification.

  44. Template Generated Spam Messages Hello, This is Terry Hagan.We are accepting your mo rtgage application. Our company confirms you are legible for a $250.000 loan for a $380.00/month. Approval process will take 1 minute, so please fill out the form on our website. Best Regards, Terry Hagan; Senior Account Director Trades/Fin ance Department North Office Dear Mr/Mrs, This is Brenda Dunn.We are accepting your mortga ge application. Our office confirms you can get a $228.000 lo an for a $371.00 per month payment. Follow the link to our website and submit your contact information. Best Regards, Brenda Dunn; Accounts Manager Trades/Fina nce Department East Office

  45. Correlation Clustering • Parameterized similarity measure: • Solution is equivalent to poly-cut in a fully connected graph. • Edge weight is similarity of the connected nodes. • Maximize intra-cluster similarity. cxczc

  46. Problem Setting • Parameterized similarity measure: • Pairwise features: • Edit distance of subjects, • tf.idf similarity of body, • … • Collection x contains Ti messages x1(i),…,xTi. • Matrix with if and are in the same cluster and 0 otherwise. • Correlation clustering is NP complete! • Solve relaxed variant instead: • Substitute continuous for

  47. Large Margin Approach combine the minimizations combine the minimizations • Structural SVM with margin rescaling: minimize subject to: replace with Lagrangian dual QP with O(T3) constraints!

  48. Exploit Data Stream! • Only the latest email xt has to be integrated into the existing clustering. • Clustering on x1,…,xt-1 remains fixed. • Execution time is linear in the number of emails. window ? time

  49. Sequential Approximation • Exploit streaming nature of data: • Decoding strategy: Find the best cluster for the latest message or create a singelton. objective of clustering objective of sequential update computation in O(T) constant

  50. Results for Batch Detection • No significant difference.

More Related