Create Presentation
Download Presentation

Download Presentation

Constraint satisfaction inference for discrete sequence processing in NLP

Constraint satisfaction inference for discrete sequence processing in NLP

178 Views

Download Presentation
Download Presentation
## Constraint satisfaction inference for discrete sequence processing in NLP

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Constraint satisfaction inference for discrete sequence**processing in NLP Antal van den Bosch ILK / CL and AI, Tilburg University DCU Dublin April 19, 2006 (work with Sander Canisius and Walter Daelemans)**Constraint satisfaction inference for discrete sequence**processing in NLP Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion**How to map sequences to sequences?**• Machine learning’s pet solution: • Local-context windowing (NETtalk) • One-shot prediction of single output tokens. • Concatenation of predicted tokens.**The near-sightedness problem**• A local window never captures long-distance information. • No coordination of individual output tokens. • Long-distance information does exist; holistic coordination is needed.**Holistic information**• “Counting” constraints: • Certain entities occur only once in a clause/sentence. • “Syntactic validity” constraints: • On discontinuity and overlap; chunks have a beginning and an end. • “Cooccurrence” constraints: • Some entities must occur with others, or cannot co-exist with others.**Solution 1: Feedback**• Recurrent networks in ANN (Elman, 1991; Sun & Giles, 2001), e.g. word prediction. • Memory-based tagger (Daelemans, Zavrel, Berck, and Gillis, 1996). • Maximum-entropy tagger (Ratnaparkhi, 1996).**Feedback disadvantage**• Label bias problem (Lafferty, McCallum, and Pereira, 2001). • Previous prediction is an important source of information. • Classifier is compelled to take its own prediction as correct. • Cascading errors result.**Solution 2: Stacking**• Wolpert (1992) for ANNs. • Veenstra (1998) for NP chunking: • Stage-1 classifier, near-sighted, predicts sequences. • Stage-2 classifier learns to correct stage-1 errors by taking stage-1 output as windowed input.**Stacking disadvantages**• Practical issues: • Ideally, train stage-2 on cross-validated output of stage-1, not “perfect” output. • Costly procedure. • Total architecture: two full classifiers. • Local, not global error correction.**What exactly is the problem with mapping to sequences?**• Born in Made, The Netherlands O_O_B-LOC_O_B-LOC_I-LOC • Multi-class classification with 100s or 1000s of classes? • Lack of generalization • Some ML algorithms cannot cope very well. • SVMs • Rule learners, decision trees • However, others can. • Naïve Bayes, Maximum-entropy • Memory-based learning**Solution 3: n-gram subsequences**• Retain windowing approach, but • Predict overlapping n-grams of output tokens.**Resolving overlapping n-grams**• Probabilities available: Viterbi • Other options: voting**N-gram+voting disadvantages**• Classifier predicts syntactically valid trigrams, but • After resolving overlap, only local error correction. • End result is still a concatenation of local uncoordinated decisions. • Number of classes increases (problematic for some ML).**Learning linguistic sequences**Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion**Four “chunking” tasks**• English base-phrase chunking • CoNLL-2000, WSJ • English named-entity recognition • CoNLL-2003, Reuters • Dutch medical concept chunking • IMIX/Rolaquad, medical encyclopedia • English protein-related entity chunking • Genia, Medline abstracts**Treated the same way**• IOB-tagging. • Windowing: • 3-1-3 words • 3-1-3 predicted PoS tags (WSJ / Wotan) • No seedlists, suffix/prefix, capitalization, … • Memory-based learning and maximum-entropy modeling • MBL: automatic parameter optimization (paramsearch, Van den Bosch, 2004)**IOB-codes for chunks: step 1, PTB-II WSJ**((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .))**IOB-codes for chunks: step 1, PTB-II WSJ**((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .))**IOB codes for chunks:Flatten tree**[Once]ADVP [he]NP [was held]VP [for]PP [three months]NP [without]PP [being charged]VP**Example: Instances**feature 1 feature 2 feature 3 class (word -1) (word 0) (word +1) • _ Once he I-ADVP • Once he was I-NP • he was held I-VP • was held for I-VP • held for three I-PP • for three months I-NP • three months without I-NP • months without being I-PP • without being charged I-VP • being charged . I-VP • charged._ O**MBL**• Memory-based learning • k-NN classifier (Fix and Hodges, 1951; Cover and Hart, 1967; Aha et al., 1991), Daelemans et al. • Discrete point-wise classifier • Implementation used: TiMBL (Tilburg Memory-Based Learner)**Memory-based learning and classification**• Learning: • Store instances in memory • Classification: • Given new test instance X, • Compare it to all memory instances • Compute a distance between X and memory instance Y • Update the top k of closest instances (nearest neighbors) • When done, take the majority class of the k nearest neighbors as the class of X**Similarity / distance**• A nearest neighbor has the smallest distance, or the largest similarity • Computed with a distance function • TiMBL offers two basic distance functions: • Overlap • MVDM (Stanfill & Waltz, 1986; Cost & Salzberg, 1989) • Feature weighting • Exemplar weighting • Distance-weighted class voting**The Overlap distance function**• “Count the number of mismatching features”**The MVDM distance function**• Estimate a numeric “distance” between pairs of values • “e” is more like “i” than like “p” in a phonetic task • “book” is more like “document” than like “the” in a parsing task**Feature weighting**• Some features are more important than others • TiMBL metrics: Information Gain, Gain Ratio, Chi Square, Shared Variance • Ex. IG: • Compute data base entropy • For each feature, • partition the data base on all values of that feature • For all values, compute the sub-data base entropy • Take the weighted average entropy over all partitioned subdatabases • The difference between the “partitioned” entropy and the overall entropy is the feature’s Information Gain**Feature weighting in the distance function**• Mismatching on a more important feature gives a larger distance • Factor in the distance function:**Distance weighting**• Relation between larger k and smoothing • Subtle extension: making more distant neighbors count less in the class vote • Linear inverse of distance (w.r.t. max) • Inverse of distance • Exponential decay**Current practice**• Default TiMBL settings: • k=1, Overlap, GR, no distance weighting • Work well for some morpho-phonological tasks • Rules of thumb: • Combine MVDM with bigger k • Combine distance weighting with bigger k • Very good bet: higher k, MVDM, GR, distance weighting • Especially for sentence and text level tasks**Base phrase chunking**• 211,727 training, 47,377 test examples • 22 classes • [He]NP [reckons]VP [the current account deficit]NP [will narrow]VP [to]PP [only $ 1.8 billion]NP [in]PP [September]NP .**Named entity recognition**• 203,621 training, 46,435 test examples • 8 classes • [U.N.]organizationofficial [Ekeus]personheads for [Baghdad]location**Medical concept chunking**• 428,502 training, 47,430 test examples • 24 classes • Bij [infantiel botulisme]diseasekunnen in extreme gevallen [ademhalingsproblemen]symptomen [algehele lusteloosheid]symptomoptreden.**Protein-related concept chunking**• 458,593 training, 50,916 test examples • 51 classes • Most hybrids express both [KBF1]proteinand [NF-kappa B]proteinin their nuclei , but one hybrid expresses only [KBF1]protein .**Learning linguistic sequences**Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion**Comparative study**• Base discrete classifier: Maximum-entropy model (Zhang Le, maxent) • Extended with feedback, stacking, trigrams, combinations • Compared against • Conditional Markov Models (Ratnaparkhi, 1996) • Maximum-entropy Markov Models (McCallum, Freitag, and Pereira, 2000) • Conditional Random Fields (Lafferty, McCallum, and Pereira, 2001) • On Medical & Protein chunking**Maximum entropy**• Probabilistic model: conditional distribution p(C|x) (= probability matrix between classes and values) with maximal entropy H(p) • Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible • Maximize entropy in matrix through iterative process: • IIS, GIS (Improved/Generalized Iterative Scaling) • L-BFGS • Discretized!**Conditional Markov Models**• Probabilistic analogue of Feedback • Processes from left to right • Produces conditional probabilities, including previous classification, limited by beam search • With beam=1, equal to Feedback • Can be trained with maximum entropy • E.g. MXPOST, Ratnaparkhi (1996)