LING 696B: Maximum-Entropy and Random Fields

1 / 23

# LING 696B: Maximum-Entropy and Random Fields - PowerPoint PPT Presentation

LING 696B: Maximum-Entropy and Random Fields. Review: two worlds. Statistical model and OT seem to ask different questions about learning UG: what is possible/impossible? Hard-coded generalizations Combinatorial optimization (sorting)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'LING 696B: Maximum-Entropy and Random Fields' - gretel

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### LING 696B: Maximum-Entropy and Random Fields

Review: two worlds
• Statistical model and OT seem to ask different questions about learning
• UG: what is possible/impossible?
• Hard-coded generalizations
• Combinatorial optimization (sorting)
• Statistical: among the things that are possible, what is likely/unlikely?
• Soft-coded generalizations
• Numerical optimization
• Marriage of the two?
Review: two worlds
• OT: relate possible/impossible patterns in different languages through constraint reranking
• Stochastic OT: consider a distribution over all possible grammars to generate variation
• Today: model frequency of input/output pairs (among the possible) directly using a powerful model
Maximum entropy and OT
• Imaginary data:
• Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each
• Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w1)}p([pap]|/bap/) ~ (1/Z) exp{-(w2)}
Maximum entropy
• Why have Z?
• Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1
• So Z = exp{-(2*w1)} + exp{-(w2)} (same for all candidates) -- called a normalization constant
• Z can quickly become difficult to compute, when number of candidates is large
• Very similar proposal in Smolensky, 86
• How to get w1, w2?
• Learned from data (by calculating gradients)
• Need: frequency counts, violation vectors (same as stochastic OT)
Maximum entropy
• Why do exp{.}?
• It’s like take maximum, but “soft” -- easy to differentiate and optimize
Maximum entropy and OT
• Inputs are violation vectors: e.g. x=(2,0) and (0,1)
• Outputs are one of K winners -- essentially a classification problem
• Violating a constraint works against the candidate (prob ~ exp{-(x1*w1 + x2*w2)}
• Crucial difference: ordering candidates by one score, not by lexico-graphic orders
Maximum entropy
• Ordering discrete outputs from input vectors is a common problem:
• Also called Logistic Regression (recall Nearey)
• Explaining the name:
• Let P= p([bab]|/bap/), then log[P/(1-P)] = w2 - 2*w1

Logistic transform

Linear regression

The power of Maximum Entropy
• Max Eng/logistic regression is widely used in many areas with interacting, correlated inputs
• Recall Nearey: phones, diphones, …
• NLP: tagging, labeling, parsing … (anything with a discrete output)
• Easy to learn: only a global maximum, optimization efficient
• Isn’t this the greatest thing in the world?
• Need to understand the story behind the exp{} (in a few minutes)
Demo: Spanish diminutives
• Data from Arbisi-Kelm
• Constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle
Stochastic OT and Max-Ent
• Is better fit always a good thing?
Stochastic OT and Max-Ent
• Is better fit always a good thing?
• Should model-fitting become a new fashion in phonology?
The crucial difference
• What are the possible distributions of p(.|/bap/) in this case?
The crucial difference
• What are the possible distributions of p(.|/bap/) in this case?
• Max-Ent considers a much wider range of distributions
What is Maximum Entropy anyway?
• Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy
• Given a dice, which distribution has the largest entropy?
What is Maximum Entropy anyway?
• Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy
• Given a dice, which distribution has the largest entropy?
• Add constraints to distributions: the average of some feature functions is assumed to be fixed:

Observed value

What is Maximum Entropy anyway?
• Example of features: violations, word counts, N-grams, co-occurrences, …
• The constraints change the shape of the maximum entropy distribution
• Solve constrained optimization problem
• This leads to p(x) ~ exp{k wk*fk(x)}
• Very general (see later), many choices of fk
The basic intuition
• Begin “ignorant” as much as possible (with maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of fk(x))
• Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold)
• Common practice in NLP
• This is better seen as a “descriptive” model
Going towards Markov random fields
• Maximum entropy applied to conditional/joint distributionp(y|x) or p(x,y) ~ exp{k wk*fk(x,y)}
• There can be many creative ways of extracting features fk(x,y)
• One way is to let a graph structure guide the calculation of features. E.g. neighborhood/clique
• Known as Markov network/random field
Conditional random field
• Impose a chain-structured graph, and assign features to edges
• Still a max-ent, same calculation

m(yi, yi+1)

f(xi, yi)

Wilson’s idea
• Isn’t this a familiar picture in phonology?

m(yi, yi+1) -- Markedness

Surface form

f(xi, yi)

Faithfulness

Underlying form

The story of smoothing
• In Max-Ent models, the weights can get very large and “over-fit” the data (see demo)
• Common to penalize (smooth) this with a new objective function:new objective = old objective + parameter * magnitude of weights
• Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning
• Constraints that force less similarity --> a higher penalty for them to change value