ling 696b maximum entropy and random fields n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
LING 696B: Maximum-Entropy and Random Fields PowerPoint Presentation
Download Presentation
LING 696B: Maximum-Entropy and Random Fields

Loading in 2 Seconds...

play fullscreen
1 / 23

LING 696B: Maximum-Entropy and Random Fields - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

LING 696B: Maximum-Entropy and Random Fields. Review: two worlds. Statistical model and OT seem to ask different questions about learning UG: what is possible/impossible? Hard-coded generalizations Combinatorial optimization (sorting)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'LING 696B: Maximum-Entropy and Random Fields' - gretel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
review two worlds
Review: two worlds
  • Statistical model and OT seem to ask different questions about learning
  • UG: what is possible/impossible?
    • Hard-coded generalizations
    • Combinatorial optimization (sorting)
  • Statistical: among the things that are possible, what is likely/unlikely?
    • Soft-coded generalizations
    • Numerical optimization
  • Marriage of the two?
review two worlds1
Review: two worlds
  • OT: relate possible/impossible patterns in different languages through constraint reranking
  • Stochastic OT: consider a distribution over all possible grammars to generate variation
  • Today: model frequency of input/output pairs (among the possible) directly using a powerful model
maximum entropy and ot
Maximum entropy and OT
  • Imaginary data:
    • Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each
    • Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w1)}p([pap]|/bap/) ~ (1/Z) exp{-(w2)}
maximum entropy
Maximum entropy
  • Why have Z?
    • Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1
    • So Z = exp{-(2*w1)} + exp{-(w2)} (same for all candidates) -- called a normalization constant
    • Z can quickly become difficult to compute, when number of candidates is large
    • Very similar proposal in Smolensky, 86
  • How to get w1, w2?
    • Learned from data (by calculating gradients)
    • Need: frequency counts, violation vectors (same as stochastic OT)
maximum entropy1
Maximum entropy
  • Why do exp{.}?
    • It’s like take maximum, but “soft” -- easy to differentiate and optimize
maximum entropy and ot1
Maximum entropy and OT
  • Inputs are violation vectors: e.g. x=(2,0) and (0,1)
  • Outputs are one of K winners -- essentially a classification problem
  • Violating a constraint works against the candidate (prob ~ exp{-(x1*w1 + x2*w2)}
  • Crucial difference: ordering candidates by one score, not by lexico-graphic orders
maximum entropy2
Maximum entropy
  • Ordering discrete outputs from input vectors is a common problem:
    • Also called Logistic Regression (recall Nearey)
  • Explaining the name:
    • Let P= p([bab]|/bap/), then log[P/(1-P)] = w2 - 2*w1

Logistic transform

Linear regression

the power of maximum entropy
The power of Maximum Entropy
  • Max Eng/logistic regression is widely used in many areas with interacting, correlated inputs
    • Recall Nearey: phones, diphones, …
    • NLP: tagging, labeling, parsing … (anything with a discrete output)
  • Easy to learn: only a global maximum, optimization efficient
  • Isn’t this the greatest thing in the world?
    • Need to understand the story behind the exp{} (in a few minutes)
demo spanish diminutives
Demo: Spanish diminutives
  • Data from Arbisi-Kelm
    • Constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle
stochastic ot and max ent
Stochastic OT and Max-Ent
  • Is better fit always a good thing?
stochastic ot and max ent1
Stochastic OT and Max-Ent
  • Is better fit always a good thing?
  • Should model-fitting become a new fashion in phonology?
the crucial difference
The crucial difference
  • What are the possible distributions of p(.|/bap/) in this case?
the crucial difference1
The crucial difference
  • What are the possible distributions of p(.|/bap/) in this case?
    • Max-Ent considers a much wider range of distributions
what is maximum entropy anyway
What is Maximum Entropy anyway?
  • Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy
  • Given a dice, which distribution has the largest entropy?
what is maximum entropy anyway1
What is Maximum Entropy anyway?
  • Jaynes, 53: the most ignorant state corresponds to the distribution with the most entropy
  • Given a dice, which distribution has the largest entropy?
  • Add constraints to distributions: the average of some feature functions is assumed to be fixed:

Observed value

what is maximum entropy anyway2
What is Maximum Entropy anyway?
  • Example of features: violations, word counts, N-grams, co-occurrences, …
  • The constraints change the shape of the maximum entropy distribution
    • Solve constrained optimization problem
  • This leads to p(x) ~ exp{k wk*fk(x)}
    • Very general (see later), many choices of fk
the basic intuition
The basic intuition
  • Begin “ignorant” as much as possible (with maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of fk(x))
  • Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold)
    • Common practice in NLP
  • This is better seen as a “descriptive” model
going towards markov random fields
Going towards Markov random fields
  • Maximum entropy applied to conditional/joint distributionp(y|x) or p(x,y) ~ exp{k wk*fk(x,y)}
  • There can be many creative ways of extracting features fk(x,y)
    • One way is to let a graph structure guide the calculation of features. E.g. neighborhood/clique
    • Known as Markov network/random field
conditional random field
Conditional random field
  • Impose a chain-structured graph, and assign features to edges
    • Still a max-ent, same calculation

m(yi, yi+1)

f(xi, yi)

wilson s idea
Wilson’s idea
  • Isn’t this a familiar picture in phonology?

m(yi, yi+1) -- Markedness

Surface form

f(xi, yi)

Faithfulness

Underlying form

the story of smoothing
The story of smoothing
  • In Max-Ent models, the weights can get very large and “over-fit” the data (see demo)
  • Common to penalize (smooth) this with a new objective function:new objective = old objective + parameter * magnitude of weights
  • Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning
    • Constraints that force less similarity --> a higher penalty for them to change value