1.19k likes | 1.38k Views
Texts for text extraction contain sentences like:The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today.. Motivation for the project. We'd like to automatically recognize that both are of the form:The IRA bombed a fam
E N D
1. Attacking the Data Sparseness Problem Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Bontcheva, Jia Cui, Klaus Macherey, David Guthrie, Martin Holub, Marco Cammisa, Cassia Martin, Jerry Liu, Kris Haralambiev, Fred Jelinek
2. Texts for text extraction contain sentences like:
The IRA bombed a family owned shop in Belfast yesterday.
FMLN set off a series of explosions in central Bogota today.
Motivation for the project
3. We’d like to automatically recognize that both are of the form:
The IRA bombed a family owned shop in Belfast yesterday.
FMLN set off a series of explosions in central Bogota today.
Motivation for the project
4. Our Hypotheses A transformation of a corpus to replace words and phrases with coarse semantic categories will help overcome the data sparseness problem encountered in language modeling, and text extraction.
Semantic category information might also help improve machine translation
A noun-centric approach initially will allow bootstrapping for other syntactic categories
5. A six week goal – Labeling noun phrases
Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday
Humans aboard space_vehicle dodge satellite timeref.
6. Preparing the data- Pre-Workshop Identify a tag set
Create a Human annotated corpus
Create a double annotated corpus
Process all data for named entity and noun phrase recognition using GATE Tools (26 million words)
Parsed about (26 million words)
Develop algorithms for mapping target categories to Wordnet synsets to support the tag set assessment
7. The Semantic Classes and the Corpus A subset of classes available in Longman's Dictionary of contemporary English (LDOCE) Electronic version
Rationale:
The number of semantic classes was small
The classes are somewhat reliable since they were used by a team of lexicographers to code Noun senses, Adjective preferences and Verb preferences
Many words have subject area information which might be useful
8. The Semantic Classes
9. The Semantic Classes
10. The Semantic Classes
11. The Semantic Classes
12. The human annotated statistics Inter-annotator agreement is 94%, so that is the upper limit of our task.
214,446 total annotated noun phrases (262,683 including “None of the Above”)
29,071 unique vocabulary items (Unlemmatized)
25 semantic categories (162 associated subject areas were identified)
127,569 with semantic category - Abstract, 59 %
13. The experimental setup
14. The main development set (dev)
15. A challenging development set for experiments on useen words (Hard data set)
16. Our Experiments include: Supervised Approaches (Learning from Human Annotated data)
Unsupervised approaches
Using outside evidence (the dictionary or wordnet)
Syntactic information from parsing or pattern matching
Context words, the use of preferences, the use of topical information
17. Experiments on unseen words - Hard data set
Training corpus has only words with unambiguous annotations
125,000 training instances
73,000 instances held-out
Perplexity – 21
Baseline – Accuracy 45%
Improvement – Accuracy 68.5 %
Context can contribute greatly in unsupervised experiments
18. Results on the dev set
Random with some frequent ambiguous words moved into testing
113,000 training instances
85,000 instances held-out
Perplexity – 3.44
Baseline – Accuracy 80%
Improvement – Accuracy 87 %
19. The scheme for annotating the large corpus After experimenting with the development sets, we need a scheme for making use of all of the dev corpus to tag the blind corpus.
We developed a incremental scheme within the maximum entropy framework
Several talks have to do with re-estimation techniques useful to bootstrapping process.
20. Terminology Seen words – words seen in the human annotated data (new instances of known words)
Unseen words – not in the training material but in dictionary
Novel words – not in the training material nor in the dictionary/Wordnet
21. Bootstrapping
22. The Unannotated Data – Four types
23. The Unannotated Data – Four types
24. The Unannotated Data – Four types
25. The Unannotated Data – Four types
27. Results on the Blind Data
We set aside one tenth of the annotated corpus
Randomly selected within each of the domains
It contained 13,000 annotated instances
The baseline here was very high - 90% with simple techniques
We were able to achieve 93.5% accuracy
28. Overview Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject (David, Jia, Martin)
Structuring the context using syntax and semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion
29. Semantic Categories and MT 10 test words – high, medium, and low frequency
Collected their target translations using EuroWordNet (e.g. Dutch)
Crane:
[lifts and moves heavy objects] – hijskraan, kraan
[large long-necked wading bird] - kraanvogel
30. SemCats and MT (2) Manually mapped synonym sets to semantic categories
automatic mapping will be presented later
Studied how many synonym sets are ruled out as translations by the semantic category
31. Some Results 3 words – full disambiguation
crane (Mov.Solid/Animal), medicine (Abstract/Liquid), plant (Plant/Solid)
7 words – the categories reduce substantially the possible translations
club - [Abstr/an association of people...], [Mov.Solid/stout stick...], [Mov.Solid/ an implement used by a golfer...], [Mov.Solid/a playing card...], [NonMov.Solid/a building …]
club/NonMov.Solid – [clubgebouw, clubhuis, …]
club/Abstr. – [bevolkingsgroep, broederschap, …]
club/Mov.Solid – [knots, kolf, malie], [kolf, malie], [club]
32. The architecture The “multiple-knowledge sources” WSD architecture (Stevenson 03)
Allow use of multiple taggers and combine their results through a weighted function
Weights can be learned from a corpus
All taggers implemented as GATE components and combined in applications
34. The Bag-of-Words Tagger The bag-of-words tagger is an Information Retrieval-inspired tagger with parameters:
Window size: 50 default value
What POS to put in the content vectors (default: nouns and verbs)
Which similarity measure to use
Used in WSD (Leacock et al 92)
Crane/Animal={species, captivity, disease…}
Crane/Mov.Solid={worker, disaster, machinery…}
35. BoW classifier (2) Seen words classified by calculating the inner product between their context vector and the vectors for each possible category
Inner product calculated as:
Binary vectors – number of matching terms
Weighted vectors:
Leacock’s measure – favour concepts that occur frequently in exactly one category
Take into account the polysemy of concepts in the vectors
36. Current performance measures The baseline frequency tagger on its own – 91% on the test (blind) set
Bag-of-words tagger on its own – 92.7%
Combined architecture –93.2% (window size 50, using only nouns, binary vectors)
37. Future work on the architecture Integrate syntactic information, subject codes, and document topics
Experiment with cosine similarity
Implement [Yarowsky’92] WSD algorithm
Implement the weighted function module
Experiment with integrating the ME tools as one of the taggers supplying preferences for the weighting module
38. Overview Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject (David, Jia, Martin)
Structuring the context using syntax and semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion
39. Accuracy MeasurementsKris Haralambiev How to measure the accuracy
How to distinguish “correct”, “almost correct” and “wrong”
40. Exact Match Measurements W = (w1, w2, …, wn) – vector of the annotated words
X = (x1, x2, …, xn) – categories assigned by the annotators
Y = (y1, y2, …, yn) – categories assigned by a program
Exact match (default) measurement – 1 for match and 0 for mismatch of each (xi,yi) pair:
accuracy(X,Y) = |{i : xi = yi}|
41. The Hierarchy
42. Ancestor Relation Measurement weight(Cat) =|{i : xi? Tree with root Cat} |
score(xi, yi) = min(weight(xi)/weight(yi), weight(yi)/weight(xi) )
accuracy(X,Y) = ?i score(xi,yi)
43. Edge Distance Measurement distance(xi, yi) = the length of the simple path from xi to yi
each edge can be given individual length or all edges have length 1 (we prefer the latter)
44. Edge Distance Measurement (cont' d) distance(X,Y) = ?i distance(xi,yi)
Accuracy – Distance
100% - 0
? - distance(X,Y)
0% - max_possible_distance
max_possible_distance =
= ?i max(distance(xi,cat))
might be reasonable to use aver. instead of max
45. Some Baselines Blind data
46. Overview Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject (David, Jia, Martin)
Structuring the context using syntax and semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion
55. Overview Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject (David, Jia, Martin)
Structuring the context using syntax and semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion
63. Overview Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject (David, Jia, Martin)
Structuring the context using syntax and semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion
64. Hard vs Soft Word Clusters
Words as features are sparse, we need to cluster up
Hard clusters
A feature is assigned to one and only one cluster. (The cluster for which there exists the strongest evidence.)
Soft clusters
A feature is assigned to as many clusters as there is evidence for.
65. Using clustering and contextual features Baseline – prior + most frequent semantic category
All words within the target noun phrase (with a threshold of 10 occurrences)
Adjective hard clusters
Clusters are defined by most frequent semantic category
Noun soft clusters
Clusters are defined by all semantic categories
Combined adjective hard clusters and noun soft clusters
66. Results with clusters and context
68. Clustering Adjectives We take adjectives with low H(T | a) and make clusters form them depending on which semantic category they predict
Then use each cluster of adjectives as a context feature
69. Overview Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject (David, Jia, Martin)
Structuring the context using syntax and semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion
70. Structuring the context using syntax
Syntactic Model: eXtended Dependency Graph
Syntactic Relations considerd: V_Obj, V_Sog, V_PP, NP_PP
Results
Observations:
Features are too scarce
We're overfitting! We need more intelligent methods.
71. Semantic Fingerprint:Generalizing nouns using EuroWordnet
72. Noun semantic fingerprints: an example Words in the events are replaced by “basic concepts”
73. Verb semantic fingerprints: an example
74. How to exploit the word context? [Semantic Category]-Subj-to_think
Positive observations
his wife thought he should eat more
the waitress thought that Italians leave tiny tips
Our conceptual hierarchy contains FemaleHuman and MaleHuman...
75. How to exploit the word context?
76. Syntactic slots and slot fillers
77. How to exploit the word context? Using...
a revised hierarchy
Female animal and male animal ? Animal
Female human and male human ? Human
Female and male ? Animate
“one-semantic-class-per-discourse” hypothesis
the “semantic fingerprint”: generalising nouns in the basic concepts of EuroWordnet and verbs in the top most in Wordnet
78. Results
79. Results: a closer look
80. Overview Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject (David, Jia, Martin)
Structuring the context using syntax and semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion
81. Unsupervised Semantic Labeling of Nouns using ME
Frederick Jelinek
Semantic Analysis for Sparse Data
82. Motivation Base ME features on lexical and grammatical relationships found in the context of nouns to be labeled
Hand-labeled data too sparse to allow using powerful ME compound features
Wish to utilize large unlabeled British National Corpus (and internet, etc.) for training
Will use dictionary and initialization by statistics from smaller hand-labeled corpus
83. Format of Labeled Training Data w is the noun to be labeled
r1, r2…, rm are the relationships in the context of w which correlate with the label appropriate to w
C is the label denoting semantic class
f1, f2,…, fK is the label count, I.e., fC=1 and fi=0 for i ? C
Then the event file format is
(f1, f2,…, fK , w, r1, r2…, rm )
84. Format of BNC Training Data The label counts fi will be fractional with fi = 0 if the dictionary does not allow noun w to have the ith label.
Always fi = 0 and Si fi = 1
The problem is the initial selection of values of fi
Suggestion: let fi = Q(C= i | w) where Q denotes the empirical distribution from hand labeled data.
86. The empirical distribution used in the ME iterations is obtained from sums of values of fi found in both the labeled and BNC data sets.
These counts determine
which of the potential features will be selected as actual features
the values of the l parameters in the ME model
Inner Loop ME Re-estimation
88. Outer Loop Re-scaling of Data Once the ME model P(C = c | w,r1,…,rm) is estimated, the fi values in event files of the BNC portion of data are re-scaled.
fi values in the hand labeled portion remain unchanged
New empirical counts are thus available
to determine the identity of new actual features
the parameters of a new ME probability model
Etc.
89. Preliminary Results by Jia Cui
90. Concluding Thoughts Preliminary results are promising
Method requires theoretical and practical exploration
Changing of features and feature targets is a new phenomenon in ME estimation
Careful selection of relationships and basing them on clusters required will lead to effective features
See proposal by Jia Cui and David Guthrie
91. Overview Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject (David, Jia, Martin)
Structuring the context using syntax and semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion
92. Unsupervised Semantic Tagging
93. Summary Motivations
Lexical Information for Semantic Tagging
Unsupervised Natural Language Learning
Empirical Estimation for ME bootstrapping
Weakly Supervised BNC Tagging through Wordnet
A semantic similarity metric over Wordnet
Experiments and Results
Mapping LDOCE to Wordnet
Bootstrapping over an untagged corpus
Re-estimation thrugh Wordnet
94. Motivations All experiments tell that lexical information is crucial for semantic tagging, …
data sparseness seems to limit the effect of the context
The contribution of different resources needs to be exploited (as in WSD)
In applications hand-tagging should be applied in a cost effective way
Good results need to scale-up also to technological scenarios where poorer (or no) resources are available
95. Motivations (cont’d) Wordnet contribution to semantic tagging
A source of evidence for a larger set of lexicals (unseen words)
A consistent way to generalize single observations
(hierarchical) constraints over word uses statistics
Similarity of words uses suggests semantic similarity:
Corpus-driven syntactic similarity is one possible choice
Domain or topical similarity is also relevant
Semantic similarity in the Wordnet hierarchy suggests useful levels of generalization
Specific hypernyms, i.e. able to separate different senses
General hypernyms, i.e. help to reduce the number of word classes to model
96. Learning Contextual Evidence Each syntactic relation provides a “view” on a word usage, i.e. suggests a set of nouns with common behaviour(s)
Semantic similarity among nouns is a model of local semantic preference
to drink {beer, water, …, cocoa/L, stock/L, …}
The {…, president, director, boy, ace/H, brain/H, …} succeeds
97. Semantic classes vs. language models The role of p( C | v, d)
e.g.
p( n | v, d) ? p( n | C) p( C | v, d)
Implications
p(n | C) gives a lexical semantic model that …
it is likely to depend on the corpus and not on the individual context
p(C | v, d) model selectional preferences and …
provides disambiguation cues for contexts (v d X)
98. Semantic classes vs. language models Lexical evidence: p(n | C) (or also p(C|n) )
Contextual evidence: p( C | v, d)
The idea:
Contextual evidence can be collected from the corpus by involving the lexical knowledge base
The modeling of lexical evidence can be seen as a side effect of the context (p(C | n) ? p(n|C) )
Implied approach
Learn the second as an estimate for the first and then combine for bootstrapping to unseen words
99. Conceptual Density Basic terminology
Target noun set T (e.g. {beer, water, stock} nouns in relation r=VDirobj with a given verb)
(Branching Factor) Average number m of children of a node s, i.e. the average number of children of any node subsumed by s
(Marks) Set of marks M, i.e. the subset of nouns in T that are subsumed within the WN subhierarchy rooted in s. N = |M|
(Area) area(s), total number of nodes of the subhierarchy rooted at s
100. Conceptual Density (cont’d)
101. Using Conceptual Density Target Noun set T (e.g. subjs of verb to march)
horse (6 senses in WN1.6),
ant (1 sense in WN1.6)
troop (4 senses in WN1.6)
division (12 senses in WN1.6)
elephant (2 senses in WN1.6)
102. Summary Motivations
Lexical Information for Semantic Tagging
Unsupervised Natural Language Learning
Empirical Estimation for ME bootstrapping
Weakly Supervised BNC Tagging through Wordnet
A semantic similarity metric over Wordnet
Experiments and Results
Mapping LDOCE to Wordnet
Bootstrapping over an untagged corpus
Re-estimation thrugh Wordnet
103. Results: Mapping LDOCE classes Lexical Entries in LDOCE are defined in terms of a Semantic Class and a topical tag (Subject Codes), e.g. stock ('L','FO')
The semantic similarity metrics has been used to derive WN synset(s) that represent <SemClass,SubjCode> pairs
A Wn explanation of lexical entries in a LM class (lexical mapping)
The position(s) in the WN noun hierarchy of each LM class (category mapping)
Semantic preference of synsets given words, LM classes (and Subject Codes) can be mapped into probabilities, e.g.
p( WN_syns | n LM_class )
and then
p(LM_class | n WN_syns ), p(LM_class | n), p(LM_class | WN_syns )
104. Mapping LDOCE classes (cont’d) Example Cluster: 2---EDZI
'2‘ 'Abstract and solid'
'ED‘-'ZI‘ 'education - institutions, academic name of '
T={nursery_school, polytechnic, school, seminary, senate}
Synset: “school “,
cd=0.580, coverage: 60%
Synset “educational_institution”,
cd=0.527, coverage: 80%
Synset “gathering assemblage”
cd=0.028, coverage: 40%
105. Case study: the word stock in LDOCE stock T a supply (of something) for use
stock J goods for sale
stock N the thick part of a tree trunk
stock A a group of animals used for breeding
stock A farm animals usu . cattle ; LIVESTOCK
stock T a family line , esp . of the stated character
stock T money lent to a government at a fixed rate of interest
stock T the money (CAPITAL) owned by a company, divided into SHAREs
stock P a type of garden flower with a sweet smell
stock L a liquid made from the juices of meat, bones , etc . , used in cooking
stock J (in former times) a stiff cloth worn by men round the neck of a shirt
compare TIE
stock N a piece of wood used as a support or handle, as for a gun or tool
stock N the piece which goes across the top of an ANCHOR_1_1 from side to side
stock P a plant from which CUTTINGs are grown
stock P a stem onto which another plant is GRAFTed
106. Case study: stock as Animal (A) stock A a group of animals used for breeding
stock A farm animals usu . cattle ; LIVESTOCK
107. Case Study: stock (N - P) stock N a piece of wood used as a support or handle , as for a gun or tool
stock N the piece which goes across the top of an ANCHOR_1_1 from side to side
stock N the thick part of a tree trunk
stock P a plant from which CUTTINGs are grown
stock P a stem onto which another plant is GRAFTed
stock P a type of garden flower with a sweet smell
108. LM Category Mapping
109. Results: A Simple (Unsupervised) Tagger Estimate over the parsed corpus+Wordnet and by mapping into LD categories, the following quantities:
P( C | hw r), P( C | r), P( C | hw)
(r ranges over: SubjV, DirObj, N_P_hw, hw_P_N)
Apply a simple Bayesian model to any incoming contexts
<hw r1, …, rk>
and Select argmaxC( p(C| hw) p(C| r1) … p(C | rk))
(OBS: p(C | rj) is the back-off of p(C | hw rj))
110. Unsupervised Tagger: Evaluation Coverage: 10% Corpus words (Annitated material)
Coverage: 50% of the Test test set were seen by Sensum Coverage: 10% Corpus words (Annitated material)
Coverage: 50% of the Test test set were seen by Sensum
111. Results: Re-estimate probs for a ME model Use sentences in training data for learning lexical and contextual preferences of nouns and relations
Use lexical preferences to pre-estimate the empirical distributions over unseen data (see constraints Q(c,w,R) in Fred’s)
Train the ME model over all available data
Tag held-out and blind data
112. Results Features: All syntactic Features
ME Tra: Training Data
ME Test: Held-out
Result: 78-79%
Features: Only head words
ME Tra: Training Data
ME Test: Held-out
Result: 80.76%
113. Conclusions A robust parameter estimation method for semantic tagging
Less prone to sparse data
Generalize to meaningful noun classes
Develop “lexicalized” contextual cues and a semantic dictionary
A natural and viable way to integrate corpus-driven evidence with a general-purpose lexicon
Results consistent wrt fully supervised methods
Open perspectives for effective estimation of unseen empirical distributions
114. Open Issue Estimate contextual and lexical probabilities from the 28M portion of the BNC (already parsed here)
Alternative formulations of similarity metrics
Experiment a bootstrapping method by imposing the proposed estimates (i.e. p(C| w, SubjV)) as constraints to Q(C, w, SubjV))
Manually assess and measure the automatically derived Longman-Wordnet mapping
115. Summary Slide IR-inspired approaches (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy
Incorporating context preferences (Jerry)
Adjective Classes and Subject markings (David)
Structuring the context using syntax and semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy Experiments (Fred)
Unsupervised Re-estimation (Roberto)
116. Our Accomplishments Developed a method for bootstrapping using maximum entropy
More than 300 experiments with features
Integrated dictionary and syntactic information
Integrated dictionary, Wordnet, syntactic information and topic information experiments which gave us significant improvement
Developed a system for unsupervised tagging
117. Lessons learned Semantic Tagging has an intermediate complexity between the rather successful NE recognition and Word Sense Disambiguation
Semantic tagging over BNC is viable with high accuracy
Accuracy reached by most of the proposed methods: ?94%
This task stimulates cross-fertilization between statistical and symbolic knowledge grounded on solid linguistic principles and resources
118. NO! The near future at a glance … Availability of semantic information for head nouns is critical to a variety of linguistic tasks
IR and CLIR, Information Extraction and Question Answering
Machine Translation and Language Modeling
Annotated resources can provide a significant stimulus to machine learning of linguistic patterns (e.g. QA answer structures)
Open possibilities for corpus-driven learning of other semantic phenomena (e.g. verb argument structures) and incremental learning methods
119. … and a quick look further Unseen phenomena still represent hard cases for any probabilistic model (rare vs. impossible labels for unseen/novel words)
Integration of external resources is problematic
Projecting observed empirical distributions may lead to overfitting data
Lexical information (e.g. Wordnet) has not a clear probabilistic interpretation
Soft Features (Jia Cui) seem a promising model
Better use of the context:
Design and derivation of class-based contextual features (David Guthrie)
Existing lexical resources provide large scale and effective information for bootstrapping
120. A Final thought Thanks to the Johns Hopkins faculty and staff for their availability and helpfulness during the workshop.
Special thanks to Fred Jelinek for answering endless questions about maximum entropy and helping to model our problem.