Attacking the Data Sparseness Problem

1. Attacking the Data Sparseness Problem Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Bontcheva, Jia Cui, Klaus Macherey, David Guthrie, Martin Holub, Marco Cammisa, Cassia Martin, Jerry Liu, Kris Haralambiev, Fred Jelinek

2. Texts for text extraction contain sentences like: The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today. Motivation for the project

3. We�d like to automatically recognize that both are of the form: The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today. Motivation for the project

4. Our Hypotheses A transformation of a corpus to replace words and phrases with coarse semantic categories will help overcome the data sparseness problem encountered in language modeling, and text extraction. Semantic category information might also help improve machine translation A noun-centric approach initially will allow bootstrapping for other syntactic categories

5. A six week goal � Labeling noun phrases Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday Humans aboard space_vehicle dodge satellite timeref.

6. Preparing the data- Pre-Workshop Identify a tag set Create a Human annotated corpus Create a double annotated corpus Process all data for named entity and noun phrase recognition using GATE Tools (26 million words) Parsed about (26 million words) Develop algorithms for mapping target categories to Wordnet synsets to support the tag set assessment

7. The Semantic Classes and the Corpus A subset of classes available in Longman's Dictionary of contemporary English (LDOCE) Electronic version Rationale: The number of semantic classes was small The classes are somewhat reliable since they were used by a team of lexicographers to code Noun senses, Adjective preferences and Verb preferences Many words have subject area information which might be useful

8. The Semantic Classes

9. The Semantic Classes

10. The Semantic Classes

11. The Semantic Classes

12. The human annotated statistics Inter-annotator agreement is 94%, so that is the upper limit of our task. 214,446 total annotated noun phrases (262,683 including �None of the Above�) 29,071 unique vocabulary items (Unlemmatized) 25 semantic categories (162 associated subject areas were identified) 127,569 with semantic category - Abstract, 59 %

13. The experimental setup

14. The main development set (dev)

15. A challenging development set for experiments on useen words (Hard data set)

16. Our Experiments include: Supervised Approaches (Learning from Human Annotated data) Unsupervised approaches Using outside evidence (the dictionary or wordnet) Syntactic information from parsing or pattern matching Context words, the use of preferences, the use of topical information

17. Experiments on unseen words - Hard data set Training corpus has only words with unambiguous annotations 125,000 training instances 73,000 instances held-out Perplexity � 21 Baseline � Accuracy 45% Improvement � Accuracy 68.5 % Context can contribute greatly in unsupervised experiments

18. Results on the dev set Random with some frequent ambiguous words moved into testing 113,000 training instances 85,000 instances held-out Perplexity � 3.44 Baseline � Accuracy 80% Improvement � Accuracy 87 %

19. The scheme for annotating the large corpus After experimenting with the development sets, we need a scheme for making use of all of the dev corpus to tag the blind corpus. We developed a incremental scheme within the maximum entropy framework Several talks have to do with re-estimation techniques useful to bootstrapping process.

20. Terminology Seen words � words seen in the human annotated data (new instances of known words) Unseen words � not in the training material but in dictionary Novel words � not in the training material nor in the dictionary/Wordnet

21. Bootstrapping

22. The Unannotated Data � Four types




27. Results on the Blind Data We set aside one tenth of the annotated corpus Randomly selected within each of the domains It contained 13,000 annotated instances The baseline here was very high - 90% with simple techniques We were able to achieve 93.5% accuracy

28. Overview Bag of words (Kalina) Evaluation (Kris) Supervised methods using maximum entropy (Klaus) Incorporating context preferences (Jerry) Experiments with Adjective Classes and Subject (David, Jia, Martin) Structuring the context using syntax and semantics (Cassia, Fabio) Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia) Unsupervised Re-estimation (Roberto) Student Proposals (Jia, Dave, Marco) Conclusion

29. Semantic Categories and MT 10 test words � high, medium, and low frequency Collected their target translations using EuroWordNet (e.g. Dutch) Crane: [lifts and moves heavy objects] � hijskraan, kraan [large long-necked wading bird] - kraanvogel

30. SemCats and MT (2) Manually mapped synonym sets to semantic categories automatic mapping will be presented later Studied how many synonym sets are ruled out as translations by the semantic category

31. Some Results 3 words � full disambiguation crane (Mov.Solid/Animal), medicine (Abstract/Liquid), plant (Plant/Solid) 7 words � the categories reduce substantially the possible translations club - [Abstr/an association of people...], [Mov.Solid/stout stick...], [Mov.Solid/ an implement used by a golfer...], [Mov.Solid/a playing card...], [NonMov.Solid/a building �] club/NonMov.Solid � [clubgebouw, clubhuis, �] club/Abstr. � [bevolkingsgroep, broederschap, �] club/Mov.Solid � [knots, kolf, malie], [kolf, malie], [club]

32. The architecture The �multiple-knowledge sources� WSD architecture (Stevenson 03) Allow use of multiple taggers and combine their results through a weighted function Weights can be learned from a corpus All taggers implemented as GATE components and combined in applications

34. The Bag-of-Words Tagger The bag-of-words tagger is an Information Retrieval-inspired tagger with parameters: Window size: 50 default value What POS to put in the content vectors (default: nouns and verbs) Which similarity measure to use Used in WSD (Leacock et al 92) Crane/Animal={species, captivity, disease�} Crane/Mov.Solid={worker, disaster, machinery�}

35. BoW classifier (2) Seen words classified by calculating the inner product between their context vector and the vectors for each possible category Inner product calculated as: Binary vectors � number of matching terms Weighted vectors: Leacock�s measure � favour concepts that occur frequently in exactly one category Take into account the polysemy of concepts in the vectors

36. Current performance measures The baseline frequency tagger on its own � 91% on the test (blind) set Bag-of-words tagger on its own � 92.7% Combined architecture �93.2% (window size 50, using only nouns, binary vectors)

37. Future work on the architecture Integrate syntactic information, subject codes, and document topics Experiment with cosine similarity Implement [Yarowsky�92] WSD algorithm Implement the weighted function module Experiment with integrating the ME tools as one of the taggers supplying preferences for the weighting module


39. Accuracy MeasurementsKris Haralambiev How to measure the accuracy How to distinguish �correct�, �almost correct� and �wrong�

40. Exact Match Measurements W = (w1, w2, �, wn) � vector of the annotated words X = (x1, x2, �, xn) � categories assigned by the annotators Y = (y1, y2, �, yn) � categories assigned by a program Exact match (default) measurement � 1 for match and 0 for mismatch of each (xi,yi) pair: accuracy(X,Y) = |{i : xi = yi}|

41. The Hierarchy

42. Ancestor Relation Measurement weight(Cat) =|{i : xi? Tree with root Cat} | score(xi, yi) = min(weight(xi)/weight(yi), weight(yi)/weight(xi) ) accuracy(X,Y) = ?i score(xi,yi)

43. Edge Distance Measurement distance(xi, yi) = the length of the simple path from xi to yi each edge can be given individual length or all edges have length 1 (we prefer the latter)

44. Edge Distance Measurement (cont' d) distance(X,Y) = ?i distance(xi,yi) Accuracy � Distance 100% - 0 ? - distance(X,Y) 0% - max_possible_distance max_possible_distance = = ?i max(distance(xi,cat)) might be reasonable to use aver. instead of max

45. Some Baselines Blind data




64. Hard vs Soft Word Clusters Words as features are sparse, we need to cluster up Hard clusters A feature is assigned to one and only one cluster. (The cluster for which there exists the strongest evidence.) Soft clusters A feature is assigned to as many clusters as there is evidence for.

65. Using clustering and contextual features Baseline � prior + most frequent semantic category All words within the target noun phrase (with a threshold of 10 occurrences) Adjective hard clusters Clusters are defined by most frequent semantic category Noun soft clusters Clusters are defined by all semantic categories Combined adjective hard clusters and noun soft clusters

66. Results with clusters and context

68. Clustering Adjectives We take adjectives with low H(T | a) and make clusters form them depending on which semantic category they predict Then use each cluster of adjectives as a context feature


70. Structuring the context using syntax Syntactic Model: eXtended Dependency Graph Syntactic Relations considerd: V_Obj, V_Sog, V_PP, NP_PP Results Observations: Features are too scarce We're overfitting! We need more intelligent methods.

71. Semantic Fingerprint:Generalizing nouns using EuroWordnet

72. Noun semantic fingerprints: an example Words in the events are replaced by �basic concepts�

73. Verb semantic fingerprints: an example

74. How to exploit the word context? [Semantic Category]-Subj-to_think Positive observations his wife thought he should eat more the waitress thought that Italians leave tiny tips Our conceptual hierarchy contains FemaleHuman and MaleHuman...

75. How to exploit the word context?

76. Syntactic slots and slot fillers

77. How to exploit the word context? Using... a revised hierarchy Female animal and male animal ? Animal Female human and male human ? Human Female and male ? Animate �one-semantic-class-per-discourse� hypothesis the �semantic fingerprint�: generalising nouns in the basic concepts of EuroWordnet and verbs in the top most in Wordnet

78. Results

79. Results: a closer look


81. Unsupervised Semantic Labeling of Nouns using ME Frederick Jelinek Semantic Analysis for Sparse Data

82. Motivation Base ME features on lexical and grammatical relationships found in the context of nouns to be labeled Hand-labeled data too sparse to allow using powerful ME compound features Wish to utilize large unlabeled British National Corpus (and internet, etc.) for training Will use dictionary and initialization by statistics from smaller hand-labeled corpus

83. Format of Labeled Training Data w is the noun to be labeled r1, r2�, rm are the relationships in the context of w which correlate with the label appropriate to w C is the label denoting semantic class f1, f2,�, fK is the label count, I.e., fC=1 and fi=0 for i ? C Then the event file format is (f1, f2,�, fK , w, r1, r2�, rm )

84. Format of BNC Training Data The label counts fi will be fractional with fi = 0 if the dictionary does not allow noun w to have the ith label. Always fi = 0 and Si fi = 1 The problem is the initial selection of values of fi Suggestion: let fi = Q(C= i | w) where Q denotes the empirical distribution from hand labeled data.

86. The empirical distribution used in the ME iterations is obtained from sums of values of fi found in both the labeled and BNC data sets. These counts determine which of the potential features will be selected as actual features the values of the l parameters in the ME model Inner Loop ME Re-estimation

88. Outer Loop Re-scaling of Data Once the ME model P(C = c | w,r1,�,rm) is estimated, the fi values in event files of the BNC portion of data are re-scaled. fi values in the hand labeled portion remain unchanged New empirical counts are thus available to determine the identity of new actual features the parameters of a new ME probability model Etc.

89. Preliminary Results by Jia Cui

90. Concluding Thoughts Preliminary results are promising Method requires theoretical and practical exploration Changing of features and feature targets is a new phenomenon in ME estimation Careful selection of relationships and basing them on clusters required will lead to effective features See proposal by Jia Cui and David Guthrie


92. Unsupervised Semantic Tagging

93. Summary Motivations Lexical Information for Semantic Tagging Unsupervised Natural Language Learning Empirical Estimation for ME bootstrapping Weakly Supervised BNC Tagging through Wordnet A semantic similarity metric over Wordnet Experiments and Results Mapping LDOCE to Wordnet Bootstrapping over an untagged corpus Re-estimation thrugh Wordnet

94. Motivations All experiments tell that lexical information is crucial for semantic tagging, � data sparseness seems to limit the effect of the context The contribution of different resources needs to be exploited (as in WSD) In applications hand-tagging should be applied in a cost effective way Good results need to scale-up also to technological scenarios where poorer (or no) resources are available

95. Motivations (cont�d) Wordnet contribution to semantic tagging A source of evidence for a larger set of lexicals (unseen words) A consistent way to generalize single observations (hierarchical) constraints over word uses statistics Similarity of words uses suggests semantic similarity: Corpus-driven syntactic similarity is one possible choice Domain or topical similarity is also relevant Semantic similarity in the Wordnet hierarchy suggests useful levels of generalization Specific hypernyms, i.e. able to separate different senses General hypernyms, i.e. help to reduce the number of word classes to model

96. Learning Contextual Evidence Each syntactic relation provides a �view� on a word usage, i.e. suggests a set of nouns with common behaviour(s) Semantic similarity among nouns is a model of local semantic preference to drink {beer, water, �, cocoa/L, stock/L, �} The {�, president, director, boy, ace/H, brain/H, �} succeeds

97. Semantic classes vs. language models The role of p( C | v, d) e.g. p( n | v, d) ? p( n | C) p( C | v, d) Implications p(n | C) gives a lexical semantic model that � it is likely to depend on the corpus and not on the individual context p(C | v, d) model selectional preferences and � provides disambiguation cues for contexts (v d X)

98. Semantic classes vs. language models Lexical evidence: p(n | C) (or also p(C|n) ) Contextual evidence: p( C | v, d) The idea: Contextual evidence can be collected from the corpus by involving the lexical knowledge base The modeling of lexical evidence can be seen as a side effect of the context (p(C | n) ? p(n|C) ) Implied approach Learn the second as an estimate for the first and then combine for bootstrapping to unseen words

99. Conceptual Density Basic terminology Target noun set T (e.g. {beer, water, stock} nouns in relation r=VDirobj with a given verb) (Branching Factor) Average number m of children of a node s, i.e. the average number of children of any node subsumed by s (Marks) Set of marks M, i.e. the subset of nouns in T that are subsumed within the WN subhierarchy rooted in s. N = |M| (Area) area(s), total number of nodes of the subhierarchy rooted at s

100. Conceptual Density (cont�d)

101. Using Conceptual Density Target Noun set T (e.g. subjs of verb to march) horse (6 senses in WN1.6), ant (1 sense in WN1.6) troop (4 senses in WN1.6) division (12 senses in WN1.6) elephant (2 senses in WN1.6)

102. Summary Motivations Lexical Information for Semantic Tagging Unsupervised Natural Language Learning Empirical Estimation for ME bootstrapping Weakly Supervised BNC Tagging through Wordnet A semantic similarity metric over Wordnet Experiments and Results Mapping LDOCE to Wordnet Bootstrapping over an untagged corpus Re-estimation thrugh Wordnet

103. Results: Mapping LDOCE classes Lexical Entries in LDOCE are defined in terms of a Semantic Class and a topical tag (Subject Codes), e.g. stock ('L','FO') The semantic similarity metrics has been used to derive WN synset(s) that represent <SemClass,SubjCode> pairs A Wn explanation of lexical entries in a LM class (lexical mapping) The position(s) in the WN noun hierarchy of each LM class (category mapping) Semantic preference of synsets given words, LM classes (and Subject Codes) can be mapped into probabilities, e.g. p( WN_syns | n LM_class ) and then p(LM_class | n WN_syns ), p(LM_class | n), p(LM_class | WN_syns )

104. Mapping LDOCE classes (cont�d) Example Cluster: 2---EDZI '2� 'Abstract and solid' 'ED�-'ZI� 'education - institutions, academic name of ' T={nursery_school, polytechnic, school, seminary, senate} Synset: �school �, cd=0.580, coverage: 60% Synset �educational_institution�, cd=0.527, coverage: 80% Synset �gathering assemblage� cd=0.028, coverage: 40%

105. Case study: the word stock in LDOCE stock T a supply (of something) for use stock J goods for sale stock N the thick part of a tree trunk stock A a group of animals used for breeding stock A farm animals usu . cattle ; LIVESTOCK stock T a family line , esp . of the stated character stock T money lent to a government at a fixed rate of interest stock T the money (CAPITAL) owned by a company, divided into SHAREs stock P a type of garden flower with a sweet smell stock L a liquid made from the juices of meat, bones , etc . , used in cooking stock J (in former times) a stiff cloth worn by men round the neck of a shirt compare TIE stock N a piece of wood used as a support or handle, as for a gun or tool stock N the piece which goes across the top of an ANCHOR_1_1 from side to side stock P a plant from which CUTTINGs are grown stock P a stem onto which another plant is GRAFTed

106. Case study: stock as Animal (A) stock A a group of animals used for breeding stock A farm animals usu . cattle ; LIVESTOCK

107. Case Study: stock (N - P) stock N a piece of wood used as a support or handle , as for a gun or tool stock N the piece which goes across the top of an ANCHOR_1_1 from side to side stock N the thick part of a tree trunk stock P a plant from which CUTTINGs are grown stock P a stem onto which another plant is GRAFTed stock P a type of garden flower with a sweet smell

108. LM Category Mapping

109. Results: A Simple (Unsupervised) Tagger Estimate over the parsed corpus+Wordnet and by mapping into LD categories, the following quantities: P( C | hw r), P( C | r), P( C | hw) (r ranges over: SubjV, DirObj, N_P_hw, hw_P_N) Apply a simple Bayesian model to any incoming contexts <hw r1, �, rk> and Select argmaxC( p(C| hw) p(C| r1) � p(C | rk)) (OBS: p(C | rj) is the back-off of p(C | hw rj))

110. Unsupervised Tagger: Evaluation Coverage: 10% Corpus words (Annitated material) Coverage: 50% of the Test test set were seen by Sensum Coverage: 10% Corpus words (Annitated material) Coverage: 50% of the Test test set were seen by Sensum

111. Results: Re-estimate probs for a ME model Use sentences in training data for learning lexical and contextual preferences of nouns and relations Use lexical preferences to pre-estimate the empirical distributions over unseen data (see constraints Q(c,w,R) in Fred�s) Train the ME model over all available data Tag held-out and blind data

112. Results Features: All syntactic Features ME Tra: Training Data ME Test: Held-out Result: 78-79% Features: Only head words ME Tra: Training Data ME Test: Held-out Result: 80.76%

113. Conclusions A robust parameter estimation method for semantic tagging Less prone to sparse data Generalize to meaningful noun classes Develop �lexicalized� contextual cues and a semantic dictionary A natural and viable way to integrate corpus-driven evidence with a general-purpose lexicon Results consistent wrt fully supervised methods Open perspectives for effective estimation of unseen empirical distributions

114. Open Issue Estimate contextual and lexical probabilities from the 28M portion of the BNC (already parsed here) Alternative formulations of similarity metrics Experiment a bootstrapping method by imposing the proposed estimates (i.e. p(C| w, SubjV)) as constraints to Q(C, w, SubjV)) Manually assess and measure the automatically derived Longman-Wordnet mapping

115. Summary Slide IR-inspired approaches (Kalina) Evaluation (Kris) Supervised methods using maximum entropy Incorporating context preferences (Jerry) Adjective Classes and Subject markings (David) Structuring the context using syntax and semantics (Cassia, Fabio) Re-estimation techniques for Maximum Entropy Experiments (Fred) Unsupervised Re-estimation (Roberto)

116. Our Accomplishments Developed a method for bootstrapping using maximum entropy More than 300 experiments with features Integrated dictionary and syntactic information Integrated dictionary, Wordnet, syntactic information and topic information experiments which gave us significant improvement Developed a system for unsupervised tagging

117. Lessons learned Semantic Tagging has an intermediate complexity between the rather successful NE recognition and Word Sense Disambiguation Semantic tagging over BNC is viable with high accuracy Accuracy reached by most of the proposed methods: ?94% This task stimulates cross-fertilization between statistical and symbolic knowledge grounded on solid linguistic principles and resources

118. NO! The near future at a glance � Availability of semantic information for head nouns is critical to a variety of linguistic tasks IR and CLIR, Information Extraction and Question Answering Machine Translation and Language Modeling Annotated resources can provide a significant stimulus to machine learning of linguistic patterns (e.g. QA answer structures) Open possibilities for corpus-driven learning of other semantic phenomena (e.g. verb argument structures) and incremental learning methods

119. � and a quick look further Unseen phenomena still represent hard cases for any probabilistic model (rare vs. impossible labels for unseen/novel words) Integration of external resources is problematic Projecting observed empirical distributions may lead to overfitting data Lexical information (e.g. Wordnet) has not a clear probabilistic interpretation Soft Features (Jia Cui) seem a promising model Better use of the context: Design and derivation of class-based contextual features (David Guthrie) Existing lexical resources provide large scale and effective information for bootstrapping

120. A Final thought Thanks to the Johns Hopkins faculty and staff for their availability and helpfulness during the workshop. Special thanks to Fred Jelinek for answering endless questions about maximum entropy and helping to model our problem.

Attacking the Data Sparseness Problem

Attacking the Data Sparseness Problem

Presentation Transcript

Attacking Antivirus

Some Mathematical Ideas for Attacking the Brain Computer Interface Problem

Attacking the Data Sparseness Problem

Attacking Interoperability

Some Mathematical Ideas for Attacking the Brain Computer Interface Problem

Attacking the Application Server

Attacking Data Stores

Attacking the Poetry Prompt

Attacking the person.

Random Forests and the Data Sparseness Problem in Language Modeling

Attacking the Prompt

Attacking the DBQ

Erasing Fishy manhood Odor - Hints for Attacking the Problem

Attacking Antivirus

Attacking the PHP Market

An Attack on Data Sparseness

Stop Attacking the Queue

The Data Invasion Problem

Attacking

Attacking Mazatzal

Attacking RSA

ATTACKING THE SCIENCE TAKS