Attacking the Data Sparseness Problem

Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Bontcheva, Jia Cui, Klaus Macherey, David Guthrie, Martin Holub, Marco Cammisa, Cassia Martin, Jerry Liu, Kris Haralambiev, Fred Jelinek Attacking the Data Sparseness Problem

Motivation for the project Texts for text extraction contain sentences like: The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today.

ORGANIZATION ATTACKED LOCATION DATE ORGANIZATION ATTACKED LOCATION DATE Motivation for the project We’d like to automatically recognize that both are of the form: The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today.

Our Hypotheses • A transformation of a corpus to replace words and phrases with coarse semantic categories will help overcome the data sparseness problem encountered in language modeling, and text extraction. • Semantic category information might also help improve machine translation • A noun-centric approach initially will allow bootstrapping for other syntactic categories

A six week goal – Labeling noun phrases • Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday • Humansaboardspace_vehicledodgesatellitetimeref.

Preparing the data- Pre-Workshop • Identify a tag set • Create a Human annotated corpus • Create a double annotated corpus • Process all data for named entity and noun phrase recognition using GATE Tools (26 million words) • Parsed about (26 million words) • Develop algorithms for mapping target categories to Wordnet synsets to support the tag set assessment

The Semantic Classes and the Corpus • A subset of classes available in Longman's Dictionary of contemporary English (LDOCE) Electronic version • Rationale: • The number of semantic classes was small • The classes are somewhat reliable since they were used by a team of lexicographers to code Noun senses, Adjective preferences and Verb preferences • Many words have subject area information which might be useful

FemaleAnim. MaleAnim. MaleAnim. The Semantic Classes Concrete Abstract Animate Inanimate Solid Gas Liquid Plant Animal Human Movable Non-movable FemaleAnim.

Female Male FemaleAnim. MaleAnim. MaleAnim. The Semantic Classes Concrete Abstract Animate Inanimate Solid Gas Liquid Plant Animal Human Movable Non-movable FemaleAnim.

Female Organic Physical Qualities Male FemaleAnim. MaleAnim. MaleAnim. The Semantic Classes Abstract Concrete Animate Inanimate Solid Gas Liquid Plant Animal Human Movable Non-movable FemaleAnim.

Female Organic Physical Qualities Male FemaleAnim. MaleAnim. MaleAnim. The Semantic Classes Abstract Concrete Animate Inanimate Solid Gas Liquid Plant Animal Human Movable Non-movable Collective FemaleAnim.

The human annotated statistics • Inter-annotator agreement is 94%, so that is the upper limit of our task. • 214,446 total annotated noun phrases (262,683 including “None of the Above”) • 29,071 unique vocabulary items (Unlemmatized) • 25 semantic categories (162 associated subject areas were identified) • 127,569 with semantic category - Abstract, 59 %

Human Annotated with semantic tags– Noun Phrases Only 220,000 instances 2million words The experimental setup BNC (Science, Politics, Business) 26 million words

The main development set (dev) Training 113,000 instances Held out 85,000 instances Blind portion Machine Learning to improve this

A challenging development set for experiments on useen words (Hard data set) Training – all unambiguous words 125,000 instances Held out – ambiguous words 73,000 instances Blind portion Machine Learning to improve this

Our Experiments include: • Supervised Approaches (Learning from Human Annotated data) • Unsupervised approaches • Using outside evidence (the dictionary or wordnet) • Syntactic information from parsing or pattern matching • Context words, the use of preferences, the use of topical information

Experiments on unseen words - Hard data set • Training corpus has only words with unambiguous annotations • 125,000 training instances • 73,000 instances held-out • Perplexity – 21 • Baseline – Accuracy 45% • Improvement – Accuracy 68.5 % • Context can contribute greatly in unsupervised experiments

Results on the dev set • Random with some frequent ambiguous words moved into testing • 113,000 training instances • 85,000 instances held-out • Perplexity – 3.44 • Baseline – Accuracy 80% • Improvement – Accuracy 87 %

The scheme for annotating the large corpus • After experimenting with the development sets, we need a scheme for making use of all of the dev corpus to tag the blind corpus. • We developed a incremental scheme within the maximum entropy framework • Several talks have to do with re-estimation techniques useful to bootstrapping process.

Terminology • Seen words – words seen in the human annotated data (new instances of known words) • Unseen words – not in the training material but in dictionary • Novel words – not in the training material nor in the dictionary/Wordnet

Bootstrapping Human Annotated Blind portion Unannotated Data

The Unannotated Data – Four types Human Annotated Blind portion Unambiguous 515,000 instances

The Unannotated Data – Four types Human Annotated Blind portion Seen in training 550,000 instances Unambiguous 515,000 instances

Unseen but in dictionary 9,000 The Unannotated Data – Four types Human Annotated Blind portion Seen in training 550,000 instances Unambiguous 515,000 instances

Unseen but in dictionary 9,000 The Unannotated Data – Four types Human Annotated Blind portion Seen in training 550,000 instances Unambiguous 515,000 instances Novel 20,000

Unseen Unambiguous/Annotated Marked as <0, 0, ...., 0, 1> Marked with appropriate probabilites. e.g. seen w is <p(C1|w), ...p(Cn|w)> Ambiguous Annotated 201K Unambiguous 515K Seen 550K 9K Novel 20K Training Training Training Tag TestData

Results on the Blind Data • We set aside one tenth of the annotated corpus • Randomly selected within each of the domains • It contained 13,000 annotated instances • The baseline here was very high - 90% with simple techniques • We were able to achieve 93.5% accuracy

Overview • Bag of words (Kalina) • Evaluation (Kris) • Supervised methods using maximum entropy (Klaus) • Incorporating context preferences (Jerry) • Experiments with Adjective Classes and Subject (David, Jia, Martin) • Structuring the context using syntax and semantics (Cassia, Fabio) • Re-estimation techniques for Maximum Entropy Experiments (Fred, Jia) • Unsupervised Re-estimation (Roberto) • Student Proposals (Jia, Dave, Marco) • Conclusion

Semantic Categories and MT • 10 test words – high, medium, and low frequency • Collected their target translations using EuroWordNet (e.g. Dutch) • Crane: • [lifts and moves heavy objects] – hijskraan, kraan • [large long-necked wading bird] - kraanvogel

SemCats and MT (2) • Manually mapped synonym sets to semantic categories • automatic mapping will be presented later • Studied how many synonym sets are ruled out as translations by the semantic category

Some Results • 3 words – full disambiguation • crane (Mov.Solid/Animal), medicine (Abstract/Liquid), plant (Plant/Solid) • 7 words – the categories reduce substantially the possible translations • club - [Abstr/an association of people...], [Mov.Solid/stout stick...], [Mov.Solid/ an implement used by a golfer...], [Mov.Solid/a playing card...], [NonMov.Solid/a building …] • club/NonMov.Solid – [clubgebouw, clubhuis, …] • club/Abstr. – [bevolkingsgroep, broederschap, …] • club/Mov.Solid – [knots, kolf, malie], [kolf, malie], [club]

The architecture • The “multiple-knowledge sources” WSD architecture (Stevenson 03) • Allow use of multiple taggers and combine their results through a weighted function • Weights can be learned from a corpus • All taggers implemented as GATE components and combined in applications

The Bag-of-Words Tagger • The bag-of-words tagger is an Information Retrieval-inspired tagger with parameters: • Window size: 50 default value • What POS to put in the content vectors (default: nouns and verbs) • Which similarity measure to use • Used in WSD (Leacock et al 92) • Crane/Animal={species, captivity, disease…} • Crane/Mov.Solid={worker, disaster, machinery…}

BoW classifier (2) • Seen words classified by calculating the inner product between their context vector and the vectors for each possible category • Inner product calculated as: • Binary vectors – number of matching terms • Weighted vectors: • Leacock’s measure – favour concepts that occur frequently in exactly one category • Take into account the polysemy of concepts in the vectors

Current performance measures • The baseline frequency tagger on its own – 91% on the test (blind) set • Bag-of-words tagger on its own – 92.7% • Combined architecture –93.2% (window size 50, using only nouns, binary vectors)

Future work on the architecture • Integrate syntactic information, subject codes, and document topics • Experiment with cosine similarity • Implement [Yarowsky’92] WSD algorithm • Implement the weighted function module • Experiment with integrating the ME tools as one of the taggers supplying preferences for the weighting module

Accuracy MeasurementsKris Haralambiev How to measure the accuracy How to distinguish “correct”, “almost correct” and “wrong”

Exact Match Measurements • W = (w1, w2, …, wn) – vector of the annotated words • X = (x1, x2, …, xn) – categories assigned by the annotators • Y = (y1, y2, …, yn) – categories assigned by a program • Exact match (default) measurement – 1 for match and 0 for mismatch of each (xi,yi) pair: accuracy(X,Y) = |{i : xi = yi}|

Abstract T Concrete C Animate Q Inanimate I PhysQual 4 Organic 5 Plant P Animal A Human H Liquid L Gas G Solid S Non-movable J Movable N B D F M The Hierarchy

Animate Q Human H Animal A F M Ancestor Relation Measurement • The exact match will assign 0 for the pairs (H,M), (H,F), (A,Q), … • Give a partial score for two categeories in ancestor relation • weight(Cat)=|{i : xi Tree with root Cat} | • score(xi, yi) = min(weight(xi)/weight(yi), weight(yi)/weight(xi) ) • accuracy(X,Y) = i score(xi,yi)

Animate Q Human H Animal A F M Edge Distance Measurement • The ancestor relation will assign some score for pairs like (H,M), (A,Q), but will assign 0 for pairs like (M,F), (A,H) • Going further, we want to compute the similarity (distance) between X and Y • distance(xi, yi) = the length of the simple path from xito yi • each edge can be given individual length or all edges have length 1 (we prefer the latter)

Edge Distance Measurement (cont' d) • distance(X,Y) = i distance(xi,yi) • Accuracy – Distance 100% - 0 ? - distance(X,Y) 0% - max_possible_distance • max_possible_distance = = i max(distance(xi,cat)) • might be reasonable to use aver. instead of max

Blind data Some Baselines • Training + held-out data

Supervised Methods using Maximum Entropy Jia Cui, David Guthrie, Martin Holub, Jerry Liu, Klaus Macherey

Overview • Maximum Entropy Approach • Feature Functions • Word Classes • Experimental Results

Maximum Entropy Approach • Principle: • Define suitable features (constraints) on training data • Find maximum entropy distribution that satisfies constraints (GIS) • Properties : • Easy to integrate information from several knowledge sources • Always converges to the global optimum on training data • Usage of: YASMET toolkit (by F. J. Och) & JME (by J. Cui)

Feature Functions • Prior Features • Use Unigram probabilities P(c) for semantic categories c as feature • Lexical Features • Use the lexical information directly as a feature • Reduce number of features by using the following definition

Attacking the Data Sparseness Problem

Attacking the Data Sparseness Problem

Presentation Transcript

Attacking Antivirus

Some Mathematical Ideas for Attacking the Brain Computer Interface Problem

Attacking Interoperability

Some Mathematical Ideas for Attacking the Brain Computer Interface Problem

Attacking the Application Server

Attacking Data Stores

Attacking the Poetry Prompt

Attacking the person.

Random Forests and the Data Sparseness Problem in Language Modeling

Attacking the Prompt

Attacking the DBQ

Erasing Fishy manhood Odor - Hints for Attacking the Problem

Attacking Antivirus

Attacking the PHP Market

An Attack on Data Sparseness

Stop Attacking the Queue

The Data Invasion Problem

Attacking

Attacking Mazatzal

Attacking RSA

ATTACKING THE SCIENCE TAKS

Attacking Antivirus