Predicting the Semantic Orientation of Adjectives Vasileios Hatzivassiloglou and Kathleen R. McKeown Presenter: Gabriel Nicolae
Introduction • Orientation/polarity = direction of deviation from the norm Nearly synonymous simple vs. simplistic Antonyms hot vs. cold
Introduction • In linguistic constructs such as conjunctions the choice of arguments and connectives are mutually constrained. The tax proposal was simple and well-received simplistic but well-received simplistic and well-received by the public.
Goals • Automatically identify antonyms • Distinguish near synonyms How? • by retrieving semantic orientation information using indirect information collected from a large corpus Why? • dictionaries and similar sources (thesauri, WordNet) do not include explicitly semantic orientation information • lack of links between antonyms and synonyms when they depend on the domain of the discourse
Overview of their approach • Correlation between indicators and semantic orientation • direct indicators: affixes (in-, un-) • mostly negatives • exceptions: independent, unbiased • indirect indicators: conjunctions • conjoined adjectives usually are of the same orientation for most connectives • the situation is reversed for but fair and legitimate corrupt and brutal fair and brutal corrupt and legitimate vs. from corpus semantically anomalous
General algorithm • Extract conjunctions of adjectives and morphological relations • Label each two conjoined adjectives as being of the same or different orientation using a log-linear regression model • Separate adjectives into two subsets of different orientation using a clustering algorithm • The group with the higher average frequency is labeled as positive
Data collection • Corpus: 21 million word 1987 Wall Street Journal • Training data: a set of adjectives with predetermined (hand-annotated) orientation labels (+ or -) • 1,336 adjectives (657 +, 679 -) • The training set was validated by four other people • 500 adjectives: 89.15% agreement • Test data: • 15,048 conjunction tokens • 9,296 distinct pairs of conjoined adjectives (type)
Data collection (cont.) • Each conjunction token is classified according to three variables: • conjunction used • and, or, but, either-or, neither-nor • type of modification • attributive, predicative, appositive, resultative • number of the modified noun • singular, plural
Validation of the conjunction hypothesis Results • Their conjunction hypothesis is validated overall and for almost all individual cases • There are small differences in the behavior of conjunctions between linguistic environments (as represented by the three attributes) • Conjoint antonyms appear far more frequently than expected by chance in conjunctions other than but
Prediction of link type • Baseline 1: always guessing that a link is of the same orientation type => 77.84% accuracy • Baseline 2: Baseline 1 + but exhibits the opposite pattern => 80.82% accuracy • Morphological relationships: • Adjectives related in form almost always have different semantic orientations • Highly accurate (97.06%), but applies only to 1,336 labeled adjectives (891,780 possible pairs) • E.g. adequate-inadequate, thoughtful-thoughtless • Baseline 1 + Morphology => 78.86% accuracy • Baseline 2 + Morphology => 81.75% accuracy
Prediction of link type (cont.) • Log-linear regression model x: the vector of the observed counts in the various conjunction categories w: the vector of weights to be learned y: the response of the system • Using the method of iterative stepwise refinement they selected 9 predictor variables from all 90 possible predictor variables. • Small improvement: 80.97% accuracy (82.05% accuracy using Morphology) but now each prediction is rated between 0 and 1
Clustering • Input: a graph of adjectives connected by dissimilarity links • Small dissimilarity value => same-orientation link • High dissimilarity value => different-orientation link • Method used: apply an iterative optimization procedure on each connected component, based on the exchange method, a non-hierarchical clustering algorithm • Idea: find the partition Psuch that the objective functionΦ is minimized
Labeling the clusters as + or - • In oppositions of gradable adjectives where one member is semantically unmarked, the unmarked member is the most frequent one about 81% of the time • Unmarked => positive orientation almost always • So, label as positive the group that has the highest average frequency of words.
Graph connectivity and performance • They tested how graph connectivityaffects the overall performance