1 / 19

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon. Problem. Semi supervised sarcasm identification using SASI

becca
Download Presentation

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-Supervised Recognition of Sarcastic Sentencesin Twitter and Amazon

  2. Problem • Semi supervised sarcasm identification using SASI • Sarcasm: the activity of saying or writing the opposite of what you mean, or of speaking in a way intended to make someone else feel stupid or show them that you are angry

  3. Datasets • Twitter Dataset: • Tweets are 140 characters or fewer • Tweets can contain urls, references to other tweeters (@<user>) or hashtags #<tag> • Slang, abbreviations, and emoticons are common • 5.9 million tweets • 14.2 average words per tweet • 18.9% include a url, 35.3% contain @<user> • 6.9% contain one or more hashtags

  4. Datasets • Amazon Dataset: • 66,000 reviews of 120 products • 953 characters on average • Usually structured and grammatical • Have fields including writer, date, rating, and summary • Amazon reviews have a great deal of context compared to tweets

  5. Classification • The algorithm is semi-supervised • Seeded with a small group of labeled sentences • The seed is annotated with a sarcasm ranking in [1,5] • Syntatic and pattern based features are used to build a classifier

  6. Data Preprocessing • Specific information was replaced with general tags to facilitate pattern matching: • ‘[PRODUCT]’,‘[COMPANY]’,‘[TITLE]’ • ‘[AUTHOR]’ ‘[USER]’,‘[LINK]’ and • ‘[HASHTAG]’. • All HTML tags removed

  7. Pattern Extraction and Selection • Words are classified into high frequency words (HFW) and (CW) • A pattern is an ordered sequence of HFWs and slots for CWs • “[COMPANY] CW does not CW much” • Generated patterns were removed if they were present in two seeds with rankings 1 and 5 • Patterns were removed which appear only in reference to a single product

  8. Pattern Matching

  9. Other Features • (1) Sentence length in words, • (2) Number of “!” characters in the sentence • (3) Number of “?” characters in the sentence • (4) Number of quotes in the sentence • (5) Number of capitalized/all capitals words in the sentence.

  10. Data Enrichment • Assumption: Sentences near a sarcastic sentence are similarly sarcastic • Using the seed set for the Amazon data, perform a yahoo search for text snippets containing the seeds. Include the surrounding sentences in the seed, annotated similarly to the original search parameters

  11. Classification • Similar to kNN • The score for a new instance is the weighted average of the k nearest training set vectors, measured using Euclidean distance

  12. Baseline • Assume sarcasm implies saying the opposite of what you mean • Identify reviews with few stars and decide that sarcasm is present if strongly positive words appear in the review

  13. Training Sets • Amazon: • 80 positive and 505 negative examples • (471/5020 expanded) • Twitter • 1500 #sarcasm hash tagged tweets (Noisy) • Changed to be positive examples from the Amazon dataset and manually selected negative examples from the Twitter dataset

  14. Test Sets • 90 positive and 90 negative examples each for Amazon and Twitter • Only sentences containing a named entity or named entity reference were sampled (more likely to contain sentiment → relevance) • Non-sarcastic sentences belong only to negative reviews, increasing the chance that they contain negative sentiment • MTurk used to create a gold standard for the test set. Each sentence was annotated by 3 annotators.

  15. Inter-Annotator Agreement • Amazon: k = 0.34 • Twitter: k = 0.41 • Superior Twitter agreement is attributed to lack of context in the medium

  16. Tables

  17. Baseline Intuitions • The Baseline has high precision, but low recall • It cannot recognize subtly sarcastic sentences • These results imply that the definition “saying the opposite of what you mean” is not a good indicator of sarcasm

  18. Reasons for Good Twitter Results • Robustness of sparse and incomplete pattern matching • SASI learns a model with a feature space spanning over 300 dimensions • Sarcasm may be easier to detect in tweets because tweeters have to go out of their way to make sarcasm explicit in an environment with no context

  19. Notes • #sarcasm tags were unreliable • Punctuation marks were the weakest predictors, in contrast to the findings of Teppermann et al. (2006) • The exception to this is the use of ellipses, which was a strong predictor in combination with other features

More Related