Multi-Label Classification of Yelp Reviews: Insights and Approaches

CS 277 DataMiningProject PresentationInstructor: Prof. Dave NewmanTeam: Hitesh Sajnani, Vaibhav Saini, KusumKumarDonald Bren School of Information and Computer ScienceUniversity of California, Irvine

Problem Statement • Classify a given yelp review text into one or more relevant categories

Dataset • Reviews s • Reviews from Foodand Restaurantcategory • # Useful votes > 1 • Total 10,000 reviews • Classification categories • Identified categories using sample set of 400 random reviews • Refined categories using 200 more reviews • Final categories: 5 • Food, Ambience, Service, Deals/Discounts Worthiness

Data Annotation • 10,000 reviews divided into 5 bins (w/ repetition) • 6researchers manually annotated reviews • 225 man-hours of work! • Discrepancy in 981 ambiguous reviews -- removed from analysis • Total 9,019 reviews: split into 80% train and 20% test

Features – unigrams/bigrams/trigrams Total 703 textual features 375unigrams, 208 bigrams, 120 trigrams Frequency Unigrams/bigrams/trigrams

Features – User ratings 3 nominal features – Good, Moderate, Bad

Approach • Reviews can be classified into more than one categories • Not a binary classification problem. It is a multi-label classification!

Binary classifiers for each category • Learns one binary classifier for each category • Output is the union of predictions of all binary classifiers Original dataset Transformed datasets

Classifier for each subset of categories • Categories = {Food, Service, Ambience, Deals} • We consider each different “subset of categories” as a single category and learn a multi-class classifier Transformed dataset Reviews Categories Review 1 “1001” Review 2 “0011” Review 3 “1000” Review 4 “0111”

Ensemble of subset classifiers • Train a classifier for predicting only each subset of categories • Classifier 1 for (Food, Service) • Classifier 2 for (Food, Ambience) • Classifier 3 for (Food, Deals) • Classifier 4 for (Service, Ambience) • Classifier 5 for (Service, Deals • Classifier 6 for (Ambience, Deals) Total 6 classifiers for subset of size of 2 categories – 4C2

Ensemble of classifiers: Prediction • Ask each classifier to vote!

Ensemble of classifiers: Prediction • Final prediction: Majority vote (>= 2 classifiers)

Evaluation measures Notations: Let (x,Y) be a multi-label example, Y L Let h be a multi-label classifier Let Z = h(x) be the set of labels predicted by h for (x, Y) Precision: Recall:

Precision & Recall (Train)

Precision & Recall (Test)

Observation1: Ensemble gave the best results

Observation 2: Data Skew Normalized skew in training data by adding selective data

Precision & Recall (w & w/o category normalization)

Thanks! Check out our yelp submission http://www.ics.uci.edu/~vpsaini/ Feedback welcome!

Multi-Label Classification of Yelp Reviews: Insights and Approaches

Multi-Label Classification of Yelp Reviews: Insights and Approaches

Presentation Transcript

Problem Statement

Problem Statement

PROBLEM STATEMENT:

Problem Statement

PROBLEM STATEMENT

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem statement

PROBLEM STATEMENT

Problem Statement

Problem Statement

Problem statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem statement

Problem Statement