250 likes | 384 Views
This project focuses on classifying Yelp reviews into multiple relevant categories, including Food, Ambience, Service, and Deals/Discounts. We utilized a dataset of 10,000 restaurant reviews, which were manually annotated by a team of researchers. Our approach employs binary classifiers for each category, combined through an ensemble voting method. The results show that the ensemble method yields superior performance compared to single classifiers. We highlight the importance of addressing data skew and optimizing feature selection in our classification model.
E N D
CS 277 DataMiningProject PresentationInstructor: Prof. Dave NewmanTeam: Hitesh Sajnani, Vaibhav Saini, KusumKumarDonald Bren School of Information and Computer ScienceUniversity of California, Irvine
Problem Statement • Classify a given yelp review text into one or more relevant categories
Dataset • Reviews s • Reviews from Foodand Restaurantcategory • # Useful votes > 1 • Total 10,000 reviews • Classification categories • Identified categories using sample set of 400 random reviews • Refined categories using 200 more reviews • Final categories: 5 • Food, Ambience, Service, Deals/Discounts Worthiness
Data Annotation • 10,000 reviews divided into 5 bins (w/ repetition) • 6researchers manually annotated reviews • 225 man-hours of work! • Discrepancy in 981 ambiguous reviews -- removed from analysis • Total 9,019 reviews: split into 80% train and 20% test
Features – unigrams/bigrams/trigrams Total 703 textual features 375unigrams, 208 bigrams, 120 trigrams Frequency Unigrams/bigrams/trigrams
Features – User ratings 3 nominal features – Good, Moderate, Bad
Approach • Reviews can be classified into more than one categories • Not a binary classification problem. It is a multi-label classification!
Binary classifiers for each category • Learns one binary classifier for each category • Output is the union of predictions of all binary classifiers Original dataset Transformed datasets
Classifier for each subset of categories • Categories = {Food, Service, Ambience, Deals} • We consider each different “subset of categories” as a single category and learn a multi-class classifier Transformed dataset Reviews Categories Review 1 “1001” Review 2 “0011” Review 3 “1000” Review 4 “0111”
Ensemble of subset classifiers • Train a classifier for predicting only each subset of categories • Classifier 1 for (Food, Service) • Classifier 2 for (Food, Ambience) • Classifier 3 for (Food, Deals) • Classifier 4 for (Service, Ambience) • Classifier 5 for (Service, Deals • Classifier 6 for (Ambience, Deals) Total 6 classifiers for subset of size of 2 categories – 4C2
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Final prediction: Majority vote (>= 2 classifiers)
Evaluation measures Notations: Let (x,Y) be a multi-label example, Y L Let h be a multi-label classifier Let Z = h(x) be the set of labels predicted by h for (x, Y) Precision: Recall:
Observation 2: Data Skew Normalized skew in training data by adding selective data
Thanks! Check out our yelp submission http://www.ics.uci.edu/~vpsaini/ Feedback welcome!