Opinion Spam Detection Using Data Mining and Other Techniques By Ben Reback
What is opinion spam? Opinion spam is the action of providing opinion, either positive or negative, to publicly help or hurt some person or organization without disclosing a relationship with said person or organization. Long definition, simple idea: lying about some service to help or hurt the provider by the provider or a competitor’s request.
The Power of Opinion Cornell group studies effect of online reviews on hotel pricing and revenue: -a one-point increase in Travelocity score (on a five-point system) increases room rates by 11% -a 1% increase in reputation over several sites boosts revenue per available room by roughly 1% -more than 50% of potential hotel guests reported using online reviews in choosing hotel
How to detect opinion spam? Previous approach by another Cornell group: -crowd-sourced creation of both positive and negative opinion spam -analyzed dataset, comparing positive to negative opinions based on lexical choices by author -compared accuracy of computation-based detection against human detection
Lexical analysis is a strong metric, but is it enough? • What about more effective, non-constrained opinion authors? • What about computer-generated opinions? We must use word choice analysis as well as other tools to effectively detect false reviews: -user activity (hard) -style analysis and deviation (harder) -IP tracking and statistics (really hard)
My strategy: • Create a web trawler to gather reviews • Use a database to store large data sets • Create functions to analyze dataset • Hopefully learn something about the way people review a service
The Web Trawler Since I wouldn’t be able to easily crowd-source data, I would have to programmatically collect my own set. Initially, I wanted to get reviews from multiple sites to potentially cross-list similar reviews, reviewers, and review styles. However….
The Database How does one store and operate on large datasets? -used Postgres 9.2 -open-source database with Java driver I would need to store initial data as well as any analysis I did such as: -word type usage (similar to Cornell study) -rating-based word choice -user-specific style choices (enthusiasm, spelling…) -etc
Tables To start, I created the tables where the trawler would be dumping data: -Review -Hotel -User These tables had basic entities that would store the data straight from the web page.
Analysis Types of analysis range from relatively basic to very complex and computationally intensive: Basic: types of words associated with review level i.e. four-star reviews are more likely to report ‘good’ or ‘met expectations’ Complex: determining a reviewers style based on comparison of all his reviews of different levels
More Analysis Basic: review outliers – how similar reviews of a certain rating are / drastically varying opinions. Do the reviews mention similar things? Complex: determining whether a user has actually used the service or not – is a review too generic or too specific? To what degree does this indicate falsehood?
Future Work -Significant amount of parsing required – site changes require code changes. -Cross-site data mining, user matching. -Adjusted user ratings based on user riskiness -reliability of new user vs. established user -More types of analysis beyond what is available from user data and review text -IP activity indicating bot usage -ability to generate realistic reviews
Works Cited 1. http://www.travelweekly.com/Travel-News/Hotel-News/Cornell-study-links-hotel-reviews-and-room-revenue/ 2. http://www.cs.cornell.edu/~myleott/op_spamACL2011.pdf Pictures: 1. http://www.blogcdn.com/www.engadget.com/media/2011/07/opinion-spam-1311794211.jpg 2. http://masterthenewnet.com/wp-content/uploads/2011/12/quick-sand.png