Opinion Spam Detection

Opinion Spam Detection Using Data Mining and Other Techniques By Ben Reback

What is opinion spam? Opinion spam is the action of providing opinion, either positive or negative, to publicly help or hurt some person or organization without disclosing a relationship with said person or organization. Long definition, simple idea: lying about some service to help or hurt the provider by the provider or a competitor’s request.

The Power of Opinion Cornell group studies effect of online reviews on hotel pricing and revenue[1]: -a one-point increase in Travelocity score (on a five-point system) increases room rates by 11% -a 1% increase in reputation over several sites boosts revenue per available room by roughly 1% -more than 50% of potential hotel guests reported using online reviews in choosing hotel

How to detect opinion spam? Previous approach by another Cornell group[2]: -crowd-sourced creation of both positive and negative opinion spam -analyzed dataset, comparing positive to negative opinions based on lexical choices by author -compared accuracy of computation-based detection against human detection

Lexical analysis is a strong metric, but is it enough? • What about more effective, non-constrained opinion authors? • What about computer-generated opinions? We must use word choice analysis as well as other tools to effectively detect false reviews: -user activity (hard) -style analysis and deviation (harder) -IP tracking and statistics (really hard)

My strategy: • Create a web trawler to gather reviews • Use a database to store large data sets • Create functions to analyze dataset • Hopefully learn something about the way people review a service

The Web Trawler Since I wouldn’t be able to easily crowd-source data, I would have to programmatically collect my own set. Initially, I wanted to get reviews from multiple sites to potentially cross-list similar reviews, reviewers, and review styles. However….

Despite useful tools, HTML and Javascript parsing are still a mess (partially due to anti-trawler scripting) I decided to focus on TripAdvisor hotels Trawler components: -Selenium 2.0 Webdriver -Java program It was more difficult to collect data than I anticipated

The Database How does one store and operate on large datasets? -used Postgres 9.2 -open-source database with Java driver I would need to store initial data as well as any analysis I did such as: -word type usage (similar to Cornell study) -rating-based word choice -user-specific style choices (enthusiasm, spelling…) -etc

Tables To start, I created the tables where the trawler would be dumping data: -Review -Hotel -User These tables had basic entities that would store the data straight from the web page.

Analysis Types of analysis range from relatively basic to very complex and computationally intensive: Basic: types of words associated with review level i.e. four-star reviews are more likely to report ‘good’ or ‘met expectations’ Complex: determining a reviewers style based on comparison of all his reviews of different levels

More Analysis Basic: review outliers – how similar reviews of a certain rating are / drastically varying opinions. Do the reviews mention similar things? Complex: determining whether a user has actually used the service or not – is a review too generic or too specific? To what degree does this indicate falsehood?

Future Work -Significant amount of parsing required – site changes require code changes. -Cross-site data mining, user matching. -Adjusted user ratings based on user riskiness -reliability of new user vs. established user -More types of analysis beyond what is available from user data and review text -IP activity indicating bot usage -ability to generate realistic reviews

Works Cited 1. http://www.travelweekly.com/Travel-News/Hotel-News/Cornell-study-links-hotel-reviews-and-room-revenue/ 2. http://www.cs.cornell.edu/~myleott/op_spamACL2011.pdf Pictures: 1. http://www.blogcdn.com/www.engadget.com/media/2011/07/opinion-spam-1311794211.jpg 2. http://masterthenewnet.com/wp-content/uploads/2011/12/quick-sand.png

Opinion Spam Detection

Opinion Spam Detection

Presentation Transcript

Improving Digest-Based Collaborative Spam Detection

Web Spam Detection with Anti-Trust Rank

Email Spam Detection using machine Learning

Spam Email Detection

Spam, Spam, Spam, Spam….

Network-Level Spam Detection

D-CISD: Distributed Content Insensitive Spam Detection

Internet Level Spam Detection and SpamAssassin 2.50

Review Spam Detection via Temporal Pattern Discovery

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

SPAM DETECTION IN P2P SYSTEMS

Deceptive Opinion Spam Analysis

Opinion Spam and Analysis

Opinion Detection by Transfer Learning

Spam Detection

Blog Track Open Task: Spam Blog Detection

Naïve Bayes for Text Classification: Spam Detection

Improving Spam Detection Based on Structural Similarity

Review Spam Detection via Temporal Pattern Discovery

Spam Detection Kingsley Okeke Nimrat Virk

Improving Digest-Based Collaborative Spam Detection