Opinion Mining Dr. Alaa El-Halees Faculty of Information Technology Islamic University of Gaza Seminar 9/9/2008
Outline • Definition • Applications • Challenges • Model • Arabic • Conclusion
Definition • Opinion mining (sentiment mining, opinion/sentiment extraction) is the area of research that attempts to make automatic systems to determine human opinion from text written in natural language. • It seeks to identify the view point (s) underlying a text span; an example application is classifying a movie review as thumbs up or thumbs down.
Definition • Consider, for instance, the following scenario. A major computer manufacturer, disappointed with unexpectedly low sales, finds itself confronted with this question: Why aren't consumers buying our laptop? While concrete data such as the laptop's weight or the price of a competitor's model are obviously relevant, answering this question requires focusing more on people's personal views of such objective characteristics. Moreover, subjective judgments regarding intangible qualities --- e.g., "the design is tacky" or "customer service was condescending" --- or even misperceptions --- "updated device drivers aren't available" --- must be taken into account as well.
Definition • What other people think has always been an important piece of information for most of us during the decision-making process. • Opinion mining draws on computational linguistic, information retrieval, text mining, natural language processing, machine learning, statistics and predictive analysis
Definition • Two main types of textual information. Facts and Opinions • Most current information processing technique (e.g., search engines) work with facts (assume they are true) • Facts can be expressed with topic keywords
Definition • In real life, facts are important, but opinion also plays a crucial role. A computer manufacturer, disappointed with low sales, asks itself: Why aren’t consumers buying our laptop? The Democratic National Committee, disappointed with the last election, wants to know on an on-going basis: What is the reaction in the press, newsgroups, chat rooms, and blogs to Bush’s latest policy decision?
Definition • The main advantage is the speed On average, humans process six articles per hour against the machine’s throughput of 10 per second
Applications • Applications as a Sub-Component Technology: • recommendation systems • Summarization • Question Answering: Q: What is the international reaction to the reelection of Robert Mugabe as President of Zimbabwe? A: African observers generally approved of his victory while Western Governments denounced it.
Applications • Applications in Business • marketing intelligence, • product and service benchmarking and improvement. • To understand the voice of the customer as expressed in everyday communications
Applications • Politics • As is well known, opinions matter a great deal in politics. Some work has focused on understanding what voters are thinking
Applications • Blog analysis • Perform subjectivity and polarity classification on blog posts • Discover irregularities in temporal mood patterns (fear, excitement, etc) appearing in a large corpus of blogs • Use link polarity information to model trust and influence in the blogosphere • Analyze Blog sentiments about movies and correlate it with its sales
Applications • Human Computer Interaction • Affect sensing • Human Robot Interaction
Challenges • Determine whether a document or portion (e.g. paragraph or statement) is subjective. • Example: “the battery lasts 2 hours” vs. “the battery only lasts 2 hours”
Challenges • The difficulty lies in the richness of human language use. Example: 1. This is a great camera. 2. A great amount of money was spent for promoting this camera. 3. One might think this is a great camera. Well think again, because..... • a single keyword can be used to convey three different opinions, +ve, neutral and -ve respectively.
Challenges • In order to arrive at sensible conclusions, sentiment analysis has to understand context. For example, “fighting” and “disease” is negative in a war context but positive in a medical one. • Different mining for different domains.
Challenges • Human agreed in of the same document. 82% chance of two or more human analysts agreeing with each other.
Data Preparation • The data preparation step performs necessary data preprocessing and cleaning on the dataset for the subsequent analysis. Some commonly used preprocessing steps include removing non-textual contents and markup tags (for HTML pages), and removing information about the reviews that are not required for sentiment analysis, such as review dates and reviewers’ names. • Balance training datasets distributions.
Review Analysis • The review analysis step analyzes the linguistic features of reviews so that interesting information, including opinions and/or product features, can be identified. • This step often applies various computational linguistics tasks to reviews first, and then extracts opinions and product features from the processed reviews. • Two commonly adopted tasks for review analysis are POS tagging and negation tagging.
Sentiment Classification • There are two main techniques for sentiment classification: • The symbolic technique uses manually crafted rules and lexicons, • The machine learning approach uses unsupervised, or supervised learning to construct a model from a large training corpus.
What? • Find relevant words, phrases, patterns that can be used to express subjectivity • Determine the polarity of subjective expressions
Words • Adjectives • positive:honest important mature large patientRon Paul is the only honest man in Washington. • Kitchell’s writing is unbelievably mature and is only likely to get better. • To humour me my patient father agrees yet again to my choice of film • negative: harmful hypocritical inefficient insecure • It was a macabre and hypocritical circus. • Why are they being so inefficient ?
Words • Verbs • positive:praise, love • negative: blame, criticize • Nouns • positive: pleasure, enjoyment • negative: pain, criticism
Phrases • Phrases containing adjectives and adverbs • positive: high intelligence, low cost • negative: little variation, many troubles
Patterns • way with <np>:… to ever let China use force to have its way with … • expense of <np>: at the expense of the world’s securty and stability • underlined <dobj>: Jiang’s subdued tone … underlined his desire to avoid disputes …
Machine Learning • Studies showed that standard machine learning techniques definitively outperform human-produced baselines.
Machine Learning • To treat sentiment classification simply as a special case of topic-based categorization (with the two “topics” being positive sentiment and negative sentiment)
Supervised Methods • In order to train a classifier for sentiment recognition in text, classic supervised learning techniques (e.g. Support Vector Machines, naive Bayes, Maximum Entropy) can be used. A supervised approach entails the use of a labelled training corpus to learn a certain classification function. The method that in the literature often yields the highest accuracy regards a Support Vector Machine classifier
Unsupervised Learning A clustering algorithm partitions the adjectives into two subsets + slow scenic nice terrible handsome painful fun expensive comfortable
Arabic • Work of Yousif Almas and Khurshid Ahmad • “A note on extracting ‘ sentiments’ in financial news in English, Arabic & Urdu” • Used Pattern approach in financial Data
Conclusion • An important field of study • New Filed • Many application • Suitable for Arabic Language Research • Almost no work in this area
References • Pang, Bo and Lee, L. (2008). “Opinion Mining and Sentiment Analysis”, Foundations and Trends Rin, Information Retrieval, Vol. 2, Nos. 1–2 (2008) 1–135, ebook from http://www.cs.cornell.edu/home/llee/omsa/omsa.pdf • Wiebe, J. Cardie, C. and Riloff, E. ( 2007). “Manual and Automatic Subjectivity and Sentiment Analysis” , Center for Extraction and Summarization of Events and Opinions in Text. University of Utah
References • Almas, Y. and Ahmad, K. (2007). “A note on extracting ‘sentiments’ in financial news in English, Arabic & Urdu”. The Second Workshop on Computational Approaches to Arabic Script-based Languages” LSA 2007 Linguistic Institute • July 21, 2007•Stanford University. • Leung, C. and Chan, S. ( 2008). “Sentiment Analysis of Product Reviews”. Encyclopedia of Data Warehousing and Mining - 2nd Edition, Information Science Reference, August 2008