540 likes | 735 Views
Multi- variate Outliers in Data Cubes. 2012-03-05 JongHeum Yeon. Contents. Sentiment Analysis and Opinion Mining Materials from AAAI-2011 Tutorial Multi- variate Outliers in Data Cubes Motivation Technologies Issues. Sentiment Analysis and Opinion Mining.
E N D
Multi-variate Outliers in Data Cubes 2012-03-05 JongHeumYeon
Contents • Sentiment Analysis and Opinion Mining • Materials from AAAI-2011 Tutorial • Multi-variate Outliers in Data Cubes • Motivation • Technologies • Issues
Sentiment Analysis and Opinion Mining • Opinion mining or sentiment analysis • Computational study of opinions, sentiments, subjectivity, evaluations, attitudes, appraisal, affects, views, emotions, etc., expressed in text. • Opinion mining ~= sentiment analysis • Sources: Global Scale • Word-of-mouth on the Web • Personal experiences and opinions about anything in reviews, forums, blogs, Twitter, micro-blogs, etc. • Comments about articles, issues, topics, reviews, etc. • Postings at social networking sites, e.g., facebook. • Organization internal data • News and reports • Applications • Businesses and organizations • Individuals • Ads placements • Opinion retrieval
Problem Statement • (1) Opinion Definition • Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my motherwas mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …” • document level, i.e., is this review + or -? • sentence level, i.e., is each sentence + or -? • entity and feature/aspect level • Components • Opinion targets: entities and their features/aspects • Sentiments: positive and negative • Opinion holders: persons who hold the opinions • Time: when opinions are expressed • (2) Opinion Summarization
Two main types of opinions • Regular opinions: Sentiment/opinion expressions on some target entities • Direct opinions: • “The touch screen is really cool.” • Indirect opinions: • “After taking the drug, my pain has gone.” • Comparative opinions: Comparisons of more than one entity. • e.g., “iPhone is better than Blackberry.” • Opinion (a restricted definition) • An opinion (or regular opinion) is simply a positive or negative sentiment, view, attitude, emotion, or appraisal about an entity or an aspect of the entity (Hu and Liu 2004; Liu 2006) from an opinion holder (Bethard et al 2004; Kim and Hovy 2004; Wiebe et al 2005). • Sentiment orientation of an opinion • Positive, negative, or neutral (no opinion) • Also called opinion orientation, semantic orientation, sentiment polarity.
Entity and Aspect • Definition (entity) • An entity e is a product, person, event, organization, or topic. e is represented as • a hierarchy of components, sub-components, and so on. • Each node represents a component and is associated with a set of attributes of the component. • An opinion is a quintuple
Goal of Opinion Mining • Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …” • Quintuples • (iPhone, GENERAL, +, Abc123, 5-1-2008) • (iPhone, touch_screen, +, Abc123, 5-1-2008) • Goal: Given an opinionated document, • Discover all quintuples (ej, ajk, soijkl, hi, tl), • Or, solve some simpler forms of the problem (sentiment classification at the document or sentence level) • Unstructured Text → Structured Data
Sentiment, subjectivity, and emotion • Sentiment ≠ Subjective ≠ Emotion • Sentence subjectivity: An objective sentence presents some factual information, while a subjective sentence expresses some personal feelings, views, emotions, or beliefs. • Emotion: Emotions are people’s subjective feelings and thoughts. • Most opinionated sentences are subjective, but objective sentences can imply opinions too. • Emotion • Rational evaluation: Many evaluation/opinion sentences express no emotion • e.g., “The voice of this phone is clear” • Emotional evaluation • e.g., “I love this phone” • “The voice of this phone is crystal clear” • Sentiment ⊄ Subjectivity • Emotion ⊂ Subjectivity • Sentiment ⊄ Emotion
Opinion Summarization • With a lot of opinions, a summary is necessary. • A multi-document summarization task • For factual texts, summarization is to select the most important facts and present them in a sensible order while avoiding repetition • 1 fact = any number of the same fact • But for opinion documents, it is different because opinions have a quantitative side & have targets • 1 opinion ≠ a number of opinions • Aspect-based summary is more suitable • Quintuples form the basis for opinion summarization
Aspect-based Opinion Summary • Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …” • Feature Based Summary of iPhone • Opinion Observer
Aspect-based Opinion Summary • Opinion Observer
Aspect-based Opinion Summary • Bing • Google Product Search
Aspect-based Opinion Summary • OpinionEQ • Detail opinion sentences
OpinionEQ • % of +ve opinion and # of opinions • Aggregate opinion trend
Opinion Mining Problem • (ej, ajk, soijkl, hi, tl), • ej- a target entity: Named Entity Extraction (more) • ajk– an aspect of ej: Information Extraction • soijklis sentiment: Sentiment Identification • hiis an opinion holder: Information/Data Extraction • tlis the time: Information/Data Extraction • 5 pieces of information must match • Coreferenceresolution • Synonym match (voice = sound quality) • …
Opinion Mining Problem • Tweets from Twitter are the easiest • short and thus usually straight to the point • Reviewsare next • entities are given (almost) and there is little noise • Discussions, comments, and blogs are hard. • Multiple entities, comparisons, noisy, sarcasm, etc • Determining sentiments seems to be easier. • Extracting entities and aspects is harder. • Combining them is even harder.
Opinion Mining Problem in the Real World • Source the data, e.g., reviews, blogs, etc • (1) Crawl all data, store and search them, or • (2) Crawl only the target data • Extract the right entities & aspects • Group entity and aspect expressions, • Moto = Motorola, photo = picture, etc … • Aspect-based opinion mining (sentiment analysis) • Discover all quintuples • (Store the quintuples in a database) • Aspect based opinion summary
Problems • Document sentiment classification • Sentence subjectivity & sentiment classification • Aspect-based sentiment analysis • Aspect-based opinion summarization • Opinion lexicon generation • Mining comparative opinions • Some other problems • Opinion spam detection • Utility or helpfulness of reviews
Approaches • Knowledge-based approach • Uses background knowledge of linguistics to identify sentiment polarity of a text • Background knowledge is generally represented as dictionaries capturing the sentiments of lexicons • Learning-based approach • Based on supervised machine learning techniques • Formulating the problem of sentiment identification as a text classification, utilizing bag-of-words model
Document Sentiment Classification • Classify a whole opinion document (e.g., a review) based on the overall sentiment of the opinion holder • A text classification task • It is basically a text classification problem • Assumption: The doc is written by a single person and express opinion/sentiment on a single entity. • Goal: discover (_, _, so, _, _), where e, a, h, and t are ignored • Reviews usually satisfy the assumption. • Almost all papers use reviews • Positive: 4 or 5 stars, negative: 1 or 2 stars
Document Unsupervised Classification • Data: reviews from epinions.com on automobiles, banks, movies, and travel destinations. • Three steps • Step 1 • Part-of-speech (POS) tagging • Extracting two consecutive words (two-word phrases) from reviews if their tags conform to some given patterns, e.g., (1) JJ, (2) NN. • Step 2: Estimate the sentiment orientation (SO) of the extracted phrases • Pointwise mutual information • Semantic orientation (SO) • Step 3: Compute the average SO of all phrases
Document Supervised Learning • Directly apply supervised learning techniques to classify reviews into positive and negative • Three classification techniques were tried • Naïve Bayes • Maximum entropy • Support vector machines • Pre-processing • Features: negation tag, unigram (single words), bigram, POS tag, position. • Training and test data • Movie reviews with star ratings • 4-5 stars as positive • 1-2 stars as negative • Neutral is ignored. • SVMgives the best classification accuracy based on balance training data • 83% • Features: unigrams (bag of individual words)
Aspect-based Sentiment Analysis • (ej, ajk, soijkl, hi, tl) • Aspect extraction • Goal: Given an opinion corpus, extract all aspects • A frequency-based approach (Hu and Liu, 2004) • nouns (NN) that are frequently talked about are likely to be true aspects (called frequent aspects) • Infrequent aspect extraction • To improve recall due to loss of infrequent aspects. It uses opinions words to extract them • Key idea: opinions have targets, i.e., opinion words are used to modify aspects and entities. • “The pictures are absolutely amazing.” • “This is an amazing software.” • The modifying relation was approximated with the nearest noun to the opinion word.
Aspect-based Sentiment Analysis • Using part-of relationship and the Web • Improved (Hu and Liu, 2004) by removing those frequent noun phrases that may not be aspects: better precision (a small drop in recall). • It identifies part-of relationship • Each noun phrase is given a pointwise mutual information score between the phrase and part discriminators associated with the product class, e.g., a scanner class. • e.g., “of scanner”, “scanner has”, etc, which are used to find parts of scanners by searching on the Web: • Extract aspects using DP (Qiuet al. 2009; 2011) • A double propagation (DP) approach proposed • Based on the definition earlier, an opinion should have a target, entity or aspect. • Use dependency of opinions & aspects to extract both aspects & opinion words. • Knowing one helps find the other. • E.g., “The rooms are spacious” • It extracts both aspects and opinion words. • A domain independent method.
Aspect-based Sentiment Analysis • DP is a bootstrapping method • Input: a set of seed opinion words, • no aspect seeds needed • Based on dependency grammar (Tesniere 1959). • “This phone has good screen”
Aspect-based Sentiment Analysis • iKnow Easy subject Keeping Fast modify subject Delivery Quite
Aspect Sentiment Classification • For each aspect, identify the sentiment or opinion expressed on it. • Almost all approaches make use of opinion words and phrases. • But notice: • Some opinion words have context independent orientations, e.g., “good” and “bad” (almost) • Some other words have context dependent orientations, e.g., “small” and sucks” (+ve for vacuum cleaner) • Supervised learning • Sentence level classification can be used, but … • Need to consider target and thus to segment a sentence (e.g., Jiang et al. 2011) • Lexicon-based approach (Ding, Liu and Yu, 2008) • Need parsing to deal with: Simple sentences, compound sentences, comparative sentences, conditional sentences, questions; different verb tenses, etc. • Negation (not), contrary (but), comparisons, etc. • A large opinion lexicon, context dependency, etc. • Easy: “Apple is doing well in this bad economy.”
Aspect Sentiment Classification • A lexicon-based method (Ding, Liu and Yu 2008) • Input: A set of opinion words and phrases. A pair (a, s), where a is an aspect and s is a sentence that contains a. • Output: whether the opinion on a in s is +ve, -ve, or neutral. • Two steps • Step 1: split the sentence if needed based on BUT words (but, except that, etc). • Step 2: work on the segment sfcontaining a. Let the set of opinion words in sfbe w1, .., wn. Sum up their orientations (1, -1, 0), and assign the orientation to (a, s) accordingly. • where wi.ois the opinion orientation of wi. • d(wi, a) is the distance from a to wi.
Previous Work • 연종흠, 이동주, 심준호, 이상구, 상품 리뷰 데이터와 감성 분석 처리 모델링, 한국, 한국전자거래학회지, 2011 • JongheumYeon, DongjooLee, Jaehui Park and Sang-goo Lee, A Framework for Sentiment Analysis on Smartphone Application Stores, AITS, 2012
On-Line Sentiment Analytical Processing • 의견 정보가 증가할수록 OLAP(On-Line Analytical Processing)처럼 의견 정보를 다양한 각도로 분석및 의사 결정 지원에 활용하는 요구 증가 • 하지만 기존의 오피니언마이닝 기법은 결과가 정형화되어 있어 다각도로 데이터를 분석하기 어려움 • 구매 예정자를 대상으로 리뷰를 특징 단위의 점수로 요약 • 특정 키워드에 연관된 의견 성향을 판단 • OLSAP: On-Line Sentiment Analytical Processing • 의사 결정 지원을 위해 의견 정보를 데이터 웨어하우스에 저장 • 의견 정보를 온라인에서 동적으로 분석하고 통합하는 처리 기법 • OLSAP를 위한 의견 정보의 모델링 방안을 제시
의견 데이터 모델 • OLSAP에서는 다음과 같은 형태로 의견 데이터를 모델링 • 는 “아이폰”과 같은 의견이 표현된 대상 • 는 “LCD”와 같은의 세부 특징 • 는 “좋다”와 같이 각 특징에 대한 어휘 • 는 “꽤” 와 같은 의견의 강도를 나타내는 어휘 • 와 는 각각 특징과 의견강도에 대한 실수 값 • 부정일 경우 음수, 긍정일 경우 양수 • 는 의견을 제시한 사용자 • 는 의견이 작성된 시각 • 는 의견이 작성된 위치
OLSAP 모델링 • OLSAP 데이터베이스 스키마 • 의견 정보 연관 테이블
Related Work • Integration of Opinion into Customer Analysis Model, Eighth IEEE International Conference on e-Business Engineering, 2011
Motivation • Opinion Mining on top of Data Cubes • OnLine Analysis of Data to provide “Clients” with “right” reviews • Interaction is the key between • Analysis of Review Data and Clients • Let the client decide how to view the result of analyzing the reviews • 1. Any opinion mining can’t be perfect. • 2. Mined data itself can have “malicious” outliers. • Data Warehousing • Data Cubes, Multidimensional Aggregation • A ‘real-systematic’ platform to give the birth of data mining. • Focus: • More system-like approach, towards the integrated Algorithm & Data Structure, and its Performance, in order to integrate the OLAP with Opinion Mining. • In other words, no interests on traditional opinion mining issues such as natural language processing and polarity classification stuffs
Motivation • Avg_GroupBy(User=Anti1, Product=Samsung, …) means the (average) value grouped by (user, product, …) where the values of user and product are the given literals, as to Anti1 and Samsung, respectively. • ALL represents the don’t care.
Motivation • To find out if Anti1’s review needs to be considered or out of concerned, we are interested in the following values: • 1. Avg_GroupBy(User=Anti1, Product=Samsung) • 2. Avg_GroupBy(User=Anti1, Product=~Samsung), • where Product=~Samsung means U-{Samsung}. • 3. Avg_GroupBy(User=Anti1, Product=ALL) • = Avg_GroupBy(User=Anti1) • 4. Avg_GroupBy(User=~Anti1, Product=Samsung) • 5. Avg_GroupBy(User=ALL, Product=Samsung) • = Avg_GroupBy(Product=Samsung)
Motivation • Look into Behavior of Anti1 & Anti2 • Anti1 provides the values only to Samsung while Ant2 does to others as well. • 1) Avg_GroupBy(User=Anti1, Product=ALL) = Avg_GroupBy(User=Anti1, Product=Samsung) • i.e., Avg_GroupBy(User=Anti1, Product=~Samsung) = NULL • && • 2) Avg_GroupBy(User=Anti1, Product=Samsung) = 1 turns out to be an outlier, considering a Avg_GroupBy(User=~Anti1, Product=Samsung) = 2.85 • 이 경우( Ant1만 빼야하는지, 아니면 Ant2도 빼야하는지, 아니면 이들을 다 포함한 평균값을 생각해야 하는지? 즉 Avg_GroupBy(User=ALL, Product=Samsung) = 2.18 • 이경우User-3는 Samsung에만 줬는데 왜 Outlier가 아닌지? • 예제가 부족하지만, User-3의 avg값은 outlier가 아닐정도라고 가정.
Look into Behavior of Anti1 & Anti2 • Anti2 provides the values not only to Samsung, but to others as well. • 1) Avg_GroupBy(User=Anti2, Product=ALL) != Avg_GroupBy(User=Anti1, Product=Samsung) && • i.e., Avg_GroupBy(User=Anti1, Product=~Samsung) is NOT NULL • && • 2) Avg_GroupBy(User=Anti2, Product=Samsung) 와 Avg_GroupBy(User=Anti2, Product=~Samsung) 가 너무 차이남 • && • 3) Avg_GroupBy(User=Anti2, Product=Samsung) turns out to be an outlier, considering that Avg_GroupBy(User=ALL, Product=Samsung) • 이경우User-2는 Samsung과 다른 제품들에 모두 줬는데 왜 Outlier가 아닌지? • 위의 2)번 조건에 위배. 즉 User-2 자체의 점수들 자체가 짬. i.e., 점수 분포는 not bias.
Motivation • Summary 1) 특정 제품 그룹만 review하고, 그 review 평균값이 다른 user들의 해당 그룹 review 평균값과 많은 차이가 날때. 2) 특정 제품 그룹과 다른 그룹들 모두 review하고, 그 그룹간 review 평균값이많이 차이나면서, 특정 그룹 review 평균값이 다른 user들의 해당 그룹 review 평균값과 많은 차이가 날때. • 위의 2)에서 및줄친부분이 만족하지 않으면 원래 review 점수가 짠 사람. • User3 should be okay! • Why? 한 그룹만 review 했지만 그 평균값이 다른 user들의 해당 그룹 평균값과 별로 차이가 나지 않음. • User2 should be okay! • 원래 짠 사람. • User4 should be okay! • 여러 그룹 review하고, 각 그룹의 평균값이 다른 user들의 해당 그룹 평균값과 별로 차이가 나지 않음.
Research Perspectives -1 • Outlier Conditions • Most likely, we must consider some heuristics, to suit the domain; here opinion (review) data. • Condition1 • Condition2 • … • Condition_n in forms of as followings • Multi-variate Outlier Detection • Avg_groupby(X1=x1, X2=x2, …., Xn=xn) is an outlier only if for Xi_c = X – {Xi} Chisq [Avg_groupby(X1=x1, X2=x2, Xi-1=xi-1, Xi+1=xi+1,…., Xn=xn)] * Skew 보다도 값이 넘어갈때. • Sort of ….. • Can this conditions be interactively input by the user? (Rule-based approach) • For some users who are not likely to explore the interactive outline-detection features, can a default-rule be applied and give the user some hints wrt potential outliers?
Research Perspectives -2 • Outlier-conscious Aggregation – Aggregation Construction Algorithm (& Data Structures) • Data cubes are constructed to contain Avg_groupby(X1=x1, X2=x2, …., Xn=xn) for each dimension X1, …Xn. • However, after either interactive (manual) ot batch (heuristical automatic) process of eliminating outliers, the cube also needs to be “effectively or efficiently” constructed to contain Avg_groupby without having those outliers. • Most likely, cubes need to maintain not only Avg_groupby value. Instead, needs to have count, sum, max, min values as well. • While 1. Multi-variate Outlier Detection Avg_groupby(X1=x1, X2=x2, …., Xn=xn) is an outlier only if for Xi_c = X – {Xi} Chisq [Avg_groupby(X1=x1, X2=x2, Xi-1=xi-1, Xi+1=xi+1,…., Xn=xn)] * Skew 보다도 값이 넘어갈때. 2. To see if the lower-variate also cause the outlier: |Xn-1|. In other words, ant1 can input outlier for all individual Samsung products. Then avg_gourpby(ant1,samsung) will be a outlier while avg_groupby(ant1,samsung,samsung_prod1) is an outlier. => So find out the loweset-dimension outlier, and removes all the containing outlier elements.
Research Perspectives -3 • Outlier-conscious Aggregation – Visualization of Aggregation and possible outliers and their effects. • Instead of showing the Avgs • Not only Average • Med or • Distribution
Research Perspectives -3 Showing the distribution Containing possible outliers Showing Not only Mean but Median (and Mode)