807 - TEXT ANALYTICS

807 - TEXT ANALYTICS Massimo PoesioLecture 2: Text Classification

TEXT CLASSIFICATION From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm =================================================

TAGS FOR TEXT MANAGEMENT

HOW DOCUMENT TAGS ARE PRODUCED • Traditionally: by HAND by EXPERTS (curators) • E.g., Yahoo!, Looksmart, about.com, ODP, Medline • Also: Bridgeman, UK Data Archive • very accurate when job is done by experts • consistent when the problem size and team is small • difficult and expensive to scale • Now: • AUTOMATICALLY • By CROWDS

CLASSIFICATION AND CATEGORIZATION • Given: • A description of an instance, xX, where X is the instance language or instance space. • Issue: how to represent text documents. • A fixed set of categories: C ={c1, c2,…, cn} • Determine: • The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C. • We want to know how to build categorization functions (“classifiers”).

AUTOMATIC TEXT CLASSIFICATION • Supervised • Training set • Test set • Clustering: next week

Text Categorization: attributes • Representations of text are very high dimensional (one feature for each word). • High-bias algorithms that prevent overfitting in high-dimensional space are best. • For most text categorization tasks, there are many irrelevant and many relevant features. • Methods that combine evidence from many or all features (e.g. naive Bayes, kNN, neural-nets) tend to work better than ones that try to isolate just a few relevant features (standard decision-tree or rule induction)* *Although one can compensate by using many rules

Bayesian Methods • Learning and classification methods based on probability theory. • Bayes theorem plays a critical role in probabilistic learning and classification. • Build a generative model that approximates how data is produced • Uses prior probability of each category given no information about an item. • Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Bayes’ Rule

Maximum a posteriori Hypothesis

Maximum likelihood Hypothesis If all hypotheses are a priori equally likely,we only need to consider the P(D|h) term:

Naive Bayes Classifiers Task: Classify a new instance based on a tuple of attribute values

Naïve Bayes Classifier: Assumptions • P(cj) • Can be estimated from the frequency of classes in the training examples. • P(x1,x2,…,xn|cj) • O(|X|n•|C|) • Could only be estimated if a very, very large number of training examples was available. Conditional Independence Assumption:  Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache The Naïve Bayes Classifier • Conditional Independence Assumption: features are independent of each other given the class:

C X1 X2 X3 X4 X5 X6 Learning the Model • Common practice:maximum likelihood • simply use the frequencies in the data

Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache Problem with Max Likelihood • What if we have seen no training cases where patient had no flu and muscle aches? • Zero probabilities cannot be conditioned away, no matter the other evidence!

Smoothing to Avoid Overfitting # of values of Xi overall fraction in data where Xi=xi,k • Somewhat more subtle version extent of “smoothing”

Using Naive Bayes Classifiers to Classify Text: Basic method • Attributes are text positions, values are words. • Naive Bayes assumption is clearly violated. • Example? • Still too many possibilities • Assume that classification is independent of the positions of the words • Use same parameters for each position

Text Classification Algorithms: Learning • From training corpus, extract Vocabulary • Calculate required P(cj)and P(xk | cj)terms • For each cj in Cdo • docsjsubset of documents for which the target class is cj • Textj single document containing all docsj • for each word xkin Vocabulary • nk number of occurrences ofxkin Textj

Text Classification Algorithms: Classifying • positions  all word positions in current document which contain tokens found in Vocabulary • Return cNB, where

Underflow Prevention • Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. • Class with highest final un-normalized log probability score is still the most probable.

Naïve Bayes Posterior Probabilities • Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. • However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not. • Output probabilities are generally very close to 0 or 1.

Feature Selection • Text collections have a large number of features • 10,000 – 1,000,000 unique words – and more • Make using a particular classifier feasible • Some classifiers can’t deal with 100,000s of feat’s • Reduce training time • Training time for some methods is quadratic or worse in the number of features (e.g., logistic regression) • Improve generalization • Eliminate noise features • Avoid overfitting

Recap: Feature Reduction • Standard ways of reducing feature space for text • Stemming • Laugh, laughs, laughing, laughed -> laugh • Stop word removal • E.g., eliminate all prepositions • Conversion to lower case • Tokenization • Break on all special characters: fire-fighter -> fire, fighter

Feature Selection • Yang and Pedersen 1997 • Comparison of different selection criteria • DF – document frequency • IG – information gain • MI – mutual information • CHI – chi square • Common strategy • Compute statistic for each term • Keep n terms with highest value of this statistic

Feature selection via Mutual Information • We might not want to use all words, but just reliable, good discriminators • In training set, choose k words which best discriminate the categories. • One way is in terms of Mutual Information: • For each word w and each category c

(Pointwise) Mutual Information

Feature selection via MI (contd.) • For each category we build a list of k most discriminating terms. • For example (on 20 Newsgroups): • sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, … • rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, … • Greedy: does not account for correlations between terms • In general feature selection is necessary for binomial NB, but not for multinomial NB

Information Gain

Chi-Square X^2 = N(AD-BC)^2 / ( (A+B) (A+C) (B+D) (C+D) ) Use either maximum or average X^2 Value for complete independence?

Document Frequency • Number of documents a term occurs in • Is sometimes used for eliminating both very frequent and very infrequent terms • How is document frequency measure different from the other 3 measures?

Yang&Pedersen: Experiments • Two classification methods • kNN (k nearest neighbors; more later) • Linear Least Squares Fit • Regression method • Collections • Reuters-22173 • 92 categories • 16,000 unique terms • Ohsumed: subset of medline • 14,000 categories • 72,000 unique terms • Ltc term weighting

Yang&Pedersen: Experiments • Choose feature set size • Preprocess collection, discarding non-selected features / words • Apply term weighting -> feature vector for each document • Train classifier on training set • Evaluate classifier on test set

Discussion • You can eliminate 90% of features for IG, DF, and CHI without decreasing performance. • In fact, performance increases with fewer features for IG, DF, and CHI. • Mutual information is very sensitive to small counts. • IG does best with smallest number of features. • Document frequency is close to optimal. By far the simplest feature selection method. • Similar results for LLSF (regression).

Results Why is selecting common terms a good strategy?

IG, DF, CHI Are Correlated.

Information Gain vs Mutual Information • Information gain is similar to MI for random variables • Independence? • In contrast, pointwise MI ignores non-occurrence of terms • E.g., for complete dependence, you get: • P(AB)/ (P(A)P(B)) = 1/P(A) – larger for rare terms than for frequent terms • Yang&Pedersen: Pointwise MI favors rare terms

Feature Selection:Other Considerations • Generic vs Class-Specific • Completely generic (class-independent) • Separate feature set for each class • Mixed (a la Yang&Pedersen) • Maintainability over time • Is aggressive features selection good or bad for robustness over time? • Ideal: Optimal features selected as part of training

Yang&Pedersen: Limitations • Don’t look at class specific feature selection • Don’t look at methods that can’t handle high-dimensional spaces • Evaluate category ranking (as opposed to classification accuracy)

Feature Selection: Other Methods • Stepwise term selection • Forward • Backward • Expensive: need to do n^2 iterations of training • Term clustering • Dimension reduction: PCA / SVD

FEATURES IN SPAM ASSASSIN • SpamAssassin is a spam filtering program written in Perl, developed by Justin Mason in 1997 and entirely rewritten by Mark Jeftovic in 2001 when it went on Sourceforge – now maintained by the Apache foundation • It uses Naïve Bayes and a variety of other methods • It uses features coming both from content and from other properties of the email message

SpamAssassin Features 100 From: address is in the user's black-list 4.0 Sender is on www.habeas.com Habeas Infringer List 3.994 Invalid Date: header (timezone does not exist) 3.970 Written in an undesired language 3.910 Listed in Razor2, see http://razor.sf.net/ 3.801 Subject is full of 8-bit characters 3.472 Claims compliance with Senate Bill 1618 3.437 exists:X-Precedence-Ref 3.371 Reverses Aging 3.350 Claims you can be removed from the list 3.284 'Hidden' assets 3.283 Claims to honor removal requests 3.261 Contains "Stop Snoring" 3.251 Received: contains a name with a faked IP-address 3.250 Received via a relay in list.dsbl.org 3.200 Character set indicates a foreign language

SpamAssassin Features 3.198 Forged eudoramail.com 'Received:' header found 3.193 Free Investment 3.180 Received via SBLed relay, seehttp://www.spamhaus.org/sbl/ 3.140 Character set doesn't exist 3.123 Dig up Dirt on Friends 3.090 No MX records for the From: domain 3.072 X-Mailer contains malformed Outlook Expressversion 3.044 Stock Disclaimer Statement 3.009 Apparently, NOT Multi Level Marketing 3.005 Bulk email software fingerprint (jpfree) found inheaders 2.991 exists:Complain-To 2.975 Bulk email software fingerprint (VC_IPA) found inheaders 2.968 Invalid Date: year begins with zero 2.932 Mentions Spam law "H.R. 3113" 2.900 Received forged, contains fake AOL relays 2.879 Asks for credit card details

Measuring Performance • Classification accuracy: What % of messages were classified correctly? • Is this what we care about? • Which system do you prefer?

Measuring Performance • Precision = good messages kept all messages kept • Recall =good messages kept all good messages Trade off precision vs. recall by setting threshold Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable

F-measure = 1 / (average(1/precision, 1/recall)) OK for search engines (maybe) would prefer to be here! point where precision=recall (sometimes reported) OK for spam filtering and legal search Measuring Performance high threshold: all we keep is good, but we don’t keep much low threshold: keep all the good stuff,but a lot of the bad too 600.465 - Intro to NLP - J. Eisner

Micro- vs. Macro-Averaging • If we have more than one class, how do we combine multiple performance measures into one quantity? • Macroaveraging: Compute performance for each class, then average. • Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

Micro- vs. Macro-Averaging: Example Class 1 Class 2 Micro.Av. Table • Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 • Microaveraged precision: 100/120 = .83 • Why this difference?

Confusion matrix • Function of classifier, topics and test docs. • For a perfect classifier, all off-diagonal entries should be zero. • For a perfect classifier, if there are n docs in category j than entry (j,j) should be n. • Straightforward when there is 1 category per document. • Can be extended to n categories per document.

807 - TEXT ANALYTICS