360 likes | 460 Views
Learn the importance of feature selection in natural language processing, including methods like TF-IDF term weighting and linguistic features for better accuracy. Discover why and how to select features, different types of feature selection, and tools like Document Frequency and Chi-Square Statistic.
E N D
SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)
Today • Feature selection • TF.IDF Term Weighting • Weka Input File Format
Features for Text Categorization • Linguistic features • Words • lowercase? (should we convert to?) • normalized? (e.g. “texts” “text”) • Phrases • Word-level n-grams • Character-level n-grams • Punctuation • Part of Speech • Non-linguistic features • document formatting • informative character sequences (e.g. <)
When Do We NeedFeature Selection? • If the algorithm cannot handle all possible features • e.g. language identification for 100 languages using all words • text classification using n-grams • Good features can result in higher accuracy • What if we just keep all features? • Even the unreliable features can be helpful. • But we need to weight them: • In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection!
Why Feature Selection? • Not all features are equally good! • Bad features: best to remove • Infrequent • unlikely to be seen again • co-occurrence with a class can be due to chance • Too frequent • mostly function words • Uniform across all categories • Good features: should be kept • Co-occur with a particular category • Do not co-occur with other categories • The rest: good to keep
Types Of Feature Selection? • Feature selection reduces the number of features • Usually: • Eliminating features • Weighting features • Normalizing features • Sometimes by transforming parameters • e.g. Latent Semantic Indexing using Singular Value Decomposition • Method may depend on problem type • For classification and filtering, may want to use information from example documents to guide selection.
Feature Selection • Task independent methods • Document Frequency (DF) • Term Strength (TS) • Task-dependent methods • Information Gain (IG) • Mutual Information (MI) • 2 statistic (CHI) Empirically compared by Yang & Pedersen (1997)
Pedersen & Yang Experiments • Compared feature selection methods for text categorization • 5 feature selection methods: • DF, MI, CHI, (IG, TS) • Features were just words, not phrases • 2 classifiers: • kNN: k-Nearest Neighbor • LLSF: Linear Least Squares Fit • 2 data collections: • Reuters-22173 • OHSUMED: subset of MEDLINE (1990&1991 used)
Document Frequency (DF) DF: number of documents a term appears in • Based on Zipf’s Law • Remove the rare terms: (seen 1-2 times) • Spurious • Unreliable – can be just noise • Unlikely to appear in new documents • Plus • Easy to compute • Task independent: do not need to know the classes • Minus • Ad hoc criterion • For some applications, rare terms can be good discriminators (e.g., in IR)
Stop Word Removal • Common words from a predefined list • Mostly from closed-class categories: • unlikely to have a new word added • include: auxiliaries, conjunctions, determiners, prepositions, pronouns, articles • But also some open-class words like numerals • Bad discriminators • uniformly spread across all classes • can be safely removed from the vocabulary • Is this always a good idea? (e.g. author identification)
2 statistic (CHI) • 2 statistic (pronounced “kai square”) • A commonly used method of comparing proportions. • Measures the lack of independence between a term and a category (Yang & Pedersen)
2 statistic (CHI) Is “jaguar” a good predictor for the “auto” class? We want to compare: • the observed distribution above; and • null hypothesis: that jaguar and auto are independent
2 statistic (CHI) Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? • If independent: Pr(j,a) = Pr(j) Pr(a) • So, there would be: N Pr(j,a), i.e. N Pr(j) Pr(a) Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 • Which = N(5/N)(502/N)=2510/N=2510/10005 0.25
2 statistic (CHI) Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? expected: fe observed: fo
2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? expected: fe observed: fo
2 statistic (CHI) 2 is interested in(fo– fe)2/fe summed over all table entries: The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence). expected: fe observed: fo
2 statistic (CHI) There is a simpler formula for 2: N = A + B + C + D
2 statistic (CHI) How to use 2 for multiple categories? Compute 2 for each category and then combine: • To require a feature to discriminate well across all categories, then we need to take the expected value of 2: • Or to weight for a single category, take the maximum:
2 statistic (CHI) • Pluses • normalized and thus comparable across terms • 2(t,c) is 0, when t and c are independent • can be compared to 2distribution, 1 degree of freedom • Minuses • unreliable for low frequency terms
Information Gain • A measure of importance of the feature for predicting the presence of the class. • Has an information theoretic justification • Defined as: • The number of “bits of information” gained by knowing the term is present or absent • Based on Information Theory • We won’t go into this in detail here.
Information Gain (IG) IG: number of bits of information gained by knowing the term is present or absent t is the term being scored, ci is a class variable entropy: H(c) specific conditional entropy H(c|t) specific conditional entropy H(c|¬t)
Mutual Information (MI) • The probability of seeing x and y together vs • The probably of seeing x anywhere times the probability of seeing y anywhere (independently). MI = log ( P(x,y) / P(x)P(y) ) = log(P(x,y)) – log(P(x)P(y)) From Bayes law: P(x,y) = P(x|y)P(y) = log(P(x|y)P(y)) – log(P(x)P(y)) MI = log(P(x|y) – log(P(x))
Mutual Information (MI) rare terms get higher scores Approximation: N = A + B + C + D does not use term absence
Using Mutual Information • Compute MI for each category and then combine • If we want to discriminate well across all categories, then we need to take the expected value of MI: • To discriminate well for a single category, then we take the maximum:
Mutual Information • Pluses • I(t,c) is 0, when t and c are independent • Has a sound information-theoretic interpretation • Minuses • Small numbers produce unreliable results • Does not use term absence
CHI max, IG, DF Term strength Mutual information From Yang & Pedersen ‘97
Feature Comparison DF, IG and CHI are good and strongly correlated • thus using DF is good, cheap and task independent • can be used when IG and CHI are too expensive • MI is bad • favors rare terms (which are typically bad)
Term Weighting • In the study just shown, terms were (mainly) treated as binary features • If a term occurred in a document, it was assigned 1 • Else 0 • Often it us useful to weight the selected features • Standard technique: tf.idf
TF.IDF Term Weighting • TF: term frequency • definition: TF = tij • frequency of term i in document j • purpose: makes the frequent words for the document more important • IDF: inverted document frequency • definition: IDF = log(N/ni) • ni : number of documents containing term i • N : total number of documents • purpose: makes rare words across documents more important • TF.IDF (for term i in document j) • definition: tij log(N/ni)
Term Normalization • Combine different words into a single representation • Stemming/morphological analysis • bought, buy, buys -> buy • General word categories • $23.45, 5.30 Yen -> MONEY • 1984, 10,000 -> DATE, NUM • PERSON • ORGANIZATION • (Covered in Information Extraction segment) • Generalize with lexical hierarchies • WordNet, MeSH • (Covered later in the semester)
What Do People Do In Practice? • Feature selection • infrequent term removal • infrequent across the whole collection (i.e. DF) • seen in a single document • most frequent term removal (i.e. stop words) • Normalization: • Stemming. (often) • Word classes (sometimes) • Feature weighting: TF.IDF or IDF • Dimensionality reduction (sometimes)
Weka • Java-based tool for large-scale machine-learning problems • Tailored towards text analysis • http://weka.sourceforge.net/wekadoc/
Weka Input Format • Expects a particular input file format • Called ARFF: Attribute-Relation File Format • Consists of a Header and a Data section http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6)
WEKA File Format: ARFF @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ... Numerical attribute Nominal attribute • Other attribute types: • String • Date Missing value http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6) Slide adapted from Eibe Frank's
WEKA Sparse File Format • Value 0 is not represented explicitly • Same header (i.e @relation and @attribute tags) • the @data section is different • Instead of @data 0, X, 0, Y, "class A" 0, 0, W, 0, "class B" • We have @data {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"} • This saves LOTS of space for text applications. • Why?
Next Time • Wed: Guest lecture by Peter Jackson: • Pure and Applied Research in NLP: The Good, the Bad, and the Lucky. • Following week: • Text Categorization Algorithms • How to use Weka