1 / 43

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS. authors: B. Liu, W.S. Lee, P.S. Yu, X. Li presented by Rafal Ladysz. WHAT IT IS ABOUT. the paper shows: document classification one class of positively labeled documents accompanied by a set of unlabeled, mixed documents

marius
Download Presentation

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS authors: B. Liu, W.S. Lee, P.S. Yu, X. Li presented by Rafal Ladysz

  2. WHAT IT IS ABOUT • the paper shows: • document classification • one class of positively labeled documents • accompanied by a set of unlabeled, mixed documents • the above enables to build accurate classifiers • using EM algorithm based on NB classification • strengthening the EM by so called “spy documents” • experimental results for illustration • we will browse through the paper and • emphasize/refresh some of its theoretical aspects • try to understand the methods described • look at results obtained and interpret them

  3. AGENDA (informally) • problem described • document classification • PSC - general assumptions • PSC - some theory • Bayes basics • EM in general • I- EM algorithm • introducing spies • I-S-EM algorithm • selecting classifier • experimental data • results and conclusions • references

  4. KEY PROBLEM – a big picture • no labeled negative training data (text documents) • only a (small) set of relevant (positive) documents • necessity to classify unlabeled text documents • importance: • finding relevant text on the web • or digital libraries

  5. DOCUMENT CLASSIFICATION – some techniques used • kNN (Nearest Neighbors) • Linear Least Squares Fit • SVM • Naive Bayes: utilized here

  6. PARTIALLY SUPERVISED CLASSIFICATION (PSC) – theoretical foundations fixed distributionD over space X x Y, where Y = {0, 1} X, Y: sets of possible documents, classes (positive and negative), respectively ”example” is a labeled document two sets of documents: labeled as positiveP of size n1 drawn from DX|Y=1 unlabeledM of size n2 drawn indep. from X for DX remark: there might be some relevant documents in M(but we don’t know about their existence!)

  7. PSC cont. • PrD[A]: A X x Y chosen randomlyaccordingto D • T: a finite sample being a subset of our dataset • PrT[A]: A T  X x Y chosen randomly • learning algorithm: deals with F, a class of functions and selects a function f from F: F: X  {0, 1}to be used by classifier • probability oferror: Pr[f(X) Y] = Pr[(f(X) = 1)  (Y = 0)] + Pr[(f(X) = 0)  (Y = 1)] • sum of “false positive” and “false negative” cases

  8. PSC: approximations (1) • after transforming expression for probability of error: Pr[f(X)  Y] = =Pr[f(X) = 1] -Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • notice: Pr[Y = 1] = const (no changes of criteria) • approximation 1: keepingPr[f(X) = 0|Y = 1]small: learning errorPr[f(X) = 1]-Pr[Y = 1] = = Pr[f(X) = 1] – const   minimizingPr[f(X) = 1]

  9. PSC: approximations (2) • error: Pr[f(X)  Y] = =Pr[f(X) = 1] -Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • approximation 2: keeping Pr[f(X) = 0|Y = 1]small ANDminimizingPr[f(X) = 1 minimizingPrM[f(X) = 1]) (assumption: most irrelevant) ANDkeepingPrP(ositive)[f(X) = 1])  r where r is a recall: (relevant retrieved) / (all relevant) for large enoughsets P(positive) and M (unlabeled)

  10. CONSTRAINT OPTIMIZATION • simply summarizing what has just been said: a good learning results are achievable if: • the learning algorithm minimizes the number of unlabeled examples labeled as positive • the constrain that fraction of errors on the positive examples  1 – recall (declared upfront) is satisfied

  11. COMPLEXITY FUNCTION (CF) • VC-dim: complexity measure of F (class of f.) • meaning: cardinality of the largest sample set T T  X such that |F|T| = 2|T| thus the larger T, the more functions  F (class of f.); conversely, the higher VC-dim, the more f. in F • Naive Bayes: VC-dim 2m + 1 where m is the cardinalityclassifier’sof vocabulary

  12. CF – two cases • no noise: ftF (X, Y)D: Y = ft(X)(“perfect” f.) it can be shown that selecting f^ F which minimizes i= 1n2f(Xi)|M AND with total recall on set of positives (P) results in a function with small expected error • noise: Y may or may not equal ft(X) • F may or may not contain the target function f • labels are noisy • specifying target expected recall required

  13. CF in noise – modus operandi • learning algorithm tries to output f^  F such that: • E[recall(f^)]  r (that’s why recall is required) • E[precision(f^)]  best available for fF recall(f)  r • how the algorithm achieves that • selecting a set of positives examples from DX|Y=1 and unlabeled examples from D|X • searches a function f which minimizes i=1n2 f(Zi) operating on unlabeled examples • under constrain: errors fraction on positives  1 - r

  14. PROBABILITY vs. LIKELIHOOD • in the Webster dictionary: apparently synonims • from probabilistic point of view: • {si}: some collectively exclusive states of nature • assuming the prior probabilities P(si) are known • observing experimental outcomes {oj}: more info • suppose that oj si: P(oj|si) is known • it is the likelihood of the outcome oj given state si • Bayes theor. combines prior probab. with likelihood • and determines posterior probability for each si • likelihood: probability of observed experimental outcome

  15. NAIVE BAYES in general • formally, Bayes’ theorem can be formulated P(Si|Oj) = P(Oj|Si)P(Si) / (k=1n P(Oj|Sk)P(Sk)) and is called Inverse Probability Law • NB model assumptions • words randomly selected from lexicon, with replacement • words’ independence (words as components of a feature vector) • even though simplistic works pretty well • NB together with EM will be emplloyed here

  16. NB-basedtext classification - formalism • D: training set of • documents as ordered list of words wt • V = <w1, w2, ..., w|V|>: vocabulary used • wd i,k is a word  V in position k of doc. di • C = {c1, c2, ...,c|C|}: predif. classes, here: c1, c2 • Pr[cj|di]: posterior probab needed • total p.: Pr[cj] = iPr[cj|di] / |D| (indeed: Pr[di]1/|D|) • in NB model:class with the highest Pr[cj|di] is assigned to the document

  17. ITERATIVE EXPECTATION-MAXIMIZATION ALGORITHM (I-EM): a concept • a general method of estimating max. likelihood • of an underlying distribution’s parameters • when the data is incomplete • two main applications of the EM algorithm: • when the data has missing values due to problems with the observation process • when optimizing the likelihood function is: • analytically hard • but the likelihood function can be simplified by assuming values for additional, hidden parameters

  18. I-EM -mathematically • (i+1) =argmax zP(Z=z|x,(i))L(x,Z=z|) where: x is an observable, Z represents all hidden (unknown, missing) data,  stands for all (sought after) parameters • problem: determine parameter  on the base of observations y only, • i.e. without knowledge of complete data set x • solution: exploit y and determine iteratively x, 

  19. I-EM properties • simple but computationally demanding • convergence behavior • no guarantee for global optimum • initial point (0) determines if global optimum is reachable (or algorithm gets stuck in local optimum) • stable: likelihood function increases in every iteration until (local if not global) optimum reached • M(aximum) L(ikelihood) are fixed points in EM

  20. I-EM ALGORITHM – why and how • for the classification (main objective) posterior probabilityPr[cj|di] needed • probabilities will converge during iterations • EM: iterative algorithm for max. likelihood estimation for incomplete data (interpolates) • two steps: 1. expectation: filling in missing data 2. maximization: parameters estimating next iteration launched

  21. I-EM: symbols used • symbols used: • D: training set of documents • each documant: ordered list of words • wdi,k: kth word in ith document • each wdi,k V = {w1, w2, ..., w|V|} (vocabulary) • vocabulary: all words to be classified • C = {c1, c2}: predefine dclasses (only 2)

  22. I-EM - application • initial labeling: di P  c1, i.e. Pr[c1|di] = 1, Pr[c2di] = 0 dj M  c2, i.e. Pr[c1|dj] = 0, Pr[c2dj] = 1 (vice versa) • NB-C created, then applied to dataset M: • computing posterior probab. Pr[c1|dj] in M (eq. 5) • assigning computed new probabilistic label to dj • Pr[c1|di] = 1 (not affected) during the process • in each iteration: • Pr[c1|dj] is revised, then • new NB-C built based on new Pr[c1|dj]for M and P • iterating continues till convergence occurs

  23. I-EM pseudocode I-EM(M, P) 1. build initial NB classifier NB-C using M and P sets 2. loop while NB-C parameters keep changing (i.e. as long as convergence is taking place) 3. for each document djM 4. compute Pr[c1|dj] using current NB-C (eq. 5) //Pr[c2|dj] = 1 - Pr[c1|dj]: c1 and c2 are collectively excl. //ifPr[c1|dj] > Pr[c2|djthen di is classified as c1 5. update Pr[wt|c1] and Pr[c1](eq. 3, 4) //given probabilistically assigned classes for // dj(Pr[c1|dj]) and setP, // a new NB-Cbuilt during processing

  24. I-EM – benefits and limitations • EM A. helps assign probabilistic class labels to each dj in mixed set of documents: Pr[c1|dj] and Pr[c2|dj} • all the above probabilities converge (iterations) • the final result is sensitive to initial conditions assumed • conclusion: • good handling of “easy” data (+/- separable easily) • a niche for improvement for “hard” data • source of the limitation: initialization strongly biased towards positive data (documents) • solution: • balanced initialization (+/-) • find reliable negativedocuments for initialization c2 in EM

  25. I-EM: extension • I-EM helps identify (most likely) negatives in M • issue: how to get as reliable as possible data (documents) to do so • idea: using “spy“ documents from P in M • approach: • select s  10% of documents from P; denoted S • add S-set to M-set • S behave as unknown positive documents do in M • enabling inference within M • I-EM still in use • but instead of M it operates on M  S

  26. SPIES – determining threshold • set of spy documents S = {s1, s2, ..., sk} • Pr[c1|si]  si: probab. label assigned to each spy • in noiseless case: t = min{Pr[c1|si]}, i = 1, 2, ..., k • equivalent to retrieving all spy documents • in more realistic scenario: noise and outliers exist • minimum probability might be unreliable, because e.g.: for outlier si in S: posterior Pr[c1|si] might be << Pr[c1|dj]  M • setting t: • sort si in S according to Pr[c1|si] • set noise level l (e.g. 15%) so that l% of docs have probability < t • thus, Step-1 objective is: • identifying a set of reliable negative documents from the unlabeled set • unlabeled set to be treated as negative data (docs)

  27. SPY DOCUMENTS and Step-1 algorithm • threshold t used for decision making: • if Pr[c1|dj] < t: denoted as N(egative) • if for dj  P Pr[c1|dj] > t: denoted as U(nlabeled) • algorithm Step-1 for identifying most likely negativesN from unlabeled U set

  28. STEP-1 effect positives negatives BEFOREAFTER LN (likely negative) M (un-labeled) c2 U un-labeled c2 positive spies some spies positive c1 P (positive) c1 help of spies: most positives in M get into unlabeled set, while most negatives get into LN; purity of LN higher than that of M initial situation: M = P  N no clue which are P and N spies from P added to M

  29. STEP-2: building and selecting final classifier • EM still in use, but now with P, LN and U • algorithm proceeds as follows: • put all spies S back to P (where they were before) • diP:  c1 (i.e. Pr[c1|di] = 1); (fixed thru iterations) • djLN:  c2 (i.e. Pr[c2|dj] = 1); (changing thru EM) • dkU: initially assigned no label (will be after EM(1)) • run EM using P, LN and U until it converges • final classifier is produced when EM stops • all these constitutes S-EM (spy EM)

  30. STEP-2: comments • probabilities of sets U and LN are allowed to change • set U participates in EM since EM(2) on with its documents assigned probab. labels Pr[c1|dk] • experimenting with a = 5%, 10% or 20% gave similar results; why? • for the parameter a (% used for creating LN): when within a range of approximately 5%-20%: if too many positives in LN, then EM corrects it slowly adding them to positives

  31. STEP-1 AND STEP-2 SUMMARY • Step 1: Identifying a set of reliable negative documents from the unlabeled set. The unlabeled set is treated as negative data. • Step 2: Building and selecting a classifier, consists of two sub-steps: • building a set of classifiers by iteratively applying a classification algorithm; the EM algorithm is used again. b) selecting a good classifier from the set of classifiers constructed above; this sub-step may be called "catching a good classifier".

  32. SELECTING CLASSIFIER • as said, EM is prone to local maxima trap • if a local maximum separates the two classes well: no problem (or problem solved) • otherwise (i.e. positives and negatives consist of many clusters each) the data may be not separable • remedy: stop iterating of EM at some point • what point?

  33. SELECTING CLASSIFIER continued • eq. (2) can be helpful: error probability Pr[f(X)  Y] = Pr[f(X) = 1] -Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • it can be shown that knowing the component PrM[Y = c1]allows us to estimate the error • method: estimating the change of the probability error between iterations i and i+1 • i = can becomputed (formula in 4.5 of the paper) • if i > 0 for the first time, then i-th classifier produced is the last to add (no need to proceed beyond i)

  34. EXPERIMENTAL DATA described • two large document corpora • 30 datasets created • e.g. 20 newsgroups subdivided into 4 groups • all headers removed • e.g. WebKB (CS depts.) subdivided into 7 categories • objective: • recovering positive documents placed into mixed sets • no need to separate test set (from training set) • unlabeled mixed set serves as the test set

  35. DATA description cont. • for each experiment: • dividing full positive set into two subsets: P and R • P: positive set used in the algorithm with a% of the full positive set • R: set of remaining documents with b% have been put into negative set M (not all in R put to M) • belief: in reality M is large and has a small proportion of positive documents • parameters a and b have been varied to cover different scenarios

  36. EXPERIMENTAL RESULTS • techniques used NB-C: applied directly to P (c1) and M(c2) to built classifier to be applied to classify data in set M I-EM: applies EM-A to P and M as long as converges (no spy yet); final classifier to be applied to M to identify its positives S-EM: spies used to re-initialize I-EM to build the final classifier; threshold t used

  37. RESULTS cont. • Table 1: 30 results for diferent parametrs a, b • Table 2: summary of averages for other a, b settings • precisionF = 2pr/(p+r), where p, r are and recall, respectively • S-EM outperforms NB and I-EM in F dramatically • accuracy (of a classifier) A = c/(c+1) , where c, i are numbers of correct and incorrect decisions, respectively • S-EM outperforms NB and I-EM in A as well • comment: datasets skewed (positives are only a small fraction), thus A is not a reliable measure of classifier’s performance

  38. RESULTS cont. • Table 3: F-score and accuracy A • results in this table show great effect of reinitialization with spies: S-EM outperforms I-EMbest • reinitialization is not, however, the only factor of improvement: S-EM outperforms S-EM4 • conclusions: both Step-1 (reinitializing) and Step-2 (selecting the best model) are needed!

  39. REFERENCES other than in the paper http://www.cs.uic.edu/~liub/LPU/LPU-download.html http://www.ant.uni-bremen.de/teaching/sem/ws02_03/slides/em_mud.pdf http://www.mcs.vuw.ac.nz/~vignaux/docs/Adams_NLJ.html http://plato.stanford.edu/entries/bayes-theorem/ http://www.math.uiuc.edu/~hildebr/361/cargoat1sol.pdf http://jimvb.home.mindspring.com/monthall.htm http://www2.sjsu.edu/faculty/watkins/mhall.htm http://www.aei-potsdam.mpg.de/~mpoessel/Mathe/3door.html http://216.239.37.104/search?q=cache:aKEOiHevtE0J:ccrma-www.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html+bayes+likelihood&hl=pl&ie=UTF-8&inlang=pl

  40. THANK YOU

More Related