1 / 49

Chapter 7: Text mining

Chapter 7: Text mining. Text mining. It refers to data mining using text documents as data. There are many special techniques for pre-processing text documents to make them suitable for mining. Most of these techniques are from the field of “Information Retrieval”. .

doane
Download Presentation

Chapter 7: Text mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 7: Text mining Bing Liu

  2. Text mining • It refers to data mining using text documents as data. • There are many special techniques for pre-processing text documents to make them suitable for mining. • Most of these techniques are from the field of “Information Retrieval”. Bing Liu

  3. Information Retrieval (IR) • Conceptually, information retrieval (IR) is the study of finding needed information. I.e., IR helps users find information that matches their information needs. • Historically, information retrieval is about document retrieval, emphasizing document as the basic unit. • Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. • IR has become a center of focus in the Web era. Bing Liu

  4. Information User Search/select Queries Stored Information Info. Needs Information Retrieval Translating info. needs to queries Matching queries To stored information Query result evaluation Does information found match user’s information needs? Bing Liu

  5. Text Processing • Word (token) extraction • Stop words • Stemming • Frequency counts Bing Liu

  6. Stop words • Many of the most frequently used words in English are worthless in IR and text mining – these words are called stop words. • the, of, and, to, …. • Typically about 400 to 500 such words • For an application, an additional domain specific stop words list may be constructed • Why do we need to remove stop words? • Reduce indexing (or data) file size • stopwords accounts 20-30% of total word counts. • Improve efficiency • stop words are not useful for searching or text mining • stop words always have a large number of hits Bing Liu

  7. Stemming • Techniques used to find out the root/stem of a word: • E.g., • user engineering • users engineered • used engineer • using • stem: use engineer Usefulness • improving effectiveness of IR and text mining • matching similar words • reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. Bing Liu

  8. Basic stemming methods • remove ending • if a word ends with a consonant other than s, followed by an s, then delete s. • if a word ends in es, drop the s. • if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. • If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. • …... • transform words • if a word ends with “ies” but not “eies” or “aies” then “ies --> y.” Bing Liu

  9. Frequency counts • Counts the number of times a word occurred in a document. • Counts the number of documents in a collection that contains a word. • Using occurrence frequencies to indicate relative importance of a word in a document. • if a word appears often in a document, the document likely “deals with” subjects related to the word. Bing Liu

  10. Vector Space Representation • A document is represented as a vector: • (W1, W2, … … , Wn) • Binary: • Wi= 1 if the corresponding term i (often a word) is in the document • Wi= 0 if the term i is not in the document • TF: (Term Frequency) • Wi= tfi where tfi is the number of times the term occurred in the document • TF*IDF: (Inverse Document Frequency) • Wi =tfi*idfi=tfi*log(N/dfi))where dfi is the number of documents contains term i, and N the total number of documents in the collection. Bing Liu

  11. Vector Space and Document Similarity • Each indexing term is a dimension. A indexing term is normally a word. • Each document is a vector • Di = (ti1, ti2, ti3, ti4, ... tin) • Dj = (tj1, tj2, tj3, tj4, ..., tjn) • Document similarity is defined as Bing Liu

  12. Query formats • Query is a representation of the user’s information needs • Normally a list of words. • Query as a simple question in natural language • The system translates the question into executable queries • Query as a document • “Find similar documents like this one” • The system defines what the similarity is Bing Liu

  13. An Example • A document Space is defined by three terms: • hardware, software, users • A set of documents are defined as: • A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) • A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) • A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) • If the Query is “hardware and software” • what documents should be retrieved? Bing Liu

  14. An Example (cont.) • In Boolean query matching: • document A4, A7 will be retrieved (“AND”) • retrieved:A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) • In similarity matching (cosine): • q=(1, 1, 0) • S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 • S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 • S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 • Document retrieved set (with ranking)= • {A4, A7, A1, A2, A5, A6, A8, A9} Bing Liu

  15. Relevance judgment for IR • A measurement of the outcome of a search or retrieval • The judgment on what should or should not be retrieved. • There is no simple answer to what is relevant and what is not relevant: need human users. • difficult to define • subjective • depending on knowledge, needs, time,, etc. • The central concept of information retrieval Bing Liu

  16. Precision and Recall • Given a query: • Are all retrieved documents relevant? • Have all the relevant documents been retrieved ? • Measures for system performance: • The first question is about the precision of the search • The second is about the completeness (recall) of the search. Bing Liu

  17. Precision and Recall (cont) Relevant Not Relevant Retrieved a b Not retrieved d c a a P = -------------- R = -------------- a+b a+c Bing Liu

  18. Number of relevant documents retrieved Precision = -------------------------------------------- Total number of documents retrieved Number of relevant documents retrieved Recall = ----------------------------------------------------- Number of all the relevant documents in the database Precision and Recall (cont) • Precision measures how precise a search is. • the higher the precision, • the less unwanted documents. • Recall measures how complete a search is. • the higher the recall, • the less missing documents. Bing Liu

  19. Relationship of R and P • Theoretically, • R and P not depend on each other. • Practically, • High Recall is achieved at the expense of precision. • High Precision is achieved at the expense of recall. • When will p = 0? • Only when none of the retrieved documents is relevant. • When will p=1? • Only when every retrieved documents are relevant. • Depending on application, you may want a higher precision or a higher recall. Bing Liu

  20. P-R diagram P 1.0 System A System B 0.5 System C 0.1 R 0.1 1.0 0.5 Bing Liu

  21. Alternative measures • Combining recall and precision, F score 2PR F = ------------------ R + P • Breakeven point: when p = r • These two measures are commonly used in text mining: classification and clustering. • Accuracy is not normally used in text domain because the set of relevant documents is almost always very small compared to the set of irrelevant documents. Bing Liu

  22. Web Search as a huge IR system • A Web crawler (robot) crawls the Web to collect all the pages. • Servers establish a huge inverted indexing database and other indexing databases • At query (search) time, search engines conduct different types of vector query matching Bing Liu

  23. Different search engines • The real differences among different search engines are • their indexing weight schemes • their query process methods • their ranking algorithms • None of these are published by any of the search engines firms. Bing Liu

  24. Vector Space Based Document Classification Bing Liu

  25. Vector Space Representation • Each doc j is a vector, one component for each term (= word). • Have a vector space • terms are attributes • n docs live in this space • even with stop word removal and stemming, we may have 10000+ dimensions, or even 1,000,000+ Bing Liu

  26. Government Science Arts Classification in Vector space • Each training doc is a point (vector) labeled by its topic (= class) • Hypothesis: docs of the same topic form a contiguous region of space • Define surfaces to delineate topics in space Bing Liu

  27. Test doc = Government Government Science Arts Bing Liu

  28. Rocchio Classification Method • Given training documents compute a prototype vector for each class. • Given test doc, assign to topic whose prototype (centroid) is nearest using cosine similarity. Bing Liu

  29. Rocchio Classification • Constructing document vectors into a prototype vector for each class cj. •  and  are parameters that adjust the relative impact of relevant and irrelevant training examples. Normally, •  = 16 and  = 4. Bing Liu

  30. Naïve Bayesian Classifier • Given a set of training documents D, • each document is considered an ordered list of words. • wdi,kdenotes the word wt in position k of document di, where each word is from the vocabulary V = < w1, w2, … , w|v| >. • Let C = {c1, c2, … , c|C|} be the set of pre-defined classes. • There are two naïve Bayesian models, • One based on multi-variate Bernoulli model (a word occurs or does not occurs in a document). • One based on the multinomial model (the number of word occurrences is considered) Bing Liu

  31. Naïve Bayesian Classifier (multinomial model) (1) (2) N(wt, di) is the number of times the word wt occurs in document di. P(cj|di) is in {0, 1} (3) Bing Liu

  32. k Nearest Neighbor Classification • To classify document d into class c • Define k-neighborhood N as k nearest neighbors of d • Count number of documents n in N that belong to c • Estimate P(c|d) as n/k • No training is needed (?). Classification time is linear in training set size. Bing Liu

  33. Example Government Science Arts Bing Liu

  34. Example: k=6 (6NN) P(science| )? Government Science Arts Bing Liu

  35. Linear classifiers:Binary Classification • Consider 2 class problems • Assume linear separability for now: • in 2 dimensions, can separate by a line • in higher dimensions, need hyperplanes • Can find separating hyperplane by linear programming (e.g. perceptron): • separator can be expressed as ax + by = c Bing Liu

  36. Linear programming / Perceptron Find a,b,c, such that ax + by c for red points ax + by c for green points. Bing Liu

  37. Linear Classifiers (cont.) • Many common text classifiers are linear classifiers • Despite this similarity, large performance differences • For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose? • What to do for non-separable problems? Bing Liu

  38. Which hyperplane? In general, lots of possible solutions for a,b,c. Support Vector Machine (SVM) finds an optimal solution Bing Liu

  39. Support vectors Maximize margin Support Vector Machine (SVM) • SVMs maximize the margin around the separating hyperplane. • The decision function is fully specified by a subset of training samples, the support vectors. • Quadratic programming problem. • SVM: very good for text classification Bing Liu

  40. Optimal hyperplane • Let the training examples be (xi, ,yi) i = 1, 2,…, n, where xiis n-dimensional vector. yi is its class, -1 or 1. • The class represented by the subset with yi = -1 and the class represented by the subset with yi = +1 are linearly separable if there exists (w, b) such that • The margin of separation m is the separation between the hyperplane wTx + b = 0 and the closest data points (support vectors). • The goal of a SVM is to find the optimal hyperplane with the maximum margin of separation. • wTxi + b 0 for yi = +1 • wTxi + b< 0 for yi = -1 Bing Liu

  41. A Geometrical Interpretation • The decision boundary should be as far away from the data of both classes as possible • We maximize the margin, m Class 2 m Class 1 Bing Liu

  42. SVM formulation: separable case • Thus, support vector machines (SVM) are linear functions of the form f(x)=wTx + b, where w is the weight vector and x is the input vector. • To find the linear function: Minimize: Subject to: • Quadratic programming. Bing Liu

  43. Non-separable caseSoft margin SVM • To deal with cases where there may be no separating hyperplane due to noisy labels of both positive and negative training examples, the soft margin SVM is proposed: Minimize: Subject to: i  0, i = 1, …, n where C 0 is a parameter that controls the amount of training errors allowed. Bing Liu

  44. Illustration:Non-separable case Support Vectors: 1 margin s.v. i = 0 Correct 2 non-margin s.v. i < 1 Correct (in margin) 3 non-margin s.v. I> 1 Error 1 1 1 3 3 2 3 1 Bing Liu

  45. f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f(.) Feature space Input space Extension to Non-linear Decision surface • In general, complex real world applications may not be expressed with linear functions. • Key idea: transform xi into a higher dimensional space to “make life easier” • Input space: the space xi are in • Feature space: the space of f(xi) after transformation Bing Liu

  46. Kernel Trick • The mapping function (.) is used to project data into a higher dimensional feature space. x =(x1, .., xn)  (x) = (1(x), …, N(x)) • With a higher dimensional space, the data are more likely to be linearly separable. • In SVM, the projection can be done implicitly, rather than explicitly because the optimization does not actually need the explicit projection. • It only needs a way to compute inner products between pairs of training examples (e.g., x, z) Kernel:K(x, z) = <(x)  (z)> • If you know how to compute K, you do not need to know . Bing Liu

  47. Comments of SVM • SVM are seen as best-performing method by many. • Statistical significance of most results not clear. • Kernels are an elegant and efficient way to map data into a better representation. • SVM can be expensive to train (quadratic programming). • For text classification, linear kernel is common and often sufficient. Bing Liu

  48. Document clustering • We can still use the normal clustering techniques, e.g., partition and hierarchical methods. • Documents can be represented using vector space model. • For distance function, cosine similarity measure is commonly used. Bing Liu

  49. Summary • Text mining applies and adapts data mining techniques to text domain. • A significant amount of pre-processing is needed before mining, using information retrieval techniques. Bing Liu

More Related