1 / 30

ICS 278: Data Mining Lecture 12: Text Mining

ICS 278: Data Mining Lecture 12: Text Mining. Padhraic Smyth Department of Information and Computer Science University of California, Irvine. Text Mining. Information Retrieval Text Classification Text Clustering Information Extraction. Text Mining Applications. Information Retrieval

Download Presentation

ICS 278: Data Mining Lecture 12: Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICS 278: Data MiningLecture 12: Text Mining Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  2. Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  3. Text Mining Applications • Information Retrieval • Query-based search of large text archives, e.g., the Web • Text Classification • Automated assignment of topics to Web pages, e.g., Yahoo, Google • Automated classification of email into spam and non-spam • Text Clustering • Automated organization of search results in real-time into categories • Discovery clusters and trends in technical literature (e.g. CiteSeer) • Information Extraction • Extracting standard fields from free-text • extracting names and places from reports, newspapers (e.g., military applications) • extracting resume information automatically from resumes • Extracting protein interaction information from biology papers Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  4. Text Mining • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  5. General concepts in Information Retrieval • Representation language • typically a vector of p attribute values, e.g., • set of color, intensity, texture, features characterizing images • word counts for text documents • Data set D of N objects • Typically represented as an N x p matrix • Query Q: • User poses a query to search D • Query is typically expressed in the same representation language as the data, e.g., • each text document is a set of words that occur in the document • Query Q is also expressed as a set of words, e.g.,”data” and “mining” Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  6. Query by Content • traditional DB query: exact matches • e.g. query Q = [level = MANAGER] & [age < 30] • query-by-content query: more general / less precise • e.g. Q = what historic record most similar to new one? • for text data, often called “information retrieval (IR)” • Goal • Match query Q to the N objects in the database • Return a ranked list (typically) of the most similar/relevant objects in the data set D given Q Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  7. Issues in Query by Content • What representation language to use • How to measure similarity between Q and each object in D • How to compute the results in real-time (for interactive querying) • How to rank the results for the user • Allowing user feedback (query modification) • How to evaluate and compare different IR algorithms/systems Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  8. The Standard Approach • fixed-length (d dimensional) vector representation • for query (d-by-1 Q) and and database (d-by-n X) objects • use domain-specific higher-level features (vs raw) • image • color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), … • text • “bag of words”: freq count for each word in each document, … • compute distances between vectorized representation • use k-NN to find k vectors in X closest to Q Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  9. Evaluating Retrieval Methods • predictive models (classify/regress) objective • score = accuracy on unseen test data • evaluation more complex for query by content • real score = how “useful” is retrieved info (subjective) • e.g. how would you define real score for Google’s top 10 hits? • towards objectivity, assume: • 1) each object is “relevant” or “irrelevant” • simplification: binary and same for all users (e.g. committee vote) • 2) each object labelled by objective/consistent oracle • these assumptions suggest classifier approach possible • rather different goals: want nearests to Q, not separability per se • but would require learning classifier at query time (Q = pos class) • which is why k-NN type approach seems so appropriate … Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  10. Precision versus Recall • DQ = Q’s ranked retrievals (smallest distance first) • DQT = those with distance < threshold • Threshold ~0: few false positives (FP) (say relevant, but not), many false neg (FN) • large threshold: few false negative (FP), many false pos (FP) • precision = TP / (TP+FP) • fraction of retrieved objects that are relevant • recall = TP / (TP + FN) • fraction of retrieved objects / total relevant objects • Tradeoff: high precision -> low recall, and vice-versa • For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”). Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  11. Precision-Recall Curve (form of ROC) alternative (point) values: precision where recall=precision or precision for fixed number of retrievals or average precision over multiple recall levels C is universally worse than A & B Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  12. TREC evaluations • Text Retrieval Conference (TReC) • Web site: trec.nist.gov • Annual impartial evaluation of IR systems • e.g., D = 1 million documents • TREC organizers supply contestants with several hundred queries Q • Each competing system provides its ranked list of queries • Union of top 100 queries or so from each system is then manually judges to be relevant or non-relevant for each query Q • Precision, recall, etc, then calculated and systems compared Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  13. Text Retrieval • document: book, paper, WWW page, ... • term: word, word-pair, phrase, … (often: 50,000+) • query Q = set of terms, e.g., “data” + “mining” • NLP (natural language processing) too hard, so … • want (vector) representation for text which • retains maximum useful semantics • supports efficient distance computes between docs and Q • term weights • Boolean (e.g. term in document or not); “bag of words” • real-valued (e.g. freq term in doc; relative to all docs) ... • notice: loses word order, sentence structure, etc. Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  14. Toy example of a document-term matrix Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  15. Distances between Documents • Measuring distance between 2 documents: • wide variety of distance metrics: • Euclidean (L2) = sqrt(i(xi - yi)2) • L1 = I |xi - yi | • ... • weighted L2 = sqrt(i(wixi - wiyi)2) • Cosine distance between docs Di = (di1,…,diT) • dc(Di,Dj) = k=1…T dikdjk / sqrt( k=1…T dik2 k=1…T dik2) • Can give better results than Euclidean • because it normalizes relative to document length Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  16. Distance matrices for toy document-term data Euclidean Distances TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 Cosine Distances Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  17. TF-IDF Term Weighting Schemes • binary weights favor larger documents, so ... • TF (term freq): term weight = number of times in that document • problem: term common to many docs => low discrimination • IDF (inverse-document frequency of a term) • nj documents contain term j, N documents in total • IDF = log(N/nj) • Favors terms that occur in relatively few documents • TF-IDF: TF(term)*IDF(term) • No real theoretical basis, but works well empirically and widely used Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  18. TF-IDF Example TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 TF-IDF(t1 in D1) = TF*IDF = 24 * log(10/9) TF-IDF doc-term mat t1 t2 t3 t4 t5 t6 d12.5 14.6 4.6 0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1 0 ... IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  19. Typical Document Querying System • Queries Q = binary term vectors • Documents represented by TF-IDF weights • Cosine distance used for retrieval and ranking TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 TF TF-IDF d1 0.70 0.32 d2 0.77 0.51 d3 0.58 0.24 d4 0.60 0.23 d50.79 0.43 ... Q=(1,0,1,0,0,0) Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  20. Synonymy and Polysemy • Synonymy • the same concept can be expressed using different sets of terms • e.g. bandit, brigand, thief • negatively affects recall • Polysemy • identical terms can be used in very different semantic contexts • e.g. bank • repository where important material is saved • the slope beside a body of water • negatively affects precision Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  21. Latent Semantic Indexing • Approximate data in the original d-dimensional space by data in a k-dimensional space, where k << d • Find the k linear projections of the data that contain the most variance • Principal components analysis or SVD • Also known as “latent semantic indexing” when applied to text • Captures dependencies among terms • In effect represents original d-dimensional basis with a k-dimensional basis • e.g., terms like SQL, indexing, query, could be approximated as coming from a single “hidden” term • Why is this useful? • Query contains “automobile”, document contains “vehicle” • can still match Q to the document since the 2 terms will be close in k-space (but not in original space), i.e., addresses synonymy problem Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  22. Toy example of a document-term matrix Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  23. SVD • M = U S VT • M = n x d = original document-term matrix (the data) • U = n x d , each row = vector of weights for each document • S = d x d diagonal matrix of eigenvalues • Columns of VT = new orthogonal basis for the data • Each eigenvalue represents how much information is of the new “basis” vectors Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  24. Example of SVD Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  25. v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19] v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31] D1 = database x 50 D2 = SQL x 50 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  26. Another LSI Example • A collection of documents: d1: Indian government goes for open-sourcesoftware d2: Debian 3.0 Woody released d3: Wine 2.0 released with fixes for Gentoo 1.4 and Debian 3.0 d4: gnuPOD released: iPOD on Linux… with GPLed software d5: Gentoo servers running at open-source mySQL database d6: Dolly the sheep not totally identical clone d7: DNA news: introduced low-cost human genomeDNA chip d8: Malaria-parasite genomedatabase on the Web d9: UK sets up genome bank to protect rare sheep breeds d10: Dolly’sDNA damaged Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  27. LSI Example (continued) • The term-document matrix X d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 open-source 1 0 0 0 1 0 0 0 0 0 software 1 0 0 1 0 0 0 0 0 0 Linux 0 0 0 1 0 0 0 0 0 0 released 0 1 1 1 0 0 0 0 0 0 Debian 0 1 1 0 0 0 0 0 0 0 Gentoo 0 0 1 0 1 0 0 0 0 0 database 0 0 0 0 1 0 0 1 0 0 Dolly 0 0 0 0 0 1 0 0 0 1 sheep 0 0 0 0 0 1 0 0 0 0 genome 0 0 0 0 0 0 1 1 1 0 DNA 0 0 0 0 0 0 2 0 0 1 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  28. LSI Example • The reconstructed term-document matrix after projecting on a subspace of dimension K=2 •  = diag(2.57, 2.49, 1.99, 1.9, 1.68, 1.53, 0.94, 0.66, 0.36, 0.10) d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 open-source 0.34 0.28 0.38 0.42 0.24 0.00 0.04 0.07 0.02 0.01 software 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02 Linux 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02 released 0.63 0.53 0.72 0.79 0.45 -0.01 -0.05 0.09 -0.00 -0.04 Debian 0.39 0.33 0.44 0.48 0.28 -0.01 -0.03 0.06 0.00 -0.02 Gentoo 0.36 0.30 0.41 0.45 0.26 0.00 0.03 0.07 0.02 0.01 database 0.17 0.14 0.19 0.21 0.14 0.04 0.25 0.11 0.09 0.12 Dolly -0.01 -0.01 -0.01 -0.02 0.03 0.08 0.45 0.13 0.14 0.21 sheep -0.00 -0.00 -0.00 -0.01 0.03 0.06 0.34 0.10 0.11 0.16 genome 0.02 0.01 0.02 0.01 0.10 0.19 1.11 0.34 0.36 0.53 DNA -0.03 -0.04 -0.04 -0.06 0.11 0.30 1.70 0.51 0.55 0.81 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  29. Further Reading • Text: Chapter 14 • Web-related Document Search: • An excellent resource on Web-related search is Chapter 3, Web Search and Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003, is also excellent. • Information on how real Web search engines work: • http://searchenginewatch.com/ • Latent Semantic Analysis • Applied to grading of essays: The debate on automated grading, IEEE Intelligent Systems, September/October 2000. Online athttp://www.k-a-t.com/papers/IEEEdebate.pdf Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

  30. Next up …. • Information Retrieval • Text Classification • Text Clustering • Information Extraction Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

More Related