1 / 100

Introduction to Biomedical Informatics Text Mining

Introduction to Biomedical Informatics Text Mining. Outline. Introduction and Motivation Techniques Document classification Document clustering Topic discovery from text Information extraction Additional Resources and Recommended Reading. Motivation for Text Mining. PubMed

gizela
Download Presentation

Introduction to Biomedical Informatics Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Biomedical InformaticsText Mining

  2. Outline • Introduction and Motivation • Techniques • Document classification • Document clustering • Topic discovery from text • Information extraction • Additional Resources and Recommended Reading

  3. Motivation for Text Mining • PubMed • PubMed approximately 1 million new articles per year • Human annotation cannot keep up – increased demand for automation • Problem is even greater in other domains (e.g., Web search in general)

  4. From Jensen, Saric, Bork, Nature Reviews Genetics, 2006

  5. Text Mining Problems • Classification: automatically assign a document into 1 or more categories • “easy” problem: is an email spam or non-spam? • “hard” problem: assign MesH terms to new PubMed articles • Clustering: Group a set of documents into clusters • e.g., automatically group docs in search results by theme • Topic Discovery: Discover themes in docs and index docs by theme • E.g., discover new scientific concepts not yet covered by MeSH terms • Information Extraction: Extract mentions of entities from documents • “easy” problem: extract all gene names mentioned in a document • “hard” problem: extract a set of facts relating genes in a document, e.g., statements such as “gene A activates gene B”

  6. Classification and Clustering • We already discussed these methods in the context of general data mining in earlier lectures. Now we want to apply these techniques specifically to text data • Recall: • Given a vector of features x, a classifier maps x to a target variable y, where y is categorical, e.g., y = {has cancer, does not have cancer} • Learning a classifier consists of being given a training data set of pairs of x’s and y’s, and learning how to map x to y • Clustering is similar, but our target data doesn’t have any target y values – we have to discover the target values (the “clusters”) automatically

  7. Classification and Clustering for Text Documents • Document representation • Most classification and clustering algorithms assume that each object (here a document) to be classified can be represented as a fixed length vector of variables/feature/attribute values • So how do we convert documents into fixed-length vectors? • “Bag of Words” representation • Each vector entry represents whether term j occurred in a document, or how often it occurred • Same idea as for information retrieval • Ignores word order, document structure….but found to work well in practice and has considerable computational advantages over working with the document directly Once we have our vector (bag of words) representation we can use any classification or clustering method on our documents

  8. Document Classification

  9. Document Classification • Document classification has many applications • Spam email detection • Automated tagging of biomedical articles (e.g., in PubMed) • Automated creation of Web-page taxonomies • Data Representation • “Bag of words” most commonly used: either counts or binary • Can also use “phrases” for commonly occurring combinations of words • Classification Methods • Naïve Bayes widely used (e.g., for spam email) • Fast and reasonably accurate • Support vector machines (SVMs) • Often the most accurate method in research studies • But more complex computationally than other methods • Logistic Regression (regularized) • Widely used in industry, often competitive with SVMs

  10. Trimming the Vocabulary • Stopword removal: • remove “non-content” words • very frequent “stop words” such as “the”, “and”…. • remove very rare words, e.g., that only occur a few times in 1 million documents • Often results in removal of 30% or more of the original unique words • Stemming: • Reduce all variants of a word to a single term • e.g., {draw, drawing, drawings} -> “draw” • Can use Porter stemming algorithm (1980) • This still often leaves us with 10000 to 1 million unique terms => a very high-dimensional classification problem!

  11. Classifying Term Vectors • Typically multiple different terms or words may be helpful • Class = “finance” • Words = “stocks”, “return”, “interest”, “rate”, etc. • Thus, classifiers that combine multiple features often do well, e.g, • Naïve Bayes, Logistic regression, SVMs, etc (compared to decision trees, for example, which would branch on 1 word at a time) • Linear classifiers often perform well in high-dimensions • Typically we have a large number of features/dimensions in text classification • Theory and experiments tell us linear classifiers do well in high dimensions • So naïve Bayes, logistic regression, linear SVMS, are all useful • Main questions in practice are: • which terms to use in the classifier? • whichlinear classifier to select?

  12. Feature Selection • Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms • See classification results later in these slides • Greedy search • Start from empty set or full set and add/delete one at a time • Heuristics for adding/deleting • Information gain (mutual information of term with class) • Chi-square • Other ideas • Methods tend not to be particularly sensitive to the specific heuristic used for feature selection, but some form of feature selection often improves performance

  13. Example of Role of Feature Selection(from Chakrabarti, Chapter 5) 9600 documents from US Patent database 20,000 raw features (terms)

  14. Types of Classifiers Let c be the class label and let x be a vector of features • Generative/Probabilistic • Model p(x | c) for each class, then estimate p(c | x) • e.g., naïve Bayes model • Conditional Probability/Regression • Model p(c | x) directly, e.g., • e.g., logistic regression • Discriminative • Look for decision boundaries in input space x directly • No probabilities • e.g., perceptron, linear discriminants, SVMs, etc

  15. Probabilistic “Generative” Classifiers • Model p( x | ck )for each class and perform classification via Bayes rule,c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) } • How to model p( x | ck )? • p( x | ck ) = probability of a “bag of words” x given a class ck • Two commonly used approaches (for text): • Naïve Bayes: treat each term xj as being conditionally independent, given ck • Multinomial: model a document with N words as N tosses of a p-sided die

  16. Naïve Bayes Classifier for Text • Naïve Bayes classifier = conditional independence model • Assumes conditional independence assumption given the class: p( x | ck ) = Pp( xj | ck ) • Note that we model each term xj as a discrete random variable • Binary terms (Bernoulli): p( x | ck ) = Pp( xj = 1 | ck ) Pp( xj = 0 | ck)

  17. Multinomial Classifier for Text • Multinomial Classification model Assume that the data are generated by a p-sided die (multinomial model) where N= total number of terms nj= number of times term j occurs in the document Here we have a single random variable for each class, and the p( xj = i | ck ) probabilities sum to 1 over i(i.e., a multinomial model)

  18. Highest Probability Terms in Multinomial Distributions

  19. Common Data Sets used for Evaluation • Reuters • 10700 labeled documents • 10% documents with multiple class labels • Yahoo! Science Hierarchy • 95 disjoint classes with 13,598 pages • 20 Newsgroups data • 18800 labeled USENET postings • 20 leaf classes, 5 root level classes • WebKB • 8300 documents in 7 categories such as “faculty”, “course”, “student”. • Industry • 6449 home pages of companies partitioned into 71 classes

  20. Comparing Naïve Bayes and Multinomial models McCallum and Nigam (1998) Found that multinomial outperformed naïve Bayes (with binary features) in text classification experiments

  21. Comparing Multinomial and Bernoulli on Reuter’s Data(from McCallum and Nigam, 1998)

  22. Comparing Bernoulli and Multinomial(slide from Chris Manning, Stanford) Results from classifying 13,589 Yahoo! Web pages in Science subtree of hierarchy into 95 different classes

  23. WebKB Data Set • Train on ~5,000 hand-labeled web pages • Cornell, Washington, U.Texas, Wisconsin • Crawl and classify a new site (CMU) • Results:

  24. Comparing Bernoulli and Multinomial on Web KB Data

  25. Comments on Generative Models for Text (Comments applicable to both Naïve Bayes and Multinomial classifiers) • Simple and fast => popular in practice • e.g., linear in p, n, M for both training and prediction • Training = “smoothed” frequency counts, e.g., • e.g., easy to use in situations where classifier needs to be updated regularly (e.g., for spam email) • Numerical issues • Typically work with log p( ck | x ),etc., to avoid numerical underflow • Useful trick: • when computing S log p( xj | ck ) , for sparse data, it may be much faster to • precomputeS log p( xj = 0| ck ) • and then subtract off the log p( xj = 1| ck ) terms • Note: both models are “wrong”: but for classification are often sufficient

  26. optional Beyond independence • Naïve Bayes and multinomial both assume conditional independence of words given class • Alternative approaches try to account for higher-order dependencies • Bayesian networks: • p(x | c) = Px p(x | parents(x), c) • Equivalent to directed graph where edges represent direct dependencies • Various algorithms that search for a good network structure • Useful for improving quality of distribution model • ….however, this does not always translate into better classification • Maximum entropy models • p(x | c) = 1/Z Psubsets f( subsets(x) | c) • Equivalent to undirected graph model • Estimation is equivalent to maximum entropy assumption • Feature selection is crucial (which f terms to include) – • can provide high accuracy classification • …. however, tends to be computationally complex to fit (estimating Z is difficult)

  27. Basic Concepts of Support Vector Machines Circles = support vectors = points on convex hull that are closest to hyperplane M = margin = distance of support vectors from hyperplane Goal is to find weight vector that maximizes M

  28. Reuters Data Set • 21578 documents, labeled manually • 9603 training, 3299 test articles • 118 categories • An article can be in more than one category • Learn 118 binary category distinctions • Example “interest rate” article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. • Earn (2877, 1087) • Acquisitions (1650, 179) • Money-fx (538, 179) • Grain (433, 149) • Crude (389, 189) • Trade (369,119) • Interest (347, 131) • Ship (197, 89) • Wheat (212, 71) • Corn (182, 56) Common categories (#train, #test)

  29. Dumais et al. 1998: Reuters - Accuracy

  30. Precision-Recall for SVM (linear), Naïve Bayes, and NN (from Dumais 1998) using the Reuters data set

  31. Comparing Text Classifiers • Naïve Bayes or Multinomial models • Low time complexity (training = single linear pass through the data) • Generally good, but not always best performance • Widely used for spam email filtering • Linear SVMs, Logistic Regression • Often produce best results in research studies • But more computationally complex to train (particularly SVMs) • Others • decision trees: less widely used, but can be useful

  32. optional Learning with Labeled and Unlabeled documents • In practice, obtaining labels for documents is time-consuming, expensive, and error prone • Typical application: small number of labeled docs and a very large number of unlabeled docs • Idea: • Build a probabilistic model on labeled docs • Classify the unlabeled docs, get p(class | doc) for each class and doc • This is equivalent to the E-step in the EM algorithm • Now relearn the probabilistic model using the new “soft labels” • This is equivalent to the M-step in the EM algorithm • Continue to iterate until convergence (e.g., class probabilities do not change significantly) • This EM approach to classification shows that unlabeled data can help in classification performance, compared to labeled data alone

  33. Learning with Labeled and Unlabeled Data (from “Semi-supervised text classification using EM”, Nigam, McCallum, and Mitchell, 2006)

  34. Other issues in text classification • Real-time constraints: • Being able to update classifiers as new data arrives • Being able to make predictions very quickly in real-time • Document length • Varying document length can be a problem for some classifiers • Multinomial tends to be better than Bernoulli for example • Multi-labels and multiple classes • Text documents can have more than one label • SVMs for example can only handle binary data

  35. Other issues in text classification (continued) • Feature selection • Experiments have shown that feature selection (e.g., by greedy algorithms using information gain) can often improve results • Linked documents • Can view Web documents as nodes in a directed graph • Classification can now be performed that leverages the link structure, • Heuristic = class labels of linked pages are more likely to be the same • Optimal solution is to classify all documents jointly rather than individually • Resulting “global classification” problem is typically computationally complex

  36. Background Resources: Document Classification • S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003. • See chapter 5 for discussion of text classification • C. D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008 • Chapters 13 to 15 on text classification • (and chapters 16 and 17 on text clustering) • http://nlp.stanford.edu/IR-book/information-retrieval-book.html • SVMs for text classification • T. Joachims, Learning to Classify Text using Support Vector Machines: Methods, Theory and Algorithms, Kluwer, 2002

  37. Document Clustering

  38. Document Clustering • For clustering we can use either • Vectors to represent each document (E.g., bag of words) • Useful for clustering algorithms such as k-means, probabilistic clustering, or • For N documents, define an N x N similarity matrix • Doc-doc similarity can be defined in different ways (e.g., TF-IDF) • Useful for clustering methods such as hierarchical clustering • Unlike classification, there is typically less concern with selecting the “best” vocabulary for clustering • remove common stop words and infrequent words

  39. Case Study: Clustering of 2 Million PubMed Articles Reference: Boyack et al, Clustering more than Two Million Biomedical Publications…, PLoS One, 6(3), March 2011 • Data Set : 2.15 million articles in PubMed • all articles published between 2004 and 2008, with at least 5 MeSH terms • Data for each document • MeSH terms • Words from titles and abstracts • Preprocessing • MEDLINE stopword list of 132 words + 300 words commonly used at NIH • Terms appearing in less than 4 documents were removed • 272,926 unique terms and 175 million document-term pairs

  40. Methodology • Methods compared • Data sources: MeSH only versus Title/Abstract words only • Similarity metrics • Tf-idf cosine (see earlier lectures) • Latent semantic indexing/analysis (see earlier lectures) • Topic modeling (discussed later in these slides) • Self-organizing map (neural network method) • Poisson-based model • 25 million similarity pairs computed • Approximately top-12 most similar documents for each document • 9 sets of clusters compared • 9 combinations of clustering data-source+similarity metric evaluated • Hierarchical (single-link) clustering applied to each of the 9 similarity sets • Heuristics used to determine when clusters can no longer be merged

  41. Evaluation Methods and Results • Evaluation metrics • Textual coherence within a cluster (see paper) • Coherence of papers within a cluster in terms of funding source (Question: how reliable are these metrics?) • Conclusions (from the paper) • PubMed’s own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. • Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

  42. Textual CoherenceResults

  43. Two-dimensional map of the highest-scoring cluster solution, representing nearly 29,000 clusters and over two million articles.

  44. Background Resources: Document Clustering • Papers • BoyackKW, Newman D, Duhon RJ, Klavans R, Patek M, et al. 2011 Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoSONE 6(3): e18029. doi:10.1371/journal.pone.0018029 • Douglass R. Cutting, David R. Karger, Jan O. Pedersen and John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of ACM SIGIR '92. • Ying Zhao and George Karypis(2005) Hierarchical clustering algorithms for document data sets, Data Mining and Knowledge Discovery, Vol. 10, No. 2, pp. 141 - 168, 2005. • MALLET (Software) • Java-based package for classification, clustering, topic modeling, and more… • http://mallet.cs.umass.edu/

  45. Topic Modeling Some slides courtesy of David Newman, UC Irvine

  46. Topics = Multinomial Distributions

  47. What is Topic Modeling? • Topic = probability distribution (multinomial) over words • Document is assumed to be a mixture of topics • Each document is represented by a probability distribution over topics • Note that this is different to clustering, which assigns each doc to 1 cluster • Learning algorithm is completely unsupervised • No labels required • Output of learning algorithm • T topics, each represented by a probability distribution over words • For each document, what its mixture of topics is • For each word in each document, what topic it is assigned to

  48. What is Topic Modeling useful for? • Summarize large document collections • Retrieve documents • Automatically index documents • Analyze text • Measure trends • Generate topic-based networks

  49. Topic Modeling vs. Other Approaches • Clustering (summarization) • Topic model typically outperforms clustering • Clustering: document belongs to single cluster • Topic modeling: document is composed of multiple topics • Latent semantic indexing (theme discovery) • Topics tend to be more interpretable than LSI “vectors” • TF-IDF (information retrieval) • Keywords are often too specific. Topics (or a combination of topics and keywords) are usually better

  50. Topic Modeling > Theory Topic Modeling vs. Clustering The Topic Model has the advantage over clustering that it can assign multiple topics per document. One Cluster Multiple Topics

More Related