An Overview of Topic Modeling

An Overview of Topic Modeling Weifeng Li1,2 and Hsinchun Chen1 1 Artificial Intelligence Laboratory, The University of Arizona 2 University of Georgia

Acknowledgements • Many of the pictures, results, and other materials are taken from: • David Blei, Princeton University • The Stanford Natural Language Processing Group

Outline • Introduction and Motivation • Latent Dirichlet Allocation • Probabilistic Modeling Overview • LDA Assumptions • Inference • Evaluation • Research Example: LDA Application in Profiling Underground Economy Sellers • LDA Variants • Relaxing the Assumptions of LDA • Incorporating Metadata • Coupling with Deep Learning • Generalizing to Other Kinds of Data • Future Directions • Tools & Implementation Details

Introduction and Motivation • As more information is becoming easily available, it is difficult to find and discover what we need. • Topic models are a suite of algorithms for discovering the main themes that pervade a large and other wise unstructured collection of documents. • Among these algorithms, Latent Dirichlet Allocation (LDA), a technique based in Bayesian Modeling, is the most commonly used nowadays. • Topic models can be applied to massive collections of documents to automatically organize, understand, search, and summarize large electronic archives. • Especially relevant in today’s “Big Data” environment.

Introduction and Motivation • Each topic is a distribution of words; each document is a mixture of corpus-wide topics; and each word is drawn from one of those topics.

Introduction and Motivation • In reality, we only observe documents. The other structures are hidden variables. Our goal to infer the hidden variables.

Introduction and Motivation: 100-topic LDA, 17,000 Science articles The resulting output from an LDA model would be sets of topics containing keywords which would then be manually labeled. On the left are the inferred topic proportions for the example article from the pervious figure.

Use Cases of Topic Modeling • Topic models have been used to: • Annotate documents and images • Organize and browse large corpora • Model topic evolution • Categorize source code archives • Discover influential articles

Probabilistic Modeling Overview • Modeling: treat the data as arising from a generative process that includes hidden variables. This defines a joint distribution over both the observed and the hidden variables. • Inference: infer the conditional distribution (posterior) of the hidden variables given the observed variables. • Analysis: check the fit of the model; make prediction based on new data; explore the properties of the hidden variables. Modeling Inference Analysis

Latent DirichletAllocation: Assumptions • LDA is a generative Bayesian model for topic modeling, which is built on the following assumptions: • Assumptions on all variables: • Word: the basic unit of discrete data • Document: a collection of words (exchangeability assumption) • Corpus: a collection of documents • Topic (hidden): a distribution over words & the number of topics is known. • Assumptions on how texts are generated: Dirichlet Dist.(next slide) • For each topic ,draw a multinomial over words • For each document , • Draw a document topic proportion • For each word : • Draw a topic • Draw a word

DirichletDistribution: Dir() • Named after Peter G. L. Dirichlet and often denoted as Dir(); A family of continuous multivariate probability distributions parameterized by a vector of positive reals. • Dir() is the multivariate generalization of the beta distribution. Dirichlet distributions are often used as prior distributions in Bayesian statistics. • Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.( Conjugates distributions: the posterior distributions are in the same family as the prior distribution.)

LDA: Probabilistic Graphical Model Per-document topics proportions is a multinomial distribution, which is generated from Dirichlet distribution parameterized by . Smilarly, topics is also a multinomial distribution, which is generated from Dirichlet distribution parameterized by . For each word , its topic is drawn from document topic proportions . Then, we draw the word from the topic , where .

The Graphical Model for LDA: Joint Distribution • This distribution specifies a number of dependencies that define LDA (as shown in the plate diagram).

Inference • Objective: computing the conditional distribution (posteriors) of the topic structure given the observed documents. • : the joint distribution of all the random variables, which is easy to compute • : the marginal probability of observations(the probability of seeing the observed corpus under any topic model), which is intractable. • In theory, is computed by summing the joint distribution over every possible combination of , which is exponentially large. • Approximation methods: search over the topic structure • Sampling-based algorithms attempt to collect samples from the posterior to approximate it with an empirical distribution. • Variational algorithms posit a parameterized family of distributions over the hidden structure and then find the member of that family that is closest to the posterior.

More on Approximation Methods • In Sampling-based algorithms, Gibbs sampling is the most commonly used: • Approximating the posterior with samples. • Construct a Markov chain—a sequence of random variables, each dependent on the previous—whose limiting distribution is the posterior. • The Markov chain is defined on the hidden topic variables for a particular corpus, and the algorithm is to run the chain for a long time, collect samples from the limiting distribution, and then approximate the distribution with the collected samples (see Steyers & Griffiths, 2006). • Variationalalgorithms are a deterministic alternative to sampling-based algorithms. • Posit a parametrized family of distributions over the hidden structure and then find the member of that family that is closet to the posterior. • The inference problem is transformed to an optimization problem. • Coordinate ascent variational inference algorithm for LDA (see Blei, Ng, and Jordan, 2003)

Model Evaluation: Perplexity • Perplexity is the most typical evaluation of LDA models (Bao & Datta, 2014; Blei et al., 2003). • Perplexity measures the modeling power by calculating the inverse log-likelihood of unobserved documents (an decreasing function). (Higher likelihood, the better model) • Better models have lower perplexity, suggesting less uncertainties about the unobserved document. Average log-likelihood of all unobserved document Log-likelihood of each unobserved document Wd: words in document d; Nd: Length of document d The figure compares LDA with other topic modeling approaches. The LDA model is consistently better than all other benchmark approaches. Moreover, as the number of topics go up, the LDA model becomes better (i.e., the perplexity decreases.)

Model Evaluation: Topic Coherence • Topic coherence evaluates the semantic nature of the learned topics. • Specifically, it measures the semantic similarity among the top keywords of a topic. Topic coherence has shown to be correlated with human evaluations of topic quality. • Topic coherence a topic is calculated by: • where is the top keywords of the topic • There are two commonly used score metrics: The Extrinsic UCI Metric (Newman et al. 2010): where is the word co-occurrence probability of word pair estimated from an external corpus (e.g., Wikipedia) and ) is the probability of word in the external corpus. The Intrinsic UMass Metric (Mimno et al. 2011): where counts the number of documents word pair co-occurred and counts the number of documents containing . Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100-108). ACL. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing (pp. 262-272). ACL.

Model Selection: How Many Topics to Choose • The author of LDA suggests to select the number of topics from 50 to 150 (Blei 2012); however, the optimal number usually depends on the size of the dataset. • Cross validation on perplexity is often used for selecting the number of topics. • Specifically, we propose possible numbers of topics first, evaluate the average perplexity using cross validation, and pick the number of topics that has the lowest perplexity. • The following plot illustrates the selection of optimal number of topics for 4 datasets.

LDA Research Example – Profiling Underground Economy Sellers • The underground economy is the online black market for exchanging products/services that relate to cybercrimes. • Cyber crime activities have been mostly commoditized in the underground economy.  Sellers impose a growing threat to cyber security. • Sellers advertise their products/services by giving details about their resources, payments, contacts, etc. • Objective: to profile underground economy sellers to reflect their specialties(characteristics) Li, W., Chen, H., & Nunamaker Jr, J. F. (2016). Identifying and profiling key sellers in cyber carding community: AZSecure text mining system. Journal of Management Information Systems, 33(4), 1059-1086.

LDA Research Example – Profiling Underground Economy Sellers • Input: Original threads from hacker forums • Preprocessing: • Thread Retrieval: Identifying threads related to the underground economy by conducting snowball sampling-based keywords search • Thread Classification: Identifying advertisement threads using MaxEnt classifier • Focusing on malware advertisements and stolen card advertisement • Can be generalized to other advertisements.

LDA Research Example – Profiling Underground Economy Sellers Description of the stolen data/service Seller of stolen data: Rescator • To profile the seller, we seek to identify the major topics in its advertisement. • Example input: Prices of the stolen data Contact: a dedicated shop and ICQ Payment Options 23

LDA Research Example – Profiling Underground Economy Sellers • For LDA model selection, we use perplexity to choose the optimal number of topics for the advertisement corpus. • Output: • LDA gives the probabilities of each topics associated with the seller. • We pick the top- topics to profile the seller ( in our example). • For each topic, we pick the top- keywords to interpret the topic ( in our example). • The following table helps us to profile Rescator based on its characteristics in terms of the product, the payment, and the contact.

LDA Variants: Relaxing the Assumptions of LDA • Consider the order of the words: words in a document cannot be exchanged • Conditioning on the previous word (Wallach 2006) • Hidden Markov Model (Griffiths et al. 2005) • Consider the order of the documents • Dynamic LDA (Blei & Lafferty 2006) • Consider previously unseen topics: the number of topics is not fixed • Bayesian Nonparametrics (Blei et al. 2010)

Dynamic LDA • Motivation: • LDA assumes the order of documents does not matter (Not appropriate for sequential corpora) • We want to capture language change over time. • In Dynamic LDA, topic evolves over time. • Dynamic LDA uses a logistic normal distribution to model topics evolving over time. Example: Topics drift through time Blei, D. M., and Lafferty, J. D. 2006. “Dynamic topic models,” in Proceedings of the 23rd international conference on Machine learning (ICML 2006), pp. 113–120 (doi: 10.1145/1143844.1143859).

Bayesian Nonparametric Topic Modeling: Hierarchical Dirichlet Process • Bayesian nonparametrics: the parameter space has infinite imension • Nonparametric topic model: a topic model whose topic space has infinite dimension (i.e., an infinite number of topics) • Less vulnerable to model overfitting or underfitting caused by the misspecification of topic number • A prominent nonparametric topic model is hierarchical Dirichlet Process (HDP) (Teh et al. 2006) • HDP is increasingly chosen for modeling topics over LDA for its reliability and flexibility. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 1566-1581.

Dirichlet Process: the Major Building Block of HDP • Dirichlet process (DP): A probability distribution of discrete distributions over the topic space with probability one. • (concentration parameter): how concentrated the discrete distributions over topics drawn from the DP are; • (base distribution): determines the topic space and the expectation of the discrete distributions drawn from the DP. • Two nice properties of DP samples that allow for modeling topics: • Clustering property: topics previously drawn from the distribution will likely to be drawn again, allowing the words within a document to be clustered under certain topics. • Infinity property: once can drawn as many as topics as needed because DP samples are distributions over topics; therefore, the number of topics is unbounded.

Example: Sample #1 Sample #2 Sample #3 =1 : • Discrete distribution with probability one. • Can take an infinite number of values as sampled from . • Values sampled from are often repeated, thus forming clusters (each spike represents one cluster.) =10 =100 =1000

Hierarchical Dirichlet Process (HDP) For the corpus: , For each document: Per-word Topic Observed Word Corpus Topic Distribution Dirichlet process (DP) allows for modeling an infinite number of topics. Document Topic Distribution : base topic distribution (e.g., Dirichlet distribution Dir()); : corpus topic concentration parameter; : document topic concentration parameter

HDP Model Specifications: Hierarchical Dirichlet Process

Latent Dirichlet Allocation vs. Hierarchical Dirichlet Process Latent Dirichlet Allocation Hierarchical Dirichlet Process Corpus-Level Topics Corpus-Level Topics Topics Document 1(Topic Prop.) Document 1 (Topic Dist.) Document 2 (Topic Prop.) Document 2 (Topic Dist.) …… …… Document D (Topic Prop.) Document D (Topic Dist.)

Hierarchical Dirichlet Process Performance: Document Modeling on Academic Papers • (a): HDP performed as well as the best LDA model, doing so without any form of model selection procedure. • (b): The posterior over the number of topics obtained under the HDP model is consistent with this range of the best-fitting LDA models. Perplexity of LDA versus HDP Posterior number of topics in HDP (a) (b) Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 1566-1581.

LDA Variants: Incorporating Metadata • Account for metadata of the documents (e.g., author, title, geographic location, links, etc.) • Author-topic model (Rosen-Zvi et al. 2004) • Assumption: The topic proportions are attached to authors. • Allows for inferences about authors, for example, author similarity. • Relational topic model (Chang & Blei 2010) • Documents are linked (e.g., citation, hyperlink) • Assumption: links between documents depend on the distance between their topic proportions. • Takes into account node attributes (the words of the document) in modeling the network links. • Supervised topic model (Blei& McAuliffe 2007) • A general purpose method for incorporating metadata into topic models

From Topic Model to Supervised Topic Model (STM) cvv 0.03 vbv 0.03 ssn 0.02 price 0.05 delivery 0.04 service 0.01 model 0.03 device 0.01 machine 0.01 pos 0.04 skimmer 0.02 encrypt 0.01 picture 0.04 video 0.02 tutorial 0.01 Documents (e.g., reviews) topics: Responses (e.g., service quality) Descriptive (unsupervised) Predictive (Supervised) • STM can simultaneously • Explore the underlying topics • Make predictions using the extracted topics • More accurate prediction by capturing the underlying topics shared across documents (Blei & Mcauliffe 2008)

Supervised LDA • Supervised LDA are topic models of documents and response variables. They are fit to find topics predictive of the response variable. How many topics? rating 10-topic sLDA model on movie reviews (Pang and Lee, 2005): identifying the topics correspond to ratings Blei, D. M., and Mcauliffe, J. D. 2008. “Supervised Topic Models,” in Advances in neural information processing systems, pp. 121–128 (doi: 10.1002/asmb.540).

Parametric vs. Nonparametric Parametric Approach: cvv 0.03 vbv 0.03 ssn 0.02 cvv 0.03 vbv 0.03 ssn 0.02 price 0.05 delivery 0.04 service 0.01 price 0.05 delivery 0.04 service 0.01 model 0.03 device 0.01 machine 0.01 model 0.03 device 0.01 machine 0.01 pos 0.04 skimmer 0.02 encrypt 0.01 pos 0.04 skimmer 0.02 encrypt 0.01 picture 0.04 video 0.02 tutorial 0.01 picture 0.04 video 0.02 tutorial 0.01 Documents (e.g., reviews) topics: Responses (e.g., service quality) Nonparametric Approach: Proportion of predefined number of topics Documents (e.g., reviews) topics: Responses (e.g., service quality) Distributionof unlimited number of topics

Supervised HDP: Nonparametric Supervised Topic Model Corpus Topic Distribution Document Topic Distribution Per-word Topic Observed Word Response Coefficient Latent variables Observed data

At least 17% improvement over closest technique Better Predictive R-squared Failed to converge Failed to converge Mean Absolute Error (MAE) Failed to converge Failed to converge Better Root Mean Square Error (RMSE) Failed to converge Failed to converge Better Li, W., Yin, J., & Chen, H. (2018). Supervised Topic Modeling Using Hierarchical Dirichlet Process-Based Inverse Regression: Experiments on E-Commerce Applications. IEEE Transactions on Knowledge and Data Engineering, 30(6), 1192-1205.

LDA Variants: Coupling with Deep Learning • Neural Topic Model (Miao et al. 2017) • Variationalautoencoder (VAE): a deep learning technique for approximating posterior distribution using the autoencoder framework. • Building upon deep learning’s improved capability of approximating non-linear relationships, VAE generally provides better model inference than traditional variational inference algorithms. • Neural topic model: leveraging VAE for model inference • Word embedding-based topic modeling (Das et al. 2015) • Word embedding: representing word semantics in a continuous vector spaceSemantically related words tend to be closer to each other in the vector space. • Word embedding-based topic: a distribution over the vector space (instead of a multinomial distribution over words) • This encourages the model to group words that are a priori known to be semantically related into topics. Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian LDA for Topic Models with Word Embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 795–804). Miao, Y., Grefenstette, E., & Blunsom, P. (2017). Discovering Discrete Latent Topics with Neural Variational Inference. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 2410–2419).

LDA Variants: Generalizing to Other Kinds of Data • LDA is mixed-membership model of grouped data. • Rather than associating each group of data with one topic, each group exhibits multiple topics in different proportions. • Hence, LDA can be adapted to other kinds of observations with only small changes to the corresponding inference algorithms. • Population genetics • Application: finding ancestral populations • Intuition: each individual’s genotype descends from one or more of the ancestral populations (topics) • Computer vision • Application: classifying images, connect images and captions, build image hierarchies, etc. • Intuition: each image exhibits a combination of visual patterns and that the same visual patterns recur throughout a collection of images

Future Directions • Evaluation and model checking • Held-out accuracy may not correspond to better organization or easier interpretation (Amazon Mechanical Turk experiment; see Chang et al., 2009; perplexity is not strongly correlated to human judgement; sometimes even slightly anti-correlated) • Which topic model should I use? • How can I decide which of the many modeling assumptions are important for my goals? • Visualization and user interfaces • How to display the topics? • How to best display a document with a topic model? • How can we best display document connections?

Future Directions • Topic models for data discovery • How can topic models help us form hypothesis about the data? • What can we learn about the language based on the topic model posterior? • Topic interpretation • Topics are distributions over words; therefore, interpreting topics semantics from these distributions becomes important. • How to properly interpret a topic? • Multilingual topic modeling • Whether the same topic can appear in different languages? • How to find common topics across different languages?

Topic Modeling - Tools

gensim (Python) Example:Topic Modeling News Articles • We use the news article dataset from the Lee corpus. test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) lee_train_file= test_data_dir + os.sep + 'lee_background.cor‘ defbuild_texts(fname): with open(fname) as f: for line in f: yield gensim.utils.simple_preprocess(line, deacc=True, min_len=3) train_texts = list(build_texts(lee_train_file)) bigram = gensim.models.Phrases(train_texts) defprocess_texts(texts): texts = [[word for word in line if word not in stops] for line in texts] texts = [bigram[line] for line in texts] texts = [[word.split('/')[0] for word in lemmatize(' '.join(line), allowed_tags=re.compile('(NN)'), min_length=3)] for line in texts] return texts train_texts = process_texts(train_texts) dictionary = Dictionary(train_texts) corpus = [dictionary.doc2bow(text) for text in train_texts] Data import & preprocessing Bigram collocation detection Creating a bag-of-words model Tutorial retrieved from: https://markroxor.github.io/gensim/static/notebooks/gensim_news_classification.html

gensim (Python) Example:Topic Modeling News Articles Specifying the number of topics ldamodel = gensim.models.LdaModel(corpus=corpus, num_topics=10, id2word=dictionary) ldatopics = ldamodel.show_topics(formatted=False) import pyLDAvis.genism pyLDAvis.enable_notebook() pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary) Executing LDA & visualize Example output: Tutorial retrieved from: https://markroxor.github.io/gensim/static/notebooks/gensim_news_classification.html

gensim (Python) Example:Topic Modeling News Articles Executing HDP & showing learned topics • Each topic is returned with its most probable keywords and its associated weights. hdpmodel = gensim.models.HdpModel(corpus=corpus, id2word=dictionary) hdptopics = hdpmodel.show_topics(formatted=False) hdpmodel.show_topics() Example output: [u'topic 0: 0.004*collapse + 0.004*afghanistan + 0.004*troop + 0.003*force + 0.003*government + 0.002*benefit + 0.002*operation + 0.002*taliban + 0.002*time + 0.002*today + 0.002*ypre + 0.002*tourism + 0.002*person + 0.002*help + 0.002*wayne + 0.002*fire + 0.002*peru + 0.002*day + 0.002*united_state + 0.002*hih', u'topic 1: 0.003*group + 0.003*government + 0.002*target + 0.002*palestinian + 0.002*end + 0.002*terrorism + 0.002*cease + 0.002*memorandum + 0.002*radio + 0.002*call + 0.002*official + 0.002*path + 0.002*security + 0.002*wayne + 0.002*attack + 0.002*human_right + 0.001*four + 0.001*gunman + 0.001*sharon + 0.001*subsidiary', u'topic 2: 0.003*rafter + 0.003*double + 0.003*team + 0.002*reality + 0.002*manager + 0.002*cup + 0.002*australia + 0.002*abc + 0.002*nomination + 0.002*user + 0.002*freeman + 0.002*herberton + 0.002*lung + 0.002*believe + 0.002*injury + 0.002*steve_waugh + 0.002*fact + 0.002*statement + 0.002*mouth + 0.002*alejandro', … Tutorial retrieved from: https://markroxor.github.io/gensim/static/notebooks/gensim_news_classification.html

An Overview of Topic Modeling