Statistical Approaches to Joint Modeling of Text and Network Data

Statistical Approaches to Joint Modeling of Text and Network Data Arthur Asuncion, Qiang Liu, Padhraic Smyth UC Irvine MURI Project Meeting August 25, 2009

Outline • Models: • The “topic model”: Latent Dirichlet Allocation (LDA) • Relational topic model (RTM) • Inference techniques: • Collapsed Gibbs sampling • Fast collapsed variational inference • Parameter estimation, approximation of non-edges • Performance on document networks: • Citation network of CS research papers • Wikipedia pages of Netflix movies • Enron emails • Discussion: • RTM’s relationship to latent-space models • Extensions

Motivation • In (online) social networks, nodes/edges often have associated text (e.g. blog posts, emails, tweets) • Topic models are suitable for high-dimensional count data, such as text or images • Jointly modeling text and network data can be useful: • Interpretability: Which “topics” are associated to each node/edge? • Link prediction and clustering, based on topics

What is topic modeling? • Learning “topics” from a set of documents in a statistical unsupervised fashion • Many useful applications: • Improved web searching • Automatic indexing of digital historical archives • Specialized search browsers (e.g. medical applications) • Legal applications (e.g. email forensics) List of “topics” Topic Model Algorithm “bag-of-words” Topical characterization of each document # topics

Latent Dirichlet Allocation (LDA)[Blei, Ng, Jordan, 2003] • History: • 1988: Latent Semantic Analysis (LSA) • Singular Value Decomposition (SVD) of word-document count matrix • 1999: Probabilistic Latent Semantic Analysis (PLSA) • Non-negative matrix factorization (NMF) -- version which minimizes KL divergence • 2003: Latent Dirichlet Allocation (LDA) • Bayesian version of PLSA K D D P (word | doc) P (word | topic) P (topic | doc) ≈ * K W W

Graphical model for LDA Each document d has a distribution over topics Θk,d ~ Dirichlet(α) Each topic k is a distribution over words Φw,k ~ Dirichlet(β) Topic assignments for each word are drawn from document’s mixture zid ~ Θk,d The specific word is drawn from the topic zid xid ~ Φw,z Nd K D • Hidden/observed variables are in unshaded/shaded circles. • Parameters are in boxes. • Plates denote replication across indices. Demo

What if the corpus has network structure? CORA citation network. Figure from [Chang, Blei, AISTATS 2009]

Relational Topic Model (RTM)[Chang, Blei, 2009] • Same setup as LDA, except now we have observed network information across documents (adjacency matrix) “Link probability function” Documents with similar topics are more likely to be linked. Nd’ Nd K

Link probability functions • Exponential: • Sigmoid: • Normal CDF: • Normal: • where Element-wise (Hadamard) product 0/1 vector of size K Note: The formulation above is similar to “cosine distance”, but since we don’t divide by the magnitude, this is not a true notion of “distance”.

Approximate inference techniques(because exact inference is intractable) • Collapsed Gibbs sampling (CGS): • Integrate out Θ and Φ • Sample each zid from the conditional • CGS for LDA: [Griffiths, Steyvers, 2004] • Fast collapsed variational Bayesian inference (“CVB0”): • Integrate out Θ and Φ • Update variational distribution for each zid using the conditional • CVB0 for LDA: [Asuncion, Welling, Smyth, Teh, 2009] • Other options: • ML/MAP estimation, non-collapsed GS, non-collapsed VB, etc.

Collapsed Gibbs sampling for RTM • Conditional distribution of each z: • Using the exponential link probability function, it is computationally efficient to calculate the “edge” term. • It is very costly to compute the “non-edge” term exactly. LDA term “Edge” term “Non-edge” term

Approximating the non-edges • Assume non-edges are “missing” and ignore the term entirely (Chang/Blei) • Make the following fast approximation: • Subsample non-edges and exactly calculate the term over subset. • Subsample non-edges but instead of recalculating statistics for every zidtoken, calculate statistics once per document and cache them over each Gibbs sweep.

Variational inference • Minimize Kullback-Leibler (KL) divergence between true posterior and “variational” posterior (equivalent to maximizing “evidence lower bound”): • Typically we use a factorized variational posterior for computational reasons: Jensen’s inequality. Gap = KL [q, p(h|y)] By maximizing this lower bound, we are implicitly minimizing KL (q, p)

CVB0 inference for topic models[Asuncion, Welling, Smyth, Teh, 2009] • Collapsed Gibbs sampling: • Collapsed variational inference (0th-order approx): • Statistics affected by q(zid): • Counts in LDA term: • Counts in Hadamard product: • “Soft” Gibbs update • Deterministic • Very similar to ML/MAP estimation

Parameter estimation We learn the parameters of the link function (γ = [η, ν]) via gradient ascent: We learn parameters (α, β) via a fixed-point algorithm [Minka 2000]. Also possible to Gibbs sample α, β Step-size

Document networks

Link rank • We use “link rank” on held-out data as our evaluation metric. • Lower is better. • How to compute link rank for RTM: • Run RTM Gibbs sampler on {dtrain} and obtain {Φ, Θtrain, η, ν} • Given Φ, fold in dtest to obtain Θtest • Given {Θtrain, Θtest, η, ν}, calculate probability that dtest would link to each dtrain. Rank {dtrain} according to these probabilities. • For each observed link between dtest and {dtrain}, find the “rank”, and average all these ranks to obtain the “link rank” dtest {dtrain} Black-box predictor Ranking over {dtrain} Link ranks Edges among {dtrain} Edges between dtest and {dtrain}

Results on CORA data We performed 8-fold cross-validation. Random guessing gives link rank = 2000.

Results on CORA data • Model does better with more topics • Model does better with more words in each document

Timing Results on CORA “Subsampling (20%) without caching” not shown since it takes 62,000 seconds for D=1000 and 3,720,150 seconds for D=4000

CGS vs. CVB0 inference Total time: CGS = 5285 seconds CVB0 = 4191 seconds CVB0 converges more quickly. Also, each iteration is faster due to clumping of data points.

Results on Netflix Baseline does very well! Needs more investigation…

Some Netflix topics POLICE: [t2] police agent kill gun action escape car film DISNEY: [t4] disney film animated movie christmas cat animation story AMERICAN: [t5] president war american political united states government against CHINESE: [t6] film kong hong chinese chan wong china link WESTERN: [t7] western town texas sheriff eastwood west clint genre SCI-FI: [t8] earth science space fiction alien bond planet ship AWARDS: [t9] award film academy nominated won actor actress picture WAR: [t20] war soldier army officer captain air military general FRENCH: [t21] french film jean france paris fran les link HINDI: [t24] film hindi award link india khan indian music MUSIC: [t28] album song band music rock live soundtrack record JAPANESE: [t30] anime japanese manga series english japan retrieved character BRITISH: [t31] british play london john shakespeare film production sir FAMILY: [t32] love girl mother family father friend school sister SERIES: [t35] series television show episode season character episodes original SPIELBERG: [t36] spielberg steven park joe future marty gremlin jurassic MEDIEVAL [t37] king island robin treasure princess lost adventure castle GERMAN: [t38] film german russian von germany language anna soviet GIBSON: [t41] max ben danny gibson johnny mad ice mel MUSICAL: [t42] musical phantom opera song music broadway stage judy BATTLE: [t43] power human world attack character battle earth game MURDER: [t46] death murder kill police killed wife later killer SPORTS: [t47] team game player rocky baseball play charlie ruth KING: [t48] king henry arthur queen knight anne prince elizabeth HORROR: [t49] horror film dracula scooby doo vampire blood ghost

Some movie examples • 'Sholay' • Indian film, 45% of words belong to topic 24 (Hindi topic) • Top 5 most probable movie links in training set: • 'Laawaris‘ • 'Hote Hote Pyaar Ho Gaya‘ • 'Trishul‘ • 'Mr. Natwarlal‘ • 'Rangeela‘ • ‘Cowboy’ • Western film, 25% of words belong to topic 7 (western topic) • Top 5 most probable movie links in training set: • 'Tall in the Saddle‘ • 'The Indian Fighter' • 'Dakota' • 'The Train Robbers' • 'A Lady Takes a Chance‘ • ‘Rocky II’ • Boxing film, 40% of words belong to topic 47 (sports topic) • Top 5 most probable movie links in training set: • 'Bull Durham‘ • '2003 World Series‘ • 'Bowfinger‘ • 'Rocky V‘ • 'Rocky IV'

Directed vs. Undirected RTM on ENRON emails • Undirected: Aggregate incoming & outgoing emails into 1 document • Directed: Aggregate incoming emails into 1 “receiver” document and outgoing emails into 1 “sender” document • Directed RTM performs better than undirected RTM Random guessing: link rank=500

Discussion • RTM is similar to latent space models: • Topic mixtures (the “topic space”) can be combined with the other dimensions (the “social space”) to create a combined latent position z. • Other extensions: • Include other attributes in the link probability (e.g. timestamp of email, language of movie) • Use non-parametric prior over dimensionality of latent space (e.g. use Dirichlet processes) • Place a hierarchy over {θd} to learn clusters of documents – similar to latent position cluster model [Handcock, Raftery, Tantrum, 2007] Projection model [Hoff, Raftery, Handcock, 2002] Multiplicative latent factor model [Hoff, 2006] RTM

Conclusion • Relational topic modeling provides a useful start for combining text and network data in a single statistical framework • RTM can improve over simpler approaches for link prediction • Opportunities for future work: • Faster algorithms for larger data sets • Better understanding of non-edge modeling • Extended models

Thank you!

Statistical Approaches to Joint Modeling of Text and Network Data

Statistical Approaches to Joint Modeling of Text and Network Data

Presentation Transcript

The statistical analysis of personal network data

Introduction to Statistical Modeling

Statistical analysis and modeling of neural data Lecture 17

Statistical Approaches to Length of Reign

Joint Modeling of Entity-Entity Links and Entity-Annotated Text

The statistical analysis of personal network data

Joint Enhancement of Topic Modeling and Information Network Mining

Statistical Shape Modeling and Finite Element Analysis of the Hip Joint

Statistical Modeling of Text

Statistical analysis and modeling of neural data Lecture 6

Statistical analysis and modeling of neural data Lecture 7

Statistical Approaches to Mining Multivariate Data Streams

Statistical Modeling of OMICS data

Traditional Approaches to Modeling and Analysis

Statistical Modeling

(Statistical) Approaches to Word Alignment

Statistical Modeling

Statistical Modeling For Data Science

Statistical Approaches to Mining Multivariate Data Streams

Statistical analysis and modeling of neural data Lecture 4

Statistical Approaches to Inverse Problems

Statistical analysis and modeling of neural data Lecture 5