1 / 34

Multilinear Algebra for Analyzing Data with Multiple Linkages

Multilinear Algebra for Analyzing Data with Multiple Linkages. Danny Dunlavy , Tammy Kolda, Philip Kegelmeyer Sandia National Laboratories DOE ASCR Applied Mathematics PI Meeting October 16, 2008.

arnie
Download Presentation

Multilinear Algebra for Analyzing Data with Multiple Linkages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilinear Algebra for Analyzing Data with Multiple Linkages Danny Dunlavy, Tammy Kolda, Philip Kegelmeyer Sandia National LaboratoriesDOE ASCR Applied Mathematics PI MeetingOctober 16, 2008 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under Contract DE-AC04-94AL85000 SAND2008-6760C

  2. Goals and Motivation • Goal • Analyze complex relationships between multiple types of data objects • Motivating Tasks • Information retrieval • Clustering • Disambiguation and resolution • Motivating Applications • Trend analysis and prediction in scientific discovery • Mathematics for Analysis of Petascale Data (MAPD) • Analysis of simulation data (ensembles of runs)

  3. Linear Algebra, Graph Analysis, Informatics Terms d1 LSA car d2 service d3 military Documents repair • Network analysis and bibliometrics • PageRank [Brin & Page, 1998] • Transition matrix of a Markov chain: • Ranks: • HITS [Kleinberg, 1998] • Adjacency matrix of the Web graph: • Hubs: • Authorities: • Latent Semantic Analysis (LSA) [Dumais, et al., 1988] • Vector space model of documents (term-document matrix): • Truncated SVD: • Maps terms and documents to the “same” k-dimensional space

  4. Multi-Link Graphs Authors • Nodes (one type) connected by multiple types of links • Node x Node x Connection • Two types of nodes connected by multiple types of links • Node A x Node B x Connection • Multiple types of nodes connected by multiple types of links • Node A x Node B x Node C x Connection • Directed and undirected links Terms Documents

  5. Tensors J An I£J£K tensor K xijk I J • Other names for tensors • Multidimensional array • N-way array • Tensor order • Number of dimensions • Other names for dimension • Mode • Way • Example • The matrix A (at left) has order 2. • The tensor X (at left) has order 3 and its 3rd mode is of size K. An I x J matrix aij Notation I scalar vector matrix tensor

  6. Tensors Column (Mode-1) Fibers Row (Mode-2)Fibers Tube (Mode-3) Fibers An I£J£K tensor K I J Horizontal Slices Lateral Slices Frontal Slices 3rd order tensor mode 1 has dimension I mode 2 has dimension J mode 3 has dimension K

  7. Tensor Unfolding: Converting a Tensor to a Matrix (i,j,k) (i′,j′) Unfolding (matricize) (i,j,k) (i′,j′) Folding (reverse matricize) 5 7 6 8 1 32 4 : The mode-n fibers are rearranged to be the columns of a matrix

  8. Tensor n-Mode Multiplication Tensor Times Vector Tensor Times Matrix Compute the dot product of a and each column (mode-1) fiber Multiply each row (mode-2) fiber by B

  9. Outer, Kronecker, & Khatri-Rao Products M x N P x Q MP x NQ = MN x R M x R N x R Matrix Kronecker Product 3-Way Outer Product Matrix Khatri-Rao Product Rank-1 Tensor Observe: For two vectors a and b, a±b and a­b have the same elements, but one is shaped into a matrix and the other into a vector.

  10. Specially Structured Tensors K x R K x T W W I x Jx K Ix R Ix R J x S J x R wR w1 V V U U = = = +…+ v1 vR R x S x T R x R x R u1 uR Tucker Tensor Kruskal Tensor • Tucker Tensor • Tucker Tensor I x J x K

  11. Specially Structured Tensors Tucker Tensor Tucker Tensor Kruskal Tensor • Tucker Tensor • Tucker Tensor In matrix form: In matrix form:

  12. Tucker Decomposition • Three-mode factor analysis [Tucker, 1966] • Three-modePCA [Kroonenberg, et al. 1980] • N-mode PCA[Kapteyn et al., 1986] • Higher-order SVD (HO-SVD) [De Lathauwer et al., 2000] K x T C I x J x K Ix R J x S B A ¼ Given A, B, C, the optimal core is: R x S x T • A, B, and C may be orthonormal (generally, full column rank) • is not diagonal • Decomposition is not unique

  13. CANDECOMP/PARAFAC (CP) Decomposition I x J x K K x R C Ix R J x R B A = +…+ ¼ R x R x R • CANDECOMP = Canonical Decomposition [Carroll & Chang, 1970] • PARAFAC = Parallel Factors [Harshman, 1970] • Core is diagonal (specified by the vector ) • Columns of A, B, and C are not orthonormal

  14. CP Alternating Least Squares (CP-ALS) Khatri-Rao Pseudoinverse = + + … Find all the vectors in one direction simultaneously • Fix A, B, and , solve for C: • Optimal C is the least squares solution: IxJxK [Smilde et al., 2004] • Successively solve for each component (C,B,A)

  15. Analyzing SIAM Publication Data Tensor Toolbox (MATLAB) [T. Kolda, B. Bader, 2007] 1999-2004 SIAM Journal Data 5022 Documents

  16. SIAM Data: Why Tensors?

  17. SIAM Data: Tensor Construction not symmetric • where for terms in the abstracts • is the frequency of term i in document j • is the number of documents that term i appears in • for terms in the titles • for terms in the author-suppplied keywords • where

  18. SIAM Data: CP Applications • Community Identification • Communities of papers connected by different link types • Latent Document Similarity • Using tensor decompositions for LSA-like analysis • Analysis of a Body of Work via Centroids ( ) • Body of work defined by query terms • Body of work defined by authorship • Author Disambiguation & Resolution • List of most prolific authors in collection changes • Multiple author disambiguation • Journal Prediction • Co-reference information to define journal characteristics

  19. SIAM Data: Document Similarity Link Analysis: Hubs and Authorities on the World Wide Web, C.H.Q. Ding, H. Zha, X. He, P. Husbands, and H.D. Simon, SIREV, 2004. • CP decomposition, R = 10: • Similarity scores: Interior Point Methods (Not related) Sparse approximate inverses (Arguably distantly related) Graph Partitioning (Related)

  20. SIAM Data: Document Similarity Link Analysis: Hubs and Authorities on the World Wide Web, C.H.Q. Ding, H. Zha, X. He, P. Husbands, and H.D. Simon, SIREV, 2004. • CP decomposition, R = 30: • Similarity scores: Graphs (Related)

  21. Papers Most Similar to Authors’ Centroids Sven Leyffer (3 articles in data) Most Similar Papers

  22. SIAM Data: Author Disambiguation • CP decomposition, R = 20: • Computed centroids of top 20 authors’ papers ( ) • Disambiguation scores: • Scored all authors with same last name, first initial T Chan (2), TM Chan (4) T Manteufel (3) S McCormick (3) G Golub (1)

  23. SIAM Data: Author Disambiguation Disambiguation Scores for Various CP Tensor Decompositions (+ = correct; o = incorrect) R = 15

  24. SIAM Data: Author Disambiguation Disambiguation Scores for Various CP Tensor Decompositions (+ = correct; o = incorrect) R = 20

  25. SIAM Data: Author Disambiguation Disambiguation Scores for Various CP Tensor Decompositions (+ = correct; o = incorrect) R = 25

  26. SIAM Data: Journal Prediction • CP decomposition, R = 30, vectors from B • Bagged ensemble of decision trees (n = 100) • 10-fold cross validation • 43% overall accuracy

  27. SIAM Data: Journal Prediction • Graph of confusion matrix • Layout via Barnes-Hut • Node color = cluster • Node size = % correct (0-66)

  28. Ongoing Work at Sandia • Temporal Analysis [Bader, Kolda, Acar, Dunlavy] • Language analysis [Chew, Bader, Dunlavy] • Choice of rank [Acar, Dunlavy, Kolda] • Parallel computation [Sears, Bader, Kolda] • Weighting [Dunlavy, Kolda, Acar, Dhillon (UT)] • Validation [Dunlavy, Kolda, Acar, Dhillon (UT)] • Data fitting via optimization [Kolda, Acar, Dunlavy] • Large-scale analysis [Hendrickson, Kegelmeyer]

  29. Thank You Tech report: Multilinear Algebra for Analyzing Data with Multiple Linkages (via my web page) Book chapter: Tensor Decompositions for Analyzing Similarity Graphs with Multiple Linkages (to appear) Danny Dunlavy Email: dmdunla@sandia.gov Web Page: http://www.cs.sandia.gov/~dmdunla

  30. Extra Slides

  31. High-Order Analogue of the Matrix SVD 1 2 R • Matrix SVD: + ++ = = • Tucker Tensor (finding bases for each subspace): • Kruskal Tensor (sum of rank-1 components):

  32. Other Tensor Applications at Sandia • Tensor Toolbox (MATLAB)[Kolda/Bader, 2007] http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/ • TOPHITS (Topical HITS)[Kolda, 2006] • HITS plusterms in dimension 3 • Decomposition: CP • Cross Language Information Retrieval[Chew et al., 2007] • Different languages in dimension 3 • Decomposition: PARAFAC2 • Temporal Analysis of E-mail Traffic[Bader et al., 2007] • Directed e-mail graph with time in dimension 3 • Decomposition: DEDICOM

  33. Tensors and HPC • Sparse tensors • Data partitioning and load balancing issues • Partially sorted coordinate format • Optimize unfolding for single mode • Scalable decomposition algorithms • CP, Tucker, PARAFAC2, DEDICOM, INDSCAL, … • Different operations, data partitioning issues • Applications • Network analysis, web analysis, multi-way clustering, data fusion, cross-language IR, feature vector generation • Different operations, data partitioning issues • Leverage Sandia’s Trilinos packages • Data structures, load balancing, SVD, Eigen.

  34. CP Alternating Least Squares (CP-ALS)

More Related