Parcube sparse parallelizable tensor decompositions
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

ParCube : Sparse Parallelizable Tensor Decompositions PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on
  • Presentation posted in: General

ParCube : Sparse Parallelizable Tensor Decompositions. Evangelos E. Papalexakis 1 , Christos Faloutsos 1 , Nikos Sidiropoulos 2 1 Carnegie Mellon University, School of Computer Science 2 University of Minnesota, ECE Department.

Download Presentation

ParCube : Sparse Parallelizable Tensor Decompositions

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Parcube sparse parallelizable tensor decompositions

ParCube: Sparse Parallelizable Tensor Decompositions

Evangelos E. Papalexakis1, Christos Faloutsos1, Nikos Sidiropoulos2

1Carnegie Mellon University, School of Computer Science

2University of Minnesota, ECE Department

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Bristol, UK, September 24th-28th, 2012.


Outline

Outline

  • Introduction

    Problem Statement

    Method

    Experiments

    Conclusions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Introduction

Introduction

  • Facebook has ~800 Million users

    • Evolves over time

    • How do we spot interesting patterns & anomalies in this very large network?

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Introduction1

Introduction

  • Suppose we have Knowledge Base data

    • E.g. Read the Web Project at CMU

      • Subject – verb – object triplets, mined from the web

    • Many gigabytes or terabytes of data!

    • How do we find potential new synonyms to a word using this knowledge base?

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Introduction to tensors

Introduction to Tensors

  • Tensors are multidimensional generalizations of matrices

    • Previous problems can be formulated as tensors!

    • Time-evolving graphs/social networks, Multi-aspect data (e.g. subject, object, verb)

  • Focus on 3-way tensors

    • Can be viewed as Data cubes

    • Indexed by 3variables (IxJxK)

verb

subject

object

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Introduction to tensors1

Introduction to Tensors

  • PARAFAC decomposition

    • Decompose a tensor into sum of outer products/rank 1 tensors

    • Each rank 1 tensor is a different group/”concept”

    • “Similar” to the Singular Value Decomposition in the matrix case

verb

Store the factor vectors ai, bi, ci as columns of matrices A, B, C

subject

object

“products”

“leaders/CEOs”

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Outline1

Outline

Introduction

  • Problem Statement

    Method

    Experiments

    Conclusions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Why not parafac

Why not PARAFAC?

  • Today’s datasets are in the orders of terabytes

    • e.g. Facebook has ~ 800 Million users!

  • Explosive complexity/run time for truly large datasets!

  • Also, data is very sparse

    • We need the decomposition factors to be sparse

      • Better interpretability / less noise

      • Can do multi-way soft co-clustering this way!

    • PARAFAC is dense!

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Problem statement

Problem Statement

  • Wish-list:

    • Significantly drop the dimensionality

      • Ideally 1 or more orders of magnitude

    • Parallelize the computation

      • Ideally split the problem into independent parts and run in parallel

    • Yield sparse factors

    • Don’t loose much in the process

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Previous work

Previous work

  • A.H. Phanet al. Block decomposition for very large-scale nonnegative tensor factorization

    • Partition & merge parallel algorithm for NN PARAFAC

    • No sparsity

  • Q. Zhanget al. A parallel nonnegative tensor factorization algorithm for mining global climate data.

  • D. Nionet al. Adaptive algorithms to track the parafacdecomposition of a third-order tensor & J. Sunet al. Beyond streams and graphs: dynamic tensor analysis

    • Tensor is a stream, both methods seek to track the decomposition

  • C.E. TsourakakisMach: Fast randomized tensor decompositions & J. Sun et al. Multivis:Content- based social network exploration through multi-way visual analysis

    • Sampling based TUCKER models.

  • E.E. Papalexakis et al. Co-clustering as multilinear decomposition with sparse latent factors.

    • Sparse PARAFAC algorithm applied to co-clustering

None combines all requirements!

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Our proposal

Our proposal

  • We introduce ParCubeand set the following goals:

  • Goal 1: Fast

    • Scalable & parallelizable

  • Goal 2: Sparse

    • Ability to yield sparse latent factors and a sparse tensor approximation

  • Goal 3: Accurate

    • provable correctness in merging partial results, under appropriate conditions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Outline2

Outline

Introduction

Problem Statement

  • Method

    Experiments

    Conclusions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Parcube the big picture

ParCube: The big picture

Break up tensor into small pieces

using sampling

G1

G2

Match columns and distribute non-zero values to appropriate indices in original (non-sampled) space

G1

Fit dense PARAFAC decomposition on small sampled tensors

  • Sampling selects small portion of indices

    • PARAFAC vectors ai bi ciwill be sparse by construction

G2

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


The parcube method

The ParCube method

  • Key ideas:

    • Use biased sampling to sample rows, cols & fibers

    • Sampling weight

    • During sampling, always keep a common portion of indices across samples

    • For each smaller tensor, do the PARAFAC decomposition.

    • Need to specify 2 parameters:

      • Sampling rate: s

        • Initial dimensions I, J, K  I/s, J/s, K/s

      • Number of repetitions / different sampled tensors: r

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Putting the pieces together

Putting the pieces together

Details

  • Say we have matrices Asfrom each sample

    • Possibly have re-ordering of factors

    • Each matrix corresponds to different sampled index set of the original index space

    • All factors share the “upper” part (by construction)

G3

Proposition: Under mild conditions, the algorithm will stitch components correctly & output what exact PARAFAC would

Proof on paper

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Outline3

Outline

Introduction

Problem Statement

Method

  • Experiments

    Conclusions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Experiments

Experiments

  • We use the Tensor Toolbox for MatlabPARAFAC for baseline and core implementation

  • Evaluation of performance

    • Algorithm correctness

    • Execution speedup

    • Factor sparsity

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Experiments correctness for multiple repetitions

Experiments – Correctness for multiple repetitions

  • Relative cost = ParCube approximation cost / PARAFAC approximation cost

  • The more samples we get, the closer we are to exact PARAFAC

  • Experimental validation of our theoretical result.

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Experiments correctness speedup for 1 repetition

Experiments - Correctness & Speedup for 1 repetition

  • Relative cost = ParCube approximation cost / PARAFAC approximation cost

  • Speedup = PARAFAC execution time / ParCube execution time

  • Extrapolation to parallel execution for 4 repetitions yields 14.2x speedup (and improves accuracy)

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Experiments correctness sparsity

Experiments – Correctness & Sparsity

  • Output size = NNZ(A) + NNZ(B) + NNZ(C)

  • 90% sparser than PARAFAC while maintaining the same approximation error

Same as PARAFAC

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Experiments1

Experiments

  • Knowledge Discovery

    • Enron email/social network 186×186×44

    • Network traffic data (Lbnl) 65170 × 65170 × 65327

    • Facebook Wall posts 63891 × 63890 × 1847

    • Knowledge Base data (Never Ending Language Learner – Nell) 14545 × 14545 × 28818

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Discovery enron

Discovery - Enron

  • Who-emailed-whom data from the ENRON email dataset.

    • Spans 44 months

    • 184×184×44 tensor

    • We picked s = 2, r = 4

  • We were able to identify social cliques and spot spikes that correspond to actual important events in the company’s timeline

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Discovery lbnl network data

Discovery – Lbnl Network Data

1 src

  • Network traffic data of form (src IP, dst IP, port #)

    • 65170 × 65170 × 65327 tensor

    • We pick s = 5, r = 10

  • We were able to identify a possible Port Scanning Attack

1 dst

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Discovery facebook wall posts

Discovery – Facebook Wall posts

1 Wall

  • Small portion of Facebook’s users

    • 63890 users for 1847 days

    • Picked s = 100, r = 10

  • Data in the form (Wall owner, poster, timestamp)

  • Downloaded from http://socialnetworks.mpi-sws.org/data-wosn2009.html

  • We were able to identify a birthday-like event.

1 day

Evangelos Papalexakis (CMU) – ECML-PKDD 2012


Discovery nell

Discovery - Nell

  • Knowledge base data

  • Taken from the Read The Web project at CMU

    • http://rtw.ml.cmu.edu/rtw/Special thanks to Tom Mitchell for the data.

  • Noun phrase x Context x Noun phrase triplets

    • e.g. ‘Obama’ – ‘is’ – ‘the president of the United States’

  • Discover words that may be used in the same context

  • We picked s = 500, r = 10.

  • Evangelos Papalexakis (CMU) – ECML-PKDD 2012


    Outline4

    Outline

    Introduction

    Problem Statement

    Method

    Experiments

    • Conclusions

    Evangelos Papalexakis (CMU) – ECML-PKDD 2012


    Conclusions

    Conclusions

    • Goal 1: Fast

      • Scalable & parallelizable

    • Goal 2: Sparse

      • Ability to yield sparse latent factors and a sparse tensor approximation

    • Goal 3: Accurate

      • provable correctness in merging partial results, under appropriate conditions

      • Experiments that also demonstrate that

    • Enables processing of tensors that don’t fit in memory

    • Interesting findings in diverse Knowledge Discovery settings

    Evangelos Papalexakis (CMU) – ECML-PKDD 2012


    The end

    The End

    Evangelos E. Papalexakis

    Email: [email protected]

    Web: http://www.cs.cmu.edu/~epapalex

    Thank you!

    Any questions?

    Christos Faloutsos

    Email: [email protected]

    Web: http://www.cs.cmu.edu/~christos

    Nicholas Sidiropoulos

    Email: [email protected]

    Web: http://www.ece.umn.edu/users/nikos/

    Evangelos Papalexakis (CMU) - ASONAM 2012


  • Login