1 / 32

The Gamma Operator for Big Data Summarization on an Array DBMS

This paper explores the usage of the Gamma Operator for large matrices beyond RAM size in a parallel shared-nothing system, such as SciDB. The operator allows for efficient storage and processing of multidimensional arrays by combining processing with the R package and LAPACK. The paper discusses the properties of the Gamma Operator and its application in big data analytics.

vieraj
Download Presentation

The Gamma Operator for Big Data Summarization on an Array DBMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez

  2. Acknowledgments Michael Stonebraker , MIT My PhD students: Yiqun Zhang, Wellington Cabrera SciDB team: Paul Brown, Bryan Lewis, Alex Polyakov

  3. Why SciDB? Large matrices beyond RAM size Storage by row or column not good enough Matrices natural in statistics, engineer. and science Multidimensional arrays -> matrices, not same thing Parallel shared-nothing best for big data analytics Closer to DBMS technology, but some similarity with Hadoop Feasible to create array operators, having matrices as input and matrix as output Combine processing with R package and LAPACK

  4. Old: separate sufficient statistics

  5. New: Generalizing and unifying Sufficient Statistics: Z=[1,X,Y]

  6. Equivalent equations with projections from Γ

  7. Properties of 

  8. Further properties details:non-commutative and distributive

  9. Storage in array chunks

  10. In SciDB we store the points in X as 2D array. SCAN Worker

  11. Array storage and processing in SciDB Assuming d<<n it is natural to hash partition X by i=1..n Gamma computation is fully parallel maintaining local Gamma versions in RAM. X can be read with a fully parallel scan No need to write Gamma from RAM to disk during scan, unless fault tolerant

  12. Point must fit in one chunk. Otherwise, join is needed (slow) NO! OK Coordinator Coordinator Worker 1 Worker 1

  13. Parallel computation Coordinator Worker 1 Worker 2 send send

  14. Dense matrix operator: O(d2 n)

  15. Sparse matrix operator: O(d n) for hyper-sparse matrix

  16. Pros: Algorithm evaluation with physical array operators Since xi fits in one chunk joins are avoided (at least 2X I/O with hash or merge join) Since xi*xiT can be computed in RAM we avoid an aggregation which would require sorting points by i No need to store X twice: X, XT: half I/O, half RAM space No need transpose X, costly reorganization even in RAM, especially if X spans several RAM segments Operator works in C++ compiled code: fast; vector accessed once; direct assignment (bypass C++ functions calls)

  17. System issues and limitations • Gamma not efficiently computable in AQL or AFL: hence operator is required • Arrays of tuples in SciDB are more general, but cumbersome for matrix manipulation: arrays of single attribute (double) • Points must be stored completely inside a chunk: wide rectangular chunks: may not be I/O optimal • Slow: Arrays must be pre-processed to SciDB load format, loaded to 1D array and re-dimensioned=>optimize load. • Multiple SciDB instances per node improve I/O speed: interleaving CPU • Larger chunks are better: 8MB, especially for dense matrices; avoid shuffling; avoid joins • Dense (alpha) and sparse (beta) versions

  18. Benchmark: scale up emphasis • Small: cluster with 2 Intel Quadcore servers 4GB RAM, 3TB disk • Large: Amazon cloud

  19. Why is Gamma faster than SciDB+LAPACK?

  20. Combination: SciDB + R

  21. Can Gamma operator beat LAPACK?

  22. outdated SciDB in the Cloud: massive parallelism

  23. Comparing systems to compute Γ on local server

  24. Comparing systems to compute Γ on local server

  25. Vertica vs. SciDB for sparse matrices

  26. Running on the cloud

  27. Running on the cloud

  28. Conclusions • One pass summarization matrix operator: parallel, scalable • Optimization of outer matrix multiplication as sum (aggregation) of vector outer products • Dense and sparse matrix versions required • Operator compatible with any parallel shared-nothing system, but better for arrays • Gamma matrix must fit in RAM, but n unlimited • Summarization matrix can be exploited in many intermediate computations (with appropriate projections) in linear models • Simplifies many methods to two phases: • Summarization • Computing model parameters • Requires arrays, but can work with SQL or MapReduce

  29. Future work: Theory • Use Gamma in other models like logistic regression, clustering, Factor Analysis, HMMs • Connection to frequent itemset • Sampling • Higher expected moments, co-variates • Unlikely: Numeric stability with unnormalized sorted data

  30. Future work: Systems • DONE: Sparse matrices: layout, compression • DONE: Beat LAPACK on high d • Online model learning (cursor interface needed, incompatible with DBMS) • Unlimited d (currently d>8000); join required for high d? Parallel processing of high d more complicated, chunked • Interface with BLAS and MKL, not worth it? • DONE: Faster than column DBMS for sparse?

More Related