1 / 26

Scalable Data-intensive Analytics

Scalable Data-intensive Analytics. Meichun Hsu Intelligent Information Management Lab HP Labs August 24, 2008. Joint work with Qiming Chen, Bin Zhang, Ren Wu. Joint work with Qiming Chen, Bin Zhang, Ren Wu. Outline. Introduction Illustrated Computation Patterns

tolla
Download Presentation

Scalable Data-intensive Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Data-intensive Analytics Meichun Hsu Intelligent Information Management Lab HP Labs August 24, 2008 Joint work with Qiming Chen, Bin Zhang, Ren Wu Joint work with Qiming Chen, Bin Zhang, Ren Wu

  2. Outline Introduction Illustrated Computation Patterns Groupby with User-defined Aggregate Table Function over Graph-structured data Summary and other on-going work

  3. BI Services delivered to broad base of users / applications OLTP systems, sensors, external feeds, web content… Business Operational Analytics Data Transformation Analytics Massively Parallel Data Warehouse Files / Tables Challenges in BI Analytics • Scaling of data-intensive analytics components have not kept pace • Plus new challenges: • Bigger and bigger data sets • More and more complex transformation and analysis • Demand on near real-time responses to enable Operational BI (OpBI) • Data-intensive transformation and operational analytics increasingly recognized as the bottleneck “In fact, it’s THE bottleneck in most VLDW/VLDB and very large data integration systems.”-

  4. Challenges in BI Analytics (regarding a media mix problem)…“The result of a non-linear model of promotional lift as a function of media spend by channel, some coupon-related variables for each store is outstanding in terms of fit.  The bad news is that generating the coefficients using our application server and SPSS takes about two weeks of CPU time…. is this the type of problem we can throw at a parallel DB on…? “ - Director, Research and Analysis, BonTon, April 2008 “With the vast amounts of data growing, we have realized the fact that we often have to move data across networks for analysis. It's actually going to be better if we can stay inside the database and move some of our computations down to the individual nodes on a (parallel data warehouse) box.” - Jim Goodnight, founder and CEO of SAS, October 2007

  5. Number of HW Threads UltraSparcT2 Power4 80486 2005 2000 1990 1995 Available Parallelism Grows Exponentially • How will trends in multicores ease or exacerbate bottlenecks in current transformation/analytics components? • Will 100’s of cores in a server, and 10,000s of cores in a scale-out parallel data warehouse, present an opportunity? Courtesy: Anatasia Ailamaki, 2008

  6. Implications Opportunity to design Massively Data-Parallel Analytics Layer to dramatically improve end-to-end BI performance with • enhanced software parallelism to take better advantage of explosion of hardware threads, • enhanced data locality to better optimize utilization of limited memory and data bandwidth

  7. Parallel Query Engine vs Google’s Map Reduce • Both are elegant and successful parallel processing models • Parallel Query Engine • Rich patterns of execution (pipelining, composition, multiple source sets, integration with schema mgmt, to name a few) • Focused on built-in query operators; UDF as an exception • Google’s Map Reduce • Limited patterns of execution • Focused more on supporting user-supplied programs

  8. Approach to Scalable Analytics for BI • Integrates high-performance parallel computation with parallel query processing • Leverage SQL’s schema management and declarative query language • Fuse declarative data access with computation functions in a scale-out shared-nothing parallel processing infrastructure • Create a highly parallel data flow-oriented infrastructure for data-intensive analytics 8 12 March 2014

  9. Research Issues • Richer dependency semantics for UDF and flexibility for UDF optimization, e.g. • GroupBy with User-defined Aggregate • Structuring the computation - taking into account derivation and side effects • High performance implementation of parallel processing primitives • Efficient management of memory hierarchy in new architectures – e.g. multicore for high performance analytics • UDF execution environment – process boundaries, data flow considerations in new hybrid cluster environments • Enhance composability of User Defined Functions (UDFs) • Express a “process flow” using UDFs for ETL, information extraction, and information derivation Plan / Spec Root Independent parallelism Esp exchange Esp exchange UDF group by File scan Partitioned parallelism Execution ODBC ESP ESP ESP UDF UDF UDF DP22 DP22 DP22 DP21 DP21 DP21 9 12 March 2014

  10. Outline Introduction Illustrated Computation Patterns GroupBy with User-defined Aggregate Table Function over Graph-structured data Summary and other on-going work

  11. Find the positions of that minimize UDFs with GroupBy Example – K-Means Clustering Data points Center of clusters . Init: Starting from an initial position of the centers, Assign_Center: assigns each data point to the closest center Recalculatecenters: as the geometric center of each cluster Iterate until no change happens. K-means algorithm:

  12. Map-Reduce K-means Init. Centroids Push down to Parallel database? Calc. centroid Assign Cluster Calc new cetroids Execution Recalculate Centroids (4) (3) Reduce the answer Aggregate per cluster_id Done? (1) Map . . . Assign center per datapoint (2) Hash-distribute by cluster_id . . . . . . Intermediate key-value pair is [cluster_id, x] Map: Cx=argmin (x-Ck)2 Aggregate over each center_id Reduce: Sp[k]=Σx[k], Qp[k]=Σx[k],Np[k]=Σ1 Aggregate per cluster Compute: C[k]=Σ Sp[k] /Σ Np[k]

  13. Execution Recalculate Centroids (4) (3) UDF ssl, ssq, count the answer Aggregate per cluster_id (1) UDF assign_center . . . Assign center per datapoint (2) Hash-distribute by cluster_id . . . . . . K-Means by UDFs • UDF assign_cluster (datapoint, k_centroids) • For each datapoint compute its distances to all centroids in k_centroids, and assign it to the cluster with the closest centroid • UDF ssl (datapoint), ssq (datapoint) • Aggregate each data point to produce sufficient statistics ssl and ssq SELECT cluster_id, s = ssl(x), q = ssq(x), count (*) FROM (SELECT INTO temp x, cluster_id = assign_cluster(x, kc) FROM r) GROUPBY cluster_id

  14. Assemble results ssl( ): aggregate function {init(); iterate(); final(); merge();} Select oid, ssl(x) From r Local aggregate Parallelism in Aggregate UDFs ssl.merge() ssl.iterate() ssl.iterate() ssl.iterate() tuple-wise apply tuple-wise apply tuple-wise apply partitioning data

  15. Recalculate Centroids the answer ssl.merge() ssl.merge() Assemble results Assemble results ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() tuple-wise apply tuple-wise apply tuple-wise apply tuple-wise apply tuple-wise apply tuple-wise apply Hash-distr. by cluster_id . . . . . . . . . Parallel Aggregate UDF Plan in K-means But: This plan is very high in communication overhead

  16. VS “merge” stage of Reduce { } { } { } “iterate” stage of Reduce { assign, Sums } { assign, Sums } { assign, Sums } …… Efficient Parallel Computation Plan? Recalculate Centroids (4) the answer (3) Reduce the answer Aggregate per cluster_id ssl.merge() (1) Map ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() ssl.iterate() (2) Hash-distribute by cluster_id Assign center per datapoint . . . Hash-distr. by cluster_id . . . . . . . . . . . . . . .

  17. Calc new cetroids Done? Pushing UDF w GroupBy down to Partition Level SELECT cluster_id, s = ssl(x), q = ssq(x), count (*) FROM (SELECT INTO temp x, cluster_id = assign_cluster(x, kc) FROM r) GROUPBY cluster_id Init. Centroids For each partition Calc. centroid Assign Cluster do Aggregate-local per partition per cluster Each local-aggregate returns a table with one row for each groupby column these s.s. are much smaller than the data set itself Local rows are combined at global level for each groupby column Combine-global per cluster

  18. Outline Introduction Illustrated Computation Patterns Groupby with User-defined Aggregate Table Function over Graph-structured data Summary and other on-going work

  19. Hydraulic dynamics River Network Model of a river segment Upstream segments Downstream segment Condition at a segment at time t depends on own properties at time t and conditions at upstream segments at time t-1, calculated based on hydraulic dynamics Analytics over a Structured Data Set An Example - A prediction system for water resources

  20. Computation involves multi-dimensional dependencies (spatial, temporal) a(t2) a downstream a b c b c f d e upstream d t2 e f t1 • Output – predicted properties of all river segments • Time series, for all segments • Water level • Volume • Flow velocity • Flow and sand discharge • Input – a table of all river segments • Geometric & environmental parameters • Topology tree • Rainfall, weather sensor data • Precipitation • Evaporation • Runoff yield • Soil erosion River segment tree (millions of river segments) (tens of thousands of time intervals)

  21. P0 P2 P1 C21 C22 C13 C11 C12 hydro() hydro() hydro() Bar.merge() • For tree-structured data set, need to allow UDFs to be applied in specific order Bar.iterate() Bar.iterate() Bar.iterate() UDFs generally cannot be applied to tuples structured as a graph • Each tuple processed independently of other tuples in the set Parallelization of hydro() hydro( ): table function Select * FROM riverCROSS APPLY hydro(*)

  22. Graph traversal in SQL Pre-order traversal P0 P2 SELECT * FROM river CONNECT BY PRIOR sid = parent_sid START WITH sid = ‘P0’ P1 C21 C22 C13 C11 C12 Post-order traversal P0 P2 P1 SELECT name, sid, parent_sid FROM river CONNECT BY sid = PRIOR parent_sid START WITH sid = ‘C12’ C21 C22 C13 C11 C12

  23. Extend UDF with Traversal Control Forms for graph-structured computation • Apply a UDF f() to tree-structured data objects in post order SELECT * FROM river CROSS APPLY hydro(*) CONNECT BY sid = PRIOR ALL parent_sid START WITH is_leaf= “yes” hydro( ): table function { } P0 P2 P1 C21 C22 C13 C11 C12 Semantics: Apply hydro() starting with leaf river segments Then apply to non-leaf only when all upstream segments are applied

  24. I1 P-Level 8 I1 H1 P-Level 7 1. Leveling H1 A19 D4 D4 P-Level 6 G1 A18 G1 C4 F2 C4 F2 A17 F2 F1 P-Level 5 E2 E1 P-Level 4 E3 A15 F1 B5 F1 A13 E3 E3 D4 D1 D2 D3 P-Level 3 D3 D3 A12 A16 A16 E2 E2 C2 C3 C4 P-Level 2 C1 D1 D1 C3 B4 A10 A10 C3 B4 B2 B1 B3 B5 P-Level 1 A19 A15 A16 A8 A1 A4 A5 A9 A10 A17 A6 A7 A12 A18 A13 B3 A6 B3 A7 A8 A9 A8 A9 C2 C2 P-Level 0 A5 B2 E1 C1 D2 A5 C1 B2 E1 D2 B1 B1 A1 A4 A1 A4 partition ‘0’ at level 3 0 A6 partition ‘000’ at level 2 000 001 00000 000000 partition ‘001’ at level 0 partition ‘00000’ at level 1 partition ‘000000’ at level 0 Parallel processing strategy for graph-structured computation 3. Distribution: Distribute partitions to servers with load balancing based on size of partition and levels of partitions 2. Partition as connected subgraphs - Keep track of metadata of each partition 4. Compute in parallel Each server sorts properly then process tuples in sort order, recording metadata for parent firing and for transmitting computed tuples to other servers

  25. Summary and on-going work • Illustrate how parallel query processing and map-reduce paradigms can be enriched for advanced scalable analytics • Explore primitives tha allow explicit declaration of semantics and dependencies in analytic computation have potential • Discover important patterns • Devise efficient parallelization support in infrastructure • Additional on-going investigations • Importance of shared-nothing principle in shared-memory (multicore) architecture • Hybrid clusters and paradigms for data flow among heterogenous clusters • Goal: Combine general data flow-driven computing paradigm with data management infrastructure to achieve data-intensive analytics

  26. Q&A Thank You!

More Related