Introduction to
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

Introduction to PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Introduction to . Matei Zaharia spark -project.org. What is Spark?. Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through:

Download Presentation

Introduction to

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to

Introduction to

Matei Zaharia

spark-project.org


What is spark

What is Spark?

  • Fast and expressive cluster computing system interoperable with Apache Hadoop

  • Improves efficiency through:

    • In-memory computing primitives

    • General computation graphs

  • Improves usability through:

    • Rich APIs in Scala, Java, Python

    • Interactive shell

Up to 100×faster

(2-10× on disk)

Often 5× less code


Project history

Project History

  • Started in 2009, open sourced 2010

  • 25 companies now contributing code

    • Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, …

  • Entered Apache incubator in June


A general stack

A General Stack

  • Spark is the basis of a wide set of projects in the Berkeley Data Analytics Stack (BDAS)

Shark(SQL)

Spark Streaming

(real-time)

GraphX

(graph)

MLbase

(machine learning)

Spark

More details: amplab.berkeley.edu


This talk

This Talk

  • Spark introduction & use cases

  • Implementation

  • Other stack projects

  • The power of unification


Why a new programming model

Why a New Programming Model?

  • MapReduce greatly simplified big data analysis

  • But as soon as it got popular, users wanted more:

    • More complex, multi-pass analytics (e.g. ML, graph)

    • More interactive ad-hoc queries

    • More real-time stream processing

  • All 3 need faster data sharing across parallel jobs


Data sharing in mapreduce

Data Sharing in MapReduce

HDFSread

HDFSwrite

HDFSread

HDFSwrite

iter. 1

iter. 2

. . .

Input

result 1

query 1

HDFSread

result 2

query 2

query 3

result 3

Input

. . .

Slow due to replication, serialization, and disk IO


Data sharing in spark

Data Sharing in Spark

iter. 1

iter. 2

. . .

Input

query 1

one-timeprocessing

query 2

query 3

Input

Distributedmemory

. . .

10-100×faster than network and disk


Spark programming model

Spark Programming Model

  • Key idea: resilient distributed datasets (RDDs)

    • Distributed collections of objects that can be cached in memory across cluster

    • Manipulated through parallel operators

    • Automatically recomputed on failure

  • Programming interface

    • Functional APIs in Scala, Java, Python

    • Interactive use from Scala shell


Example log mining

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

Transformed RDD

Base RDD

Cache 1

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache()

Worker

results

tasks

Driver

Block 1

Action

messages.filter(lambda x: “foo” in x).count

messages.filter(lambda x: “bar” in x).count

Cache 2

Worker

. . .

Cache 3

Worker

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Block 2

Block 3


Fault tolerance

Fault Tolerance

RDDs track lineage info to rebuild lost data

  • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

map

reduce

filter

Input file


Fault tolerance1

Fault Tolerance

RDDs track lineage info to rebuild lost data

  • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

map

reduce

filter

Input file


Introduction to

Demo


Example logistic regression

Example: Logistic Regression

110 s / iteration

first iteration 80 s

further iterations 1 s


Spark in scala and java

Spark in Scala and Java

// Scala:

val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count()

// Java:

JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();


Supported operators

Supported Operators

  • map

  • filter

  • groupBy

  • sort

  • union

  • join

  • leftOuterJoin

  • rightOuterJoin

  • reduce

  • count

  • fold

  • reduceByKey

  • groupByKey

  • cogroup

  • cross

  • zip

sample

take

first

partitionBy

mapWith

pipe

save

...


Spark community

Spark Community

  • 1200+ meetup members

  • 90+ devs contributing

  • 25 companies contributing


This talk1

This Talk

  • Spark introduction & use cases

  • Implementation

  • Other stack projects

  • The power of unification


Software components

Software Components

  • Spark client is library in user program (1 instance per app)

  • Runs tasks locally or on cluster

    • Mesos, YARN, standalone mode

  • Accesses storage systems via Hadoop InputFormat API

    • Can use HBase, HDFS, S3, …

Your application

SparkContext

Cluster manager

Local threads

Worker

Worker

Spark executor

Spark executor

HDFS or other storage


Task scheduler

Task Scheduler

General task graphs

Automatically pipelines functions

Data locality aware

Partitioning awareto avoid shuffles

B:

A:

F:

Stage 1

groupBy

E:

C:

D:

join

map

filter

Stage 2

Stage 3

= RDD

= cached partition


Advanced features

Advanced Features

  • Controllable partitioning

    • Speed up joins against a dataset

  • Controllable storage formats

    • Keep data serialized for efficiency, replicate to multiple nodes, cache on disk

  • Shared variables: broadcasts, accumulators

  • See online docs for details!


This talk2

This Talk

  • Spark introduction & use cases

  • Implementation

  • Other stack projects

  • The power of unification


Shark hive on spark

Shark: Hive on Spark

  • Columnar SQL analytics engine for Spark

    • Support both SQL and complex analytics

    • Up to 100X faster than Apache Hive

  • Compatible with Apache Hive

    • HiveQL, UDF/UDAF, SerDes, Scripts

    • Runs on existing Hive warehouses

  • In use at Yahoo! for fast in-memory OLAP


Hive architecture

Hive Architecture

Client

Meta store

CLI

JDBC

Driver

Query Optimizer

SQL Parser

Physical Plan

Execution

MapReduce

HDFS


Shark architecture

Shark Architecture

Client

Meta store

CLI

JDBC

Cache Mgr.

Driver

SQL Parser

Physical Plan

Query Optimizer

Execution

Spark

HDFS

[Engle et al, SIGMOD 2012]


Performance

Performance

1.7 TB Warehouse Data on 100 EC2 nodes


Spark integration

Spark Integration

  • Unified system for SQL, graph processing, machine learning

  • All share the same set of workers and caches


Other stack projects

Other Stack Projects

  • Spark Streaming:stateful, fault-tolerant stream processing (out since Spark 0.7)

  • sc.twitterStream(...) .flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

  • MLlib: Library of high-quality machine learning algorithms (out since 0.8)


This talk3

This Talk

  • Spark introduction & use cases

  • Implementation

  • Other stack projects

  • The power of unification


The power of unification

The Power of Unification

  • Spark’s speed and programmability is great, but what makes the project unique?

  • Unification:multiple programming models (SQL, stream, graph, …) on the same engine

  • This had powerful benefits:

    • For the engine

    • For users

Shark

Streaming

GraphX

MLbase

Spark


Code size

Code Size

non-test, non-example source lines


Code size1

Code Size

Shark

non-test, non-example source lines


Code size2

Code Size

Streaming

Shark

non-test, non-example source lines


Code size3

Code Size

GraphX

Streaming

Shark

non-test, non-example source lines


Performance1

Performance

SQL

Streaming

Graph


Performance2

Performance

  • One benefit of unification is that optimizations we make for one project affect all the others!

  • E.g. scheduler optimizations in Spark Streaming (0.6) resulted in 2x faster Shark queries


What it means for users

What it Means for Users

  • Separate frameworks:

HDFS read

HDFS read

HDFS read

query

train

ETL

HDFS write

HDFS write

HDFS write

Spark:

train

query

HDFS read

ETL

HDFS


Getting started

Getting Started

  • Visit spark-project.org for videos & tutorials

  • Easy to run in local mode, private clusters, EC2

  • Built-in ML library since Spark 0.8

  • Conference December 2nd:www.spark-summit.org


Conclusion

Conclusion

  • Big data analytics is evolving to include:

    • More complex analytics (e.g. machine learning)

    • More interactive ad-hoc queries

    • More real-time stream processing

  • Spark is a fast platform that unifies these apps

  • More info: spark-project.org


  • Login