Data cloud frameworks
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

Data Cloud Frameworks PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on
  • Presentation posted in: General

Data Cloud Frameworks. Author - Shailendra Mishra Head Data Architecture ( Paypal ). Data at Paypal. Enable Online, Offline, and Mobile payment 128M customers worldwide $160B payment volume processed annually Major retail locations accepting PayPal 20K today  2M end of 2013

Download Presentation

Data Cloud Frameworks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data cloud frameworks

Data Cloud Frameworks

Author - Shailendra Mishra

Head Data Architecture (Paypal)


Data at paypal

Data at Paypal

  • Enable Online, Offline, and Mobile payment

  • 128M customers worldwide

  • $160B payment volume processed annually

  • Major retail locations accepting PayPal

    20K today  2M end of 2013

  • PayPal Here launching in US and international markets

  • PetaBytes of high value data and growing


The data landscape

The Data Landscape


Enterprise data painpoints

Enterprise Data - Painpoints

  • Transformation logic in BTEQ scripts in TD

  • Inefficient handshake between CDC & ETL

  • Landing data into files multiple times

  • Limited visibility into impact / lineage

  • Limited to data movement

  • Enterprise integration platform

    • Cost effective, scalable ETL server grid

    • Comprehensive capabilities

  • Eliminate need for DB Standbys

  • Minimize multiple versions / copies of data

  • Enterprise agreement on cross network architecture for data integration backend.

OPPORTUNITIES


Enterprise data landscape ii

Enterprise Data Landscape (ii)


Achievements opportunities

Achievements & Opportunities

  • Transformation logic in BTEQ scripts in TD

  • Inefficient handshake between CDC & ETL

  • Landing data into files multiple times

  • Limited visibility into impact / lineage

  • Limited to data movement

  • Enterprise integration platform

    • Cost effective, scalable ETL server grid

    • Comprehensive capabilities

  • Eliminated need for DB Standbys

  • Minimized multiple versions / copies of data

  • Enterprise agreement on cross network architecture for data integration backend.

achievements

opportunities


Business view

Business view

Real time

Decision

Data

governance

Analytical data

Stores &

Warehousing

Operational

reporting

Data

analysts

profiling

Cloud

Data stores

Quality

customers

Analytical

tools

metadata

Information

delivery

Pipe

Mask & subset

files

Source

Of record

databases

message bus

Data cloud

services

Users

Enterprise data

Modeling

  • Unified data and metadata dictionary

  • Data lineage and comprehensive data quality functions

  • PD-DM Partnership for Data Centric Solutions

  • Streaming Analytics & Real time dashboards

  • Enterprise approach across all DM disciplines

  • Machine Learning

  • Real time integration

  • Self-service data delivery

  • Enterprise Data Governance

  • Interactive querying

  • Analytics & Information Products Lifecycle

  • Text search & Analytics

GOALS


Cloud computing

Cloud computing


Data cloud architecture

Data cloud Architecture

Development Environment

IDE

Dashboard Builder

Web Services

Data Cloud

Applications

Streaming Analytics

Analytics

Reporting

Interactive Query

Natural.Lang-Proc

Txt Search

Graph Algorithms

M/c Learning

Data Acquisition & Indexing Services

ETL

QoSLatency,Uptime services

Adapters

Bulk Data Acquisition

IndexingSvcs

Infrastructure Services

Monitoring

Scheduling

Orchestration

Core Services

Distributed

File System

DW DB

Distributed

Memory Store

Stream Processing

Big Table DB

OLTPDB

Runtime

JVM

App.Containers

OCC


Data storage

Data Storage

HBase


Data storage1

Data Storage

  • Improved handshake between CDC & ETL

  • Ability to process some event data

  • Data services

    • Metadata

    • Quality

    • Profiling

    • Mask

    • Subset

    • Near real-time data

  • All data transformations in centralized grid

  • Hadoop capability ( Read / Write / Process)

  • Data interchange (external)


  • Data integration grid real time

    Data Integration Grid (Real time)

    GG User Exits

    ETL

    Distrib.Cache

    Feeds

    GG

    HBase


    Data integration grid batch

    Data Integration Grid (Batch)

    Flume

    PIG

    Hive

    GG + ETL

    GG

    HBase

    MR (Index)

    MR (Load)


    Data stream processing

    Data Stream Processing

    Stream

    Processor

    HBase


    Data stream processing1

    Data Stream Processing

    • Data stream is defined as sequence of elements (”events”) of the form (K, A) where K, and A are the tuple-valued keys and attributes respectively

    • Objective is to create a stream computing platform which consumes such a stream, computes intermediate values, and possibly emits other streams

    • The Design Elements of the stream computing platform are:

    • Processing Elements (PEs) – Basic computational element or building blocks

    • Processing Nodes (PNs) - These are logical hosts to PEs

    • Communication layer (CL) - Provides cluster management and automatic failover to standby nodes and maps physical nodes to logical nodes


    Processing elements

    Processing Elements

    • Processing Elements (PEs) are basic computational element which are identified by following properties:

      • Functionality as defined by a PE (i.e.) class and associated configuration

      • Types of events that it consumes

      • Keyed attribute in those events

      • Value of the keyed attribute in events which it consumes

      • A library of PEs is available for standard tasks

      • PE objects are accorded a TTL, so if no event arrives at a PE for the TTL, the PE is reaped

    • Special PEs

      • Keyless PEs can consume all events of the type that they are associated. These normally are used as input PEs, where the key is still being assigned

      • Abstract PE has only three components of its identity (functionality, event type, keyed attribute);

        • Attribute value is unassigned

        • It is configuredon initialization and, for any value V, it is capable of cloning itself to create fully qualified PEof that class with identical configuration and value V for the keyed attribute


    Processing nodes

    Processing Nodes

    • Processing Nodes (PNs) are logical hosts to PEs

    • Responsible for listening to events, executing operations on incoming events, dispatching events with the assistance of the communication layer, and emitting output events

    • Each event is routed to PNs based on hash function of values of known keyed attributes in that event

    • Single event may be routed to multiple PNs.

    • Set of all possible keying attributes is known from the configuration of the cluster

    • Event listener in the PN passes incoming events to the processing element container (PEC) which invokes the appropriate PEs

    • All events with a particular value of a keyed attribute are guaranteed to arrive at a particular corresponding PN, and be routed to the corresponding PE instances within it.

    • Every keyed PE can be mapped to exactly one PN based on the value of the hash function applied to the value of the keyed attribute of that PE

    • Keyless PEs may be instantiated on every PN.


    Processing node

    Processing Node

    Processing Node

    Processing Element Container

    PE1

    PEn

    PE2

    Event

    Listener

    Communication Layer

    Emitter

    Dispatcher

    Routing and Load Balancing

    Failover Management

    Transport Protocols

    Zookeeper


    Programming model

    Programming Model

    • The high-level programming paradigm is to write generic, reusable and configurablePEs that can be used across various applications

    • PEs are assembled into applications

    • The PE API is fairly simple and flexible consisting of handlers such as onCreate, onTime, onEventetc, setDownstream and a facility to define state variables

      • onEvent is invoked for each incoming event of the types the PE has subscribed to. It implements the logic for input event handling, typically an update of the internal PE state.

      • onTime method is invoked by the PE timer. By default it is synchronized with onEvent, onTrigger methods

      • onTrigger method is used for count based windows. It adds a new slot when the current slot reaches capacity


    Communication layer

    Communication Layer

    Communication layer :

    • Provides cluster management and automatic failover to standby nodes and maps physical nodes to logical nodes.

    • Automatically detects hardware failures and accordingly updates the mapping

    • Emitters specify only logical nodes when sending messages

    • Emitters are unaware of physical nodes or when logical nodes are re-mapped due to failures

    • API can be used to send input events in a round-robin fashion to nodes in an S4 cluster. These input events are then processed by keyless Pes

    • Uses a pluggable architecture to select network protocol. Events may be sent with or without a guarantee

      • Control messages may require guaranteed delivery while data may be sent without a guarantee to maximize throughput

    • Uses ZooKeeper to help coordinate between PNs in a cluster


    G raph processing

    Graph processing

    Apache

    Giraph

    HBase


    Large scale graph processing

    Large scale graph processing

    • Giraph provides libraries for large scale graph processing

      • Modeled after Google Pregel

      • Bulk synchronous parallel execution model

      • Fault tolerant using checkpointing

      • Computation is executed in memory

      • Can be a job in a map-reduce pipeline (Hadoop, Hive)

      • Uses Zookeeper for synchronization


    Example usage

    Example usage

    • User rank (page rank)

      • Can be personalized for a user or “type” of user

      • Determining popular users, news, jobs, etc.

    • Shortest paths

      • Many variants single-source, s-t shortest paths, all-to-all shortest (memory/storage prohibitive)

      • How are users, groups connected?

    • Clustering, semi-clustering

      • Max clique, triangle closure, label propagation algorithms

      • Finding related people, groups, interests

    • Hidden inferences in communities

      • Discover inferences through graph approach extract them


    Application client giraph

    Application Client - Giraph

    Fjsfsfjsf;sdfjsfjsj;sjsjsfjs

    Processor-2

    Processor-4

    Processor-5

    Processor-1

    Processor-3

    Local Computation

    Superstep

    Communication

    Barrier Synchronization


    Giraph basics

    Giraph basics

    • Deployment on big data processing infrastructure (no need to create/maintain separate graph processing cluster)

    • Dynamic resource management

      • Handle failures gracefully

      • Integrate new resources when available

    • Based on Bulk synchronous parallel model

    • 3 main attributes

      • Components that process and/or provide storage

      • Router to deliver point-to-point messages

      • Synchronization of all or a subset of components through regular intervals (supersteps)

    • Computation is done when all components are done

    • Parallelization of computation/messaging during a superstep

      • Components can only communicate by messages delivered out-of-order in the next superstep

    • Fault-tolerant/dynamic resource utilization

      • Supersteps are atomic units of parallel computation

      • Any superstep can be restarted from a checkpoint (need not be user defined)

      • A new superstep provides an opportunity for rebalancing of components among available resources


    Data science

    Data Science

    HBase

    Cloudera ML


    Apache mahout

    Apache Mahout

    • Mahout provides scalable machine learning libraries.

    • Oldest product amongst machine learning Algos widely deployed – an incomplete list is as under:

      • User and item based recommenders

      • k-means and fuzzy k-means clustering

      • Means shift clustering

      • Dirichlet process clustering

      • Latent Dirichlet allocation

      • Singular value decomposition

      • Parallel frequent pattern mining

      • Random forest decision tree based qualifier

    • Challenge - Delta between latest ML and Mahout implementations


    Mrv2 and yarn

    MRV2 and Yarn

    • Eliminates Job tracker bottlenecks

    • Separates Resource tracker and scheduler

    • Moves map/reduce to user

    • Allows Hadoop to run all sorts of jobs

      • Native BSP (Giraph)

      • AllReduce, Graflab


    Apache crunch

    Apache Crunch

    • Apache Crunch provides a framework for writing, testing and running MapReduce pipelines

    • It makes tasks like joining and data aggregation that are tedious to implement on plain MapReduce

    • The APIs are useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns


    Apache crunch api

    Apache crunch API

    • Crunch API is centered around three interfaces that represent distributed datasets:

      • PCollection<T> represents a distributed, unordered collection of elements of type T e.g. file is represented as Pcollection of strings

        • Pcollection provides a parallelDo operation that applies a function to each element in PCollection

      • PTable<K, V> is a sub-interface of Pcollection which represents unordered multimap.

        • Ptable in addition to parallelDoprovides groupByKey operation

        • groupByKey triggers the sort phase of a map-reduce job

        • Result of groupByKey is PGroupedTable<K, V> which is a sorted distributed map of type K to iterable collection of values of type V

      • PCollection, PTable, and PGroupedTable all support a union operation, which takes a series of distinct PCollections and treats them as a single, virtual Pcollection

        • Required by operations to combine multiple inputs


    Example

    Example

    public class WordCount {

    public static void main(String[] args) throws Exception {

    Pipeline pipeline = new MRPipeline(WordCount.class);

    PCollection<String> lines = pipeline.readTextFile(args[0]);

    PCollection<String> words = lines.parallelDo("my splitter", new DoFn<String, String>() {

    public void process(String line, Emitter<String> emitter) {

    for (String word : line.split("\\s+")) {

    emitter.emit(word);

    }

    }

    }, Writables.strings());

    PTable<String, Long> counts = Aggregate.count(words);

    pipeline.writeTextFile(counts, args[1]);

    pipeline.run();

    }

    }


    Cloudera ml

    Cloudera ML

    • Cloudera ML provides a command line tool for running data preparation and model evaluation tasks

      • summary

      • sample

      • pivot

      • header

      • normalize

      • showvec

      • ksketch

      • kmeans

      • lloyds

      • Kassign

    • Cloudera ML is just a start but coupled with Apache crunch one can try out Java ML algos


    Data cloud frameworks

    Q

    A

    &


  • Login