Real-time PMML Scoring over Spark Streaming and Storm
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

Real-time PMML Scoring over Spark Streaming and Storm PowerPoint PPT Presentation


  • 540 Views
  • Uploaded on
  • Presentation posted in: General

Real-time PMML Scoring over Spark Streaming and Storm. Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus. Contents. Big Data Computations.

Download Presentation

Real-time PMML Scoring over Spark Streaming and Storm

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Real time pmml scoring over spark streaming and storm

Real-time PMML Scoring over Spark Streaming and Storm

Dr. Vijay Srinivas Agneeswaran,

Director and Head, Big-data R&D,

Innovation Labs, Impetus


Contents

Contents


Big data computations

Big Data Computations

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.

[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741


Berkeley big data analytics stack bdas

Berkeley Big-data Analytics Stack (BDAS)


Bdas spark

BDAS: Spark

[MZ12] MateiZaharia, MosharafChowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.


Bdas discretized streams

BDAS: Discretized Streams

pageViews = readStream("http://...", "1s")

1_s = pageViews.map(event => (event.url, 1))

counts = 1_s.runningReduce((a, b) => a + b)


Bdas d streams streaming operators

BDAS: D-Streams Streaming Operators

words = sentences.flatMap(s => s.split(" "))

pairs = words.map(w => (w, 1))

counts = pairs.reduceByKey((a, b) => a + b)


Bdas use cases

BDAS: Use Cases


Real time analytics r over storm

Real-time Analytics: R over Storm


Real time analytics uc 1 internet traffic analysis

Real-time Analytics UC 1: Internet Traffic Analysis


Real time analysis uc2 arrhythmia detection

Real-time Analysis UC2: Arrhythmia Detection


Pmml primer

PMML Primer


Na ve bayes primer

Naïve Bayes Primer

Likelihood

Prior

Normalization Constant


Pmml scoring for na ve bayes

PMML Scoring for Naïve Bayes


Pmml scoring for na ve bayes1

PMML Scoring for Naïve Bayes

<DataDictionarynumberOfFields="4">

<DataFieldname="Class" optype="categorical" dataType="string">

<Value value="democrat"/>

<Value value="republican"/>

</DataField>

<DataField name="V1" optype="categorical" dataType="string">

<Value value="n"/>

<Value value="y"/>

</DataField>

<DataField name="V2" optype="categorical" dataType="string">

<Value value="n"/>

<Value value="y"/>

</DataField>

<DataField name="V3" optype="categorical" dataType="string">

<Value value="n"/>

<Value value="y"/>

</DataField>

</DataDictionary>

(ctd on the next slide)


Pmml scoring for na ve bayes2

PMML Scoring for Naïve Bayes

<NaiveBayesModelmodelName="naiveBayes_Model" functionName="classification" threshold="0.003">

<MiningSchema>

<MiningField name="Class" usageType="predicted"/>

<MiningField name="V1" usageType="active"/>

<MiningField name="V2" usageType="active"/>

<MiningField name="V3" usageType="active"/>

</MiningSchema>

<Output>

<OutputField name="Predicted_Class" feature="predictedValue"/>

<OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/>

<OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/>

</Output>

<BayesInputs>

(ctd on the next page)


Pmml scoring for na ve bayes3

PMML Scoring for Naïve Bayes

<BayesInputs>

<BayesInputfieldName="V1">

<PairCounts value="n">

<TargetValueCounts>

<TargetValueCount value="democrat" count="51"/>

<TargetValueCount value="republican" count="85"/>

</TargetValueCounts>

</PairCounts>

<PairCounts value="y">

<TargetValueCounts>

<TargetValueCount value="democrat" count="73"/>

<TargetValueCount value="republican" count="23"/>

</TargetValueCounts>

</PairCounts>

</BayesInput>

<BayesInputfieldName="V2">

*

<BayesInputfieldName="V3">

*

</BayesInputs>

<BayesOutputfieldName="Class">

<TargetValueCounts>

<TargetValueCount value="democrat" count="124"/>

<TargetValueCount value="republican" count="108"/>

</TargetValueCounts>

</BayesOutput>


Pmml scoring for na ve bayes4

PMML Scoring for Naïve Bayes

Definition Of Elements:-

DataDictionary :

Definitions for fields as used in mining models

( Class, V1, V2, V3 )

NaiveBayesModel :

Indicates that this is a NaiveBayes PMML

MiningSchema : lists fields as used in that model.

Class is “predicted” field,

V1,V2,V3 are “active” predictor fields

Output:

Describes a set of result values that can be returned

from a model


Pmml scoring for na ve bayes5

PMML Scoring for Naïve Bayes

Definition Of Elements (ctd .. ) :-

BayesInputs:

For each type of inputs, contains the counts of

outputs

BayesOutput:

Contains the counts associated with the values of the target field


Pmml scoring for na ve bayes6

PMML Scoring for Naïve Bayes

Sample Input

Eg1 - n y y n y y n nnnnn y yyy

Eg2 - n y n y yy n nnnn y yy n y

  • 1st , 2nd and 3rd Columns:

    Predictor variables ( Attribute “name” in element MiningField )

  • Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput)


Pmml scoring for na ve bayes7

PMML Scoring for Naïve Bayes

  • 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors )


Pmml scoring for na ve bayes8

PMML Scoring for Naïve Bayes

  • 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space )


Thank you

Thank You!


Back up slides

Back up slides


Representation of an rdd

Representation of an RDD


Logistic regression spark vs hadoop

Logistic Regression: Spark VS Hadoop

http://spark-project.org


Some spark ling examples

Some Spark(ling) examples

Scala code (serial)

var count = 0

for (i <- 1 to 100000)

{ val x = Math.random * 2 - 1

valy = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI.

Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).


Some spark ling examples1

Some Spark(ling) examples

Spark code (parallel)

val spark = new SparkContext(<Mesos master>)

varcount = spark.accumulator(0)

for (i <- spark.parallelize(1 to 100000, 12))

{ val x = Math.random * 2 – 1

val y = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Notable points:

  • Spark context created – talks to Mesos1 master.

  • Count becomes shared variable – accumulator.

  • For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.

  • Parallelize method invokes foreach method of RDD.

1Mesos is an Apache incubated clustering system – http://mesosproject.org


Logistic regression in spark serial code

Logistic Regression in Spark: Serial Code

// Read data file and convert it into Point objects

val lines = scala.io.Source.fromFile("data.txt").getLines()

val points = lines.map(x => parsePoint(x))

// Run logistic regression

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = Vector.zeros(D)

for (p <- points) {

val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y

gradient += scale * p.x

}

w -= gradient

}

println("Result: " + w)


Logistic regression in spark

Logistic Regression in Spark

// Read data file and transform it into Point objects

val spark = new SparkContext(<Mesos master>)

val lines = spark.hdfsTextFile("hdfs://.../data.txt")

val points = lines.map(x => parsePoint(x)).cache()

// Run logistic regression

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = spark.accumulator(Vector.zeros(D))

for (p <- points) {

val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y

gradient += scale * p.x

}

w -= gradient.value

}

println("Result: " + w)


  • Login