Real-time PMML Scoring over Spark Streaming and Storm
Sponsored Links
This presentation is the property of its rightful owner.
1 / 30

Real-time PMML Scoring over Spark Streaming and Storm PowerPoint PPT Presentation


  • 584 Views
  • Uploaded on
  • Presentation posted in: General

Real-time PMML Scoring over Spark Streaming and Storm. Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus. Contents. Big Data Computations.

Download Presentation

Real-time PMML Scoring over Spark Streaming and Storm

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Real-time PMML Scoring over Spark Streaming and Storm

Dr. Vijay Srinivas Agneeswaran,

Director and Head, Big-data R&D,

Innovation Labs, Impetus


Contents


Big Data Computations

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.

[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741


Berkeley Big-data Analytics Stack (BDAS)


BDAS: Spark

[MZ12] MateiZaharia, MosharafChowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.


BDAS: Discretized Streams

pageViews = readStream("http://...", "1s")

1_s = pageViews.map(event => (event.url, 1))

counts = 1_s.runningReduce((a, b) => a + b)


BDAS: D-Streams Streaming Operators

words = sentences.flatMap(s => s.split(" "))

pairs = words.map(w => (w, 1))

counts = pairs.reduceByKey((a, b) => a + b)


BDAS: Use Cases


Real-time Analytics: R over Storm


Real-time Analytics UC 1: Internet Traffic Analysis


Real-time Analysis UC2: Arrhythmia Detection


PMML Primer


Naïve Bayes Primer

Likelihood

Prior

Normalization Constant


PMML Scoring for Naïve Bayes


PMML Scoring for Naïve Bayes

<DataDictionarynumberOfFields="4">

<DataFieldname="Class" optype="categorical" dataType="string">

<Value value="democrat"/>

<Value value="republican"/>

</DataField>

<DataField name="V1" optype="categorical" dataType="string">

<Value value="n"/>

<Value value="y"/>

</DataField>

<DataField name="V2" optype="categorical" dataType="string">

<Value value="n"/>

<Value value="y"/>

</DataField>

<DataField name="V3" optype="categorical" dataType="string">

<Value value="n"/>

<Value value="y"/>

</DataField>

</DataDictionary>

(ctd on the next slide)


PMML Scoring for Naïve Bayes

<NaiveBayesModelmodelName="naiveBayes_Model" functionName="classification" threshold="0.003">

<MiningSchema>

<MiningField name="Class" usageType="predicted"/>

<MiningField name="V1" usageType="active"/>

<MiningField name="V2" usageType="active"/>

<MiningField name="V3" usageType="active"/>

</MiningSchema>

<Output>

<OutputField name="Predicted_Class" feature="predictedValue"/>

<OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/>

<OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/>

</Output>

<BayesInputs>

(ctd on the next page)


PMML Scoring for Naïve Bayes

<BayesInputs>

<BayesInputfieldName="V1">

<PairCounts value="n">

<TargetValueCounts>

<TargetValueCount value="democrat" count="51"/>

<TargetValueCount value="republican" count="85"/>

</TargetValueCounts>

</PairCounts>

<PairCounts value="y">

<TargetValueCounts>

<TargetValueCount value="democrat" count="73"/>

<TargetValueCount value="republican" count="23"/>

</TargetValueCounts>

</PairCounts>

</BayesInput>

<BayesInputfieldName="V2">

*

<BayesInputfieldName="V3">

*

</BayesInputs>

<BayesOutputfieldName="Class">

<TargetValueCounts>

<TargetValueCount value="democrat" count="124"/>

<TargetValueCount value="republican" count="108"/>

</TargetValueCounts>

</BayesOutput>


PMML Scoring for Naïve Bayes

Definition Of Elements:-

DataDictionary :

Definitions for fields as used in mining models

( Class, V1, V2, V3 )

NaiveBayesModel :

Indicates that this is a NaiveBayes PMML

MiningSchema : lists fields as used in that model.

Class is “predicted” field,

V1,V2,V3 are “active” predictor fields

Output:

Describes a set of result values that can be returned

from a model


PMML Scoring for Naïve Bayes

Definition Of Elements (ctd .. ) :-

BayesInputs:

For each type of inputs, contains the counts of

outputs

BayesOutput:

Contains the counts associated with the values of the target field


PMML Scoring for Naïve Bayes

Sample Input

Eg1 - n y y n y y n nnnnn y yyy

Eg2 - n y n y yy n nnnn y yy n y

  • 1st , 2nd and 3rd Columns:

    Predictor variables ( Attribute “name” in element MiningField )

  • Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput)


PMML Scoring for Naïve Bayes

  • 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors )


PMML Scoring for Naïve Bayes

  • 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space )


Thank You!


Back up slides


Representation of an RDD


Logistic Regression: Spark VS Hadoop

http://spark-project.org


Some Spark(ling) examples

Scala code (serial)

var count = 0

for (i <- 1 to 100000)

{ val x = Math.random * 2 - 1

valy = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI.

Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).


Some Spark(ling) examples

Spark code (parallel)

val spark = new SparkContext(<Mesos master>)

varcount = spark.accumulator(0)

for (i <- spark.parallelize(1 to 100000, 12))

{ val x = Math.random * 2 – 1

val y = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Notable points:

  • Spark context created – talks to Mesos1 master.

  • Count becomes shared variable – accumulator.

  • For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.

  • Parallelize method invokes foreach method of RDD.

1Mesos is an Apache incubated clustering system – http://mesosproject.org


Logistic Regression in Spark: Serial Code

// Read data file and convert it into Point objects

val lines = scala.io.Source.fromFile("data.txt").getLines()

val points = lines.map(x => parsePoint(x))

// Run logistic regression

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = Vector.zeros(D)

for (p <- points) {

val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y

gradient += scale * p.x

}

w -= gradient

}

println("Result: " + w)


Logistic Regression in Spark

// Read data file and transform it into Point objects

val spark = new SparkContext(<Mesos master>)

val lines = spark.hdfsTextFile("hdfs://.../data.txt")

val points = lines.map(x => parsePoint(x)).cache()

// Run logistic regression

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = spark.accumulator(Vector.zeros(D))

for (p <- points) {

val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y

gradient += scale * p.x

}

w -= gradient.value

}

println("Result: " + w)


  • Login