Real-time PMML Scoring over Spark Streaming and Storm
Download
1 / 30

Real-time PMML Scoring over Spark Streaming and Storm - PowerPoint PPT Presentation


  • 709 Views
  • Uploaded on

Real-time PMML Scoring over Spark Streaming and Storm. Dr. Vijay Srinivas Agneeswaran, Director and Head, Big-data R&D, Innovation Labs, Impetus. Contents. Big Data Computations.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Real-time PMML Scoring over Spark Streaming and Storm' - cindy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Real-time PMML Scoring over Spark Streaming and Storm

Dr. Vijay Srinivas Agneeswaran,

Director and Head, Big-data R&D,

Innovation Labs, Impetus



Big data computations
Big Data Computations

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.

[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741



Bdas spark
BDAS: Spark

[MZ12] MateiZaharia, MosharafChowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.


Bdas discretized streams
BDAS: Discretized Streams

pageViews = readStream("http://...", "1s")

1_s = pageViews.map(event => (event.url, 1))

counts = 1_s.runningReduce((a, b) => a + b)


Bdas d streams streaming operators
BDAS: D-Streams Streaming Operators

words = sentences.flatMap(s => s.split(" "))

pairs = words.map(w => (w, 1))

counts = pairs.reduceByKey((a, b) => a + b)







Na ve bayes primer
Naïve Bayes Primer

Likelihood

Prior

Normalization Constant


Pmml scoring for na ve bayes
PMML Scoring for Naïve Bayes


Pmml scoring for na ve bayes1
PMML Scoring for Naïve Bayes

<DataDictionarynumberOfFields="4">

<DataFieldname="Class" optype="categorical" dataType="string">

<Value value="democrat"/>

<Value value="republican"/>

</DataField>

<DataField name="V1" optype="categorical" dataType="string">

<Value value="n"/>

<Value value="y"/>

</DataField>

<DataField name="V2" optype="categorical" dataType="string">

<Value value="n"/>

<Value value="y"/>

</DataField>

<DataField name="V3" optype="categorical" dataType="string">

<Value value="n"/>

<Value value="y"/>

</DataField>

</DataDictionary>

(ctd on the next slide)


Pmml scoring for na ve bayes2
PMML Scoring for Naïve Bayes

<NaiveBayesModelmodelName="naiveBayes_Model" functionName="classification" threshold="0.003">

<MiningSchema>

<MiningField name="Class" usageType="predicted"/>

<MiningField name="V1" usageType="active"/>

<MiningField name="V2" usageType="active"/>

<MiningField name="V3" usageType="active"/>

</MiningSchema>

<Output>

<OutputField name="Predicted_Class" feature="predictedValue"/>

<OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/>

<OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/>

</Output>

<BayesInputs>

(ctd on the next page)


Pmml scoring for na ve bayes3
PMML Scoring for Naïve Bayes

<BayesInputs>

<BayesInputfieldName="V1">

<PairCounts value="n">

<TargetValueCounts>

<TargetValueCount value="democrat" count="51"/>

<TargetValueCount value="republican" count="85"/>

</TargetValueCounts>

</PairCounts>

<PairCounts value="y">

<TargetValueCounts>

<TargetValueCount value="democrat" count="73"/>

<TargetValueCount value="republican" count="23"/>

</TargetValueCounts>

</PairCounts>

</BayesInput>

<BayesInputfieldName="V2">

*

<BayesInputfieldName="V3">

*

</BayesInputs>

<BayesOutputfieldName="Class">

<TargetValueCounts>

<TargetValueCount value="democrat" count="124"/>

<TargetValueCount value="republican" count="108"/>

</TargetValueCounts>

</BayesOutput>


Pmml scoring for na ve bayes4
PMML Scoring for Naïve Bayes

Definition Of Elements:-

DataDictionary :

Definitions for fields as used in mining models

( Class, V1, V2, V3 )

NaiveBayesModel :

Indicates that this is a NaiveBayes PMML

MiningSchema : lists fields as used in that model.

Class is “predicted” field,

V1,V2,V3 are “active” predictor fields

Output:

Describes a set of result values that can be returned

from a model


Pmml scoring for na ve bayes5
PMML Scoring for Naïve Bayes

Definition Of Elements (ctd .. ) :-

BayesInputs:

For each type of inputs, contains the counts of

outputs

BayesOutput:

Contains the counts associated with the values of the target field


Pmml scoring for na ve bayes6
PMML Scoring for Naïve Bayes

Sample Input

Eg1 - n y y n y y n nnnnn y yyy

Eg2 - n y n y yy n nnnn y yy n y

  • 1st , 2nd and 3rd Columns:

    Predictor variables ( Attribute “name” in element MiningField )

  • Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput)


Pmml scoring for na ve bayes7
PMML Scoring for Naïve Bayes

  • 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors )


Pmml scoring for na ve bayes8
PMML Scoring for Naïve Bayes

  • 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space )





Logistic regression spark vs hadoop
Logistic Regression: Spark VS Hadoop

http://spark-project.org


Some spark ling examples
Some Spark(ling) examples

Scala code (serial)

var count = 0

for (i <- 1 to 100000)

{ val x = Math.random * 2 - 1

valy = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI.

Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).


Some spark ling examples1
Some Spark(ling) examples

Spark code (parallel)

val spark = new SparkContext(<Mesos master>)

varcount = spark.accumulator(0)

for (i <- spark.parallelize(1 to 100000, 12))

{ val x = Math.random * 2 – 1

val y = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Notable points:

  • Spark context created – talks to Mesos1 master.

  • Count becomes shared variable – accumulator.

  • For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.

  • Parallelize method invokes foreach method of RDD.

1Mesos is an Apache incubated clustering system – http://mesosproject.org


Logistic regression in spark serial code
Logistic Regression in Spark: Serial Code

// Read data file and convert it into Point objects

val lines = scala.io.Source.fromFile("data.txt").getLines()

val points = lines.map(x => parsePoint(x))

// Run logistic regression

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = Vector.zeros(D)

for (p <- points) {

val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y

gradient += scale * p.x

}

w -= gradient

}

println("Result: " + w)


Logistic regression in spark
Logistic Regression in Spark

// Read data file and transform it into Point objects

val spark = new SparkContext(<Mesos master>)

val lines = spark.hdfsTextFile("hdfs://.../data.txt")

val points = lines.map(x => parsePoint(x)).cache()

// Run logistic regression

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = spark.accumulator(Vector.zeros(D))

for (p <- points) {

val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y

gradient += scale * p.x

}

w -= gradient.value

}

println("Result: " + w)


ad