A k means based bayesian classifier inside a dbms using sql udfs
This presentation is the property of its rightful owner.
Sponsored Links
1 / 11

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs. Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database Systems Group, Department of Computer Science University of Houston Advisor: Dr. Carlos Ordonez. Motivation.

Download Presentation

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A k means based bayesian classifier inside a dbms using sql udfs

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

Ph.D Showcase, Dept. of Computer Science

Sasi Kumar Pitchaimalai

Ph.D Candidate

Database Systems Group, Department of Computer Science

University of Houston

Advisor: Dr. Carlos Ordonez


Motivation

Motivation

  • Naïve Bayes Classifier(NB)

    • One of the most popular and important classifiers in Machine Learning

    • Robust, Powerful, Fast to Compute And Easy to Understand

  • Programming Inside A DBMS

    • SQL can easily handle complex computations

    • UDFs can use arrays and processed in memory


Data mining inside a dbms

Data Mining Inside A DBMS

Avoids Exporting the data outside the DBMS

Major overhead

Data Security

Scales Linearly with large data sets

Exploit parallelism provided by a DBMS

Use optimized queries with simple database operations

Objective: Push computations involving large data sets inside the DBMS


Bayesian classifier based on k means bkm

Bayesian Classifier Based On K-Means (BKM)

  • A Generalization Of Naïve Bayes(NB)

  • The Algorithm

    • Initialization: Randomly initialize k clusters per class from the data set.

    • E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics.

    • M-Step: Re-compute cluster centers and radii. Check Convergence.

  • The E-Step and M-Step are repeated until model converges i.e clusters do not move


Bkm finding the clusters per class

BKM: Finding the clusters per class


Database optimizations

Database Optimizations

  • Five different query optimization techniques for distance computation were introduced.

  • User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF.

  • Using CASE statement instead of aggregations.

  • Sufficient Statistics of the clusters were computed in a single table scan.


Comparing accuracy nb vs bkm vs dt

Comparing Accuracy – NB Vs BKM Vs DT

  • Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases

  • Class Breakdown Accuracy:

  • BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.


Bkm scalability varying n d k

BKM Scalability- Varying n,d,k

Times per Iteration. Defaults: d=4,k=4,n=100k


Comparing dbms with mapreduce

Comparing DBMS with MapReduce

MapReduce: A distributed non-transactional high performance data intensive processing framework.


Incremental mining

Incremental Mining

  • An UDF performing incremental data mining exploiting data parallelism

  • Minimizing the number of scans(1-3) on the data set

  • Provides an approximation of the model before we scan through the complete data set

  • Requires thread safe sharing of the model without affecting performance


Papers

Papers

  • Carlos Ordonez, Sasi K. Pitchaimalai: One-pass data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: 1217-1220

  • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010

  • Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010

  • Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008

  • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008


  • Login