a k means based bayesian classifier inside a dbms using sql udfs
Download
Skip this Video
Download Presentation
A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

Loading in 2 Seconds...

play fullscreen
1 / 11

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs. Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database Systems Group, Department of Computer Science University of Houston Advisor: Dr. Carlos Ordonez. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs' - cian


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a k means based bayesian classifier inside a dbms using sql udfs
A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

Ph.D Showcase, Dept. of Computer Science

Sasi Kumar Pitchaimalai

Ph.D Candidate

Database Systems Group, Department of Computer Science

University of Houston

Advisor: Dr. Carlos Ordonez

motivation
Motivation
  • Naïve Bayes Classifier(NB)
    • One of the most popular and important classifiers in Machine Learning
    • Robust, Powerful, Fast to Compute And Easy to Understand
  • Programming Inside A DBMS
    • SQL can easily handle complex computations
    • UDFs can use arrays and processed in memory
data mining inside a dbms

Data Mining Inside A DBMS

Avoids Exporting the data outside the DBMS

Major overhead

Data Security

Scales Linearly with large data sets

Exploit parallelism provided by a DBMS

Use optimized queries with simple database operations

Objective: Push computations involving large data sets inside the DBMS

bayesian classifier based on k means bkm
Bayesian Classifier Based On K-Means (BKM)
  • A Generalization Of Naïve Bayes(NB)
  • The Algorithm
    • Initialization: Randomly initialize k clusters per class from the data set.
    • E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics.
    • M-Step: Re-compute cluster centers and radii. Check Convergence.
  • The E-Step and M-Step are repeated until model converges i.e clusters do not move
database optimizations
Database Optimizations
  • Five different query optimization techniques for distance computation were introduced.
  • User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF.
  • Using CASE statement instead of aggregations.
  • Sufficient Statistics of the clusters were computed in a single table scan.
comparing accuracy nb vs bkm vs dt
Comparing Accuracy – NB Vs BKM Vs DT
  • Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases
  • Class Breakdown Accuracy:
  • BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.
bkm scalability varying n d k
BKM Scalability- Varying n,d,k

Times per Iteration. Defaults: d=4,k=4,n=100k

comparing dbms with mapreduce
Comparing DBMS with MapReduce

MapReduce: A distributed non-transactional high performance data intensive processing framework.

incremental mining
Incremental Mining
  • An UDF performing incremental data mining exploiting data parallelism
  • Minimizing the number of scans(1-3) on the data set
  • Provides an approximation of the model before we scan through the complete data set
  • Requires thread safe sharing of the model without affecting performance
papers
Papers
  • Carlos Ordonez, Sasi K. Pitchaimalai: One-pass data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: 1217-1220
  • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010
  • Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010
  • Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008
  • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008
ad