Loading in 5 sec....

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFsPowerPoint Presentation

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

Download Presentation

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

Loading in 2 Seconds...

- 89 Views
- Uploaded on
- Presentation posted in: General

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Ph.D Showcase, Dept. of Computer Science

Sasi Kumar Pitchaimalai

Ph.D Candidate

Database Systems Group, Department of Computer Science

University of Houston

Advisor: Dr. Carlos Ordonez

- Naïve Bayes Classifier(NB)
- One of the most popular and important classifiers in Machine Learning
- Robust, Powerful, Fast to Compute And Easy to Understand

- Programming Inside A DBMS
- SQL can easily handle complex computations
- UDFs can use arrays and processed in memory

Data Mining Inside A DBMS

Avoids Exporting the data outside the DBMS

Major overhead

Data Security

Scales Linearly with large data sets

Exploit parallelism provided by a DBMS

Use optimized queries with simple database operations

Objective: Push computations involving large data sets inside the DBMS

- A Generalization Of Naïve Bayes(NB)
- The Algorithm
- Initialization: Randomly initialize k clusters per class from the data set.
- E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics.
- M-Step: Re-compute cluster centers and radii. Check Convergence.

- The E-Step and M-Step are repeated until model converges i.e clusters do not move

- Five different query optimization techniques for distance computation were introduced.
- User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF.
- Using CASE statement instead of aggregations.
- Sufficient Statistics of the clusters were computed in a single table scan.

- Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases
- Class Breakdown Accuracy:
- BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.

Times per Iteration. Defaults: d=4,k=4,n=100k

MapReduce: A distributed non-transactional high performance data intensive processing framework.

- An UDF performing incremental data mining exploiting data parallelism
- Minimizing the number of scans(1-3) on the data set
- Provides an approximation of the model before we scan through the complete data set
- Requires thread safe sharing of the model without affecting performance

- Carlos Ordonez, Sasi K. Pitchaimalai: One-pass data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: 1217-1220
- Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010
- Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010
- Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008
- Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008