Classifiers in Atlas

Classifiers in Atlas CS240B Class Notes UCLA

Data Mining • Classifiers: • Bayesian classifiers • Decision trees • The Apriori Algorithm • DBSCAN Clustering: http://wis.cs.ucla.edu/atlas/examples.html

The Classification Task • Input: a training set of tuples, each labelled with one class label • Output: a model (classifier) which assigns a class label to each tuple based on the other attributes. • The model can be used to predict the class of new tuples, for which the class label is missing or unknown • Some natural applications • credit approval • medical diagnosis • treatment effectiveness analysis

Train & Test • The tuples (observations, samples) are partitioned in training set + test set. • Classification is performed in two steps: • training - build the model from training set • Testing (for accuracy, etc.)

Classical example: play tennis? Training set from Quinlan’s Book Seq Could have Been used to generate the RID column

Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sample X the class label C such that P(C|X) is maximal

Estimating a-posteriori probabilities • Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) • P(X) is constant for all classes • P(C) = relative freq of class C samples • C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum

Naïve Bayesian Classification • Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) • For Categorical attributes: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C • Computationally this is a count with grouping

Play-tennis example: estimating P(xi|C)

Bayesian Classifiers • The training can be done by SQL count and grouping sets (but that might require many passes through the data). If the results are stored in a table called SUMMARY, then: • The testing is a simple SQL query on SUMMARY • First operation is to verticalize the table

Decision tree obtained with ID3 (Quinlan 86) outlook sunny rain overcast humidity windy P high normal weak strong N P N P [0] [1] [3] [2] […] [4]

Decision Tree Classifiers • Computed in a recursive fashion • Various ways to split and computing the splitting function • First operation is to verticalize the table

Classical example

Initial state: the node column Training set from Quinlan’s book

First Level (Outlook will then be deleted)

Gini index • E.g., two classes,Pos andNeg, and dataset S with pPos-elements and nNeg-elements. • fp = p/(p+n) fn = n/(p+n) gini(S) = 1 – fp2 - fn2 • If dataset S is split into S1,S2 ,S3 then ginisplit(S1,S2 ,S3) = gini(S1)·(p1+n1)/(p+n) + gini(S2)·(p2+n2)/(p+n) +gini(S3)·(p2+n2)/(p+n) These computations can be easily expressed in ATLaS

Programming in ATLaS • Table-based programming is powerful and natural for data intensive • SQL can be ackward and many extensions are possible • But even SQL `as is’ is adequate

The ATLaS System • The system compile ATLaS programs into C programs, which • Executes on Berkeley DB record manager • The 100 Apriori program compiles into 2,800 lines of C • Other data structures (R-trees, in-memory tables) have been added using the same API. • The system is now 54,000 lines of C++ code.

ATLaS: Conclusions • A native extensibility mechanism for SQL—and a simple one. More efficient than Java or PL/SQL • Effective with Data Minining Applications • Also OLAP applications, and recursive queries, and temporal database applications • Complement current mechanisms based on UDFs and Data Blades • Supports and favors streaming aggregates (SQL implicit default is blocking) • Good basis for determining program properties: e.g. (non)monotonic and blocking behavior • These are lessons that future QLs cannot easily ignore.

The Future • Continuous queries on Data Streams • Other extensions and improvements • Stay tuned: www.wis.ucla.edu

Classifiers in Atlas

Classifiers in Atlas

Presentation Transcript

Advanced classifiers

Classifiers

Linear Classifiers

Bayes classifiers

Classifiers

Evaluating Classifiers

Linear Classifiers

Classifiers

Classifiers

Parametric Classifiers

Classifiers Notes

Classifiers

Bayesian Classifiers

Classifiers

Ensemble Classifiers

“Classifiers”

Classifiers!!!

Linear Classifiers

Linear classifiers

Classifiers