1 / 20

Classifiers in Atlas

Classifiers in Atlas. CS240B Class Notes UCLA. Data Mining. Classifiers: Bayesian classifiers Decision trees The Apriori Algorithm DBSCAN Clustering: http://wis.cs.ucla.edu/atlas/examples.html. The Classification Task.

arlene
Download Presentation

Classifiers in Atlas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classifiers in Atlas CS240B Class Notes UCLA

  2. Data Mining • Classifiers: • Bayesian classifiers • Decision trees • The Apriori Algorithm • DBSCAN Clustering: http://wis.cs.ucla.edu/atlas/examples.html

  3. The Classification Task • Input: a training set of tuples, each labelled with one class label • Output: a model (classifier) which assigns a class label to each tuple based on the other attributes. • The model can be used to predict the class of new tuples, for which the class label is missing or unknown • Some natural applications • credit approval • medical diagnosis • treatment effectiveness analysis

  4. Train & Test • The tuples (observations, samples) are partitioned in training set + test set. • Classification is performed in two steps: • training - build the model from training set • Testing (for accuracy, etc.)

  5. Classical example: play tennis? Training set from Quinlan’s Book Seq Could have Been used to generate the RID column

  6. Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sample X the class label C such that P(C|X) is maximal

  7. Estimating a-posteriori probabilities • Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) • P(X) is constant for all classes • P(C) = relative freq of class C samples • C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum

  8. Naïve Bayesian Classification • Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) • For Categorical attributes: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C • Computationally this is a count with grouping

  9. Play-tennis example: estimating P(xi|C)

  10. Bayesian Classifiers • The training can be done by SQL count and grouping sets (but that might require many passes through the data). If the results are stored in a table called SUMMARY, then: • The testing is a simple SQL query on SUMMARY • First operation is to verticalize the table

  11. Decision tree obtained with ID3 (Quinlan 86) outlook sunny rain overcast humidity windy P high normal weak strong N P N P [0] [1] [3] [2] […] [4]

  12. Decision Tree Classifiers • Computed in a recursive fashion • Various ways to split and computing the splitting function • First operation is to verticalize the table

  13. Classical example

  14. Initial state: the node column Training set from Quinlan’s book

  15. First Level (Outlook will then be deleted)

  16. Gini index • E.g., two classes,Pos andNeg, and dataset S with pPos-elements and nNeg-elements. • fp = p/(p+n) fn = n/(p+n) gini(S) = 1 – fp2 - fn2 • If dataset S is split into S1,S2 ,S3 then ginisplit(S1,S2 ,S3) = gini(S1)·(p1+n1)/(p+n) + gini(S2)·(p2+n2)/(p+n) +gini(S3)·(p2+n2)/(p+n) These computations can be easily expressed in ATLaS

  17. Programming in ATLaS • Table-based programming is powerful and natural for data intensive • SQL can be ackward and many extensions are possible • But even SQL `as is’ is adequate

  18. The ATLaS System • The system compile ATLaS programs into C programs, which • Executes on Berkeley DB record manager • The 100 Apriori program compiles into 2,800 lines of C • Other data structures (R-trees, in-memory tables) have been added using the same API. • The system is now 54,000 lines of C++ code.

  19. ATLaS: Conclusions • A native extensibility mechanism for SQL—and a simple one. More efficient than Java or PL/SQL • Effective with Data Minining Applications • Also OLAP applications, and recursive queries, and temporal database applications • Complement current mechanisms based on UDFs and Data Blades • Supports and favors streaming aggregates (SQL implicit default is blocking) • Good basis for determining program properties: e.g. (non)monotonic and blocking behavior • These are lessons that future QLs cannot easily ignore.

  20. The Future • Continuous queries on Data Streams • Other extensions and improvements • Stay tuned: www.wis.ucla.edu

More Related