CS240B Spring 2014: Work Plan. Monday, April 1—First Week Monday 14 1 st homework WD 24: Homework 2. WD 1st: Homework 3 May 5 (Monday 6 th week): MIDTERM May 6: Project assigned. May 14 Students ’ presentations begin May 19: Project due. June 4. Last day of class.
Monday, April 1—First Week
Monday 14 1st homework
WD 24: Homework 2.
WD 1st: Homework 3
May 5 (Monday 6th week): MIDTERM
May 6: Project assigned.
May 14 Students’ presentations begin
May 19: Project due.
June 4. Last day of class.
June 13: Deadline for turning in the final—take home project and report.
* max cumulative delay on projects and report must not exceed 3 days. 10% penalty for each extra day.
Motivation: Gaining experience with current DSMS and their limitations which make it hard to support KDD applications on data streams.
Case Study: Naïve Bayesian Classifiers—arguably the simplest mining algorithm, which is doable in SQL/DBMS. Thus the question is: can we support it using a DSMS and its SQL-like query languages?
A slightly more general question is whether the NBC can be supported various CEP systems, which claim to be powerful (e.g., support rules). Couldthey be extended to support generic versions of NBC, and perhaps other data stream mining methods?
Download a DSMS or a CEP system of your choice and (after explaining why you have selected this and not the others) explore how you can implement the following tasks:
Testing of a Naïve Bayesian Classifier: you can assume that the NBC has already been trained and you can read it from the input, or a DB, a file, or memory.
Assume now that you also have a stream of pre-classified samples. Use this to determine the accuracy of your current classifier, at periodic intervals. Output the accuracy, and if this falls below a certain threshold execute the next step.
Periodically retrain a new NBC from the stream of pre-classified tuples; then use the newly built classifier to predict the class of unclassified tuples (Step 1).
See if you can generalize your software, and e.g., design/develop generic NBCs, ensemble methods, other classifiers, etc.
It is understood that the limitations of DSMS and CEP systems will probably prevent you from completing all these tasks (listed in order of increasing difficulty). So, you should make sure that you (1) download a good system, (2) write clear report explaining your efforts, and the reasons that prevented you from going further. (For test sets, see the CS240A project --- http://www.cs.ucla.edu/classes/winter14/cs240A/DMproject.html)
You, and your contribution to the discussion will be evaluated too:
(i) Remind me of the questions you asked during the discussion
(ii) I will later record any answer/comment that you have provided via email.