CSI5388 Practical Recommendations

CSI5388Practical Recommendations

Context for our Recommendations I This discussion will take place in the context of the following three questions: • I have created a new classifier for a specific problem. How does it compare to other existing classifiers on this particular problem? • I have designed a new classifier, how does it compare to existing classifiers on benchmark data? • How do various classifiers fare on benchmark data or on a single new problem?

Context for our Recommendations II These three questions can be translated into the four different situations: • Situation 1: Comparison of a new classifier to generic ones for a specific problem • Situation 2: Comparison of a new classifier to generic ones on generic problems • Situation 3: Comparison of generic classifiers on generic domains • Situation 4: Comparison of generic classifiers on a specific problem

Selecting learning algorithms I • The general strategy is to try to select classifiers that are more likely to succeed on the task at hand. • Situation 1: Select generic classifiers with a good chance of success at the particular task. • E.g., For high dimensionality problem: use SVM as a generic classifier • E.g., For class imbalanced problem: use SMOTE as a generic classifier, etc. • Situation2: Different from Situation 1 in that not specific problem is targeted. So, choose generic classifiers that are generally accurate and stable across domains • E.g., Random Forests, SVMs, Bagging

Selecting learning algorithms II • Situation 3: Different from Situations 1 and 2. This time, we are interested in finding the strengths and weaknesses of various algorithms on different problems. So, select various well-known and well-used algorithms. Not necessarily the best algorithms overall. • E.g., Decision Trees, Neural Networks, Naïve Bayes, Nearest Neighbours, SVMs, etc. • Situation 4: reduces to Case 1 where what matters is the search for an optimal classifier or to Case 3, where the purpose is of a more general and scientific nature.

Selecting Data Sets I • The selection of data sets is different in the cases of Situations 1 and 4 and Situations 2 and 3. • Situations 1 and 4: We distinguish between two cases: • Case 1: There is just one data set of interest – Just use this data set. • Case 2: We are considering a class of data sets (e.g., data sets for text categorization). In this case, we should look at Situations 2 and 3, since data sets in the same class can have different characteristics (e.g., noise, class imbalances, etc). The only difference is that the domains in this class will be more closely related than those in a wider study of the kind considered in Situations 2 and 3.

Selecting Data Sets II • Situations 2 and 3: The first thing that we need to do is determine what the exact purpose of the study is. • Case 1: To test a specific characteristic of a new algorithm or of various algorithms (e.g., their resilience to noise) – Select domains presenting the same characteristics • Case 2: To test the general performance of a new algorithm or of various algorithms on a variety of domains with different characteristics — Select varied domains, but watch the way in which you report the results. There may be a lot of variance, from classifier to classifier and type of domain to type of domain. It will be best to cluster the kinds of domains on which classifiers excel or do poorly and report the results on a cluster by cluster basis.

Selecting Data Sets III • Situations 2 and 3 (Cont’d): Three questions remain: • Question 1: How many data sets are necessary / desirable? • Question 2: Where can we get these data sets? • Question 3: How do we select data sets from those available?

Selecting Data Sets IV Situations 2 & 3: How many data sets? The number of domains necessary depends on the variance in the performance of the classifiers. As a rule of thumb, 3 to 5 domains within the same category of domains are desirable to begin with. Note: As domains get added, the question raised by [Salzberg, 1997] and [Jensen, 2001] regarding the multiplicity effect should be considered. Situations 2 & 3: Where can we get these data sets? • UCI Repository for machine learning or other repositories (but the collections may not be representative of reality). • Directly from the Web (but gathering a cleaning a data collection is extremely time consuming) • Artificial data sets (easy to build, unlimited in size, but too far removed from reality) • Real-world inspired artificial data (real-world data sets artificially augmented. Easy to build, closer to reality)

Selecting Data Sets V Situations 2 & 3: How do we select data sets from those available? • Select all those that are available and meet the constraints of the algorithms that are under study. For example, the UCI repository contain many data sets, but only a subset of these are multi-class, only a subset has nominal attributes only, only a subset has no missing attributes, and so on. • In order to increase the number of domains available for use by researchers or practitioners of Data Mining, some amendments to the data sets can be made to make as many data sets as possible conform to the requirements of the classifiers.

Selecting performance measures • Cases 2 and 3: Caruana and Niculescu-Mizil, 2004 suggest that the Root mean Squared error is the best general-purpose method since it is the one that is best correlated with the other eight measures that they use. Researchers are, however, encouraged to use a variety of different metrics in order to discover the various strengths and shortcomings of each classifier and each domain more specifically. • Cases 1 and 4: We distinguish between the following cases: • Balanced versus imbalanced domains: ROC • Certainty of the decision matters: B & K • All the classes matter: RMSE • The problem is binary but one class matters more than the other: Precision, Recall, F-measure, Sensitivity, Specificity, Likelihood Ratios.

Selecting an error estimation method and statistical test I • If the size of the data set is large enough (the size of all testing sets is, at least, 30) and if the statistics of interest to the user is parameterizable: cross-validation can be tried (but see the next slide). • If the data set is particularly small, i.e., if some of the testing sets contain fewer than 30 examples: say, if it contains fewer than 30, or so samples: Bootstrapping or Randomization. • If the statistics of interest does not have statistical tests associated with it: Bootstrapping or Randomization.

Selecting an error estimation method and statistical test II • Question: How can one see whether cross-validation is appropriate for his/her purposes? • 2 ways: • Visual: plot the distribution and check its shape visually • Apply a Hypothesis Test designed to see if the distribution is normal or not. (e.g. Chi squared goodness of fit, Kolmogorov-Smirnov goodness of fit, etc.) • Since no practical distribution will be exactly Normal, we must also look into the robustness of the various statistical method considered. The t-test is quite robust against the normality assumption. • If the distribution is far from normal non-parametric tests must be used.

Selecting an error estimation method and statistical test III • The robustness of a procedure is important since that will ensure that the reported significance level is close to the true one. • However, Robustness does not answer the question of whether efficient use is made of the data so that a false null hypothesis can be rejected. • Power should be considered • The power of a test depends on some intrinsic nature of that test, but also on the shape and size of the population to which it is applied. • Example: Parametric tests based on the normality assumption are generally as powerful or more powerful than non-parametric tests based on ranks in the case of distribution functions with lighter tails than the normal distribution.

Selecting an error estimation method and statistical test IV • But: Parametric tests based on the normality assumption are less powerful than non-parametric ones in the case where the tails of the distribution are heavier than those of the normal distribution (An important kind of data presenting such distributions are data containing outliers). • Note that the relative power of parametric and non-parametric tests does not change as a function of sample size, even if a test is asymptotically distribution free (i.e., if it becomes more and more robust as the sample size increases).

CSI5388 Practical Recommendations

CSI5388 Practical Recommendations

Presentation Transcript

Practical Recommendations for Sustainable Construction

PRACTICAL RECOMMENDATIONS FOR POTENTIAL APPLICANTS

Practical Recommendations on Crawling Online Social Networks

Recommendations

Practical recommendations for digital switch-over

Recommendations

RECOMMENDATIONS

Recommendations

Recommendations

Practical Recommendations on Crawling Online Social Networks

RECOMMENDATIONS

Practical recommendations for healthy diet

Recommendations:

Recommendations

Practical recommendations to potential applicants

CSI5388 Current Approaches to Evaluation

Making recommendations: Practical Considerations

Recommendations

Practical Recommendations For Creating Aggressive Internet Marketing Strategies

RECOMMENDATIONS

Recommendations

Expert Perspectives on Immunotherapy for NSCLC: Practical Guidance and Recommendations