1 / 17

The basic notions related to machine learning

The basic notions related to machine learning. Feature extraction. It is a vital step before the actual learning: we have to create the input feature vector Obviously, the optimal feature set is task-dependent Ideally, the features are recommended by an expert of the given domain

ewynn
Download Presentation

The basic notions related to machine learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The basic notions related to machine learning

  2. Feature extraction • It is a vital step before the actual learning: we have to create the input feature vector • Obviously, the optimal feature set is task-dependent • Ideally, the features are recommended by an expert of the given domain • In practice, however, we (engineers) have to solve it • Good feature set: contains relevant and few features • In many practical tasks it is not clear what are the relevant features • Eg. influenza – fever: relevant, color of eye: irrelevant, age ??? • When we are unsure, let’s include the feature • It’s not that simple: including irrelevant features makes the learning more difficult for two reasons • Curse of dimensionality • It introduces noise in the data that many algorithms have difficulties to handle

  3. Curse of Dimensionality • Too many features make learning more difficult • Number of features= dimensions of the feature space • Learning becomes harder at larger dimensional spaces • Example: let’s consider the following simple algorithm • Learning: we divide the feature space into little hypercubes, andcount the examples falling into them. We label each cube by theclass that has the most examples in it • Classification: a new test case is always labeled by the label of the cube it falls into • The number of cubes increases exponentially with the number of dimensions! • With a fixed number of examples more and more cubes remain empty • More and more examples are required to reach a certain density of examples • Real learning algorithms are more clever, but the problem is the same • More features we need much more training examples

  4. The effect of irrelevant features • The irrelevant features may make the learning algorithms less efficient • Example: nearest neighbor method • Learning: we simply store the training examples • Classify: we identify a new example by the label of its nearest neighbor • Good features: the points of the same class fall close to each other • What if we include a noise-like feature: the points are randomly scattered along the new dimension, the distance relations fall apart • Most of the learning algorithms are more clever, but their operation is also disturbed by an irrelevant (noise-like) feature

  5. Optimizing the feature space • We usually try to pick the best features manually • But of course, there are also automatic methods for this • Feature selection algorithms • They retain M<N features from the original set of N features • We can reduce the feature space not only by throwing away less relevant features, but also by transforming the feature space • Feature space transformation methods • The new feature are obtained by some combination of the old features • We usually also reduce the number of dimensions at the same time (the new feature space has fewer dimensions than the old one)

  6. Evaluating the trained model • Based on the training examples, the algorithm constructs a model(hypothesis)from the function (x1,…,xN)c függvényre • This function can guess the value of the function for any (x1,…,xN) • Our main goal is not to perfectly learn the labels of the training samples, but togeneralize to examples not seen during training • Hoe can we give an estimate on the generalization ability? • We leave out a subset of the training examples during training test set • Evaluation: • We evaluate the model on the test set estimated class labels • We compare the estimated and the guessed labels

  7. Evaluating the trained model 2 • How to quantify the error of estimation for a regression task: • Example: the algorithm outputs a straight line – the error is shown by the yellow arrows • Summarizing the error indicated by the yellow arrows: • Mean squared error or Root-mean-squared error

  8. Evaluating the trained model 3 • Quantifying the error for aclassification task: • Simplest solution: classification error rate • Number of incorrectly classified test samples/Number of all test samples • More detailed error analysis: with the help of the confusion matrix • It helps understand which classes are missed by algorithm • It also allows defining an error functionthat counts different mistakes bydifferent weights • For this we can define a weight matrixfor the different cells • „0-1 loss”: it weights the elements of themain diagonal by 0, the other cells by 1 • Same as the classification error rate

  9. Evaluating the trained model 4 • We can also weight the different mistakes differently • The most usual when we have only too classes • Example: diagnosing an illness • Thecost matrix is sized 2x2: • Error 1: False negative: the patient is ill, but the machine said no • Error 2: False positive: the machine said yes, but the patient is not ill • These have different costs! • Metrics: see fig. • Metrics preferred by doctors: • Sensitivity: tp/(tp+fn) • Specificity: tn/(tn+fp)

  10. „No Free Lunch” theorem • There exists no such universal learning algorithm that would outperform all other algorithms on all possible tasks • The optimal learning algorithm is always task-dependent • For every learning algorithm one can find task on which it performs well, and task for which it performs poorly • Demonstration: Hypothesis of Method 1 andmethod 2 on the same examples: Which hypothesis is correct?It depends on the real distribution:

  11. „No Free Lunch” theorem 2 • Put another way: The average performance (over „all posible tasks”)of all training algorithms is the same • Ok, but then… what is the sense in constructing machine learning algorithms? • We should concentrate on just one type of tasks rather than trying to solve all tasks by one algorithm! • It makes sense to look for a good algorithm for eg. speech recognition or face recognition • You should be very careful when making claims like algorithm A is better than algorithm B • Machine learning databases: for the purpose of objective evaluation of machine learning algorithms over a broad range of tasks • Pl: UCI Machine Learning Repository

  12. Generalizationvs.overfitting • No Free Lunch theorem: we can never be sure that the trained model generalizes correctly to the cases not seen during training • But then, how should it chose from the possible hypotheses? • Experience: increasing the complexity of the model increases its flexibility, so it becomes and more correct on the training examples • However, its performance starts dropping on the test examples! • This phenomenon is calledoverfitting: after learning the general properties, themodel starts to learn the pecularities of the given finite training set

  13. The „Occam’srazor” heuristics • Experience: usually the simpler model generalizes better • But of course, a too simple model is not good either • Einstein: „Things should be explained as simple as possible. But no simpler.” – this is practically the same as the Occam’s razor heuristics • The optimal model complexity is different for each task • How can we find the optimum pointshown in the figure? • Theoretical approach: we formalizethe complexity of a hypothesis • Minimum Description Length principle:we seek that hypothesis h for which K(h,D)=K(h)+K(D|h) is minimal • K(h): the complexity of hypothesis h • K(D|h): the complexity of representingset D by the hypothesis h • K(): Kolmogorov-complexity

  14. Bias and variance • Another formalism fora model being „too simple” or „toocomplex” • For the case of regression • Example: we fit the red polinomialon theblue points, green is the optimal solution • Polinomial of too low degree: cannot fit on the examplesbias • Too high degree: fits on the examples, butoscillates in between them  variance • Formally: • Let’s select a random D training set with n elements, and run the training on them • Repeat this many times, and analyze expectation of thesquared error between the g(x,D) approximationand the original F(x) functionat a given x point

  15. Bias-variance trade-off • Bias: The difference between the average of the estimates and F(x) • If it is not 0, then the model is biased: it has a tendencyto over- or under-estimate the F(x) • By increasing the model complexity (in our example the order of the polinom) the bias decreases • Variance: The variance of the estimates (their average difference from the average estimate) • A large variance is not good(we get quite different estimatesdepending on the choice of D) • Increasing model complexityincreases the variance • Optimum: somewhere in between

  16. Finding the optimal complexity – A practicalapproach • (Almost) all machine learning algorithms have meta-parameters • These allow us to tune the complexity of the model • E.g. polynomial fitting: the degree of the polynomial • These are called meta-parameters (or hyperparameters) , to separate them from the real parameters (eg. polynomials: coefficients) • Different meta-parameter values result in slightly different models • How can we find the optimal meta-parameters? • We separate a small validation (also called development) set from the training set • Over all, our data isdivided into train-dev-test sets • We repeat training on the train set several times with several meta-parameters • We evalute the models obtained on the dev set (to estimate the red curve of Fig.) • Finally, the we evaluate the model that performed best on the dev set on the test

  17. Finding the optimal complexity – Example • Let’s assume that we have two classes, and we want to separate them by a polynomial • The coefficients of the polynomialare the parameters of the model • The degree of the polynomial is the meta-parameter • What happens is we increase the degree? • The optimal degree can be estimated with the help of the independent development set

More Related