1 / 26

Review Midterm Exam

Review Midterm Exam. Introduction. Definition of a learning system:. Computer Learning Algorithm. Class of Tasks T. Performance P. Experience E. Introduction. Elements in designing a learning system:. Define the knowledge to learn Define the representation of the target knowledge

pegeen
Download Presentation

Review Midterm Exam

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review Midterm Exam

  2. Introduction Definition of a learning system: Computer Learning Algorithm Class of Tasks T PerformanceP Experience E

  3. Introduction Elements in designing a learning system: • Define the knowledge to learn • Define the representation of the target knowledge • 3. Define the learning mechanism

  4. Concept Learning The input-output space: X : The space of all possible examples (input space). Y: The space of classes (output space). X Only a small subset is contained in a given training database. Y = {0,1}

  5. Concept Learning X h The hypothesis space: The space of all hypotheses is represented by H Let h be a hypothesis in H. Let X be an example in the dataset. if h(X) = 1 then X is positive, otherwise X is negative Our goal is to find the hypothesis, h*, that is very “close” to target concept c. A hypothesis is said to “cover” those examples it classifies as positive.

  6. Concept Learning General to specific ordering in the hypothesis space: h1 h2 h3 h1 is more general than h2 and h3. h2 and h3 are neither more specific nor more general than each other.

  7. Concept Learning Let hj and hk be two hypotheses mapping examples into {0,1}. We say hj is more general than hk iff For all examples X, hk(X) = 1 hj(X) = 1 We represent this fact as hj >= hk The >= relation imposes a partial ordering over the hypothesis space H (reflexive, antisymmetric, and transitive).

  8. Concept Learning The Version Space: Hypothesis space H Version space: Subset of hypothesis from H consistent with training set D.

  9. Concept Learning Candidate elimination algorithm: • Initialize G to the set of maximally general hypotheses in H • Initialize S to the set of maximally specific hypotheses in H • For each training example X do • If X is positive: generalize S if necessary* • If X is negative: specialize G if necessary* • Output {G,S} What are some properties of the candidate elimination algorithm? *understand details in these steps

  10. Decision Trees What is a decision tree? A decision-tree learning algorithm approximates a target function using a tree representation, where each internal node corresponds to an attribute, and every terminal node corresponds to a class. What are appropriate problems for decision trees?

  11. Decision Trees Mechanism: There are different ways to construct trees from data. We will concentrate on the top-down, greedy search approach: Basic idea: 1. Choose the best attribute a* to place at the root of the tree. 2. Separate training set D into subsets {D1, D2, .., Dk} where each subset Di contains examples having the same value for a* 3. Recursively apply the algorithm on each new subset until examples have the same class or there are few of them.

  12. Decision Trees Splitting Functions. Information Gain: IG(A) = H(S) - Σv (Sv/S) H (Sv) H(S) is the entropy of all examples H(Sv) is the entropy of one subsample after partitioning S based on all possible values of attribute A.

  13. Decision Trees • Practical issues while building a decision tree can be • enumerated as follows: • How deep should the tree be? • How do we handle continuous attributes? • What is a good splitting function? • What happens when attribute values are missing? • How do we improve the computational efficiency?

  14. Decision Trees Overfitting: Assume a hypothesis space H. We say a hypothesis h in H overfits a dataset D if there is another hypothesis h’ in H where h has better classification accuracy than h’ on D but worse classification accuracy than h’ on D’. training data overfitting 0.5 0.6 0.7 0.8 0.9 1.0 testing data Size of the tree

  15. Decision Trees Solutions to Overfitting: • There are two main classes of solutions: • Stop the tree early before it begins to overfit the data. • + In practice this solution is hard to implement because it • is not clear what is a good stopping point. • 2) Grow the tree until the algorithm stops even if the overfitting • problem shows up. Then prune the tree as a post-processing • step. • + This method has found great popularity in the machine • learning community.

  16. Decision Trees Tree Validation: • Training and Validation Set Approach • Divide dataset D into a training set TR and a validation set TE • Build a decision tree on TR • Test pruned trees on TE to decide the best final tree. Dataset D Training TR Validation TE

  17. Decision Trees • Understand: • Reduced-error pruning • Rule post-pruning • Discretization of numeric attributes • What to do with missing attributes.

  18. Neural Networks Perceptron: Definition.- It’s a step function based on a linear combination of real-valued inputs. If the combination is above a threshold it outputs a 1, otherwise it outputs a –1. x1 w1 x2 w2 {1 or –1} Σ w0 wn xn X0=1

  19. Neural Networks Representational Power: A perceptron can learn only examples that are called “linearly separable”. These are examples that can be perfectly separated by a hyperplane. + + + + + + - - - - - - Linearly separable Non-linearly separable

  20. Neural Networks What is the perceptron rule? What is the delta rule? What are their differences? • The perceptron is based on an output from a step function, • whereas the delta rule uses the linear combination of inputs • directly. • The perceptron is guaranteed to converge to a consistent hypothesis assuming the data is linearly separable. The delta rules converges in the limit but it does not need the condition of linearly separable data.

  21. Neural Networks Backpropagation algorithm: The idea is to use again a gradient descent over the space of weights to find a global minimum (no guarantee). • Create a network with nin input nodes, nhidden internal nodes, • and nout output nodes. • Initailize all weights to small random numbers. • Until error is small do: • For each example X do • Propagate example X forward through the network • Propagate errors backward through the network

  22. Evaluating Hypotheses Bias vs Variance: • Bias in the estimate. Normally overoptimisitc. • To avoid it we use a separate set of data. • Variance in the estimate. The estimation varies from • sample to sample. The smaller the sample the larger the • variance. Estimated accuracy Variance accuracy True accuracy Bias

  23. Evaluating Hypotheses Sample Error and True Error: Sample error: errorS (h) = 1/n Σ δ (f(X),h(X)) Where f is the true target function, h is the hypothesis, and δ(a,b) = 1 if a = b, 0 otherwise. True Error: errorD (h) = P[ f(X) = h(x)]D How good is errorS (h) in estimating errorD (h) ?

  24. Evaluating Hypotheses Confidence Intervals: • Assume the following conditions are present: • The sample has n examples drawn according to probability D. • n > 30 • Hypothesis h has made r errors in the n examples. • Then with probability of 95%, the true error lies in the interval: • errorS(h) +- 1.96 errorS(h) (1 - errorS(h)) / n

  25. Evaluating Hypotheses The binomial distribution: The sampling error can be modeled using a binomial distribution. For example. Let’s suppose we have a dataset of size n = 40 and that the true probability of error is 0.3. Then the expected number of errors is np = 40(0.3) = 12 Plotting the binomial distribution: 0 10 20 30 40

  26. Evaluating Hypotheses Understand: How to estimate differences in error. How to compare learning algorithms.

More Related