1 / 30

Learning From Observation Part II

Learning From Observation Part II. KAIST Computer Science 20013221 박 명 제. Contents. Using Information Theory Learning General Logical Descriptions. Using Information Theory. 1. Introduction 2. Noise and Over-fitting 3. Issues related with the decision tree. Introduction.

Download Presentation

Learning From Observation Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning From ObservationPart II KAIST Computer Science 20013221 박 명 제

  2. Contents • Using Information Theory • Learning General Logical Descriptions

  3. Using Information Theory 1. Introduction 2. Noise and Over-fitting 3. Issues related with the decision tree

  4. Introduction • Information on the flip of a Coin “The less you know, the more valuable the information”

  5. History of Information Theory • C.E. Shannon, 1948,1949 papers • A Mathematical Theory of Communication • Provides the probabilistic theory of Encoding, Decoding, and Transmission of communication system. • Provides a mathematical basis for measuring the information content of a message. • Now, uses for Cryptography and Learning Theory, etc.

  6. Amount of Information • Information content is measured in BITS. • Information of N cases is represented with log N bits. • If we know information about the result of N-cases event, this information is log N bits. • If the probability of this event is P => Information is log (1/P) = -log P bits.

  7. Information Contents(Entropy) • The average information content of the various events (the –log P terms) weighted by the probabilities of the events • called Entropy H • measurement of disorder, randomness, information, uncertainty, and complexity of choice • Maximized when all is equal to 1/n

  8. Information Gain(1/2) • Using Restaurant Problem (page 534) • Information from training set (P:positive examples N:negative examples) • Remainder

  9. Information Gain(2/2) • Definition • The difference between the original information requirement and the new requirement • We select an attribute with the maximum value of Gain(A) • Example : Patrons has the highest gain of any of the attributes and would be chosen by the decision tree learning algorithm as the root.

  10. Noise and Over-fitting(1/3) • Noise • Two or more examples with the same descriptions but different classifications. • Over-fitting • Example : Rolling a Die (p.542) • makes spurious distinctions • be careful not to use the resulting freedom to find meaningless regularity in the data

  11. Noise and Over-fitting(2/3) • To prevent Over-fitting • Decision Tree Pruning • Prevent splitting by irrelevant attributes • How to find irrelevant attributes? • Attributes with very small information gain • Chi Square Pruning • Measure the deviation of clearly irrelevant hypothesis by comparing the actual numbers of positive and negative examples • The probability that the attribute is really irrelevant can be calculated with the help of standard chi-squared tables.

  12. Noise and Over-fitting(3/3) • To prevent Over-fitting • Cross-Validation • Estimate how well the current hypothesis will predict unseen data • Done by setting aside some fraction of the known data, and using it to test the prediction performance of a hypothesis induced from the rest of the known data.

  13. Broadening the Applicability • Missing data • Not all the attribute values will be know for every example in many domains. • Multi-valued attribute • When an attribute has a large number of possible values, the information gain measure gives an inappropriate indication of the attribute’s usefulness. • Continuous-valued attribute • Discretize the attribute • Example • Price for Restaurant problem (page 534)

  14. Learning General Logical Descriptions Introduction Current-best-hypothesis Search Least-commitment Search

  15. Introduction(1/3) • Steps to find hypotheses • Start out with a goal predicate.(generally Q) • Q will be a unary predicate • Find an equivalent logical expression that we can use to classify examples correctly.

  16. Introduction(2/3) • Hypothesis=Candidate Definition+Goal • Hypothesis space • set of all hypotheses • H for the hypothesis space • The Learning algorithm believes that one of the hypothesis is correct, that is, it believes the sentence

  17. Introduction(3/3) • Ways of being inconsistent with an example • An example is a false negative for the hypothesis. • Hypothesis says negative, but in fact it is positive. • An example is a false positive for the hypothesis. • Hypothesis says positive, but in fact it is negative. • Make hypothesis consistent with all the example sets.

  18. Current-best-hypothesis Search(1/6) • Main Idea • Maintain a single hypothesis, and adjust it as new examples arrive in order to maintain consistency

  19. Current-best-hypothesis Search(2/6) • When a new example e is entered • If e is consistent for a hypothesis h • Do nothing • If e is a false negative for a hypothesis h • Generalization of h to include e • If e is a false positive for a hypothesis h • Specialization of h to exclude e

  20. Current-best-hypothesisSearch(3/6) • Generalization & Specialization • Shows logical relationship of hypotheses • If C implies D, then D is a generalization of C • If D implies C, then D is a specialization of D • In hypothesis space,

  21. Current-best-hypothesis Search(4/6) • Examples from Restaurant problem(Page 534) • The first example x1 is positive. • H1 : (x) WillWait(x)  Alternate(x) • The second example x2 is negative. • H2 : (x) WillWait(x)  Alternate(x)  Patrons(x, Some) • H1 predicts it to be positive, so it is a false positive -> specialization of H1 • The third example x3 is positive. • H3 : (x) WillWait(x)Patrons(x, Some) • H2 predicts it to be negative, so it is a false negative -> Generalization of H2 • The fourth example x4 is positive. • H4 : (x)WillWait(x)Patrons(x,Some)(Patrons(x,Full)Fri/Sat(x)) • H3 predicts it to be negative, so it is a false negative -> Generalization of H3

  22. Current-best-hypothesis Search(5/6) • Simple, but described nondeterministically. • There may be several possible specializations or generalizations that can be applied. • Not necessarily lead to the simplest hypothesis • May lead to an unrecoverable situation => The program must backtrack to previous!!

  23. Current-best-hypothesis Search(6/6) • With a large number of instances and a large space, some other difficulties arise. • Checking all the previous instances over again for each modification is very expensive. • Difficult to find good search heuristics, and backtracking all over the place can take forever due to the size of hypothesis space.

  24. Least-commitment search(1/7) • Main Idea • Keeps around all and only those hypotheses that are consistent with all the data so far. • Removes hypotheses inconsistent with example • Version Space • The set of hypotheses remaining after elimination • Algorithm • Version space learning/Candidate elimination algorithm.

  25. Least-commitment search(2/7) • Properties • Incremental algorithm • Never has to go back and reexamine the old • Least-commitment algorithm • Makes no arbitrary choices • Problems • Enormous hypothesis space • Uses an interval representation that just specifies the boundaries of the set

  26. Least-commitment search(3/7) • Partial ordering on hypothesis space • With relationship of generalization and specialization • Boundary sets • G-set : most general boundary • No consistent hypotheses that are more general • S-set : most special boundary • No consistent hypotheses that are more specific • Everything in between is guaranteed to be consistent with the examples.

  27. Least-commitment search(4/7) • Learning strategy • Needs the initial version space to represent all possible hypotheses • G-set : contains only ‘True’ • S-set : contains only ‘False’ • Two properties to show that the representation is sufficient • Every consistent hypothesis is more specific than some member of the G-set, and more general than some member of the S-set • Every hypothesis more specific than some member of the G-set and more general than some member of the S-set is a consistent hypothesis

  28. Least-commitment search(5/7) • Update S and G for a new example • False positive for s • Too general, throw it out of the S-set • False negative for s • Too specific, replace it by generalization • False positive for g • Too general, replace it by specialization • False negative for g • Too specific, throw it out of the G-set

  29. Least-commitment search(6/7) • Algorithm termination • Exactly one concept left in version space • Return it as unique hypothesis • The version space collapses – either S or G becomes empty • No consistent hypothesis for the training set • Learning failed • Run out of examples with several hypotheses remaining in the version space • The remaining version space represents a disjunction of hypotheses

  30. Least-commitment search(7/7) • Discussion • Noise or insufficient attributes : the version space will always collapse? • No completely successful solution found. • Disjunction problem • By allowing limited forms of disjunction • By including a generalization hierarchy of more general predicates • Example :WaitEstimate(x,30-60)WaitEstimate(x,>60) -> LongWait(x)

More Related