1 / 52

Machine Learning CMPT 726 Simon Fraser University

Machine Learning CMPT 726 Simon Fraser University. CHAPTER 1: INTRODUCTION. Outline. Comments on general approach. Probability Theory. Joint, conditional and marginal probabilities. Random Variables. Functions of R.V.s Bernoulli Distribution (Coin Tosses).

shada
Download Presentation

Machine Learning CMPT 726 Simon Fraser University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine LearningCMPT 726Simon Fraser University CHAPTER 1: INTRODUCTION

  2. Outline • Comments on general approach. • Probability Theory. • Joint, conditional and marginal probabilities. • Random Variables. • Functions of R.V.s • Bernoulli Distribution (Coin Tosses). • Maximum Likelihood Estimation. • Bayesian Learning With Conjugate Prior. • The Gaussian Distribution. • Maximum Likelihood Estimation. • Bayesian Learning With Conjugate Prior. • More Probability Theory. • Entropy. • KL Divergence.

  3. Our Approach • The course generally follows statistics, very interdisciplinary. • Emphasis on predictive models: guess the value(s) of target variable(s). “Pattern Recognition” • Generally a Bayesian approach as in the text. • Compared to standard Bayesian statistics: • more complex models (neural nets, Bayes nets) • more discrete variables • more emphasis on algorithms and efficiency

  4. Things Not Covered Within statistics: Hypothesis testing Frequentist theory, learning theory. Other types of data (not random samples) Relational data Scientific data (automated scientific discovery) Action + learning = reinforcement learning.Could be optional – what do you think?

  5. Probability Theory Apples and Oranges

  6. Probability Theory Marginal Probability Conditional Probability Joint Probability

  7. Probability Theory Sum Rule Product Rule

  8. The Rules of Probability Sum Rule Product Rule

  9. Bayes’ Theorem posterior  likelihood × prior

  10. Bayes’ Theorem: Model Version • Let M be model, E be evidence. • P(M|E) proportional to P(M) x P(E|M) • Intuition • prior = how plausible is the event (model, theory) a priori before seeing any evidence. • likelihood = how well does the model explain the data?

  11. Probability Densities

  12. Transformed Densities

  13. Expectations Conditional Expectation (discrete) Approximate Expectation (discrete and continuous)

  14. Variances and Covariances

  15. The Gaussian Distribution

  16. Gaussian Mean and Variance

  17. The Multivariate Gaussian

  18. Reading exponential prob formulas In infinite space, cannot just form sumΣx p(x)  grows to infinity. Instead, use exponential, e.g.p(n) = (1/2)n Suppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”. Use p(x) = exp(-f(x)).

  19. Example: exponential form sample size Fair Coin: The longer the sample size, the less likely it is. p(n) = 2-n. ln[p(n)] Sample size n

  20. Exponential Form: Gaussian mean The further x is from the mean, the less likely it is. ln[p(x)] 2(x-μ)

  21. Smaller variance decreases probability The smaller the variance σ2, the less likely x is (away from the mean). ln[p(x)] -σ2

  22. Minimal energy = max probability The greater the energy (of the joint state), the less probable the state is. ln[p(x)] E(x)

  23. Gaussian Parameter Estimation Likelihood function

  24. Maximum (Log) Likelihood

  25. Properties of and

  26. Curve Fitting Re-visited

  27. Maximum Likelihood Determine by minimizing sum-of-squares error, .

  28. Predictive Distribution

  29. Frequentism vs. Bayesianism Frequentists: probabilities are measured as the frequencies of repeatable events. E.g., coin flips, snow falls in January. Bayesian: in addition, allow probabilities to be attached to parameter values (e.g., P(μ=0). Frequentist model selection: give performance guarantees (e.g., 95% of the time the method is right). Bayesian model selection: choose prior distribution over parameters, maximize resulting cost function (posterior).

  30. MAP: A Step towards Bayes Determine by minimizing regularized sum-of-squares error, .

  31. Bayesian Curve Fitting

  32. Bayesian Predictive Distribution

  33. Model Selection Cross-Validation

  34. Curse of Dimensionality Rule of Thumb: 10 datapoints per parameter.

  35. Curse of Dimensionality Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions

  36. Decision Theory Inference step Determine either or . Decision step For given x, determine optimal t.

  37. Minimum Misclassification Rate

  38. Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Decision Truth

  39. Minimum Expected Loss Regions are chosen to minimize

  40. Why Separate Inference and Decision? Minimizing risk (loss matrix may change over time) Unbalanced class priors Combining models

  41. Decision Theory for Regression Inference step Determine . Decision step For given x, make optimal prediction, y(x), for t. Loss function:

  42. The Squared Loss Function

  43. Generative vs Discriminative Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly

  44. Entropy • Important quantity in • coding theory • statistical physics • machine learning

  45. Entropy

  46. Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x? All states equally likely

  47. Entropy

  48. The Maximum Entropy Principle Commonly used principle for model selection: maximize entropy. Example: In how many ways can N identical objects be allocated M bins? Entropy maximized when

  49. Differential Entropy and the Gaussian Put bins of width ¢along the real line Differential entropy maximized (for fixed ) when in which case

  50. Conditional Entropy

More Related