1 / 34

Boosting (Part II) and Self-Organizing Maps

Boosting (Part II) and Self-Organizing Maps. By Marc Sobel. Generalized Boosting.

davenportj
Download Presentation

Boosting (Part II) and Self-Organizing Maps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boosting (Part II) and Self-Organizing Maps By Marc Sobel

  2. Generalized Boosting • Consider the problem of analyzing surveys. A large variety of people are surveyed to determine how likely they are to vote for conviction on juries. It is advantageous to design surveys which link their gender, political affiliations, etc.. to conviction. It is also advantageous to ordinally divide conviction into 5 categories which correspond to how strongly people feel about conviction.

  3. Generalized Boosting example • For the response variable, we have 5 separate values; the higher the response the greater the tendency to convict. We would like to predict how likely participants are to go for conviction based on their sex, participation in sports, etc… Logistic discrimination does not work in this example because it does not capture the complicated relationships between the predictor and response variables.

  4. Generalized boosting for the conviction problem • We assign a score h(x,y)=ηφ{|y-ycorrect|/σ} which increases in proportion to how close it is to the correct response. • We put weights on all possible responses (xi,y) for y=yi and also y≠yi. We update not only the former, but also the latter weights in a two stage procedure. First, we update weights for each case. Second, we update weights within each single case. We update weights via,

  5. Generalized boosting (part 1) • We update weights via:

  6. Generalized boosting (explained) • The error incorporates all possible mistaken possibilities rather than just a single one. The algorithm differs from 2-valued boosting in that it updates a matrix rather than a vector of weights. This algorithm works much better than the comparable one which gives weight 1 to mistakes and weight 0 to correct responses.

  7. Why does generalized boosting work? • The pseudo-loss of WL on training data (xi,yi), defined by • Defines the error made by the weak learner in case i. • The goal is to minimize the (weighted) average of the pseudo-losses.

  8. Another way to do Generalized Boosting ( GBII) • Do 5 adaboosts A[1],…,A[5]. Adaboost ‘A[i]’ distinguishes whether the correct response is ‘i’ or not. At the final stage we choose between the weighted averages by maximizing the scores:

  9. Generalized Boosting • We choose the label which give the highest weighted average.

  10. Percentage predicted correctly among the training data for GB II

  11. Histogram showing the probability of being incorrectly chosen for each item

  12. Outliers • Note in the former slide, there are 2000 (out of 2800) items which are almost never incorrectly predicted. But there are also 400 outliers which are almost always predicted incorrectly. We would like to downweight these outliers.

  13. Boosting in the presence of outliers: Hedge algorithms • Devise a loss vector at each time t. And update the weights via:

  14. Stochastic Gradient Boosting • Assume training data; (Xi,Yi); (i=1,…,n) with responses taking e.g., real values. We want to estimate the relationship between the X’s and Y’s. • Assume a model of the form,

  15. Stochastic Gradient Boosting (continued) • We use a two stage procedure: • Define • First, given β1,…,βl+1 and Θ1,…,Θl we minimize

  16. Stochastic Gradient Boosting • We fit the beta’s via (using the bootstrap)

  17. Stochastic Gradient Boosting • Note that the new weights (i.e., the beta’s) are proportional to the residual error. Bootstrapping estimates for the new parameters has the effect of making them robust.

  18. Self Organizing Maps • Create Maps of large data sets • using an ordered set of map vectors: • We view each map vector as controlling a neighborhood of the data Ni (with ni members) (i=1,…,k) (i.e., those points closest to it). • From a statistical standpoint we can view the map vectors as parameters: The idea is that they will provide a simple ordered mapping of the data.

  19. Kohonen in a typical pose

  20. Example from politics • Self-Organizing Map showing US Congress voting patterns visualized in Synapse. The boxes show clustering and distances. Red means a yes vote while blue means a no vote in the component planes.

  21. Bayesian View • BAYESIAN VIEW: Assume that for each map vector Mi there are ni copies • distributed according to a gaussian prior with mean a and variance τ2. These vectors are the means of each member of the neighborhood. The posterior distribution of is:

  22. Bayesian View of SOM • Now, estimate the prior mean ‘a’ by the ‘old’ map vector Mi[old]. Let and use the posterior estimate • For the ‘new’ Map vector. We then get the equation:

  23. Bayesian Extension: Possible Project • The Bayes update should really incorporate and random factor associated with the posterior standard deviation (e.g.,

  24. Gradient Descent Viewpoint • We can also view the SOM algorithm as part of a Newton-Raphson algorithm . Here we view the likelihood as based on independent gaussians with mean map vectors. • The mean vectors in a given neighborhood are all the same. The Newton Raphson updating algorithm operates sequentially on parts of the likelihood.

  25. Gradient Descent • We have

  26. SOM applied to the Conviction Data

  27. Correlation chart showing the relationship between conviction and the first four questions • 1.0000 0.9860 0.9843 0.9853 0.8018 • 0.9860 1.0000 0.9839 0.9841 0.7708 • 0.9843 0.9839 1.0000 0.9927 0.7883 • 0.9853 0.9841 0.9927 1.0000 0.7782 • 0.8018 0.7708 0.7883 0.7782 1.0000

  28. Proof of log ratio result • Recall that we had the risk function • Which we would like to minimize in βl+1. • First divide up the sum into two parts; the first is where φl+1 correctly predicts y; the second where it does not:

  29. conclusion • We have that:

  30. Bounding the error made by boosting • Theorem: Put We have that Proof: We have, for the weights associated with incorrect classification that:

  31. Boosting error bound (continued) • We have that: • Putting this together over all the iterations:

  32. Lower bounding the error • We can lower bound the weights by: • The final hypothesis makes a mistake on the predicting yi if (see Bayes) • The final weight on an instance

  33. Lower Bound on weights • Putting together the last slide:

  34. Conclusion of Proof: • Putting the former result together with • We get the conclusion.

More Related