Announcements

Announcements • Project proposal is due on 03/11 • Three seminars this Friday (EB 3105) • Dealing with Indefinite Representations in Pattern Recognition (10:00 am - 11:00 am) • Computational Analysis of Drosophila Gene Expression Pattern Image (11:00 am - 12:00 pm) • 3D General Lesion Segmentation in CT (3:00 pm - 4:00 pm)

Hierarchical Mixture Expert Model Rong Jin

Good Things about Decision Trees • Decision trees introduce nonlinearity through the tree structure • Viewing A^B^C as A*B*C • Compared to kernel methods • Less adhoc • Easy understanding

x=0 x=0 Generalized Tree +   + x=0 Example In general, mixture model is powerful in fitting complex decision boundary, for instance, stacking, boosting, bagging Kernel method

Each node of decision tree only depends on a single feature. Is this the best idea? Generalize Decision Trees From slides of Andrew Moore

Partition Datasets • The goal of each node is to partition the data set into disjoint subsets such that each subset is easier to classify. cylinders = 4 Partition by a single attribute Original Dataset cylinders = 5 cylinders = 6 cylinders = 8

Partition Datasets (cont’d) • More complicated partitions Cylinders< 6 and Weight > 4 ton Partition by multiple attributes Original Dataset Other cases Using a classification model for each node Cylinders  6 and Weight < 3 ton • How to accomplish such a complicated partition? • Each partition  a class • Partition a dataset into disjoint subsets  Classify a dataset into multiple classes

Attribute 1 Attribute 2 classifier A More General Decision Tree Each node is a linear classifier   +    +  + + a decision tree using classifiers for data partition a decision tree with simple data partition

  + General Schemes for Decision Trees • Each node within the tree is a linear classifier • Pro: • Usually result in shallow trees • Introducing nonlinearity into linear classifiers (e.g. logistic regression) • Overcoming overfitting issues through the regularization mechanism within the classifier. • Partition datasets with soft memberships • A better way to deal with real-value attributes • Example: • Neural network • Hierarchical Mixture Expert Model

x Router Decides which classifier should x be route to Classifier Determines the class for input x Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

x ? ? Which group should be used for classifying x ? Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

x Hierarchical Mixture Expert Model (HME) r(x) Group Layer r(x) = +1 Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

x ? ? Which expert should be used for classifying x ? Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

x Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) g1(x) = -1 ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

x Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) m1,2(x)=+1 The class label for +1

x ? ? Which group should be used for classifying x ? Hierarchical Mixture Expert Model (HME) More Complicated Case r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

x Hierarchical Mixture Expert Model (HME) More Complicated Case r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

Hierarchical Mixture Expert Model (HME) More Complicated Case r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer Group 1 g1(x) Group 2 g2(x) x x ? ? ? ? Which expert should be used for classifying x ? ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

x Hierarchical Mixture Expert Model (HME) More Complicated Case r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½ Group 1 g1(x) Group 2 g2(x) x ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) How to compute the probability p(+1|x) and p(-1|x)?

HME: Probabilistic Description r(x) Random variable g = {1, 2} r(+1|x)=p(g = 1|x), r(-1|x)=p(g = 2|x) Group Layer Group 1 g1(x) Group 2 g2(x) Random variable m = {11, 12, 21, 22} g1(+1|x) = p(m=11|x, g=1), g1(-1|x) = p(m=12|x, g=1) g2(+1|x) =p(m=21|x, g=2) g2(-1|x) =p(m=22|x, g=2) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

HME: Probabilistic Description r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer Group 1 g1(x) Group 2 g2(x) g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½ ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Compute P(+1|x) and P(-1|x)

HME: Probabilistic Description r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½ Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

HME: Probabilistic Description r(+1|x) = ¾, r(-1|x) = ¼ g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½

x Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) y ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Is HME more powerful than a simple majority vote approach?

Problem with Training HME • Using logistic regression to model r(x), g(x), and m(x) • No training examples r(x), g(x) • For each training example (x, y), we don’t know its group ID or expert ID. • can’t apply training procedure of logistic regression model to train r(x) and g(x) directly. • Random variables g, m are called hidden variables since they are not exposed in the training data. • How to train a model with incomplete data?

x Start with Random Guess … • Iteration 1: random guess: • Randomly assign points to groups and experts +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

x Start with Random Guess … • Iteration 1: random guess: • Randomly assign points to groups and experts +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,2,}{6,7} {3,4,5}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {2}{7} {3}{9} {5,4}{8}

x Start with Random Guess … • Iteration 1: random guess: • Randomly assign points to groups and experts • Learn r(x), g1(x), g2(x), m11(x), m12(x), m21(x), m22(x) +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,2,}{6,7} {3,4,5}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {2}{7} {3}{9} {5,4}{8} Now, what should we do?

x Refine HME Model • Iteration 2: regroup data points • Reassign the group membership to each data point • Reassign the expert membership to each expert +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) But, how?

x r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Determine Group Memberships Consider an example (x, +1) r(+1|x) = ¾, r(-1|x) = ¼ g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½ Compute the posterior on your own sheet !

x r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Determine Group Memberships Consider an example (x, +1) r(+1|x) = ¾, r(-1|x) = ¼ g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½

x r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Determine Expert Memberships Consider an example (x, +1) r(+1|x) = ¾, r(-1|x) = ¼ g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½

x Refine HME Model • Iteration 2: regroup data points • Reassign the group membership to each data point • Reassign the expert membership to each expert • Compute the posteriors p(g|x,y) and p(m|x,y,g) for each training example (x,y) • Retrain r(x), g1(x), g2(x), m11(x), m12(x), m21(x), m22(x) using estimated posteriors +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) But, how ?

Logistic Regression: Soft Memberships • Example: train r(x) Soft memberships

Logistic Regression: Soft Memberships • Example: train m11(x) Soft memberships

x Start with Random Guess … • Iteration 2: regroup data points • Reassign the group membership to each data point • Reassign the expert membership to each expert • Compute the posteriors p(g|x,y) and p(m|x,y,g) for each training example (x,y) • Retrain r(x), g1(x), g2(x), m11(x), m12(x), m21(x), m22(x) +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {5}{7} {2,3}{9} {4}{8} Repeat the above procedure until it converges (it guarantees to converge a local minimum) This is famous Expectation-Maximization Algorithm (EM) !

M-step • Fixed memberships and learn logistic regression models • Train r(x;r) using soft memberships p(g=1|x,y) and p(g=2|x,y) • Train g1(x; g)and g2(x; g) using soft memberships p(m=11, 12|x, y, g=1), p(m=21, 22|x, y,g=2) • Train m11(x;m), m12(x;m), m21(x;m), and m22(x;m) using soft memberships p(m=11,12|x,y,g=1), p(m=21,22|x,y,g=2) Formal EM algorithm for HME • Unknown logistic regression models • r(x;r), {gi(x; g)} and {mi(x;m)} • Unknown group memberships and expert memberships • p(g|x,y), p(m|x, y, g) • E-step • Fixed logistic regression model and estimate memberships: • Estimate p(g=1|x,y), p(g=2|x,y) for all training examples • Estimate p(m=11, 12|x, y, g=1) and p(m=21, 22|x, y, g=2) for all training examples

What are We Doing? • What is the objective of doing Expectation-Maximization? • It is still a simple maximum likelihood! • Expectation-Maximization algorithm actually tries to maximize the log-likelihood function • Most time, it converges to local maximum, not a global one • Improved version: annealing EM

Annealing EM

Improve HME • It is sensitive to initial assignments • How can we reduce the risk of initial assignments? • Binary tree  K-way trees • Logistic regression  conditional exponential model • Tree structure • Can we determine the optimal tree structure for a given dataset?

Comparison of Classification Models • Logistic regression model • A linear decision boundary: wx+b • A probabilistic model p(y|x) • Maximum likelihood approach for estimating weights w and threshold b

Comparison of Classification Models • Logistic regression model • Overfitting issue • In text classification problem, words that only appears in only one document will be assigned with infinite large weight • Solution: regularization • Conditional exponential model • Maximum entropy model • A dual problem of conditional exponential model

denotes +1 denotes -1 Support Vectors Comparison of Classification Models • Support vector machine • Classification margin • Maximum margin principle: two objective • Minimize the classification error over training data • Maximize classification margin • Support vector • Only support vectors have impact on the location of decision boundary

Comparison of Classification Models • Separable case • Noisy case Quadratic programming!

Identical terms Log-likelihood can be viewed as a measurement of accuracy Comparison of Classification Models • Similarity between logistic regression model and support vector machine Logistic regression model is almost identical to support vector machine except for different expression for classification errors

Classification boundary that achieves the least training error • Classification boundary that achieves large margin Comparison of Classification Models • Generative models have trouble at the decision boundary

Announcements

Announcements

Presentation Transcript

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements