1 / 67

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture. Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon. 1. Graphical Models: Challenges. Restricted Boltzmann Machine (RBM). Bayesian Network. Markov Network. Sprinkler. Rain. Grass Wet.

Download Presentation

Sum-Product Networks: A New Deep Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1

  2. Graphical Models: Challenges Restricted Boltzmann Machine (RBM) Bayesian Network Markov Network Sprinkler Rain Grass Wet Advantage: Compactly represent probability Problem: Inference is intractable Problem: Learning is difficult 2

  3. Deep Learning • Stack many layers E.g.: DBN [Hinton & Salakhutdinov,2006] CDBN [Lee et al.,2009] DBM [Salakhutdinov & Hinton,2010] • Potentially much more powerful than shallow architectures[Bengio, 2009] • But … • Inference is even harder • Learning requires extensive effort 3

  4. Learning:Requires approximate inference Inference: Still approximate Graphical Models

  5. E.g., hierarchical mixture model, thin junction tree, etc. Problem: Too restricted Graphical Models Existing Tractable Models

  6. This Talk: Sum-Product Networks Compactly represent partition function using a deep network Sum-Product Networks Graphical Models Existing Tractable Models

  7. Sum-Product Networks Graphical Models Exact inference linear time in network size Existing Tractable Models

  8. Can compactly represent many more distributions Sum-Product Networks Graphical Models Existing Tractable Models

  9. Learn optimal way to reuse computation, etc. Sum-Product Networks Graphical Models Existing Tractable Models

  10. Outline • Sum-product networks (SPNs) • Learning SPNs • Experimental results • Conclusion 10

  11. Why Is Inference Hard? • Bottleneck: Summing out variables • E.g.: Partition function Sum of exponentially many products

  12. Alternative Representation P(X) = 0.4  I[X1=1]I[X2=1] +0.2  I[X1=1]I[X2=0] +0.1  I[X1=0]I[X2=1] +0.3  I[X1=0]I[X2=0]

  13. Alternative Representation P(X) = 0.4  I[X1=1] I[X2=1] +0.2  I[X1=1]I[X2=0] +0.1  I[X1=0]I[X2=1] +0.3  I[X1=0]I[X2=0]

  14. Shorthand for Indicators P(X) = 0.4  X1X2 +0.2 X1X2 +0.1  X1X2 +0.3  X1X2    

  15. Sum Out Variables e: X1 = 1 P(e) = 0.4  X1 X2 + 0.2 X1 X2 +0.1  X1X2 +0.3  X1X2       SetX1 = 1, X1 = 0, X2 = 1, X2 = 1 Easy: Set both indicators to 1

  16. Graphical Representation  0.4 0.3 0.2 0.1       X1 X1 X2 X2

  17. But … Exponentially Large 2N-1                  N2N-1      X1 X2 X1 X4 X5 X2 X3 X4 X3 X5 Example: Parity Uniform distribution over states with even number of 1’s 17

  18. But … Exponentially Large Can we make this more compact?                       X1 X2 X1 X4 X5 X2 X3 X4 X3 X5 Example: Parity Uniform distribution over states of even number of 1’s 18

  19. Use a Deep Network O(N) Example: Parity Uniform distribution over states with even number of 1’s 19

  20. Use a Deep Network Induce many hidden layers Example: Parity Uniform distribution over states of even number of 1’s 20

  21. Use a Deep Network Reuse partial computation Example: Parity Uniform distribution over states of even number of 1’s 21

  22. Sum-Product Networks (SPNs) • Rooted DAG • Nodes: Sum, product, input indicator • Weights on edges from sum to children  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 22

  23. Distribution Defined by SPN P(X)S(X)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 23

  24. Distribution Defined by SPN  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 P(X)S(X) 24 24

  25. Can We Sum Out Variables? P(e)XeS(X) S(e)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 e: X1 = 1 1 0 1 1 25

  26. Can We Sum Out Variables? P(e) =XeP(X) S(e)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 e: X1 = 1 1 0 1 1 26

  27. Can We Sum Out Variables? P(e)XeS(X) S(e)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 e: X1 = 1 1 0 1 1 27 27

  28. Can We Sum Out Variables? ? = P(e)XeS(X) S(e)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 e: X1 = 1 1 0 1 1 28

  29. Valid SPN • SPN is valid ifS(e) = XeS(X) for alle • Valid  Can compute marginals efficiently • Partition function Z can be computed by setting all indicators to 1 29

  30. Valid SPN: General Conditions Theorem:SPN is valid if it is complete & consistent Consistent: Under product, no variable in one child and negation in another Complete:Under sum, children cover the same set of variables Incomplete Inconsistent S(e)  XeS(X) S(e) XeS(X) 30

  31. Semantics of Sums and Products • Product  Feature  Form feature hierarchy • Sum  Mixture (with hidden var. summed out) i i   wij wij Sum out Yi j …… …… …… ……   j  I[Yi = j]

  32. Inference Probability: P(X)=S(X) / Z 0.51  X: X1 = 1, X2 = 0 0.7 0.3 0.42 0.72    0.6 0.9 0.7 0.8      0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 0 1

  33. Inference If weights sum to 1 at each sum node ThenZ = 1, P(X)=S(X) 0.51  X: X1 = 1, X2 = 0 0.7 0.3 0.42 0.72    0.6 0.9 0.7 0.8      0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 0 1

  34. Inference Marginal: P(e)=S(e) / Z 0.69 = 0.510.18  e: X1 = 1 0.7 0.3 0.6 0.9    0.6 0.9 1 1      0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 1 1

  35. Inference MAP: Replace sums with maxs e: X1 = 1 MAX 0.7  0.42 = 0.294 0.3  0.72 = 0.216 0.7 0.3 0.42 0.72    0.6 0.9 0.7 0.8 MAX MAX MAX MAX  0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 1 1

  36. Inference MAX: Pick child with highest value MAP State: X1 = 1, X2 = 0 e: X1 = 1 MAX 0.7  0.42 = 0.294 0.3  0.72 = 0.216 0.7 0.3 0.42 0.72    0.6 0.9 0.7 0.8 MAX MAX MAX MAX  0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 1 1

  37. Handling Continuous Variables • Sum  Integral over input • Simplest case: Indicator  Gaussian SPN compactly defines a very large mixture of Gaussians

  38. SPNs Everywhere • Graphical models Existing tractable models, e.g.: hierarchical mixture model, thin junction tree, etc. SPNs can compactly represent many more distributions 38

  39. SPNs Everywhere • Graphical models Inference methods, e.g.: Junction-tree algorithm, message passing, … SPNs can represent, combine, and learn the optimal way 39

  40. SPNs Everywhere • Graphical models SPNs can be more compact byleveraging determinism, context-specific independence, etc. 40

  41. SPNs Everywhere E.g., arithmetic circuits, AND/OR graphs, case-factor diagrams SPN: First approach for learning directly from data Graphical models Models for efficient inference 41

  42. SPNs Everywhere Sum: Average-pooling Max: Max-pooling Graphical models Models for efficient inference General, probabilistic convolutional network 42

  43. SPNs Everywhere E.g., object detection grammar, probabilistic context-free grammar Sum: Non-terminal Product: Production rule Graphical models Models for efficient inference General, probabilistic convolutional network Grammars in vision and language 43

  44. Outline • Sum-product networks (SPNs) • Learning SPNs • Experimental results • Conclusion 44

  45. General Approach • Start with a dense SPN • Find the structure by learning weights Zero weights signify absence of connections • Can also learn with EM Each sum node is a mixture over children

  46. The Challenge • In principle, can use gradient descent • But … gradient quickly dilutes • Similar problem with EM • Hard EM overcomes this problem 46

  47. Our Learning Algorithm • Online learning  Hard EM • Sum node maintains counts for each child • For each example • Find MAP instantiation with current weights • Increment count for each chosen child • Renormalize to set new weights • Repeat until convergence 47

  48. Outline • Sum-product networks (SPNs) • Learning SPNs • Experimental results • Conclusion 48

  49. Task: Image Completion • Very challenging • Good for evaluating deep models • Methodology: • Learn a model from training images • Complete unseen test images • Measure mean square errors

  50. Datasets • Main evaluation: Caltech-101[Fei-Fei et al., 2004] • 101 categories, e.g., faces, cars, elephants • Each category: 30 – 800 images • Also, Olivetti [Samaria & Harter, 1994](400 faces) • Each category: Last third for test Test images: Unseen objects

More Related