520 likes | 680 Views
Advanced topics. Outline. Self-taught learning Learning feature hierarchies (Deep learning) Scaling up. Self-taught learning. Testing: What is this? . Supervised learning. Cars. Motorcycles. Testing: What is this? . Motorcycle. Car. Semi-supervised learning.
E N D
Outline • Self-taught learning • Learning feature hierarchies (Deep learning) • Scaling up
Testing: What is this? Supervised learning Cars Motorcycles
Testing: What is this? Motorcycle Car Semi-supervised learning Unlabeled images (all cars/motorcycles)
Testing: What is this? Motorcycle Car Self-taught learning Unlabeled images (random internet images)
Motorcycle Car Self-taught learning Sparse coding, LCC, etc. f1, f2, …, fk If have labeled training set is small, can give huge performance boost. Use learned f1, f2, …, fk to represent training/test sets. Using f1, f2, …, fk a1, a2, …, ak
Why feature hierarchies object models object parts (combination of edges) edges pixels
Deep learning algorithms • Stack sparse coding algorithm • Deep Belief Network (DBN) (Hinton) • Deep sparse autoencoders (Bengio) [Other related work: LeCun, Lee, Yuille, Ng …]
Deep learning with autoencoders • Logistic regression • Neural network • Sparse autoencoder • Deep autoencoder
Logistic regression Logistic regression has a learned parameter vector q. On input x, it outputs: where x1 x2 Draw a logistic regression unit as: x3 +1
Neural Network a1 String a lot of logistic units together. Example 3 layer network: a2 x1 a3 x2 x3 Layer 3 +1 +1 Layer 1 Layer 3
Neural Network Example 4 layer network with 2 output units: x1 x2 x3 +1 Layer 4 +1 +1 Layer 3 Layer 1 Layer 2
Neural Network example [Courtesy of Yann LeCun]
Training a neural network Given training set (x1, y1), (x2, y2), (x3, y3 ), …. Adjust parameters q (for every node) to make: (Use gradient descent. “Backpropagation” algorithm. Susceptible to local optima.)
Unsupervised feature learning with a neural network x1 x1 • Autoencoder. • Network is trained to output the input (learn identify function). • Trivial solution unless: • Constrain number of units in Layer 2 (learn compressed representation), or • Constrain Layer 2 to be sparse. x2 x2 x3 x3 a1 x4 x4 x5 x5 a2 +1 x6 x6 a3 Layer 2 Layer 3 +1 Layer 1
Unsupervised feature learning with a neural network Training a sparse autoencoder. Given unlabeled training set x1, x2, … a1 a2 a3 Reconstruction error term L1 sparsity term
Unsupervised feature learning with a neural network x1 x1 x2 x2 a1 x3 x3 a2 x4 x4 a3 x5 x5 +1 x6 x6 Layer 2 Layer 3 +1 Layer 1
Unsupervised feature learning with a neural network x1 x2 a1 x3 a2 x4 a3 x5 +1 New representation for input. x6 Layer 2 +1 Layer 1
Unsupervised feature learning with a neural network x1 x2 a1 x3 a2 x4 a3 x5 +1 x6 Layer 2 +1 Layer 1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 x6 Train parameters so that , subject to bi’s being sparse. +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 x6 Train parameters so that , subject to bi’s being sparse. +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 x6 Train parameters so that , subject to bi’s being sparse. +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 New representation for input. x6 +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 x6 +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 c1 x3 a2 b2 c2 x4 a3 b3 c3 x5 +1 +1 +1 x6 +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 c1 x3 a2 b2 c2 x4 a3 b3 c3 x5 New representation for input. +1 +1 +1 x6 +1 Use [c1, c3, c3] as representation to feed to learning algorithm.
Deep Belief Net Deep Belief Net (DBN) is another algorithm for learning a feature hierarchy. Building block: 2-layer graphical model (Restricted Boltzmann Machine). Can then learn additional layers one at a time.
Restricted Boltzmann machine (RBM) a1 a2 a3 Layer 2. [a1, a2, a3] (binary-valued) x2 x1 x3 x4 Input [x1, x2, x3, x4] MRF with joint distribution: Use Gibbs sampling for inference. Given observed inputs x, want maximum likelihood estimation:
Restricted Boltzmann machine (RBM) a1 a2 a3 Layer 2. [a1, a2, a3] (binary-valued) x2 x1 x3 x4 Input [x1, x2, x3, x4] Gradient ascent on log P(x) : [xiaj]obs from fixing x to observed value, and sampling a from P(a|x). [xiaj]prior from running Gibbs sampling to convergence. Adding sparsity constraint on ai’s usually improves results.
Deep Belief Network Similar to a sparse autoencoder in many ways. Stack RBMs on top of each other to get DBN. Layer 3. [b1, b2, b3] Layer 2. [a1, a2, a3] Input [x1, x2, x3, x4] Train with approximate maximum likelihood (often with sparsity constraint on ai’s):
Deep Belief Network Layer 4. [c1, c2, c3] Layer 3. [b1, b2, b3] Layer 2. [a1, a2, a3] Input [x1, x2, x3, x4]
Convolutional DBN for audio Max pooling unit Detection units Spectrogram
Convolutional DBN for audio Spectrogram
Probabilistic max pooling Convolutional DBN: Convolutional Neural net: 0 max {x1, x2, x3, x4} 0 0 0 0 max {x1, x2, x3, x4} Where xi are {0,1}, and mutually exclusive. Thus, 5 possible cases: 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 X1 X2 X3 X4 X1 X2 X3 X4 1 1 0 0 0 0 0 0 Where xi are real numbers. Collapse 2n configurations into n+1 configurations. Permits bottom up and top down inference.
Convolutional DBN for audio Spectrogram
Convolutional DBN for audio Max pooling Second CDBN layer Detection units Max pooling One CDBN layer Detection units
CDBNs for speech Learned first-layer bases
Convolutional DBN for Images ‘’max-pooling’’ node (binary) Max-pooling layer P Detection layer H Wk Hidden nodes (binary) “Filter” weights (shared) At most one hidden nodes are active. Input data V Visible nodes (binary or real)
Convolutional DBN on face images object models object parts (combination of edges) edges Note: Sparsity important for these results. pixels
Learning of object parts Examples of learned object parts from object categories Faces Cars Elephants Chairs
Training on multiple objects Trained on 4 classes (cars, faces, motorbikes, airplanes). Second layer: Shared-features and object-specific features. Third layer: More specific features. Second layer bases learned from 4 object categories. Plot of H(class|neuron active) Third layer bases learned from 4 object categories.
Hierarchical probabilistic inference Generating posterior samples from faces by “filling in” experiments (cf. Lee and Mumford, 2003). Combine bottom-up and top-down inference. Input images Samples from feedforward Inference (control) Samples from Full posterior inference
Key issue in feature learning: Scaling up
Scaling up with graphics processors US$ 250 NVIDIA GPU Peak GFlops Intel CPU 2003 2004 2005 2006 2007 2008 (Source: NVIDIA CUDA Programming Guide)
Scaling up with GPUs Approx. number of parameters (millions): Using GPU (Raina et al., 2009)
Audio State-of-the-art task performance Images Video Multimodal (audio/video)