An overview of deep learning

An overview of deep learning Geoffrey Hinton University of Toronto & Canadian Institute for Advanced Research & Google

A brief history of deep neural networks • How to recognize patterns: First convert raw input into feature activations. Then learn how to weight the feature activations to decide which class wins. • But how do we decide what features to use? • Use hand-engineering? • Define a feature as the similarity to a carefully selected training input? (Support Vector Machine) • Learn the features to optimize discrimination? • Learn a multilayer generative model?

Deep neural networks (~1985) Compare outputs with correct answer to get error signal Back-propagate error signal to get derivatives for learning outputs hidden layers input vector

What is wrong with back-propagation?(a plausible story, but false) • It requires labeled training data. • Almost all data is unlabeled. • The learning time does not scale well • It is very slow in networks with multiple hidden layers. • It can get stuck in poor local optima. • These are often quite good, but for deep nets they are far from optimal.

What was actually wrong with back-propagation? • We didn’t collect enough labeled data. • We didn’t have fast enough computers. • We didn’t initialize the weights correctly • If we fix these three problems, it works really well.

Overcoming the lack of labeled data • Keep the efficiency and simplicity of using a gradient method for adjusting the weights, but use it for modeling the structure of the sensory input. • Adjust the weights to maximize the probability that a generative model would have produced the sensory input. • Learn p(image) not p(label | image) • If you want to do computer vision, first learn computer graphics

Why unsupervised pre-training makes sense stuff stuff high bandwidth low bandwidth label label image image If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway. If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity?

What kind of generative model should we learn? • Presumably it should be like the models that people use in computer graphics.

A belief net is a directed acyclic graph composed of stochastic variables. We get to observe some of the variables and we would like to solve two problems: The inference problem: Infer the states of the unobserved variables. The learning problem: Adjust the interactions between variables to make the network more likely to generate the observed data. Belief Nets stochastic hidden cause visible effect We will use nets composed of layers of stochastic binary variables with weighted connections. This is crazy!

Why inverse computer graphics is hard • The pixels have lost the depth information. • So buy a kinect box. • We have lost the information about which parts belong to which wholes. • Parts can be highly ambiguous. We need a powerful way of assigning parts to wholes. • The idea that running a general purpose inference algorithm in a generative model is the right way to solve the assignment problem may be wrong. • But a generative model is still a great way to check that a solution is good or to define an objective function for learning.

The learning rule for sigmoid belief nets • Learning is easy if we can get an unbiased sample from the posterior distribution over hidden states given the observed data. • For each unit, maximize the log probability that its binary state in the sample from the posterior would be generated by the sampled binary states of its parents. j i learning rate

Explaining away (Judea Pearl) • Even if two hidden causes are independent, they can become dependent when we observe an effect that they can both influence. • If we learn that there was an earthquake it reduces the probability that the house jumped because of a truck. -10 -10 truck hits house earthquake posterior 20 20 p(1,1)=.0001 p(1,0)=.4999 p(0,1)=.4999 p(0,0)=.0001 -20 house jumps

To learn W, we need the posterior distribution in the first hidden layer. Problem 1: The posterior is typically complicated because of “explaining away”. Problem 2: The posterior depends on the prior as well as the likelihood. So to learn W, we need to know the weights in higher layers, even if we are only approximating the posterior. All the weights interact. Problem 3: We need to integrate over all possible configurations of the higher variables to get the prior for first hidden layer. Yuk! Why it is usually very hard to learn sigmoid belief nets one layer at a time hidden variables hidden variables prior hidden variables likelihood W data

Two types of generative neural network • If we connect binary stochastic neurons in a directed acyclic graph we get a Sigmoid Belief Net (Radford Neal 1992). • If we connect binary stochastic neurons using symmetric connections we get a Boltzmann Machine (Hinton & Sejnowski, 1983). • If we restrict the connectivity in a special way, it is easy to learn a Boltzmann machine.

Restricted Boltzmann Machines(Smolensky ,1986, called them “harmoniums”) hidden • We restrict the connectivity to make learning easier. • Only one layer of hidden units. • We will deal with more layers later • No connections between hidden units. • In an RBM, the hidden units are conditionally independent given the visible states. • So we can quickly get an unbiased sample from the posterior distribution when given a data-vector. • This is a big advantage over directed belief nets j i visible

The Energy of a joint configuration(ignoring terms to do with biases) binary state of visible unit i binary state of hidden unit j Energy with configuration v on the visible units and h on the hidden units weight between units i and j

Weights  Energies  Probabilities • Each possible joint configuration of the visible and hidden units has an energy • The energy is determined by the weights and biases (as in a Hopfield net). • The energy of a joint configuration of the visible and hidden units determines its probability: • The probability of a configuration over the visible units is found by summing the probabilities of all the joint configurations that contain it.

The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations. The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it. Using energies to define probabilities partition function

A picture of the maximum likelihood learning algorithm for an RBM j j j j a fantasy i i i i t = 0 t = 1 t = 2 t = infinity Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

A quick way to learn an RBM Start with a training vector on the visible units. Update all the hidden units in parallel Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again. j j i i t = 0 t = 1 reconstruction data This is not following the gradient of the log likelihood. But it works well. It is approximately following the gradient of another objective function (Carreira-Perpinan & Hinton, 2005).

Three ways to combine probability density models (an underlying theme of the tutorial) • Mixture: Take a weighted average of the distributions. • It can never be sharper than the individual distributions. It’s a very weak way to combine models. • Product: Multiply the distributions at each point and then renormalize. • Exponentially more powerful than a mixture. The normalization makes maximum likelihood learning difficult, but approximations allow us to learn anyway. • Composition: Use the values of the latent variables of one model as the data for the next model. • Works well for learning multiple layers of representation, but only if the individual models are undirected.

Training a deep network(the main reason RBM’s are interesting) • First train a layer of features that receive input directly from the pixels. • Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer. • It can be proved that each time we add another layer of features we improve a variational lower bound on the log probability of the training data. • The proof is slightly complicated. • But it is based on a neat equivalence between an RBM and a deep directed model (described later)

To generate data: Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling for a long time. Perform a top-down pass to get states for all the other layers. So the lower level bottom-up connections are not part of the generative model. They are just used for inference. The generative model after learning 3 layers h3 h2 h1 data

Each RBM converts its data distribution into an aggregated posterior distribution over its hidden units. This divides the task of modeling its data into two tasks: Task 1: Learn generative weights that can convert the aggregated posterior distribution over the hidden units back into the data distribution. Task 2: Learn to model the aggregated posterior distribution over the hidden units. The RBM does a good job of task 1 and a moderately good job of task 2. Task 2 is easier (for the next RBM) than modeling the original data because the aggregated posterior distribution is closer to a distribution that an RBM can model perfectly. Why does greedy learning work? Task 2 aggregated posterior distribution on hidden units Task 1 data distribution on visible units

Why does greedy learning work? The weights, W, in the bottom level RBM define p(v|h) and they also, indirectly, define p(h). So we can express the RBM model as If we leave p(v|h) alone and improve p(h), we will improve p(v). To improve p(h), we need it to be a better model of the aggregated posterior distribution over hidden vectors produced by applying W to the data.

The first big success for deep neural nets • Speech recognition is improved by using deep neural nets that look at multiple frames of coefficients and predict the states of HMMs that model phonemes. (Mohamed, Dahl & Hinton, 2009) • This approach works better than the blurry table look-up used for the previous 30 years. • Refinements of these deep neural nets are now used by Google and Microsoft for doing voice search in the cloud and also by IBM. See the 2012 review paper by Hinton et. al. in IEEE Signal Processing magazine.

Word error rates from MSR, IBM, and the Googlespeech group (early 2012) 16 (>>5870hrs)

Finding roads in high-resolution images:Another victory for big, deep nets. • VladMnih (ICML 2012) used a non-convolutional net with local fields and multiple layers of rectified linear units to find roads in cluttered aerial images. • It takes a large image patch and predicts a binary road label for the central 16x16 pixels. • There is lots of labeled training data available for this task.

Why finding roads is hard • The task is hard for many reasons: • Occlusion by buildings, trees, cars. • Shadows, lighting changes. • Minor viewpoint changes. • The worst problems are incorrect labels: • Badly registered maps. • Arbitrary decisions about what counts as a road. • Big neural nets trained on big image patches with millions of examples are the only hope.

The best road-finder on the planet?

Is there anything we cannot do with very big, deep neural networks? • It appears to be hard to do massive model averaging: • Each net takes a long time to learn. • At test time we don’t want to run lots of different large neural nets.

Averaging many models • To win a machine learning competition (e.g. Netflix) you need to use many different types of model and then combine them to make predictions at test time. • Decision trees are not very powerful models, but they are easy to fit to data and very fast at test time. • Averaging many decision trees works really well. Its called random forests (Breiman). • We make the individual trees different by giving them different training sets. That’s called bagging

Two ways to average models class 1 class 3 class 2 • We can combine models by taking the arithmetic means of their output probabilities: • We can combine models by taking the geometric means of their output probabilities: Model A: .3 .2 .5 Model B: .1 .8 .1 Combined .2 .5 .3 Model A: .3 .2 .5 Model B: .1 .8 .1 Combined .03 .16 .05 sum

Dropout: An efficient way to average many large neural nets. • Consider a neural net with one hidden layer. • Each time we present a training example, we randomly omit each hidden unit with probability 0.5. • So we are randomly sampling from 2^H different architectures. • All architectures share weights.

Dropout as a form of model averaging • We sample from 2^H models. So only a few of the models ever get trained, and they only get one training example. • This is as extreme as baggingcan get. • The sharing of the weights means that every model is very strongly regularized. • It’s a much better regularizer than L2 or L1 penalties that pull the weights towards zero.

But what do we do at test time? • We could sample many different architectures and take the geometric mean of their output distributions. • It better to use all of the hidden units, but to halve their outgoing weights. • With one hidden layer, this exactly computes the geometric mean of the predictions of all 2^H models.

What if we have more hidden layers? • Use dropout of 0.5 in every layer. • At test time, use the “mean net” that has all the outgoing weights halved. • This is not exactly the same as averaging all the separate dropped out models, but it’s a pretty good approximation, and its fast.

What about the input layer? • It helps to use dropout there too, but with a higher probability of keeping an input unit. • This trick is already used by the “denoising autoencoders” developed in Yoshua Bengio’s group. • It was derived by a different route.

A familiar example of dropout • Do logistic regression, but for each training case, dropout all but one of the inputs. • At test time, use all of the inputs. • Its better to divide the learned weights by the number of features, but if we just want the best class its unnecessary. • This is called “Naïve Bayes”. • Why keep just one input?

Another way to think about dropout • If a hidden unit knows which other hidden units are present, it can co-adapt to them on the training data. • But complex co-adaptations are likely to go wrong on new test data. • Big, complex conspiracies are not robust. • If a hidden unit has to work well with combinatorially many sets of co-workers, it is more likely to do something that is individually useful, but also marginally useful given what its co-workers achieve.

A simple example for a linear model • Training data: 1, 1, 0 0  6 1, 1, 1 1  4 -5 +11 +4 -6 co-adapted weights +3, +3, -2, -2 less co-adapted weights

How dropout differs from L2 regularization (Wager, Wang and Liang 2013) • In a net with no hidden layers, dropout on the inputs can be analysed when the output is a logistic. • Its equivalent to L2 regularization but with a very important extra factor. • The strength of the L2 regularizer on each input is proportional to the sum over all training cases of p(1-p). • Big weights are not penalized when they lead to confident outputs. It does not depend on the targets. • Big weights leading to uncertain decisions are heavily penalized. This is just what we want.

How well does dropout work? • If your deep neural net is significantly overfitting, it will reduce the number of errors by a lot. • Any net that uses “early stopping” can do better by using dropout (at the cost of taking quite a lot longer to train). • If your deep neural net is not overfitting you should be using a bigger one. • Your brain uses about a hundred thousand parameters per second of experience.

Experiments on TIMIT(NitishSrivastava) • First pre-train a deep neural network one layer at a time on unlabeled windows of acoustic coefficients. • Then fine-tune to discriminate between the classes using a small learning rate. • Standard fine-tuning: 22.7% error on test set • Dropout fine-tuning: 19.7% error on test set • This was a record for speaker-independent methods.

Experiment on TIMIT(NitishSrivastava)

Using weight constraints • In neural nets, it is standard to use an L2 penalty on the weights (called weight-decay). • This improves generalization by keeping the weights small. • It is generally better to constrain the length of the incoming weight vector of each hidden unit. • If the weight vector becomes longer than allowed, the weights are renormalized by division. • Weight constraints make it possible to use a very big initial learning rate which then decays.

Object recognition: The next big success of deep neural nets • Recognizing objects in images appears to be a much bigger problem than recognizing phonemes. • It requires a huge amount of knowledge about what things look like (no separate dictionary) • The computer vision community was convinced that deep neural nets were only good for relatively simple problems like hand-written character recognition. • Until recently, computer vision used tiny datasets. • Small datasets are OK for tuning a few parameters of hand-engineered features.

The ILSVRC-2012 competition on ImageNet • 1.2 million high-resolution training images. • Each image has a label from 1000 classes. • The classification task is to get the “correct” class in your top 5 bets. • There is some randomness in which object is labeled and in what label it is given • Some of the best existing computer vision methods were tried on this dataset by leading computer vision groups: • U. Tokyo, Oxford, INRIA, XRCE, …

Error rates on the ILSVRC-2012 competition • University of Tokyo • Oxford University Vision Group • INRIA + XRCE • University of Amsterdam • 26.1% • 26.9% • 27.0% • 29.5% • Krizhevsky et. al. • 16.4% much bigger gap than for acoustic models

Some examples from ImageNet

An overview of deep learning

An overview of deep learning

Presentation Transcript

Deep Learning

Deep Learning

Deep Learning!!!!

Deep learning

Deep Learning

Deep Learning

Deep Learning

Deep Learning

An Overview of Machine Learning

Dimensions of Learning: An Overview

An overview of BIT815 Deep Sequencing Data Analysis

Deep Learning

Deep Learning

Deep learning

An Overview of Deep Freezer

an introduction to: Deep Learning

An Overview of Service-Learning

An Overview of Machine Learning

An Overview of Action Learning