###
Welcome back to DATA 442!  
#SLIDE
As a brief recap, last lecture we first discussed the mechanics of a net, specifically how each step of the net is implemented in code, and then covered a few ideas about how different types of bugs in your code could be identified and isolated.
#SLIDE
We then discussed the limitations of SGD, including getting stuck at local minima, saddle points and poor conditioning.
#SLIDE
We discussed a number of different optimization techniques, leading up to the ADAM approach.
#SLIDE 
And, we began a discussion on how to identify an optimal learning rate - and some of the signs that you would see if what you found was suboptimal.  This is where we'll start today.  One important thing to consider is that the question itself - what learning rate should we choose? - is a bit misleading.
#SLIDE
That's because learning rates can change - i.e., there is no reason why you have to use the same learning rate at each iteration of your weights calculations.  This can let you capture the best elements of multiple learning rate strategies.  One of the most common approaches to this changing is the implementation of some decay - shifting the learning rate down as the number of iterations increases. 
#Slide
The most common approaches to this have a parameter - k - which determine some rate of decay; this is generally implemented in three different ways.  First of these is Step Decay.  In step decay, every k iterations you cut the learning rate by half; this can be modified to cut the learning rate by different rates, and frequently is.
#Slide
In exponential decay, learning rate decays according to an exponential function, approaching zero as defined by the rate k.  In the exponential case, higher values of K result in quicker decay of learning.
#Slide
And, in inverse decay the learning rate is reduced based on an inverse function, in which the denominator increments by k each iteration. 
#SLIDE
Another approach is to try and omit the need for learning rates in the first place, but solving for minima using 2nd-order derivatives. 
#SLIDE
To date, we've focused on calulating the gradient for a given set of parameters in order to from a linear approximation of our loss function; we then move our parameters in a step to minimize that linear approximation.  All of the learning we've done to date has followed this generic approach.  One of the big challenges of this is that, if you step too far (as shown here), the approximation doesn't hold very well in most cases - necessitation a smaller learning rate. 
#SLIDE
However, there is nothing stopping us from going beyond a first-order approximation of our function (i.e., a line) to a second order approximation.  In this case, we can use the gradient *and* Hessian Matrix to form a quadratic approximation of the function - basically, instead of representing the function based on the first-order derivative, we can represent the function quadratically, giving a (potentialy) more accurate step.  Another very helpful characteristics of this approach is that it has a single solution that doesn't require a step size - i.e., there is one minimum of the function which we can step to.  This step is referred to as the "Newton Step".
#SLIDE
While very powerful, in practice calculating the quadratic approximation of our loss function is limited in a few key ways.  First and foremost, a Hessian Matrix which describes the second derivative of each parameter has O(N^2) elements; inverting that matrix would require O(N^3).  Even in a small network - with 100k parameters - that would still require around a petabyte of memory to invert, which is obviously unsuitable for realistic application.
#SLIDE
So, what do we do?  There are two common implementations of second order approximations, BFGS and L-BFGS, that you will encounter.  Both use different approaches to try and minimize the total amount of information that has to be held on disk or in memory at any given time. Of particular note here is that L-BFGS does work well in cases where you don't need to do any stochastic manipulations - i.e., your function is deterministic.  In our application, that means L-BFGS can be a powerful tool if we can solve our network in one large batch (i.e., no minibatches).  However, in practice it is rare you will be fitting a network without the use of batching, so the use of L-BFGS is limited.
#SLIDE
Let's take a step back and remember what we're trying to do with these models.  To this point, we've really focused on our Loss for our training data - i.e., during the modeling process, we're trying to minimize the blue line on this figure (some measure of loss), which will increase the accuracy we're looking for (the red line here).  However, that's only half the battle - what we REALLY want is for the yellow line to be as high as possible - our validation accuracy. 
#SLIDE
So, another focus for us is how to minimize the gap between the yellow line - our independent validation accuracy - and our red line.  This difference ultimately comes down to patterns that our model captures in our training data that are not reflected in our validation data, or vice-versa.  So, how do we build models that help us close this gap?
#SLIDE
First, we're going to talk about model ensembles - by putting multiple models together, we can hypothetically capture more variance in our training data, and do a better job predicting to our validation data.  Ensembles are a common approach to imagery analysis, where you essentially build multiple models and train them all simultaneously, and then combine their results into a single output.  You can imagine any number of different ensemble approaches - even if you're simply ensembling the same model numerous times, you still might see gains of around 1-2% simply because one model may by random chance have a better set of parameters than another.  In this example ensemble, you see each image being replicated 4 times, input into a convolutional model (ResNet - more on this later), and then integrated together to produce the final estimate.
#SLIDE
Another type of model ensemble is a snapshot ensemble. Consider your normal model fitting routine, in which you do a forward pass, backward pass, and then update the weights.  Normally, you would repeat iteratively until you find the single best set of weights, and use those in your "final" model.  
#SLIDE
In a snapshot ensemble, we instead first fit our model until it reaches convergence (i.e., the values of our weights aren't changing very much).  We then save these weights, and that becomes the first member of our ensemble.  We then dramatically increase the learning rate, and repeat the process once for each ensemble member we want.
#SLIDE
These two images, from Huang, Li and Pleiss in 2017, illustrate the fundamental idea of a snapshot ensemble.  On the top is an example of how learning rate would change when building the snapshot (in red).  As you can see, as learning rate approaches a very small value each iteration, we save the state of the weights, and then dramatically increase the learning rate again (in Red).  In this example, we repeat that process 6 times, creating a 6 model ensemble.  On the bottom, you can see the intuition behind this approach.  The first image (the one on the left) shows a normal SGD optimization, in which the leraning rate is gradually decreased.  It searches across the loss function until it finds the minimum value, and saves the weights at that point.  On the right is an example of a snapshot ensemble.  Here, we're building multiple snapshots for an ensemble, and so the algorithm - at least hypothetically - is going to find multiple minima; each time the learning rate is increased, we hope to find a new local minimia; thus, each model would be representative of a different local minima in our loss surface.  This can be very helpful in capturing heterogeneity across your input data.
#SLIDE
The next class of approaches to reducing the gap between our training and validation accuracy is regularization.  We've already discussed regularization in a few different contexts - you'll recall these are essentially penalty terms added to your loss function that penalize a model for having more complex weights (i.e., bigger values across more weights).  On the slide now is the idea of L2 Regularization.  Occam's Razor style, we hope that a simpler model will perform better than a more complex model.
#SLIDE
In the context of neural networks we have other options for how we implement regularization.  One of the most common of these is dropout.  The concept of dropout is - in any given layer - you set a parameter which defines the probability of any neuron's activation value being set to 0 (i.e., 0.5 is a common choice).  Thus, in the forward stage of the network, 50% of the neurons will randomly be 'dropped'.  In implementation - for example, in Keras - dropout layers frequently refer to Affine Layers with this regularization approach built in.  
#SLIDE
Important to note is that during your training, the dropout probabilities are applied randomly - i.e., every forward pass, you dropout different neurons.  This is great for training, but would be really bad for actual use in a model - i.e., if you're designing an algorithm to prevent a robot from knocking a toddler down, you want the same image of a toddler to be recognized as a toddler 100% of the time; it would be bad news if tomorrow, due to random chance, it identifies as something different.  So, how do we handle this?
#SLIDE
The most common solution is to train your weights based on the dropout - i.e., every connection still receives a weight each iteration - and then, in the final product, multiply the outputs of each neuron by the probability a neuron was dropped out.  This approximates the relative importance of each neuron in the dropout net, while ensuring a deterministic fit that will not change due to random chance.
#SLIDE
Another version of this concept is called DropConnect.  In DropConnect, instead of zeroing out the activation functions, random weights are set to 0 instead, essentially removing connections between neurons.  
#SLIDE
A hugely popular technique for improving how generalizable models are is data augmentation.  What you see now is our basic training strategy, in which we have our input dataset (made up of images and classes), a model which we apply to the images, and then a loss function which takes the model score estimates in, and contrasts them to the true class name (i.e., stopsign).  
#SLIDE
Data Augmentation relies on adding a new step into this pipeline, which is an image transformation step.  Essentially, a model (or set of models) is defined which change the makeup of the image, with the goal of creating new views of what the image might be.  I.e., in this case, after the transofrmation the stop sign is still a stopsign, but in black and white and at a slightly different angle. The basic idea here is that we can synthetically create more data - i.e., augment our data - with additional examples of a class (here, a stopsign) by applying transformations that might approximate other, unseen images.
#SLIDE
There are a number of very common augmentation strategies that you will come across.  Flipping images is one of the most frequently seen - i.e., a horse is still a horse, no matter what direction it's facing.  If the original horse was on the left, our network might be biased towards detecting horses only if they are facing left.  By augmenting our data by flipping this image to face right, we may be able to better detect images of horses beyond what we had in our data.  
#SLIDE
The same basic principals apply for a range of other transformations - i.e., increasing the contrast and brightness of the input images, or cropping.  During training, you can also apply any combination of these modifications, or more.  The fundamental thing to keep in mind when applying augmentations is that you want the network to learn based on examples that you might expect to see in your test dataset, even though you won't be able to train on it!
#SLIDE
Transfer Learning builds on the idea of image augmentation, using networks trained on one substantive domain (i.e., 'Farm Animals'), and using them as a starting point in another domain (i.e., "Household Pets"). This not only helps with regularization (as your initial weights will be more generalized by definition), but also can help mitigate the need for absolutely massive datasets to train a network for a specific purpose.  
#SLIDE
Imagine, for example, you're training a network on CIFAR 10.  The input of this net - P - is the 3072 pixels of each image; the output is s (our 10 scores, one for each class).  To fit this network would require 40,821,000 parameters; obviously, we want a lot of data to do this - i.e., we would want to use all of CIFAR10.
#SLIDE
Now imagine that you have another database of images that have the same input dimensions (3072 pixels), but represent two different classes not present in CIFAR10: Desks and Bathtubs.  You want to be able to automatically classify if an input image belongs to one of those two classes.  Unfortunately, you only have 1000 examples of each, which is far too few to fit the nearly 41 million parameters in your network.  So, what are you to do?
#SLIDE
In transfer learning, you first freeze all of the earlier layers of your network - in this example, you would save your network and retain all of the weights in the beta, alpha, and gamma layers.  You would then delete the last layer of weights, and re-initialize it with your new case - taking in the same 100 values from earlier layers, but then outputting only 2 cases.  This final layer would have unfrozen weights, in this case a total of 200 paramaters.  You would then fit this net, preventing any changes from occuring in the earlier layers, and seeking to fit the 200 parameters in weights delta to provide the two scores for your particular problem.  If you have a larger amount of data available, you can unfreeze additional layers and train further back into the network.  Transfer learning is very common across a huge range of domains - i.e., ImageNet is very commonly used as a baseline, and then deeper levels of a network trained for a specific case.
#SLIDE
Very practically, if you have less than 1 million images in your dataset, you may want to consider using a larger dataset in a similar domain to train with, and then apply a transfer learning approach.  Most major software packages (Keras, PyTorch, etc.) provide easy mechanisms to load pre-trained weights.
#SLIDE
That's it for this lecture!  To briefly summarize, in this lecture we covered strategies for setting or varying a learning rate, alternatives to linear approximations of the gradient which don't require learning rates, a range of strategies designed to reduce the difference between our training and testing accuracy, and finally we covered the concept of transfer learning.  I look forward to next time!


\alpha_{i+1} = \alpha_{i}e^{-ki}
\alpha_{i+1} = \alpha_{i}/(1+ki)