==== Welcome back to DATA 442! In this lecture, we're going to be focusing heavily on optimization - how to use the loss functions to actually find the best set of weights for our models. ==== SLIDE You'll recall this slide from last lecture, which gives an overview of where we are today. We have covered the process of taking some arbitrary dataset (in this example, with 3 observations), using a function to predict which class each image belongs to in the dataset, and then calculating a loss function, or measure of how bad our model is. Optimization functions use this information on badness to help us select better weights. To briefly note, this process is broadly refered to as supervised learning, as the model is taking as inputs information that a human has labeled. ==== SLIDE Thus, our final goal is more like this - we want to find the "W" that ultimately minimizes our loss function, by taking the value from our total loss to inform our algorithm how good or bad a model is. This is the core loop of training a machine learning model - when we talk about "training", this is the loop we are generally referring to. ==== SLIDE So, how do we take the measure of badness in the loss function and use it to find better weights, W? This is the process of optimization. The way that optimization is most commonly conceptualized is to use the metaphor of mountains, hills and valleys. Imagine your walking around in a mountain range and trying to find the bottom. Each time you take a step, you are generally going to be going up or down in elevation. It's generally easier to go down hill, so if you were naively trying to find the lowest point in a valley, you would just always takes steps that resulted in you moving down until you can't go down any further. Our optimization algorithms are - very broadly - trying to do the same thing. They start at a random place on the landscape, but instead of an elevation, they have a loss function value. They want to move somewhere in the space that results in a lower loss value - instead of walking downhill, the alogrithm adjusts the "W" parameters. Ultimately, if a new set of "W" parameters result in a lower loss function, the algorithm now knows it walked 'downhill'. For those of you taking this course with a deep background in mathematics or physics, you may be wondering why we don't just solve for the analytic solution - why bother with all of this optimization nonsense if you can solve for the minimum of a function directly? The challenge is in the complexity of most functions we're practically using - a neural network can easily exceed millions of parameters, inhibiting or precluding practical analytic solutions. ==== SLIDE We're going to go over a wide range of optimization techniques, but let's start with a silly one: random search. If you were to completely naively build a function that sought to determine the best weights, you could simply test random permutations of weights over and over again, and then select the best one that you find. Obviously, this is... not a great idea. Not only is it arbitrarily expensive computationally (you have to choose how many iterations to run), but you're also phenomenally unlikely to identify a set of weights that is anywhere near what you could achieve with a more advanced optimization strategy. ==== SLIDE So - again, imagine you're the individual trying to find the bottom of a valley. You could start randomly anywhere in the valley - and, that might be the bottom, but it probably won't be. However, from that random location you could figure out pretty quickly which way is downhill and which way is uphill, and begin taking steps in the downhill direction. Every time you take a step, you can look around and establish what the next-best step to take might be, based on which step would take you farther down. ==== SLIDE So, how do we implement this type of an approach mathematically? Let's assume that all of you are going to be drinking a lot of coffee (or, tea) in this course. One thing you might try to establish is the right temperature for that tea - hot, cold, or just right. If you get it too cold, you suffer - visualized on our y-axis - as it's, well, gross. If you get it too hot, you burn your tongue. But, right in the middle there is going to be that perfect temperature. We want to find it! ==== SLIDE We could do it randomly - but, that would probably be pretty uncomfortable. Alternatively, we could do something called a "Grid search", or exhaustive search. In this case, we subdivide our space into a finite number of tests that are equally spaced along the X-axis (temperature). We then pour ourselves a cup of coffee at every one of these temperatures, and measure how much suffering we go through. Then, we pick the best temperature at the end - great, but our tongue is badly burned and we had to drink a LOT of coffee. ==== SLIDE So, instead we can do what the hiker did - start somewhere random, and then walk downhill. This process is called gradient descent. First, we drop *two* random points and pour ourselves coffee twice - once at each temperature. Once we do that, we can then draw a straight line between those two points, as illustrated on the slide now. Once we know what that line is, we can calculate the slope - and walk down the slope! ===== SLIDE We then repeat this process until there is little or no slope - i.e., once the slope is 0, we hypothetically would be at the bottom of the valley, or - in this case - minimized our suffering and found the best temperature. ====== SLIDE This is the same goal that we have in the context of loss functions - except instead of suffering, we're trying to minimize our total loss; instead of the right temperature for coffee, we're trying to find the best parameter W. In practice, we're doing this in multi-dimensional space with lots and lots of Ws! ========= NOTES So - let's formalize this a bit. You'll hopefully remember back to your calculus courses, and how to take derivatives. In one dimension (i.e. - the coffee example where we just have one value - temperature - we're trying to find), we can try to find the derivative of a function following the equation on the screen. As a very brief refresher, a derivative can be understoof as - at any point along a function - the slope of the line between that point and another point distance "h" away. ===== SLIDE Going back to our suffering example. Look at the empty dot - this dot represents when we drank a cup of coffee that was 45 degrees, and in that case we had a suffering of "2" on a 3 point scale. Let's say we want to calculate a derivative of the function at this point. Remember, we don't actually know the underlying curve, so in order to estimate the derivative we need at least one other point - we'll define that point as being distance "h" away. In this case, we'll arbitrarily set h to 5 - i.e., we're going to calculate a derivative based on a second measurement that is 5 degrees away towards 0 (i.e., we'll subtract 5). So, we wait until the coffee cools to 40 degrees, and then pour ourselves another cup. We then measure the suffering - turns out it's a "2.5" on a 3 point scale! We then put the shaded-in green dot. ===== SLIDE Thus, for this very simple example, we can estimate the derivative of our function to be equivalent to the result of our function at 45 - 2.0, minus the result of our function at 40 - 2.5; i.e., f(40+5) - f(40) in the equation at the upper-right would resolve to -.5. Our slope would then be -.5 / 5 (the distance we moved), or -.1. Thus, we would want to guess a temperature value that is higher, i.e., the inverse direction of the slope, as higher values of temperature seem to indicate lower values of suffering. ===== SLIDE Now - in practice we don't just have one temperature X; instead, we have up to millions of weights parameters W. While not perfectly accurate, you can imagine calculating a function like we did for temperature for every single one of those weights parameters - so, you would end up with one slope for every single parameter W. This vector is called the gradient, and the slopes for each W are partial derivatives. This gradient is absolutely key for most machine learning, as it provides us with information that allows us to update our weights in ways that are likely to result in a better prediction. In fact, most of the computational costs of deep learning specifically are in calculating hundreds, thousands, or millions of gradients to iteratively identify better and better weights across vast numbers of parameters. ===== SLIDE Ok - let's look at a more practical example. The simplest way you might imagine calculating a gradient might be finite differences - on the top, you have an arbitrary weights vector, let's say for a linear model attempting to classify our CIFAR10 dataset. When you run the weights through, you would get a total loss of 1.25347. Remember, our goal is to find some set of Weights that reduce this total loss. To do that, we are going to first define our gradient vector dW, which is going to be the same length as our W. Each element of this vector will tell us how much we could expect the total loss to change if we changed it's associated weight by some small value. ====== SLIDE A finite differences approach would allow you to calculate this by changing one weight by a very small value - say, .0001. We would then recalculate the Loss value for this new weights vector (using our loss function, classifier, etc.), in which only the first weight has been changed. ===== SLIDE In this hypothetical, let's say that the small change of .0001 in our first element in vector W resulted in a new loss, 1.25322; slightly lower than our original loss of 1.25347. ===== SLIDE So, now we can use our limit definition just like before, and calculate a finite differences approximation of our gradient for this first dimension (i.e., our first W). In other words, we would expect a decrease in our loss function if the first weight increased by .0001. ===== SLIDE We would then need to repeat this process for each element of vector W, resetting all other elements back to their original values, and adding .0001 to each in turn. So, in this example we would be calculating the gradient for the second dimension; the first weight (0.34) is reset back to it's original value, and we add the .0001 to the second weight (-1.11). We then repeat the process of calculating the loss, and use the finite differences approximation to compute the gradient for this second dimension. ===== SLIDE We would then repeat that for every case, until we have solved for dW. Needless to say, this approach is generally not a very good idea. First, it is very, very, very computationally intensive - even computing your function and related loss function for a single set of Weights W could be quite slow, especially for large sets of weights like those you would find in a convolutional neural network. Further, we aren't just talking about ~9 or 10 repeats, one for each weight. We're talking about millions or tens-of-millions of weights in deep convolutional nets, so you would have to re-fit the function millions or tens-of-millions of times. In some cases, we're even talking hundreds of millions! So, finite differences simply would take too long to evaluate for gradients. ===== SLIDE Instead, we can compute an analytic gradient - an exact, and much faster approach. We know that the vector dw (i.e., our gradient) is actually just some function of our weights and data. So, if we can solve for this function, we can solve for our gradient in one step (instead of one step for every weight!). thank god for calculus, as this is a multiple order of magnitude faster, and really enables us to fit these weights in a reasonable period of time. ===== SLIDE Alright, let's discuss how all of this math gets translated into our machine learning algorithms. Gradient descent is not only one of the most popular approaches to training our weights vector W, but also a pretty darn effective one for many solutions. It's also quite simple - here, you can see a basic implementation of how a gradient descent algorithm can be written. Let's walk through this - first, we define W as a completely random set of weights - it doens't matter where these start for this example, just know it's a vector of some random weights. We then define some number of maximum iterations - in this case, 1000, and a basic counter. We then loop 1000 times, each time doing two things. First we calculate the gradient dW, just like we discussed in the last few slides. We save this gradient as W_gradient_dw. Then, we add the inverse of this gradient to our initial weights vector; the reason for the inverse is that we want to *minimize* our total loss. One more new concept to introduce here is the hyperparameter step size - this is sometimes referred to as a learning rate. This hyperparameter is extremely important in how well our weights can be fit - if you set it too high, you risk skipping over the best solutions; if you set it too low, you risk getting stuck in a "valley". We'll talk a bit more about these challenges later, but in practice getting your learning rate or step size right is a really good first step to consider in many cases. ===== SLIDE In practice, that small snippet of code does some seriously heavy lifting. What you're looking at in these videos are a few examples of gradient descent. In each case, we have two parameters W - W1 and W2. Additionally, we have a third dimension (Y), which shows the value of the loss function at any given point. So, mountains represents combinations of W1 and W2 where the loss function was high - i.e., our model was bad. The valleys are what we're after - the points where the model performs well. As you can see from these examples, gradient descentstarts from an initial random location, and then works it's way down to the solution - almost literally "rolling down hill". ===== SLIDE Gradient Descent will form the backbone of most optimization techniques we use, but it has one big problem - if you have a very large training dataset, the analytic solution for the weights gradient can be very difficult to solve for due to the number of weights and individual pieces of data - i.e., ultimately you have to solve for the analytic gradient of every weight for every piece of data, and then take the average of those, which is quite costly. ====== SLIDE Because of this limitation, the most common implementation of gradient descent you will likely encounter is Stochastic Gradient Descent, or SGD. In SGD, a small batch of the data is taken at each step - i.e., in this example we would be randomly sampling 256. By sampling our data randomly, we expect the eventual solution we find to converge towards the true solution. While this replaces the actual gradient with an estimate of the gradient, the stochastic approximations you can find with SGD are suitable for a very wide range of applications. ====== SLIDE That's it for this lecture - to recap, we covered the core concepts of optimzation. First, we talked about our fundamental goal of reducing our loss function, and a few different - rather silly - ways to solve for the best weights vector (random guessing and exhaustive searches). Then we worked through how gradient descent works, how to effectively solve gradient descent, and then how to apply a stochastic gradient descent. I hope you enjoyed the lecture, and will see you next time! \frac{df(x)}{dx} = \lim_{h\rightarrow 0}\frac{f(x+h) - f(x)}{h}