#In this lecture, we're going to be continuing our discussion about convolutional neural networks, going into a bit more depth on how neural networks are built. #SLIDE Last week we covered a number of the building blocks - we discussed what tensors are, filters, and activation layers. We chatted about strides and pooling approaches to convolution, and how the convolutional elements of a network are then propogated into the "Fully connected" portion of the network which outputs the final score estiamtes for each class. You'll recall there are four hyperparameters we specifically discussed must be chosen for convolutions - the number of filters, filter dimensions, stride, and zero padding. #SLIDE Ok - let's head back to our CIFAR-10 example. You'll recall that we first technically implement our image as 3 matrices, i.e., a tensor. We thenn pass some number of filters over that tensor, which will result in one activation surface for each filter. #SLIDE The entire process looks something like this. You start with your input image tensor, and define some number of filters. Then, based on your strides and zero padding, you convolve that filter over the tensor to create one activation surface per filter. In this example, the activation surface is slightly smaller than the input tensor - representing the number of valid convolutions possible with a 5x5x3 filter given the 32x32x3 tensor. Finally, we aggregate all of the output activation surfaces into a new tensor, with a depth equal to the number of filters. Of note here, the terms "activation surface", "activation map", and "activation layer" are used somewhat interchangeably to refer to these surfaces; you may also hear the term "activation tensor" to refer to the collated group of all activation surfaces. #SLIDE Once we have our activation layers, we can then use them as inputs into other models instead of the image itself, to generate our final scores. While this is a very simplified linear model, as we have discussed activation layer information is most commonly passed forward into another neural network. #SLIDE When we're building our activation surfaces, the filters that we convolve across the tensor are made up of a set of dimensions and, also weights. If we blow up the filter, we would see something like this - a stack of matrices with numeric weights. These weights are used when convolving, and the dot product between the weights and the image tensor values are taking as we convolve to generate the activation surfaces (you'll recall we do a deeper dive on this in our first lecture on convolutional nets). #SLIDE A few lectures ago, we also discussed the idea of optimization - i.e., identifying the best weights to minimize a loss function. We also chatted about backpropogation and the chain rule as approaches to identifying optimal gradients. Specific flavors of optimization we've discussed have been Gradient Descent, Stochasist Gradient Descent, and Mini-batch Stochastic Gradient Descent. We also chatted about a few techniques that simply aren't used in practice - i.e., randomly guessing weights; full searches of all values. #SLIDE Recall for example Mini-batch SGD. In this approach, we have a 6-step process we follow to optimze our network. First we sample our data, taking some number of observations equal to a hyperparameter batch size. We then take that data and run a forward propogation, and calculate the loss (i.e., how bad the current set of weights are). That loss is then used to backpropogate across the network to calculate the gradients of weights with respect to the loss, and we update the weights using that gradient. We then repeat this process until we reach some hyperparameter-defined threshold - the number of loops, the change in loss over the last few iterations, or other related metrics. #SLIDE There are a number of steps that you must go through to build and optimize a network - starting with network architecture, then optimization, and finally evaluation. We're going to start at the top - choices you need to make regarding the overall architecture of your network - i.e., how to design your computational graph, and work our way through the entire process. #SLIDE First, let's start out with some fundamentals. What you're looking at now is the basic computational graph we've been using throughout this course - we have two inputs (a pixel value in red and a weight in teal), and one computational node (multiplication). #SLIDE A common practice in neural networks is to build a computational graph that looks like this - multiplications between weights and values fed into an additive function. #SLIDE This added sum is then passed into some activation function (in this example, ReLu), which then passes the signal on. #SLIDE A more common way to visualize networks incorporates the weights into the outputs of each node. For example, here we have the same two data inputs - pixel 1 (p1) and pixel 2 (p2). We still have two weights - w1 and w2 - but now they are visualized along the path between nodes. #SLIDE Within this receiving, computation node you can visualize two different processes - the process used to aggregate data coming in, and then the activation function used to pass a signal on. The aggregation algorithm in neural networks is nearly always assumed to be the sum of the weighted inputs - i.e., in this example, w1*p1 + w2*p2. Frequently this aggregation algorithm will not be visualized and is instead assumed. #SLIDE The more commonly explored process is the activation function - i.e., the function applied to the value the aggregation function calculates. There are many different activations, and we'll get into them in just a moment. Here, the output of that activation function is notated as o_1, and is the value that would be sent to the next node. Also, note w3 - just like the other two inputs, another weight is applied to any outputs from this node. #SLIDE This can also be written like this - i.e., the output of this computation is equal to some activation function f applied to the weighted sum of inputs to the node. So, how do we define this activation function? What are the pros and cons of different approaches? Ultimately when we code our networks, we will have to write some function that takes in the weighted values from earlier in our computational graph, does something, and returns a value to pass on to the next nodes. Here is a very, very simple function - i.e., one that simply returns the input value. In this example, if the activation function f was the defined activationFunction in the dummy python code, only the raw values themselves would be passed throughout the network. #SLIDE There has been a tremendous amount of research into the best performing activation functions for different purposes. I've discussed ReLU a few times before, which you can see here at the lower-left. All ReLU does is take in the weighted value, and then return the maximum of either that value or 0. In code, this would be as simple as... #SLIDE This short function. So - why would you pick a given activation function? Let's chat about a number of the most popular functions, and what the various reasons are to use one over another. #SLIDE First, we have the sigmoid activation function - we've already discussed this in this course in the context of loss functions. The sigmoid activation has two key features as it pertains to neural networks. First, all values are rescaled to a range between 0 and 1, irrespective of how large the inputs are. Second, and why sigmoid has been used historically, is that they roughly approximate how human neurons work - i.e., the neuron "fires" only once a certain value is reached. #SLIDE Sigmoid activation functions have two challenges that you should be aware of related to gradient decay and zero centering. Let's dig into the challenge of gradient decay first. #SLIDE Imagine for a moment that we are trying to solve for the gradients of weight 1 and weight 2. We just initialized our model, and they both randomly received a weight of 10. Now, look at the sigmoid function - if both weights changed from 10 to 9 (i.e., a decrease of 1), what would the change in the function be? Pause for a moment if you want to think about this. #SLIDE The answer: nothing! A change from 10 to 9 in the incoming values would make no difference - i.e., the output of the sigmoid would remain (very, very nearly) identical. You would need to change your weights to something around 0.25 in this example before you would start to see meaningful gradients. This is due to the nature of the sigmoid function - values greater than around 4 and below -4 simply don't change very much (As you can see in the graph). So, if you have sigmoid functions, they can effectively decay your gradient to 0, which can be problematic for finding optimal solutions. #SLIDE The second issue about sigmoid is that it is not zero centered - i.e., there is no shift in the data that is input, so if all of your values are positive on input, they would remain positive on output (and vice-versa). While this sounds good in principal, in practice it causes some big problems with gradient solutions. The most important of these is when all inputs to a neuron are negative or positive. In these cases, a sigmoid function gradient will result in all positive or negative solutions - there is no mixing. So, imagine you're trying to optimize two weights (w1 and w2 in our example on the slide). The best answer is when w1 = 1000, and w2 = -1000. Sigmoid activiation functions would be very, very inefficient at finding such a solution, because each iteration it would only adjust both weights in a positive or negative direction - no mixing allowed. #SLIDE Tanh has some of the same benefits of sigmoid - specifically, it has a functional form that roughly approximates how neurons work (with a rapid increase at a certain "activation" value). The big difference is that outputs are between -1 and positive 1; i.e., it's zero centered. So, the problems a sigmoid function would have when all inputs are positive or negative do not apply to tanh. However, gradient decay due to saturation is still an issue, just like the sigmoid case. #SLIDE Next, let's chat about what is probably the most common activation function across models today - the Rectified Linear Unit, or ReLU. ReLU has been shown to be exceptionally powerful for a wide range of uses, and is commonly employed in networks today. One of the biggest reasons for this is that it is computationally exceptionally simple - i.e., taking a max is a nearly instantaneous operation - and thus can be used to fit very deep networks with less computer power than might otherwise be required. However, it does have some similar drawbacks - it is not zero centered, which can lead to ineffeciencies, and it can also have a saturation if values are below 0. Across networks built primarily on ReLU nodes, the issue of "dead ReLUs" is prominant, as entire regions of a network can stop updating due to the negative saturations. Some of the most effecient networks we have identified still have around 20% of nodes as "dead relus". This doesn't mean the network is bad, but it does indicate large inefficiencies in model architecture and concommitent optimization. Of note, because ReLUs are shown to be powerful and are still desired to be used, one common approach is to initialize network weights as weight + .01, rather than simply random weights. This positive direction bias may help mitigate the percentage of neurons that end up "dead". #SLIDE There have been other approaches, however, to try and mitigate the issue of ReLU death. One of these is called a "Leaky ReLU". This function simply takes the max of either 0.01*X, or X. So, in the case of a negative value, instead of a 0, the output is .01X, or something close to 0. This has a number of benefits, but the most important is this means that there is no saturation - i.e., a change in X will always result in a change in the output. It's also still very computationally effecient, and retains our approximation of "firing" neurons. #SLIDE The Parametric Rectifier - PReLU - is a generalization of a Leaky ReLU. Here, instead of hard-coding 0.01 in the max equation, we instead replace it by a parameter (alpha on this slide). Because it is parameterized, it can effectively be treated as a weight and optimized as a part of the backpropogation procedure. A PReLU with alpha of .01 would be identical to a leaky ReLU. #SLIDE Next up - the Exponential Linear Units (ELU) function. Here, the primary goal is to create a function which has many of ReLUs qualities, but is also zero centered. It also uses the parameter alpha, which can be fit during backpropogation. In this function, you'll note that saturation also happens fairly slowly - i.e., if we compare to the last few cases (TOGGLE BACK TO RELU and COMPARE), you can see that while saturation might be an issue, it is less likely in the ELU case (however, it is still there!). It is also much closer to zero-centered, but not *actually* zero centered; this has a number of benefits; however, all of these benefits come at the cost of relatively high computational cost due to the use of an exponent. #SLIDE Ok! You've picked an activation function, now it's time to think about how you want to manipulate your data inputs (i.e., data preprocessing). This is the set of steps you will take to manipulate your inputs - in our case, images, but this is generally applicable to any problem. Here, you want to consider the type of activation function you're using, as different activation functions will perform better (or, worse) depending on the strategy you choose for standardization. Some fairly straightforward manipulations are things like zero-centering your data (i.e., subtracting the mean) or taking the standard deviation for your observations. Zero centering in particular is very important, as... #SLIDE You'll recall that activation functions that are not zero centered will return only positive or negative - not both - gradients if all inputs are positive are negative. So, if you zero-center your data, it makes it much less likely that all inputs will be in one direction, mitigating this limitation of many functions. In practice, you'll generally want to apply zero centering in most cases, as it won't hurt and it can help with activation functions that are not zero centered; more complex approaches (i.e., whitening, PCA, standard deviations) are only necessary if you are working with data that has issues with outliers or very large changes in magnitude, which is relatively uncommon in the image processing space. #SLIDE The next choice we have to consider is how we want to initialize our initial weights - i.e., what is the first set of weights we want to guess? Consider the network visualized on the right - here, we have some input image - say, the letter A - and we're going to pass that image data through the network to predict 26 scores. If all of the weights were the same acros the entire network, what would the output scores be? Pause here if you want to think about it. If you guessed "the same" - you're right! All of the scores would be identical if all of the weights in the network are identical. And if all scores are equal, then the backpropogation will be the same across the entire network, and so all updates will be identical, and you'll never have a network that improves! #SLIDE So, what do we do? In your assignment to date, you've done somethign like this - initializing all weights as a very small random number. This does prevent the problem of initializing all weights with the same value, and will work well for relatively small networks. However, as we go deeper this approach will begin to fail - very intuitively, imagine your first iteration with randomly generated, small weights. Every layer of the network you're going to multiple a small number by a small number - with even 10 layers, you'll end up with a number that is around .00000000000000000.... something very, very small. Thus, gradietns you send back throught he network will be similarly small, essentially resulting in almost no change. #SLIDE Recognizing that small numbers don't work because you essentailly zero out your gradients, the next obvious idea would be to use.. big numbers! This has a new problem though - think about what we talked about when it comes to RELU, Sigmoid, and Tanh functions... #SLIDE Using tanh as an example, as numbers get big, tanh will saturate - i.e., all outputs will be equal to 1 or -1. So, you end up with the same problem - all the scores are the same, the gradietn is 0, and you're out of luck during backpropogation! #SLIDE Bottom line is - weight initialization is a problem that is frequently overlooked, and it's really tricky to get right. If you get it too big, you saturate your activations; too small and you zero your gradients. Thankfully, we aren't the only ones that have run into this, and the best/brightest have come up with a number of approaches to get it right. Here, we're just going to focus on one - Xavier initialization. #SLIDE Xavier initialization came around in 2010, and was designed for linear activation functions. I definitely recommend digging into Xavier Glorot's 2010 piece on this if you want to dig into exactly how and why this approach works, but in practice the fundamental idea is that depending on the total complexity of your network - i.e., the number of inputs and ouputs - you need to adjust weights so as to ensure that the variance of your initialized weights is equal to 1 over the number of incoming connections. So, in practice, solving for W looks like... #SLIDE This. However, this approach is still limited when non-linear activation functions are used to define the network. In 2015, He et al. recognized that this was largely because in the ReLU paradigm, mmany of the nodes are getting saturated (half, to be precise), and so to fix that implemented... #SLIDE This small change - dividing by 2 in the denominator, to account for node saturation in RElU. Again - these choices have huge implications, and can be the difference between a good and bad fit for your network (or, being able to solve it at all!). #SLIDE Weight initialization is key, but there are other approaches to ensuring resilience of the model to saturation and gradient decay. One of the most popular of these is batch normalization. The idea of batch normalization is fairly simple: when you're optimizing a network, you're generally selecting some number of images - i.e., your batch size - to feed through the network. #SLIDE Let's say our hypothetical network looks like this - all we are doing is taking the first two pixels in both images, and passing them forward into a single computational node with weights. The activation function we choose is tanh, which is susceptible to saturation. Normally, we would do our forward and backward passes one image at a time - i.e., the forward pass for the truck can be done independently of the forward pass of the boat. #SLIDE In batch normalization, we instead run both the truck and the boat at the same time, and record the input value for each node. Importantly, no activation function is performed before batch normalization - i.e., the weighted sums of the inputs are passed directly to the batch normalization (...most of the time). #SLIDE What the batch normalization does is take in all of these values and simply normalize them in a zero-centered way - i.e., using standard deviations. This guarantees that there will be some negative and some positive values, and that a certain distribution (most commonly gaussian) is followed. By forcing the input weighted values to follow this distribution, batch normalization can then pass the values on to other activations... #SLIDE just like before, except now the risk of saturation far lower, and gradient decay is less likely to matter as well (as the standardization will re-inflate your values). I'm not going to go into the details of exactly how the distribution of values is normalized, but on the website you'll find a link to Ioffe and Szegedy, 2015; I highly recommend reading through it to get a depth of understanding on why these batch approaches work. Also briefly note that there are many different approaches to normalization, including some that allow for you to re-scale values and recover the identity mapping of your inputs (and everything inbetween). This range of approaches allows you to enable *some* saturation in your network, which can be valuable in some cases. #SLIDE Alright - to briefly recap where we are now in the process of building a practical neural network. First, we defined our architecture - our nodes, inputs, outputs, if we're using batch normalization, activations - the whole thing. We know what our data processing will look like - probably just a zero-mean shift for image data. And, we've defined how we're going to initialize our weights - probably a Xavier initialization. With all of that, we're ready to click "go" - i.e., we can start our network and see how it performs, and start the debugging process - which we'll start to cover next lecture! ##### \sigma(x) = \frac{1}{(1+e^{-x})} tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}