A quick note before I get started - part four of assignment 2 follows along with much of this lecture, so I suggest you engage with both at the same time!  Ok - now we're goign to talk about the actual process of optimization.  At this point, we've defined our network architecture - i.e., the number of layers, nodes, inputs, batch normalizations, activations - everything.  We have a data processing pipeline - normally just a zero-mean shift for image data - and we've defined some strategy to initialize our weights (probably Xavier).  That is - we're ready to click "go".  So what do we do now?  Chances are your network is going to fail to fit your first try, so how do you diagnose the issues?
#SLIDE
To recap - last lecture, we first discussed the most common approach to visualizing network structure.  We also discussed the two different computations done within these nodes - aggregating the incoming values, and then that aggregation is fed forward into an activation function.
#SLIDE
There are many different types of activation functions, with a range of pros and cons.  Both saturation (i.e., increases or decreases in your inputs don't change the function) and a lack of zero centering (i.e., outputs are all unidirectional if inputs are unidirectional) can both pose major challenges for optimizing your network.
#SLIDE
In most cases, these issues can be mitigate through three processes.  The first of these is preprocessing - most commonly, zero centering our image data by subtract the mean value of each band from every image.
#SLIDE
Weights initialization strategies can also help by ensuring the distribution of weights is balanced with respect to the complexity of your network.  
#SLIDE
And, finally, we discussed batch normalization - a computational stage in the network that forces all weights to a given distribution.
#SLIDE
Now, we're going to move on to some practical pointers on how to actually optimize. We've already picked a preprocessing and weights initialization strategies, and we're going to use ReLU activations for this example.  Our network architecture is shown here as well - we have CIFAR-10 as an input (3072 pixels), 50 hidden nodes, and 10 outputs - one per CIFAR-10 class.
#SLIDE
#In code, we will implement this type of network in stages.  First, we would have some function that defines our data preprocessing.  In most cases, this will reduce down to two things.  First, we decide if we are going to zeroShift.  The top of the code to the right is what does this - first, we calculate our mean image value, and then we subtract that value from the train and test inputs.  Here, I also have a visualization element so you can see what's happening.  Second, we have to choose if we are going to reshape our array.  For non-convolutional approaches that require one long matrix (i.e., the 3072 element long matrix on the left), we need to reshape.  Later on, when we get to convolutions, we'll disable this option.  The output of this code would be "P" in the figure to the left - i.e., our 3072 x 1 CIFAR arrays.
#SLIDE
Next, we need to define our layer algorithm - i.e., something that takes in P and Weights beta, and then outputs h.  In a fully connected case, this would be equal to 3072 x 50 weights - i.e., every input P is connected to every hidden node h.  The dot product between inputs P and weights W are what give us our outputs h.
#SLIDE
This type of fully connected layer is also referred to as an "Affine" layer.  In code, we need to define two seperate operations here.  First, we need the outputs h when we do our forward pass; second, we need the gradients for the backward pass.  Here is the forward pass, which is fairly straightforward - let's go line by line.  In the portion above the red box, we're doing some reshaping so that our weights and inputs are as-expected.  In the red box is where the real action happens.  First, we're taking the dot product between our x inputs (the 3072 pixels) and w weights (weights beta in this case).  For the first time in this course, we're also adding the term "b" - a bias term which we've been omitting to simplify our examples.  This bias term provides the network with more flexibility regarding systematic biases in your data - you can conceptuliaze it just like you would an intercept in a normal linear model.  On the next line, we have our cache - this is a variable we are going to save, because we'll need this information to solve for the gradient on our backward pass.  Finally, we return our output (in this case, a 50x1 hidden layer) and our cache of inputs.
#SLIDE
Now, we take the outputs from the affine layer, and pass them through the activation function we defined for hidden layer h.  In this case we're going to use a ReLu, which is just taking the max!
#SLIDE
We then pass the outputs of that relu to our second affine layer.  This is done in the exact same way - and with the same function.  All we have to do is change our weights initializations to account for the fact our inputs and outputs are different sizes (50 and 10). 
#SLIDE
The scores that are output - all 10 scores - are then fed into a loss function.  Here we solve for loss and the gradient to send back as per the chain rule.  Let's briefly walk through this, but this is the same SVM loss function we've worked with throughout the course.  First, we take the number of observations (N).  We then identify for each observation, the correct class score (i.e., if an observation was truly a Cat, we grab the score for "Cat" for each row).  We then calculate the SVM
loss function and save it into margin.  This is corrected on the next line by setting all "y" cases to 0 - i.e., we don't calculate loss for the correct class.  We take the sum of these for the total loss.  Now, we turn to solving for the gradient of the loss with respect to changes in our scores s (remembering that those scores are the input into this function).  Nothing new here - we build an empty matrix, count the number of positive margin cases (to solve for the gradient in the correct cases), and set cases with positive values to 1.  Finally, we divide by the sample size and, viola!  We have dX we can use to back-propogate through the network.
#SLIDE
Now we're ready for our backward pass!  The first pass back requires us to solve for the changes in our loss function when our weights W-alpha (and associated biases) are changed.  The code for that is on the screen now - we first reshape our inputs in the same way as our forward pass, and then we take our upstream gradient and solve for each of dx, dw, and db.  Head back a few lectures if this looks like nonsense to you - you've seen all of this before in our lectures on gradients!
#SLIDE
Now we head back again - now we're solving for the change in the loss function given a change in our hidden node values.  You'll recall we use a RElU activation for these hidden nodes, and so we need a backward relu function.  This is nearly as simple as the forward pass - we just pass the gradient straight on unless the incoming value was less than 0, in which case we pass on a 0.
#SLIDE
Finally, we do one last propogation back through our weights in the first affine layer.  And with that, we're done!  We now have everything we would need to update our weights.