=== SLIDE 1
Welcome back to DATA 442.  Today, we're going to be discussing many of the basic building blocks of neural networks.

=== SLIDE 
First, let's briefly remind ourselves where we are.  Last week we covered a few key things - starting with the idea of the semantic gap, or that the ways humans describe an image - for example, Bird - and the ways a computer describes an image - that is, a matrix of numbers - can be challenging to overcome.  

=== SLIDE
Much of the challenge is in identifying invariant features across objects that belong to the same class, even if those objects are slightly different. So, a beak remains a beak irrespective of the bird that has it - and a computer can use these invariant features to identify the correct class for an image.

=== SLIDE
This intra-class difference challenge is just one of many in the challenge of computer vision.  You'll recall that we covered a number of different factors that can challenge our ability to cross the semantic gap, as small changes in images to our human eye can result in very large differences in the numeric matrices representing these images.  

=== SLIDE
These include viewpoint, or the idea that a bird remains a bird irrespective of how you're looking at it.  The example we gave last time was of multiple tourists taking pictures of the Eifel Tower - the tower doesn't change, but the angle of the image does.

=== SLIDE
Next up, background can cause the object we're trying to recognize to be more or less distinguishable.  While a computer may not see the colors in the same way as us, the matrix would have less variance if all of the colors are similar.  This has the potential to challenge our ability to recognize important, invariant features - like a beak.

=== SLIDE
Lighting can similarly make a bird appear different; in this lighting image you also get a bit of viewpoint, as the image has very little evident wing (but still a beak!).  

===SLIDE
And then we have deformation - or our favorite cat activity; a cat in a bowl is still, unfortunately, a cat. 

=== SLIDE
And good old occlusion - when we can only see a small part of an object, but our human mind can quickly classify it.  How do we train a computer to overcome all of these challenges?

=== SLIDE
You'll recall that during our last lecture we talked about the K-Nearest Neighbor classifier as a simple example algorithm that gives you an idea of how we approach training a computer using large sets of data.  We walked through a detailed example of text identificaiton, and then a real-world example using the CIFAR dataset.  

=== SLIDE
We also talked about different hyperparameter selection strategies, and the idea of breaking your dataset into training, validation, and testing data to ensure that you get meaningful tests of the external validity of your modeling approach.

=== SLIDE
Today, we're going to move on to linear classification techniques.  This will be one of the most important building blocks for neural nets, so pay attention!  We'll first cover the difference between parametric and non parametric models in the context of computer vision, and introduce the basics of linear classifiers.  Second, we'll look at a few examples and discuss interpretation.  Third we'll look into the (many!) limiations of linear classifiers, and then we'll segway into a discussion on optimization of linear funnctions.

=== SLIDE
As a motivating example, we'll return to the CIFAR10 dataset.  Another set of random examples is shown on the slide, generated from the lab code (you'll be making images just like this).  Remember the goal we have - given a new image, we want to be able to identify which of these classes (plane, car, bird, cat) it belongs to.  The KNN approach was simple - record everything, and then compare a new image to everything and select the most similar one.  As you'll see in lab 1, this is a very costly excersize, and can be slow for even small datasets.  

=== SLIDE
The nearest neighbor approach can generally be considered as "Nonparametric" - that is, when you pass information into the algorithm, you're only passing the image.  The algorithm then goes through all the images in it's database, and gives probabilities that a given image belongs to a given class (with larger values indicating a higher probability of class inclusion).

=== SLIDE
A parametric model is one where we don't just pass the image - rather, we're going to take in the image and some vector of parameters, generally referred to as Weights (represented by W).  Our computer code will now be able to take into consideration two things - the image itself, as well as the weights we provide.  These weights can then adjust the probabilities that we predict.  Remember for a minute the KNN classifier - in that case, we keep all of the training data in memory to compare to.  The goal of a parametric model is to summarize all of that training data in our weights parameters.  The problem then becomes how to best summarize information in W!  If we can do that well, we can get rid of the training data and only provide W to the algorithm.  If we can do a really good job, and make W really small, we could hypothetically even enable small computers - like cellphones - to run our algorithms.

=== SLIDE
Linear Classifiers are a very simple example of a parametric approach.  In this case, all we do is read in the information in the image (remember, it's just a matrix of numbers!), multiple it by weights, and that produces our probabilistic estimates. 

=== SLIDE
Let's break this down a bit, though.  Here is a real example of a bird from CIFAR 10.  It's very blurry, as the resolution of these images is very low - 32 pixels by 32 pixels.  That results in 1024 pixels of information describing "bird" - but, 

=== SLIDE
that's only in one color.  Nearly all images we will work with will have at least 3 colors - red, green and blue - and sometimes more.  So, ultimately the challenge in this case boils down to how to take in 3,072 values, and use those values to predict what's in the image.

=== SLIDE
So, let's go back to our equation.  When we look at this linear function, the first input is our image.  In practice, this is a vector of length 3,072, where each value represents a pixel in one of our three color bands.  In python, this would be represented as a numpy array - or a 1 dimensional list.

=== SLIDE
W - our weights - are represented by a 10 x 3072 matrix.  You can imagine a matrix with 10 columns and 3072 rows.  The first column might represent "Bird"; each row in that column would be the parameter you multiply by a given pixel value.  When you sum all of those up, it gives you your overall probability the image is a bird.  You repeat that for the other 9 matrix columns.

=== SLIDE
Let's think about this using a simplified example.  Imagine if CIFAR 10 was even worse - it was a 2 x 2 image, instead of a 32 x 32 image.  We've also reduced it to a single greyscale - i.e., there is only 1 band, so no blue, green or red.  So, we now have four total pixels!  Practically, this would be too little data to distinguish between different classes, but this simplified example will help us walk through linear models.

=== SLIDE
Let's say we want to solve for what this image is (i.e., a bird or dog), and someone has already solved for and provided us with the Weights for three classes - cats, birds and planes.  First, we would take all the pixels and translate them into a single array.  This is our input "image" in our function.

=== SLIDE
Second, we have our weights matrix.  In this case, we have three classes we want to choose between - cats, birds and planes.  Each of these three classes has a weight for each of our four pixels.  

=== SLIDE
Now, we can calculate the probability that this image belongs in each class.  For example, the cat score will be the inner product between the image pixels and the first row of the weights matrix.  This is sometimes refered to as the dot product.  In this example, you can see the number 56 is highlighted - this represents the greyscale pixel value in the upper left of the image of the bird.  We multiply this by the weight for this pixel for cat - 

=== SLIDE
in this case, 0.2.  We then repeat for each pixel-weight combination, and the sum of these values represents our final weight for the cat class.

=== SLIDE
We repeat this for each of our three clases - and, in this case you can see the bird score would be the highest, and thus the class we would choose. 

=== SLIDE
One neat thing about linear classifiers is our ability to understand what the algorithm is matching against.  Just like we could stretch the bird image out, we can also go the other way - transforming our weights matrices into images. Think briefly about what the numbers in the weights matrix mean - the bigger the value (above 0), the more weight it gives to a class.  So, if the upper-left hand pixel in the cats weights matrix has a large positive value, that means that images with large values in that pixel will be more likely to be a cat; and vice-versa for negative weights.  This means that the visuals of these weights matrices can be loosely interpreted as the generic "average" object the algorithm is comparing each image against.

=== SLIDE
Here is an example of visualizing those averages, generated for each class of CIFAR.  This shows some of the advantages and limitations of linear classifiers.  The biggest challenge with a linear classifier is that it is reliant on one, single "average" - that is, all "cars" are compared to the image of the car you see on the screen now; if they don't look like that car, then they likely would not be classified as a car.  The deer example is another good one - the average deer tends to be on a field of green (or maybe a forest.)  Of course, deer are not always against a green background.  In other cases, it would be harder pressed to identify an object as a deer.  My favorite example is the horse - here, if you look closely you can see the horse appears to have two heads.  This is because there are some number of images with the horse facing right, and some number to the left; thus, the average composite tends to be a two-headed horse.  This is reflective of the broader challenge of using linear classifiers for image recognition: they just aren't very good at dealing with heterogeneity.  If you have lots of different types of cars, planes, cats, deer, and they're against different backgrounds - as you would find in the real world - a linear classifier is inherently limited in it's ability to discriminate them.

=== SLIDE
Let's take a step back for a second, though, and talk about the weight parameters.  Up until now, we've talked about how they operate, but not how you can solve for what the weight parameters should be.  You'll recall that in the CIFAR case, each class (i.e., Bird) required 3,072 weights - i.e., one weight for every pixel of information in the red, green and blue bands.  We want pixels more reflective of a bird to have higher weights, and those less reflective to have a lower weight (or, even negative). 

=== SLIDE
So, how do you select the best set of weights?  There are two core concepts in machine learning that we build on to try and pick the best weights to achieve the highest accuracy classifications we can.  The first of these is the Loss Function.  Essentially, the loss function is a quantitative attempt to capture how bad a set of weights are - for example, we might consider a set of weights bad if they don't result in findings that reflect our human interpretation of an image.  As a quick case, here is our bird from CIFAR - if our weights resulted in the scores on the slide, we would ultimately predict this bird was a Plane.  Obviously, we want weights that won't make errors like that!  However, we have to teach an algorithm some way to judge "good" from "bad", and then teach it a way to select weights that are "good".  The loss function is the way the algorithm judges, and...

=== SLIDE
The optimzation strategy is how the algorithm selects weights to test.  Imagine if you are trying to find the 3,072 weights that best discriminate between "birds" and "not birds" for CIFAR.  You can set those 3,072 weights to just about anything you want - so, how do you even pick where to begin, and what parameters to change to improve your estimate?  That is what the optimization function is all about. 

=== SLIDE
Alright - so, first let's dig a bit into the loss function.  The images you see on the lower right are going to serve as our examples, with the scores above them providing a hypothetical to help us work through how this all works.  Briefly, take the first column, which starts with a 3.2.  The 3.2 represents the score a hypothetical linear classifier gave to "Cat" for the image of a cat, given a certain set of input Weights, W.  The 5.1 is the same score for Car, and -1.7 for Frog.  So, this particular classifier didn't do very well in this case - we would predict that this image is an image of a car.  In the second case, we do better - the Car is correctly predicted to be a car.  In the third case, we do very poorly - the Frog is not only not correctly classified, but the score for frog is by far the worst of the three options.  As humans, we can look at this classifier and recognize it isn't very good - the only thing it got right was Car, and it missed both Cat and Frog by a fair margin.  The loss function is all about how to quantify that human impression in a way a computer can use.

=== SLIDE
First, just as a reminder so the rest of our notation makes sense, remember that each of the scores in this table are the product of some linear function, where we put the image in, and then take the dot product of it and a vector of weights, represented by W.  We want to teach the algorithm that this set of weights (W) was bad - and, that's what the loss function is attempting to do.

=== SLIDE
Let's take our three examples here and call it our dataset - we want to calculate a loss function for how poorly our weights did for this set of images.  To do that, we will define our dataset as being made up of 3 pairs of Xi and Yi, where X is the image data, y is the label of that image, and i is the index (from 0 to 2, in this case, representing the three images). So, x1, y1 would be equal to the image of the cat labeled as a cat, and so on.

=== SLIDE
Given this dataset, we can then calculate a total loss for a given set of weights.  This equation is fairly straight forward - 

=== SLIDE
first, we're taking as input an image (i.e., image x_1 would be the cat), and passing our weights into our linear function to generate a set of scores.  The maximum score would represent our predicted class.

=== SLIDE
We then compare this to our true / known class label, represented by y_i.  For example, we might say that if yi is not equal to the predicted class, then the loss is 900.  Or 10. Or 20.  It's our choice!  

=== SLIDE
That choice is the loss function itself - i.e., how you decide to measure "wrongness" in your algorithm.  There are many different loss functions designed for different purposes, and we'll cover a few later in this lecture.

=== SLIDE
Finally, total loss - i.e., the loss for one set of weights - is the average of the loss function across all images (in this case).  You can write more complicated total loss functions that might - for example - bias your results towards accurately classifying cats.  This notion of a loss function is highly generic across machine learning - writ large, we are passing in some data we call "X", and using it to predict "Y", and the loss function captures how well our algorithm performs.  In our case, X happens to be images with multiple bands of pixels, and Y is a label.  Ultimately, we want to find the set of weights "W" that minimizes our selected loss function.

=== SLIDE
Let's work through one example of loss - a multiclass SVM loss.  We're going to go through this equation slowly!  What you're looking at here is the loss function - *not* the total loss - i.e., this would be how we calculate loss for any single image.  As noted on the previous slide, we could take the average of all images loss functions to generate our total loss.  

=== SLIDE
First, let's talk about what we're summing over.  Imagine we want to quantify how "bad" our algorithm is at predicting if our first image - the image of a cat - is actually a cat.  A multiclass SVM loss function operates by iterating over all possible things we *could* have called the image, denoted by a capital J.  Lowecase j is the index for ever class.  So, for example, when lowercase j equals 1, it would be cat.  When lowercase j equals 2, it would be car; 3 would be frog.  Importantly, we are going to sum over every *incorrect* case - when we say "j does not equal y_i", we mean to sum every case when the class is not equal to the true class.  In the case of the Cat, this means we would solve and sum twice - once for Car and once for Frog.

=== SLIDE
Second, let's look at what we're doing with each class.  First, let's focus on s_j - s_yi.  Here, we're subtracting the score of the correct class from the score of the incorrect class.  So, in the case of our Cat, s_2 would be 5.1 - the score we gave to "Car".  We would subtract 3.2 - the cat score - from 5.1, which would give us 1.9.  Skipping epsilon for the moment, that gives us max(0, 1.9), which resolves to 1.9.  

=== SLIDE
We then repeat this for every class - but, rather than walk through that in notation, I'm going to expand on our example.  Before we go there, though, I want to talk a little bit about the Epsilon term.

=== SLIDE
The fundamental idea of support vector machines is that we want to make sure we're as sure as we can be about our estimates - i.e., it's not just enough to classify correctly, but when we're correct we want our algorithm to be really sure.  This is reflected in the scores - for example, take the scores for Car.  The 4.9 for "Car" is way above the 2.0 for Frog or 1.3 for Cat; we like that.  The Multiclass SVM Loss includes the Epsilon term to push the weights vectors we identify towards solutions with these more concrete delineations.  A higher epsilon is a more stringent test - essentially, we're going to be more likely to punish correct cases if they aren't confident in their scores.  Also of note is the figure at the lower-left, which illustrates two different loss functions.  In green is an example of a binary loss, where it's either wrong (all values below 0) or right (all values above 0).  In a hinge loss - what the multiclass SVM loss is - you don't reduce down to 0 loss until 1.0; this is reflective of the epsilon penalty term.  Predictions in the triangle between the green line and blue line would all technically be correct, but still incur a loss because the confidence of the algorithm is not high.  We'll explore this term more in just a minute as we walk through the full example.

=== SLIDE
Let's walk through our example now, which will use these three cases.  First, we'll comput the loss for our estimate of Cat.  Remember, we got Cat wrong - we predicted Car.  First, we take the maximum of either 0 or 5.1 (our Car score) minus 3.2 (our cat score), plus 1 (our epsilon term - more on how to choose this later).  This resolves to max(0, 2.9), or 2.9.

=== SLIDE
We then repeat this process again, this time for Frog.  Remember, we only look at the INCORRECT classes, skipping the "true" case of cat.  For frog, we have the score of -1.7 minus 3.2, plus 1.  This resolves to max(0, -3.9), or 0.  

=== SLIDE
So, our two cases for the first image resolved to 2.9 and 0, respectively.  This makes intuitive sense - the Car guess was wrong (i.e., 5.1 is bigger than 3.2, and 3.2 is the cat score!); thus, we penalize the algorithm for confidently stating that "Car" was the correct class.  Covnersely, in the case of the Frog the model got it right, and by a fair margin - and so there is no penalty (the loss function is 0).  Adding these two values together gives us 2.9, which would be the loss for X_1.

=== SLIDE
Now let's take the example of the car, or image X_2.  This time, we compare to Cat and Frog (the two incorrect classes).  Note that the algorithm was very confident that "Car" was the correct class - as indicated by the high score of 4.9.  This is reflected in our loss function - both cases resolve to 0, for a total loss of 0 in this case.  Good job, algorithm!

=== SLIDE
Finally, we get to the frog.  Remember the frog was very badly missclassified - by far the lowest score - and so we expect a large loss for this case.  Solving for both equations, we get a loss of 6.3 plus 6.6, or 12.9.  Remember, we're playing by golf rules and so higher is worse, so this large value would indicate a bad set of weights!

=== SLIDE
So, we can see all three losses for each case here, in the table.  Essentially, the loss function tells us that the algorithm did a very poor job with frog, a great job with car, and a bad but still better than Frog job with Cat.  To bring your attention to the two equations on this slide, the first equation at the top is our original formula for total loss; below it is the loss function we just solved for.  The teal highlighted portion is equivalent - that is, we just solved for the loss of every individual image X.  So, to get the total loss for this one set of scores, we simply need to take the average, which results in...

=== SLIDE
approximately 5.27.

=== SLIDE
And, that's it for todays lecture.  To briefly recap, we started with a discussion of parametric modeling approaches, or models where we pass some set of Weights - generally denoted by capital W - along with the image we want to make a prediction for.  These weights can be used to summarize our knowledge of what different classes look like, and help overcome the need to save all of our data (as is done in KNN).  We then covered a linear classifier, which is one example of a parametric approach, showing you both how they are solved as well as discussing visualization techniques.  Finally, we covered loss functions, with a heavy focuss on multiclass SVM loss.  I hope you enjoyed the lecture, and will see you in lecture 4.

==============Notes
^{N=3}_{i=1}{\left \{ (x_{i},y_{i}) \right \}}

\sum_{j\neq y_{i}}^{J}max(0,s_{j}-s_{y_{i}} + \varepsilon )