Welcome back to DATA 442 - glad to have you all join us. 
Last lecture we had a chance to discuss the overall history of computer vision,
and broadly introduce the themes and topics this course will be covering.
From here forward, we'll start our deep dives - going into the inner mechanics
of just how these nets work.  

=== SLIDE
Before we get started, there are a few reminders I wanted to make regarding the course.
First is Piazza - I received a number of emails from students since lecture 1, and 
just wanted to remind the entire course that we really want to ensure that Piazza
is the primary mode of communication between the course faculty, staff, and students.
As such, please note I may not be checking email very regularly - if you have a question,
make sure to ask it there.  This includes any sort of question you might have regarding
the midterm, final, assignments, attendance - really, anything.  Remember, you can always
post annonymously on Piazza, and don't hesitate to answer questions from other students!

Second, please note we are doing a live Q&A for each lecture through Piazza.  This is a scheduled
block of time where you can optionally choose to join us and watch the lecture in a group
environment, and ask questions you might have in real time.  More information is available
on Piazza.

Third, I wanted to briefly note that Lab 1 will be available tonight at midnight; you'll be able to
find all of the submission links and information you need in the "Assignment 1" folder on Piazza.
Please note that this assignment is fairly challenging, and doubly so if you have not worked with 
tools such as numpy or matplot before.  I highly, highly recommend you get an early start.


=== SLIDE
You'll remember last lecutre, when we discussed Image Classification - an idea that will form the core of
most of this courses content.  When we talk about Image Classificaiton, we are talking about a process in 
which an algorithm takes data - in the form of image data - for example, the bird on the left.  The algorithm
is additionally provided with a set of pre-determined labels - for example, person, bird, sky, house.  Ultimately,
the goal of these algorithms is to pick the correct of these pre-determined labels for a given image.

=== SLIDE
So, what makes this so hard?  After all, my 3-year-old is very good at informing me as to the differences
between a dog and a cat.  Unlike us squishy humans, computers have not had the benfit of millions of years
of evolution of their visual capabilities - and, what they "see" is very different from what we see.  This 
is because - at the end of the day - all the computer takes in is actually a big matrix or numbers.  The dimensions
of the matrix are defined by the size and resolution of the image; the range of each value is defined by the way
color is being measured.  So, a computer doesn't "see" the scene like you or I would - rather, it sees data that
is reflective of the content of the image.

=== SLIDE
The most commonn type of image is a 3 band, RGB, in which the computer actually takes in three different matrices at the same time.
Onne of these is reflective of how much "blue" is being reflected, one red, and one green.  These values inform - for example -
how bright three LED lights should turn on in a given pixel.  When they turn on at the specified intensity, the three LED lights
(Red, Green, Blue) merge to form composite images - like the picture of the bird.

So - the challenge becomes, how do we figure out that this particular arrangement of values within a matrix is actually reflective
of a bird?  This problem is generally referred to as the "semantic gap" - we (as humans) are choosing to label this particular
object as a "bird", and the gap is in the difference between our description of the data ("Bird") and the data description
of the image (the data matrix).

=== SLIDE
This challenge is made even more difficult by human semantic groupings - i.e., we intuitively group similar types of objects
into the same label.  From the standpoint of computer code, the description of a given "Bird" in the data may be very different
across two birds - even of the same species.  This is referred to as intraclass variation.

=== SLIDE
You can also imagine a range of other challenges that our human eye may quickly generalize, but could dramatically challenge
the way the data describes a given object.  One common challenge is viewpoint - a bird can easily appear in many different ways
simply depending on where the camera was held relative to where the bird is.  A more direct example would be a tourist taking pictures of the Eiffel Tower from different angles - the tower won't move, but the viewpoint does.

=== SLIDE
Lighting and Illumination can also cause the data-based description of a "Bird" to change dramatically.  A Wren set against a black background with perfect Lighting
by national geographic will appear significantly different from a Wren in the wild with little control.  A Wren after dark would similarly appear very different.  However, it's still a Wren - and we need to be able to label it as such.  Birds don't become cars simply because it's dark outside!

=== SLIDE
The background against which an object is photographed can also pose a challenge, as similarities between the image and the object can challenge both our human and computational ability to recognize the object.  

=== SLIDE
This provides a rare counter-case where the description of data (the matrix) may provide a more robust solution to object idenntification than our human eye, as camofloge has specifically evolved to counter our capabilities - not a computers.  

=== SLIDE
Deformation is when the same object can be represented by numerous different geometric arrangements.  A bird isn't very deformable - though this is one example.  However....

=== SLIDE
Other types of objects can deform into a huge array of different shapes.  No matter how far a cat climbs into a couch, it's still a cat.  Or if it's sitting in a bowl.  Or if it's leaping through the air.  However, the underlying data matrices which represent each of these images would be dramatically different - thus, again, we run into a fundamental challenge of the semantic gap.  While it is very evident to our human eyes and intuition that these cats are all ... well, cats, the data matrices that describe them are going to have very little resemblence to one another.

=== SLIDE
Occlusion is also an enormous challenge - while the human eye can gather what objects are in an image based on very little data, occluded objects limit the amount of relevant data within a matrix for a computer to assess.  As is only fair, here are a few examples of dogs this time - it is likely that in no case are you left with any confusion (these are, clearly, dogs), but think about how little data is actually *relevant* to your minds intuition.  

=== SLIDE
So - the premise of this course is simple, then.  We need to write algorithms that can overcome all of the aforementionned challenges, in a resonable period of time, and in a way that all of you have a deep understanding of.  It's easy to forget just how difficult the challenge we face is - our brains are simply well-tuned to visual tasks, and so they come natural to us.  Similar to a toddler first learning to recognize the world around it, computers have to learn as well - and we're just now starting to understand how to teach them.  But the scope of the challenge is huge - it's not just figuring out how to overcome each of the challenges to visual recognition in code, but how to do it for .. well, everything.  Buildings, cats, tables, hair, eyebrows - you name it.  That's the problem we're engaging with here.

So, let's get started.  Let's pretend we're writing our first classifier - a function that takes a matrix of data representing an image in, and outputs a class. For our hypothetical, let's say we're classifying a black and white image of a capital letter "T" - one of the earliest types of image recognition challenges.  What you see here is a very simplistic representation of what the letter T might look like in code - pixels that intersect with the letter T are represented by a "1", and white pixels are represented by a "0". 

=== SLIDE
What do we want to do with the letter T?  Teach a computer how to recognize it, of course! Think about how you might code an algorithm that does that - first, you would define an image classification function, then you would pass the matrix "T" in.  From there, you would need to identify a deterministic way to choose "T" from your set of possible labels.  For example, maybe you could do something like this:

=== SLIDE
Because we know the letter "T" results in a matrix with 32 "1" values, we can simply hard code it so that any time there are 32 pixels with "1"s, we return the letter T.  There we go - problem solved, course is over!  This is of course a terrible solution, as it would only work for a narrow definition of "T", would likely have overlap with other charachters which lead to errors, and not be able to handle any of the classes of errors we discussed before - i.e., occlusion.

This contrasts to other classes of computer algorithms, which frequently have a deterministic solution - i.e., sorting a list (everyone's favorite activity).  However, to hard-code a determinstic solution to "finding the letter T" would require a near-infinite number of exceptions and tests to ensure it is not confused with other letters, can be detected in all cases, and that it is never, ever confused with a cat.  

=== SLIDE
The impossibility of coding all such exceptions and rules has not stopped us from trying.  Back in 1986, Canny put together an algorithm which sought to detect edges, and posited that the relationships between the edges could be used to identify what the object was.  As a very simple example in our case of "T", instead of deterministically selecting the type of letter based on the number of pixels that are black (32), we can instead do it based on the number of intersections between lines (in the case of T, 1).  While deterministic - the same image fed into the same algorithm will always receive the same result - this approach is exceptionally brittle.  I.e., 

=== SLIDE
what if the font changes?  Further, if you want to consider any other type of image - i.e., a plane, or evil villans from Dr. Who - you need to start a completely new algorithm. 

=== SLIDE
So, we don't want to manually write rules for every conceivable object in the world, under all lighting conditions, irrespective of how many blankets they are hiding under.  So, what is a human that wants to automatically classify billions of frames of video and images to do?  This is where machine learning (or, AI) comes into play.  Rather than write the algorithm ourself, we instead can write a classifier - a loose framework of possible rules for an algorithm to parameterize - and let the AI use that framework to essentially write it's own function to distinguish between classes.  Practically in the case of computer vision, this is a three step process.  First, we need a very large database of labeled imagery - i.e., ImageNet - that provides millions of examples of what a "Cat", "Dog" or "Canary" are.  This is used to help cross the semantic gap - i.e., each one of these images has been labeled by a human, and thus can be used to "crosswalk" between the computer-description of matrices, annd the human-descriptors of words.  

Once we have our input data, we need a function that can produce a model we can use to classify each image.  This is the function that is where the "machine learns" - i.e., tries to identify the best set of rules to determine if an image is the letter T or a Wren.  As you can imagine, there are many, many different ways to conduct this step of machine learning, and we'll be going into some depth on them throughout this course.

In the final step, we actually apply the model to unseen images - i.e., put a new image in, and get a classification out.  This lets us understand how accurate our training was, as well as operationalize the models for use.

This results in a more complex approach to our code - rather than one function that takes an image in and puts a class out, we now have two different functions: one takes in a very large set of data to learn from and construct a classifier, and the second takes in an image and the output of the first function to produce a predicted class.  

=== SLIDE
We're going to start with a few simple algorithms to illustrate how machine learning works at a fundamental level, before we graduate on to the neural network based approaches.  The first of these is essentially the absolutely simplest approach to training an algorithm the field has come up with - "Nearest Neighbor".  Most of you have probably already encountered Nearest Neighbor, but it will be helpful to conceptualize it in the context of image recognition.  

=== SLIDE
The first step of Nearest Neighbor is to feed some kind of training data into the model.  This data has to be labeled.  As a very, very simple example, I have constructed three matrices that will represent our training data - the letter T, lowercase letter L, and a capital I.  While we'll show the outcome of a more serious case in a few minutes, these three matrices will be used to illustrate the basics of how Nearest Neighbor works, and serve as a nice illustrative example of machine learning as a whole.

=== SLIDE
Here is an example of the letter "T" from our training data, and on the right the letter "T" we seek to recognize.  You'll note it's slightly different - i.e., the font (Time New Roman) results in a slightly wider base and edges at the sides.  Thus, our old approach (32 black pixels) would fail here - we need something that can recognize similarity, which Nearest Neighbor will help us with.

=== SLIDE
For reference in later slides, here are the two letter Ts in pixel form, where each pixel that touches the letter is a 1 (black), otherwise it is a zero (white).

=== SLIDE
The basics of nearest neighbor are simple - for every single image we want to predict, we'll compare it to the training data and pick the most similar according to a distance metric.  The first distance metric we'll highlight is L1 (also called manhattan distance).  This is one of the simplest ways you could imagine comparing two images - essentailly overlaying them, and simply measuring the differences.  Let's start with our letter T.  After the algorithm "learns" (i.e., memorizes) our training data, we're going to contrast the T we want to recognize to every letter we trained with.  In this slide, the first letter T on the lower-left is the T from our training data - the example T we want to match to.  In the middle is the T we want to recognize.  On the right is the pixel-wise difference between the two images.  In red are the negative cases (where the T we want to recognize had a pixel, but the training case did not).  As you can see on this slide, we take the sum of these differences to calculate the "L1 Distance" between these two images.

=== SLIDE
On this slide, you'll see a simple programmatic example of implementing a L1-norm based distance function.  All this code does is take in one of our example matrices, convert them to numpy arrays, and take the pixel-wise difference.  This is identical to what we did manually on the previous slide. 

=== SLIDE
Now that we have our function, we can write a very simple loop that contrasts all of our different training datasets - the T we want to classify, and the three training letters (capital T, lowercase l, and capital I).  The code will produce the distance measures shown in the lower-right - a L1 distance of 10 for the letter T, 14 for the lowercase l, and 18 for the capital I.  Based on this, we would classify our letter as a T!  Wonderful.

=== SLIDE
Of course, we want to start developing much better coding habits throughout this course, so let's translate that code into a class that we can re-use and modify more readily.  We're going to be diving right in - if you feel behind or lost, I recommend you check out the python tutorial we have up on Piazza and then return here.  First, we're going to be using numpy for all of our arrays from here on.  This is fairly straightforward to implement, and will give us big speed gains later on.  Here is one quick example of converting the training letter "T" into a np array; you'll note that while the structure changes, the underlying data does not.

=== SLIDE
What you're looking at now is a simple implementation of the Nearest Neighbor algorithm, this time using Numpy. Let's spend just a little bit of time breaking down what's going on here.  

=== SLIDE
First, we're saving all of our training observations into memory.  As noted before, this training is about as simple as it gets!  

=== SLIDE
Second, we're running the exact same algorithm as before, comparing every one of our test cases (saved as self.Xtr) to the case we want to predict (X[0] in this case).  The class we're calling is on the left (NearestNeighborSinglePrediction), and on the right you'll see an example of how we would call the code.  Noting that all of our letters are now numpy arrays, we would first build a list of all of our training data.  New in this approach, and following the general paradigm of the rest of the field, we additionally create a second array, our "trainingy".  This array is entered in the same order as our "trainingX", and provides the labels.  We then provide our training data to the algorithm, and solve for a prediction.

=== SLIDE
I want to draw your attention to the 1-liner where we calculate l1Distances.  This takes the place of the for loop we wrote earlier, and is far, far more computationally effecient while providing the same solution.  This is one example of many where we will be using numpy array approaches to replace what would otherwise be implemented in a for loop.

=== SLIDE
We're almost there, but want to make this generic algorithm a little more versatile so we can test multiple letters at once - in practice, you'll need this to validate how well your algorithm is working, as you'll want to test your algorithm against thousands of cases not used in your training.   To facilitate this, I have created a new letter - lowercase l - with a different font than our training data.  In this iteration of the code, we'll be seekling to simultaneously identify the letter these two examples look most like based on our training data.

=== SLIDE
Alright!  Just a few small changes to our code let us output predictions for multiple inputs.  We now have a capital T and a lower case l, correctly classified.  Before we dig into the outcome though, I want to highlight one quick thing.

=== SLIDE
Think for a moment about the code - our training would be lightning fast, as all we're doing is making in-memory pointers, which is nearly instantaneous.  However, our prediction would be quite slow - we have to compare every single case in our training data to a given image to find the smallest distance.  This is, needless to say, very time and computer-processing intensive.  It's also the opposite of what we generally seek: it's OK if a model takes a long time to train, because we can train a model before we use it on a large farm of computers.  However, we want our models to be able to predict very quickly, or run on low-power devices like cellphones.  So, this is no good, and more complex approaches are going to provide us with the opposite relationship.

=== SLIDE
To extend this example, lets apply the same logic to CIFAR10.  This dataset is made up of 60,000 images, which is subset into 50,000 training and 10,000 testing images.  The images are equally distributed across 10 different categories, including airplane, bird, frog and truck.  You can go to the website at the bottom of this slide to do a dive into the dataset, as well as see many more examples of the images chosen.

=== SLIDE
What you're looking at on this slide is a few example images taken from a nearest neighbor implementation on the CIFAR10 dataset done by Justin Johnson in 2017.  The first column is the image we are seeking to predict - one of the 1000 testing images from each class.  On the right are the images they are found to be most similar to according to the nearest neighbor algorithm, drawn from all 50,000 training images.  Highlighted in red are the cases where the class of the first image matched would have been incorrect.  For example, in the first row you can see the image we're testing is of a boat - the class predicted in this case would have been a bird, as the first image matched is a bird.  Conversely, in the second row you can see a case where nearest neighbor gets it right - the test image of a dog would have been matched to another dog.  Because these images are very low resolution, it can sometimes be a bit challenging to see what's going on, but they provide a good illustration of the type of approach we're testing.

=== SLIDE
One simple extension of Nearest Neighbors is K Nearest Neighbors, an algorithm that allows for a vote of multiple similar cases, rather than only taking the single best match.  Here, we have an example where we have multiple training samples for the letters "A" and "T", represented by Red (A) and Blue (T) dots.  We also have an example of a hand-written letter "T", represented by a yellow dot.

=== SLIDE
Each point is aligned so that the farther the point is from 0, the more different the letter was according to the L1 Distance.  So, for example, the first red point was a letter “A” that looked a lot like the letter “T” represented by the yellow dot.  This woudl result in an erroneous classification in an unmodified nearest neighbor algorithm, as we would then select A as the most likely correct class.

=== SLIDE
Conversely, in a K=3 Nearest Neighbor, we would expand our search radius to include the 2nd and 3rd most similar letters to our hand-written T.  Each case would get a vote, and because 2/3 of the cases are T, we would correctly assign the letter T.  You can imagine any number of expansions of this voting technique (i.e., distance weighting).  

=== SLIDE
Let's go back to our CIFAR10 example.  If you look at the fourth row, you can see the example of the frog, which was missclassified as a Cat, as the cat was the most similar image.  If we expanded to K=2, K=3, K=4, or K=10, we can get an idea of how expanding K can help (or hurt) accuracy of this class of model.

=== SLIDE
Here is us doing just that - remember, the correct answer for this estimate would have been a frog.  If we have k=1, we only use the first (most similar) image for our prediction - in this case, a Cat, and we get it wrong.  If we expand k to 2, we move to the second column of this table.  As you can kind-of, sort-of make out at the top of the table, the second-most-similar image was a frog.  Thus, we now have one example of 2 - of 50% - being a cat, and one - 50% - being a frog, so the algorithm would be tied as to what class it assigned.  If we did k=3, 2 out of the three examples are now cats, and 1/3 frogs.  If we then expand all the way out to k=10, you can see that there would be a wide range of classes estimated, with 3 of the 10 points belonging to both the cat and frog case, resulting in a tie.  

=== SLIDE
Many different machine learning algorithms - including KNN - require the selection of a distance metric.  The example we just walked through was an example of the use of a L1 distance metric.  Contrasting to this is a L2 distance metric.  While it may seem bland, the choice of a distance metric has a number of implications.  A few important things to note.  First, L1 distance is predicated on the idea that differences are linear - i.e., a one-unit change in difference between 0 and 1 matters as much as a one-unit change in diference from 10 to 11.  L2 biases towards large differences - i.e., a one-unit change in difference between 0 and 1 matters less than a one-unit change in difference from 10 to 11.  However, L1 distance is sensitive to changes in your underlying coordinate systems, which inter-relates to the standardization strategies you choose.  As a broad rule of thumb, L2 distance is generally preferable if you are working in a generic feature space, but L1 might be preferable if you know the specific data you have is highly relevant to your classification (i.e., the precision of the measurement matters a lot). 

The two figures on the slide here present a visual of the difference between the L1 and L2 norm.  In both cases, imagine you are trying to distinguish between two images - an image represented by the yellow dot in the center, and an image represented by the green dot at the lower-right.  In both cases, you took the value of the upper-left hand pixel and plotted it on the Y-axis, and the lower-right hand pixel, and plotted it on the X-axis.  So, the yellow image had a value of 0 in both pixels; the image represented by the green point had positive values in both pixels.  The green dot and yellow dot are on the same spot in both cases.

=== SLIDE
Now, let's add a second image - we want to identify if the image represented by the yellow dot is more similar to the image represented by the green dot, or the image represented by the purple dot.  On the left figure - the L1 Norm - both images would have an *identical distance*, represented by the blue diamond.  That is to say, the measured distance between the yellow image and the green image would be exactly equivalent to the yellow image and the purple image.  Conversely, in the L2 Norm, the exact same images would result in different distances - in this example, the Green image would be measured as more similar to the yellow image. 

To re-iterate: in general, L1 norm will be important if the precision of your underlying measurements is well known and important, as all differences between values are treated the same irrespective of their magnitude.  L2 is more generalizable, and more readily useful in generic feature space.

=== SLIDE
So you may be asking yourself - outside of these rules-of-thumb - how do we select the best distance metric?  This is a broad class of problem, and not just about distance metrics - think about KNN so far.  There are two different choices we have to make to implement the algorithm: what value of K do we use, and what distance metric to use.  Both of these are referred to as "hyperparameters" - choices we make that influence the algorithm, but are not trained based on the data. 

A big challenge is that the correct hyperparameters are very problem dependent - the underlying nature of your data, the target you're trying to predict, sparsity, and hundreds of other things come into play in driving the best.  So, generally, we seek to try them all out and see what works best! 

=== SLIDE
We're going to dig into this notion of hyperparameters a little further.  When we talk about "what works best", there are a huge number of ways to define "best", and the research community has explored a wide range of them.  The first idea you might consider is simply choosing hyperparameters on the basis of whatever provides the most accurate estimates for your dataset.  While intuitive, this is a terrible idea, but let's walk through why.  In this approach, we would take all of our data ("The Data") and feed it forward into 10 different models - 5 different permutations of K, and two different distance definitions.  We would then look at the accuracy of every model, pick the best one, and choose the one with the lowest overall error.

So, why is this a terrible idea?  At the end of the day, we don't actually want a model that can predict only the data we have on hand - the entire point of the vast majority of machine learning is to predict for cases we *don't* have.  So, this will give us the best hyperparameters for our data, but no insight into how effective our model is for other cases.  Further, we are highly likely to dramatically overfit to our data, resulting in a model that can't be generalized to any other cases.

=== SLIDE
So, a natural extension of this would be to take our data and split it into two pieces - a set of data we're going to use for training, and then a completely independent set we're using to validate.  As before, we will fit our 10 models, but this time with only the training data.  We'll then use each of the ten models to predict the *validation* data, and choose the one that works best to establish the best hyperparameters.

Yet again - this is a bad idea.  The point of this validation set is to give us some notion of how well the algorithm will work for new data.  Because our validation set is driving which model is chosen - i.e., the error of the validation data helps us choose the "best" model - we are left without a truly external test, where the data being used to choose a model in no way influences the estimated errors.  Bottom line: don't choose a model based on data that in any way influences model accuracy!

=== SLIDE
So, what is a budding computer vision student to do?  A third approach - and a fairly good one - is to split your input data into three discrete buckets.  The first two are the training and validation data - these are the datasets that will inform our modeling.  The third, and new one, is a test dataset.  This test dataset will be left completely aside until the very, very end of our analysis - after we've done all of our training, validation, selected the best model based on our validation.  Only then will we test the final set of hyperparameters chosen by contrasting it to the test data.  That result will provide the numbers we ultimately report in academic papers, reports on the accuracy of your model, and the speech you use to illustrate how smart you are to your colleagues.  If you see other approaches to accuracy assessment which do not utilize a completely independent test set, be VERY dubious!

=== SLIDE
While an independent validation and test dataset is the most common approach in deep learning, cross-validation is another important tool.  This is especially important if you are fitting a model with a relatively small amount of data - i.e., a few thousand observations, instead of millions.  Essentially, it is possible that your model will be biased due to the random subset of data you chose for testing and validation - i.e., imagine trying to predict the contents of an image, and by purely random chance you choose 100 images of the eifel tower out of a database of 1000 for your testing set; your model might be REALLY good on paper, but in practice can't predict anything but french architecture.  

Cross validation (sometimes referred to as "K-Fold" Validation, where k is the number of folds) is an approach to mitigating this challenge.  In cross validation, we subset our data into a testing dataset, and then some number of folds - in this slide example, 4 folds.  Then, we set aside a single fold to use as validation during the modeling process, and use the remaining three folds for training.  This process is repeated for each permutation of the folds - i.e., in this example, we use fold A for validation, then B, then C, then D.  Finally, we choose teh model with the lowest overall error for each fold (i.e., through a voting mechanism, though there are many approaches to choosing the best hyperparameters across all folds).

=== SLIDE
While we've been using KNN as an example - and it's a great example to help walk through a number of basic machine learning concepts - it is not a very strong algorithm when it comes to image classification.  We already covered one reason for this - it's very slow, as most of the computational costs occur during prediction, rather than during training; this makes it unsuitable for applications that require quick decisions (most of the time).  Further, we have a decent body of evidence that suggests that KNNs are simply not good at capturing the perceptual differences we see with our eyes - this is largely because they operate on a pixel-by-pixel basis, so context of pixels is not retained effectively. This leads us into the biggest challenge that most traditional machine learning approaches have when it comes to images - the curse of dimensionality.

=== SLIDE
Most data analysis problems have focused around traditional datasets, in which a single row contains all of the information for a single observation.  For example, a dataset might contain millions of rows of data with personal characteristics of individuals such as Height and Weight.  A dataset like this can be represented by the number of dimensions it has (in this example, 2 - Height and Weight), and the number of observations it has (in this case, 3).  To represent this sample dataset would require 3 points in a 2 dimensional plane, or a simple scatterplot.

This is important, because as the number of dimensions increases, the number of training samples we need also increases - you need an example case "nearby" (i.e., a neighbor) to get a reasonable estimation.  In traditional, low dimensional cases this is quite feasible to achieve, but in high dimension cases this becomes increasingly difficult.

The figure of the right shows an example of this.  Each observation is plotted from the table, and the region in green very loosely represents the region in which we might expect to make a reasonable estimate - i.e., there are neighbors that would be nearby.  In this simple example, the green box represents around ~20% of the chart - i.e., the area in red is where we might not expect to make a very good estimate, because there are no samples nearby.

=== SLIDE
Now, let's imagine adding a third dimension, but all of our values are the same (Age).  The region of space we have to estimate across grows exponentially, but the green region stays the same size - at least in this simple example.  For each dimension we add, we need a huge additional number of samples in order to cover the new region.  This is an enormous problem in the case of images, as ...

=== SLIDE
in the case of an image, in it's raw form every pixel can be considered a dimension.  This means that methods that require a dense sampling of multidimensional space are inherently ill-suited to imagery recognition tasks - especially considering that additional dimensions are frequently added as images are permuted to add in contextual information.  Bottom line: don't use KNN for imagery recognition, it's just a bad idea.

=== SLIDE
Let's take a moment to recap where we are now.  We've been exploring the topic of image classificaiton, in which we seek to use a large body of past examples of classified images (i.e., images that have been labeled by humans) to train an algorithm to classify other images (i.e., "This image contains a cat").

We've been exploring this in the context of a Nearest Neighbors algorithm, which simply memorizes all of our training data, and for any new data selects the image that looks the most similar in the training data and then assigns it's class.  Just to reiterate, nearest neighbor is a good example for the general mechanics of our machine learning approach, but a terrible choice for imagery recognition in general.

There are two hyperparameters - the distance metric (L1 vs L2) and K (the number of neighbors to look at) you must choose for KNN.

A good way to select appropriate hyperparameters is creating a validation and test set, where the test set allows for a completely independent test of the algorithm on data it has never seen before.

For small datasets, cross-validation is a good addition to improve result robustness, as it mitigates the chance a single test set will lack variation representative of external data.

=== SLIDE
That's it for today - just a few brief reminders for everyone.  First, check in on Piazza with any questions you might have!  We're excited to answer anything that comes up on assignment 1, and there is a lot of information about it up now.  Also, remember group study is definitely something we encourage, but you must submit your own work.  Good luck, and we'll talk again soon!