Welcome back to DATA 442, lecture 15! #SLIDE As usual, we'll start off with a brief recap. Last lecture we introduced the diea of generative models, which are a class of model that seek to create new data points by drawing from the distirbution of your inputs. Or, a cat generating machine. You pick. #SLIDE We discussed three different types - two that use explicit density functions, and one in which we implicitly sample. #SLIDE The PixelRNN and CNN were examples of solving explicit density functions, using different approaches to capturing the probability at each pixel. #SLIDE We then introduced Variational Autoencoders (VAE), which leverage an encoding/decoding process to generate new samples from our inputs. #SLIDE And, finally, we discussed the Generative Adversarial Network (GAN) approach, in which two nets - a generative and discriminating net - compete with eachother to either create fake images, or - in the case of the discriminator - identify the difference between fake and real images. #SLIDE Today we're going to talk about one of my favorite topics - visualization of neural networks. There is a lot of discussion you'll hear about how neural networks are "black boxes", and we can't identify what's going on within them. This is, put simply, silly. While it's hard to identify the underlying patterns that are occuring in convolutional - or, neural - networks, it is possible to get very good ideas of why predictions are being made by visualizing just what a network is doing at any given layer or computation. This is also a fun lecture, because I get to show you a whole bunch of pictures that really let us go under the hood of networks you've been implementing in your lab! As a teaser and to kick us off, you're looking at a project by one of our Ph.D. students in this slide, in which we are trying to distinguish between high and low quality roads from satellite images. The method (SHAP) you see here helps unpack what features in the image are voting for or against a given classification. If you take a look at the upper-left hand image, under "High Quality", you'll see - as an example to kick us off - the well painted road centerlines are highlighted in red pixels. This means that, when the model sees clearly painted road lines, it is voting in favor of high quality. Pretty cool, huh? #SLIDE Throughout this course we've talked about a lot of different neural network architectures for different purposes, like this one - this is a representation of VGG16, showing the activation surfaces as convolutions occur throughout the network. However, #SLIDE We haven't talked much (at all) about what the activation surfaces themselves are actually identifying. As we parameterize our network, what types of patterns are they actually finding that are correlated? What exactly is the network identifying that gives it all of this explanatory power for our predictions? To date, we've really just seen convolutions as filters - i.e., we put some image in, we convolve our filter, and some vector comes out. The question is, what meaning can be attributed to these vectors throughout the net? This matters a lot, as if we can build an intuition for what types of layers are working, what type of features they're detecting, we can start to make more intelligent decisions about model architectures. #SLIDE One of the first things you can do is visualize the actual filter weights themselves - especially in the first layer. To take AlexNet for ann example, these filters are representative of 3x11x11 shapes, and these filters get slid over the image as they convolve over the image itself. In AlexNet you would have 64. IN the first layer, this is the direct inner product between the weights in the filters and the image. So, by visualizing the filters as images themselves, we can visualize by an 11x11x3 (in AlexNet), with all 64 images being represented. These would be the weights of the convolutional filters on th left - you can start to see what filters are meaningful for a given project. What you're looking at here are the weights solved for in the context of ImageNet. So, you can see the types of things the filters are detecting - vertical lines, horizontal lines, opposing colors, diagonals, and other absrtact types of shapes. Remember all the way back to the first lecture in this class - you'll recall that the inspiration behind convolutional neural networks largely came from the hierarchical nature of the human visual system, and how they interconnect, starting with veyr basic neurons that build up to achieve pattern recognition. The same idea is present here - the convolutional neural network is identifying very basic patterns, which it can then combine into more complex patterns at later stages in the network. #SLIDE One really cool thing is that, almost no matter what network you're using, filter design, or even input imagery, you end up with a pretty similar set of filter weights for the first layer - i.e., the baseline features things a network interprets (straight lines, horizontal lines, highly differnet colors, etc) are relatively similar, and seem to be similar to the types of patterns the human eye is desigend to pick up on. Here you see four different architectures and the type of features detected in the first layer for each case. It would be quite difficult to tell which was which on the basis of this - all of our architectures seem to settle on the same fundamental building blocks to extract data from imagery. #SLIDE Another type of visualization we can learn from is the last layer - i.e., the multi-dimensional affine layers that are used to calculate the final class scores (i.e, for imageNet it would be 1000 class scores; for CIFAR-10 it would be 10). These are the feature vectors that represent the sum total of information that the CNN reduces our input down to. So, we can run our network across all of the layers, and then record these vectors, and then try to visualize how they change based on the input. #SLIDE Let's make this a little more explicit - when we put an image into our network (say, VGG16), we are going to have some amount of data at each layer. IN the case of VGG16, one of those layers is a densenet, which takes as an input the 25,000 or so activation filters in VGG, and feeds them into a fully connected layer with 4,096 outputs. Those 4,096 outputs represent a point in 4,096 dimensional space; that point is our representation of the image. #SLIDE We can do that for any number of images, passing them throught the network, and identifying where in our multidimensional space that image ends up. #SLIDE Once we've done that for enough images, we can then cluster them together - i.e., try to figure out what images are represented in clusters throughout our space. Here you have an example of this, where on the left is an image that is selected as an input, and to the right are the 6 closest neighbors to that image that was fed into the network. You can see that the feature space allows us to capture all sorts of variation - remember back to our original discussions about.. #SLIDE two headed horses. If we were looking at these images in pixel space - i.e., just the pixel values in the image - we would struggle to identify cases where the image was semantically correct, but orientations or occlusion changed. #SLIDE There are many examples you can pick out here where feature space is clearly not susceptible to the same issues that pixel space was. Our input elephant is looking right - but the example elephants nearby in feature space are looking in all sorts of directions. Similarly the aircraft carrier is heavily occluded in some images, and facing different aspects throughout. This is a good way to understand if the feature space you're identifying is helpful in identifying the differences you're trying to find. #SLIDE You can also apply dimensionality reduction to this space - i.e., instead of having our 4,096 values in multidimensional space represent each image, we can reduce those down to the 2 dimensions that represent the data most completely. If you've taken an earlier DATA course (i.e., DATA 310), you probably learned concepts like PCA, t-SNE, kernel PCA and other dimensionality reduction techniques; you could theoretically apply any of these. This gives you a direct way to visualize where different classes are in the multidimensional space being output by your CNN, and thus a good understanding of what classes may be getting confused (or not) with one another. #SLIDE Here is a quick example of this technique being applied to a CNN designed to classify the MNIST numbers (0 to 9) - you can see the dispersion of all cases, but if you look closely you'll see confusion between (for example) 0 and 9 in the upper-right. #SLIDE Another cool thing you can do is to actually place the images themselves in this 2D space. So, here is an example of a display of images that would be considered similar to eachother in 2D space. I encourage you to go to the website at the bottom of the screen to take a look at these images in all of their glory, as.. #SLIDE some really interesting and neat patterns begin to emerge as you do this with more and more images. For example, here you can see the images in the upper-right tend to have blue. Lots of blue. So, a filter must be picking up on that color, or patterns associated with blue, becuase they are all in a similar multi-dimensional space. If you poke around the images, you'll see that that blue smudge is clearly "boats"; the green at the lower-left is flowers, and so on. #SLIDE We can also get very specific with regard to unpacking individual pieces of our networks. Envision for example selecting a given activation in a given layer - i.e., we want to know for the 1st filter of the 4th convolutional layer, what is causing that filter to trigger? This is a patch activation approach. First, we run thousands of images through our network, and identify the images that caused the maximum activations for the filter we're interested in (i.e., the largest return values). We can then look at the activation surface itself, and identify where the largest values are. Because we know that each convolution is representative of a specific area in the space of the original image, we can essentially work our way backwards to identify what that patch is, and then visualize it. #SLIDE Another approach is an occlusion experiment - the basic idea is that we want to identify what part of the image actually results in the network making a decision. So, what we do is we take our input image - here, a picture of a cat and a dog - and then we'll start to block out parts of the image and replace it with a mean pixel value. We then run this through the network and record the probability for each class in each case. We then slide this occlusion window across the image and repeat the process, and draw the heat map that records - for each lcoation we occluded - the probability an image would be classified as a cat or a dog. The idea is that if we block out something that is key for the networks interpretation in the network, then the network should change a lot - i.e., the classification decision would change. On the right you'll see an example of this, where different pixel groups oon the image of the cat and the dog have been systematically occluded. The top represents the "Cat" classification case. When the pixels of the cat are occluded, we see that the probability goes down that it is a Cat - as represented in blue. You can see from the intensity of blue that things like that cat's tail mattered just a little, but things like it's neck and, apparently, underbelly were quite important. #SLIDE Another approach is to create a saliency map - i.e., trying to understand which pixels in the image are important for the classification. The approach is to compute the gradient of the class scores themselves with respect to the image pixels. So, remember back to our backpropogation examples - we were always trying to solve for the gradient of our weights functions, and left X (our input image data) alone. Instead, this is essentially trying to identify if you changed a pixel value by 1, by how much would the resultant score estimate for a given class change? This gradient-based approach is visualized on this slide, where you can see the importance of different pixel values for each of the classifications. #SLIDE This same idea can be applied to any output - i.e., in the saliency case, we're computing a gradient based on the class score, but we can compute the gradient for any value in the network. I.e., let's say I want to know what causes filter 3 of my third layer to go up - I could calculate the gradient of filter 3 in my third layer on the basis of my X pixel inputs. Zeiler and Gerfus (in 2014) showed that you could do this effeciently using what is called a "guided" backpropogation, in which only positive gradients are backpropogated through ReLU networks. The neat thing about this is that you can actually understand every single feature that a network is examining, irrespective of the depth (or lack thereof) of the operation that you're interested in exploring. You can see an example of this in the second column on the slide now - i.e., you can see in the top image you can make out the cat with a fair amount of detail, suggesting that the features being looked for for the "Cat" class are things like the stripes, and maybe the shapes of the eyes and nose. IN contrast to that, the Dog case clearly shows the top of the dogs head, and things like the shape of the nose and ears. #SLIDE Yet another approach is gradient ascent, in which we generate an image that will result in a given output (i.e., class score) reaching it's maximum value. Just like we can optimize the weights in our network, if we freeze the network we can essentially "optimize" the pixels in an input image, with the goal of the optimizer being to maximize (hence gradient ascent) some value in the network. What you're looking at here are 9 examples drawn from work by Simonyan and Zisserman in 2015, drawn from ImageNet and an approximately 19 layer net. These were the generated images that would result in the maximum class score for each class. #SLIDE A fun example of how this works can be seen on the slide now, which is the process of a net optimizing the input image to maximize the likelihood of predicting "Spider". Straight out of lord of the rings, or your favorite halloween movie, you can see legs everywhere, along with patterns that clearly seem to represent webs. #SLIDE With some minor image augmentation, such as gaussian blurring and clipping, and better specificaiton of regularizing hyperparameters, the images that can be generated this way can be pretty incredible. Here, you're looking at the final score for ImageNet based on an 8-layer neural network. I highly encourage you to take a look at the website at the bottom of the screen, where you can interact with the tools built to diagnose networks, as well as learn more about them and see more examples, using gradient ascent approaches. #SLIDE We're not quite done yet, but because we're going to be switching gears a bit I do want to provide a summary of where we are so far. We've covered a range of different approaches to understandign CNNs, including dimensionality reduction of our output vectors, identifying the patches of images that cause individual filters to activate; intentionally occluding part of the image to determine the areas that are most important for a given classification; using backpropogationn to establish the importance of one pixel changing, and finally gradient ascent to determine the image that would maximize a given class score. #SLIDE Alright! Now we're going to change gears a bit and discuss deep reinforcement learning. The fundamental idea of reinforcement learning is to update our model as it learns more about the behavior of the system it's trying to engage with as the system changes into different states. Taking checkers as an example, you can imagine an algorithm we design that has three characteristics. First we have a state, which is - for each turn of the game - the positions of pieces. Second, we have the action the model can choose between - i.e., where to move a piece on the board. And, finally, we have some reward - a binary 0 or 1 if you win or lose the game. The idea is to calibrate a model that can adapt to different states, and then make a decision to reach a given outcome. This is the same fundamental strategy that is used for all sorts of algorithms, the most prominent of which being robotics. #SLIDE We can formalize a reinforcement learning problem as an iterative relationship between an agent - i.e., some algorithm that makes a decision about what actions to take - and an environment - i.e., something else (can even be a black box) that responds to those actions. This problem formalization allows us to determine the state of the environment (s) at any given time step, as well as any reward (positive or negative) the environment might given in response to a given action (a). You can imagine any number of different ways to construct this agent-environment relationship, ranging from giving rewards every time something specific changes (i.e., a checker is taken; a pawn is removed) to updating the environmental state due to elements outside of the action (i.e., for a robot, a new rock has moved into the field of vision after a step was taken). #SLIDE The most common way to formalize this is a Markov decision process. This is predicated on the Markov Property. A process is said to have the markov property if the conditional probability distribution of the future state of a process depends only on the current state. A simple example of this is checkers - once a piece is moved, all future possibilities are now constrained by that fact, i.e. the way the game is played after a piece is moved is always dependent on how that piece moved. You don't need to know anything about the state of the board before the piece moved to predict what possibilities still exist after the piece moved. This is true of most games we play - chess, and even things like Texas Hold'Em. This contrasts to processes in which some information about past states is needed to predict the next state - a common example is trying to solve for how fast a runner is going to move over the next 10 seconds. This requires information not just about the state of the runner (i.e., where they are, how fast they are moving), but also information about - for example - for how long the runner has already been running, and thus how tired they are likely to be over the coming 10 seconds. #SLIDE A markov decision process can be defined by a tuple of five objects. S is the set of possible states, A possible actions, R - our distribution of rewards for different pair, P - transition probabiliy distributions to the next states given a chosen state/action, and then gamma, which is how much we value current rewards (i.e., at this step) vs. later. By combining these objects, we are able to fully define how our system behaves and reacts to different sequences of actions. #SLIDE How we implement a markov decision process is through a five-step loop, in which we repeat each of the later four steps until some convergence or objective function is satisfied (i.e., a game is won). First, and only once at the first step, the envrionment is defined as some initial state s0. While s0 can be assumed to be a sample drawn from some probability of possible initial states, the easiest way to think about this is a chessboard - the pieces all arranged and ready for war. #SLIDE Once the environment is defined the agent must choose an action to perform, defined as a_{t}. This can be pushing the "jump" button in a mario game, or moving a pawn forward. #SLIDE Once an action is chosen, the environment then identifies the reward that a given action would return, given the current state of the environment. Again, this is defined as sampling from the distribution of all possible rewards, given a state and action. This could be receiving a reward for taking a piece from the opponent. In the case of super mario, this is generally defined as moving forward along the level. Essentially, the reward provides an algorithm with information about how to optimize it's play at any given state. #SLIDE Now, the environment samples the next state, St+1, given the action. This again is a probability we're sampling from, i.e. - given the white pawn moved forward, there are 10 valid responses the black player could make (moving all 8 pawns and the 2 horses). So, we sample from that according to some probability function, and the environment would return the black players move. #SLIDE Finally, the environment sends the information about rt and st+1 back to the agent, and the process is repeated. #SLIDE Looking at this loop for a moment, we have three other important terms. First is our policy, pi, which determines what action to take given a state and a selection of possible actions. Second is our constraints, omega, which identify actions that cannot be taken given a specific state. We are ultimately trying to optimize the final objective function on the screen, which uses our discount rate gamma, multiplied by the reward at each timestep t (excluding the first time step 0, before any actions are taken). The goal is to find the policy pi, given constraints omega, that results in the highest total reward. #SLIDE The most common way to find an optimal pi is trhough something called Q-Learning, which is effectively the use of a function to estimate the action-value function: i.e., given a state and action, what is our anticipated reward given a discount rate? Because a deep neural network is a natural approach to build this sort of model, this frequently takes the form of Deep Q-Learning. We're going to walk through this with an example of everyones favorite italian plumber, Mario! #SLIDE Let's break our problem down into state, action and reward. In this case, our State is going to be the raw pixels of the game - i.e., a given frame frozen in time, so we'll process every frame. Our actions available at any time are to jump, push the left button, or push the right button (again, we make this decision once per frame). Finally, our reward is to increase if we move farther right (towards the end flag), and decrease if we move left. #SLIDE This translates into the need for a network architecture that looks something like this - first, some convolutional filter and activation function to extract the information from a frame, and then an affine or fully connected layer that assigns three scores - one for each of our actions. The goal of our network is to identify the weights that are most helpful for making the decision between the four possible actions, based only on the pixel information in the state. #SLIDE Once we've made our action, we then record the next frame of our game - i.e., the state at t+1. This is used to compute the reward, rt, for that action. In the mario case, this would simply be the number of pixels advanced to the right (a positive reward), or negative rewards if we're moving left. #SLIDE Now, we calculate Loss. Loss in the case of a Q-Learning architecture is different, because we don't have any kind of "Truth" to compare to - that is, we don't know in advance what the correct action (or sequence of actions) was. So, instead, we seek to calculate the optimal score we *could* have at a given time step, and then contrast that to our current score; we seek to minimize the difference between those. Because we're seeking to minimize this loss, we can then backpropogate through our network just like we would any other loss function! #SLIDE Finally, we update our state, and repeat the process until the finish line (...or we fall into a pit and die). While there is a tremendous amount more to deep reinforcement learning, they all follow the basic principals you've learned throughout this course, with the added challenge of determining a correct action. RNNs, for example, are very commonly used in Q-Learning models, as are attention-based models. #SLIDE And, that's it! Today was a busy day, with an introduction to different techniques to visualize network behavior, markovian processes, and deep Q-learning. But, as a reward, I'll leave you with this: Mario, as played by a computer. Until next time! \sum_{t} \gamma^{t}r_{t}