Welcome back to DATA 442!  As usual, let's get started with a brief recap.
#SLIDE
Last week, we introduced the idea of a recurrent neural network, or RNN.  Unlike the networks we've looked at before in the course, a RNN takes in sequences of data - i.e., frames in a video, and makes a prediction on the basis of that sequence.  This let's us understand processes - i.e., a balloon popping - that aren't evident in a single frame.
#SLIDE
We also talked in detail about the limits of a RNN, and the most common solution to those (the LSTM, or Long Short Term Memory Network).  The big thing to remember is that a vanilla RNN, due to the nature of it's architecture, will be highly susceptible to vanishing gradients or exploding gradients.  LSTM helps mitigate the vanishing gradients issue under a range of circumstances. 
#SLIDE
Alright - today, we're going to be introducing Generative Models, but before we start down that road let's take a few steps back.  To date in this course, we've focused nearly entirely on classification models, in which we have some set of training data (i.e., pictures of cute, cuddly cats) that we use to calibrate model weights through trying to minimize some loss function.
#SLIDE
This then allows us to - at least if we have a good model - put a test image of a cat into our model, feed it forward, and get the output of the word "Cat". 
#SLIDE
Generative Models are after something fundamentally different.  
#SLIDE
Rather than seek to classify input images, in the case of generative models we're instead trying to generate outputs of samples that are drawn from the same distribution.  So, if our input images are cats, we would seek to generate outputs that.. look like cats!  Generative models are not just used for imagery, however; you can apply the same concepts to, for example, written language.
#SLIDE
So - why would we want to generate fake samples of things?  There are plenty of practical, and plenty of rather scary reasons why.  On the left you see the outputs of one of the most advanced generative networks available today.  Developed my NVIDIA and released in 2020, this network creates faces of people that don't exist, by drawing from a distribution of real faces and using them to generate new faces.  They have also created various nobs you can use to sample the distribution intentionally - i.e., imagine being able to see exactly what you would look like if you shaved your head; dyed it; or any other permutation.  This has real implications for everything from commercial companies all the way to law enforcement.
#SLIDE
The last few years have also shown the increased power of so-called "Deep Fakes", videos that are generated that seem to show individuals doing things that they never did.  While this technology is still *very* young, it's already becoming believable.  Generative networks are the functional building blocks that enable this type of inforamtion warfare. 
#SLIDE
And then, of course, we have the wonderful meme of "enhancing" images.  Generative networks are very good at enhancing an image by up-sampling based on a wide range of other samples - essentially teaching a computer what pixels would have looked like if the camera taking the picture had a better lense.
#SLIDE
There are a wide range of different generative model types, but broadly they fall into two categories: those that explicitly try to model the distribution you're sampling from, and those that implicitly try to recreate it.  For example, in your lab, you'll work with a Generative Adverserial Network (GAN) which reads in data, and then attempts to create images that fool a secondary model.  In this approach, you're never explicitly defining any probability distribution; rather, it's implicit.  This contrasts to approaches like variational autoencoders, which seek to approximate the probability distributions. 
#SLIDE
Today, we're going to go over one of each class of generative model to better elucidate the differences between the approaches to generating new data.
#SLIDE
First we're going to talk about PixelRNN and PixelCNN.  PixelRNNs and CNNs are a type of fully visible belief networks, which means that we are explicitly modeling our probability density.  In this case, we have our image data (i.e., a matrix of pixels), and we want to establish the probability that that image data would occur.  To do this, we decompose the likelihood into a product of 1-d distributions - i.e., for each pixel, the probability that a given pixel would be of a certain value, conditioned on the values of all other pixels. This decomposition is represented on the right-hand side of the equation on this slide now.  
#SLIDE
One challenge with this approach is that, as you can imagine, the probability distribution we're trying to model is very complex - i.e., we're conditioning on every pixel, for every pixel in the image.  A PixelRNN or CNN essentially recognizes that complexity, and uses a neural network to attempt to represent these distributions. To think about this very broadly - in your introductory statistics course, you learned about the so-called 'normal' distribution, which is a curve that is centered around some mean, with some area under the curve falling within a certain prescribed number of standard deviations.  This curve is represented as a function, and you can sample it at any point.  We're doing the same thing here - except, because we have such a complex distribution, we're using a neural network to capture the distribution.  
#SLIDE
PixelRNN came around in 2016, which defines an approach for solving this probability density.  Take the example pixels of an image on the right. In the PixelRNN, a RNN is fit with every pixel as an input.  Remember in a RNN, we have to have some notion of ordering - i.e., the frames in a video.  Of course, an image doesn't have inherent definitions of order - it's just a 2D surface.  So, in a PixelRNN there was a model architecture which a LSTM was fed with an initial point of data - the upper-left-hand-corner - and then every subsequent pixel was fed in all prior pixels.
#SLIDE
So, the second and third pixels would take as input the first pixel...
#SLIDE
And the fourth, fifth, and sixth pixels would take in their neighbors history, on so on until all pixels are fit.  Needless to say, this would generate a VERY long, multi-layer RNN, and so is a very slow approach to generating images. 
#SLIDE
A PixelCNN seeks to estimate the same type of probability - i.e., for the red X here, what is the probability of it being a given value, conditional on all other pixels?  In this case, we're still starting from the upper-left hand corner, but for each X we're passing a filter across the pixels that it is being conditioned on, and using that information to generate the probability distribution. So, thinking about our probability function at the upper-right, what we seek to do is identify the filter weights that provide the maximum likelihood of the input training images.  Put another way - what convolutional weights, when plugged into our probability function, result in values as close as possible to our input matrices?  In practice, this is fit via a softmax loss function within a PixelCNN, where every pixel estimate receives a loss which is backpropogated through the CNN.  While this approach, similar to the PixelRNN, can produce fairly strong results, it is relatively slow due to the sequential nature of the calculations.
#SLIDE
To briefly recap the PixelRNN and CNN - this is our example where we are explicitly calculating a likelihood for each pixel, given surrounding pixels.  This is powerful in helping us understand model performance, but can be relatively slow.  This is a path of extensive inquiry right now, with a wide range of models being created.  One note that we won't cover here but you may encounter is an idea called 'attention', which essentially limits the number of pixels considered in the probability calculation to a smaller, local neighborhood, and then ascribes some number of pixels to a long-term memory.  This and many other explicit probability approaches are proving very powerful.
#SLIDE
Now on to Variational Autoencoders (VAE)! PixelCNNs defined a density function we could solve for using maximum likelihood.  With variational autoencoders, we're going to use an intractable density function with an additional latent variable z (more on that in a second), and so now our data likelihood is this integral - the expectation over all possible values of z.  This is, obviously, a problem - we can't optimize this directly, so instead we derive and optimize a lower bound on the likelihood. 
#SLIDE
So, one step back. Let's talk a little about what an encoder is in the first place.  Essentially, the idea of an encoder is some function that takes in your data, X, does ... something ... to it, and then extracts features z.  
#SLIDE
The encoder itself can take a wide range of forms - the one you'll be most familiar with in this course is a convolutional neural network.  I.e., if you do a forward pass through a CNN, you could output (for example) one maxpool value per filter.  That would be one type of encoder, where you're extracting features from some input.  Z is almost always smaller than X, with the goal of summarizing the most relevant features in our input.
#SLIDE
The way an autoencoder is designed is with the goal of creating some smaller set of features (z) that can be fed into a second algorithm - a decoder - that can recreate the original input.  If it is possible to recreate the original input from some reduced-dimension set of Z features, then the logic is that the representation Z is effective.
#SLIDE
So, in a VAE we would do something like this - pass our X estimates into a loss function, compare how similar they are to the true X, and then use some backpropogation to update our decoder and encoder.  Note that in this approach there are no labels - just an input X.
#SLIDE
The point of all of this is once you have a well trained encoder, you can remove the decoder and simply use the features Z as input into any model you might want - i.e., a classification model.  Essentially, we have a better input than X, as we've reduced the dimensionality without losing (hopefully) any meaningful information.
#SLIDE
So - the question insofar as Generative networks is concerned is if we can use this process to generate new samples - essentially, can we turn the decoder into a probabilistic process that allows us to sample from possible outcomes to generate new data?  Hence a "Variational" auto encoder.
#SLIDE
The first thing we're going to do is assume that all of our data (X) is generated from some unknown representation (i.e., latent) z.  Think back a slide - imagine you built an autoencoder that was able to perfectly re-create a face.  You would input the whole image of the face, with every value (X) being represented.  This would be reduced down to some number of Z values, which you then put into a decoder to re-build the face.  Those Z values would have to represent important parts of the face - i.e., hair color, hair style, etc.  So, to create a new face, we can imagine sampling different "Z" parameters to capture different types of features.
#SLIDE
Let's pick this apart a bit.  On this slide we have step 1, which is about our assumptions.  In this case, we're assuming our data is generated from some underlying process based on an input of Z; we don't actually know what this decoder is.  We're assuming it exists, but would need to solve for it.  This is the algorithm that tells us - when Z changes - how our output X might change.  The second part of this is our Z.  To solve for the decoder, we're going to assume Z is drawn from some prior and sample from it; generally, this is a gaussian.  Finally, in part 3, we assume there is some conditional probability of X being output, given a Z - this is obviously a much more complicated probability distribution, so we'll end up representing it as a neural network.  The trick is, given these assumptions, how do we solve for the correct parameters for our decoder (which we'll refer to as Theta)?
#SLIDE
Ok - so how do we actually train this network?  We want our decoder to be able to take in any Z values, and then output an X that looks reasonably like our input data.  In our PixelCNN and PixelRNN, the way we did this was by solving for the maximum likelihood of our probability function; unfortunately, the integral here is intractable as we're trying to estimate over a continious value (z).  Let's briefly walk through why this is the case.
#SLIDE
FIrst, we have p(z), which we can assume is a simple distribution, something like a gaussian prior.  Nothing to see here.
#SLIDE
The second thing is the probability of an output X, given the inputs Z.  We'll seek to represent this with our decoder neural network, so again - all good, move along.
#SLIDE
The problem is the integral - to calculate this, we would need to calculate p(x|z) for every Z, which is an intractible problem.  This means that the posterior density calculation is also intractable - i.e., we can't solve for the probability of a given Z given X.  
#SLIDE
So, what are we to do?  The solution lies in creating a second neural network.  Just like we represent the proobability of X given Z with a decoder neural net, we can also represent an estimate of the posterior density (p(z|x) using a neural net, which takes X as an input.  This allows the deriviation of a lower bound on the image function at the upper-right of this slide, that has a tractable solution.  I won't go into the details on this deriviation in this lecture, but if you're interested the article Auto-Encoding Variational Bayes in the lecture notes has full details on how this posterior inference can be made.  Importantly, once you have this encoder neural net, it allows you to optimize the parameters theta.  
#SLIDE
Going back to the purpose of the autoencoder, once theta is optimized against a given set of data, manipulating Z will allow you to create arbitrary outputs (X), representing different views of the data. One of the neat things about VAE's is that, as you change the Z inputs in different dimensions, you can see the features that a given Z was measuring.  Here's an example of two dimensions - the vertical axis represents a Z value that is changing hairl color; the X axis is representing a Z value that is controlling sunglasses.  You can see as the dimensions change, slightly different faces are generated.  You can arbitrarily sample the Z space to generate any number of images once your VAE is parameterized.  One other neat thing: unlike our PixelRNN and PixelCNN, these images are all generated at once (i.e., not pixel-by-pixel), so are multiple orders of magnitude faster to implement.
#SLIDE
Ok - let's move on to talk about Generative Adversarial Networks, or GANs.  GANs are fundamentally different than either the PixelCNN, PixelRNN, or VAE approaches, in that they do not seek to solve for the probability distributions.  Instead, they seek to sample the space created by Z through a 2-player game, where both of our players are neural networks.
#SLIDE
The way we'll actually train a GAN relies on our two players.  The first player is our generator network, which is a neural network we architect to generate real-looking images (with an input of Z, our random noise).  The second network is our discriminator network, which tries to distinguish between real and fake images. 
#SLIDE
The intuition of the setup of a GAN is fairly straightforward - we want to take some random noise in (Z), run it through a generator network that learns to take random noise and transform it into something similar to our input training data, and finally output an image.  These outputs would represent a sample from our training distribution - not the training dataset itself, but the distribution of the data. 
#SLIDE
The complete architecture of a GAN looks like this.  First (at the lower-left), we have our input Z - which is, again, just random noise.  We have some generator network that takes that noise and transforms it into (eventually) pictures that look something like a cat.  These fake images are then passed into the discriminator network, alongside a batch of real images of cats.  The discriminator network then attempts to correctly classify each cat as real or fake, which is the final output of the GAN.  The idea is if we have a very good discriminator, it can do a good job of determining real vs. fake - then, if our generator network does well and can fool the discriminator, then the discriminator will change more, causing the generator to change more... and so on.
#SLIDE
Ok - so now let's think about how this would actually be trained.  Let's start with the first case - loss is low, indicating that the discriminator network is doing a good job distinguishing between fake and real cats.  This is good news for our discriminator, but bad news for our generator!  So - if loss is low, we want our discriminator to change relatively little, but our generator to change a lot.  Conversely, if our loss is high, we want our discriminator to change more rapidly, and our generator can change slower.  This is done following a minimax objective function, in whcih we alternate between a gradient ascent (yes, ascent!) on the discriminator, and a gradient ascent on the generator, where we intentionally are seeking to maximize the likelihood of the discriminator being wrong.  While it's very interesting, I won't go into the details of the loss functions or backpropogation here, but you'll get a hands-on chance to work with these implementations in your assignment, and can also read through the Goodfellow piece shared in the lecture notes.  I definitely encourage those of you interested in a really clever implementation of a mini-max function to dig deeper.
#SLIDE
OK - let's talk a little more about the GAN architecture itself.  Remember, we have two networks: the generator and the discriminator.  It's very easy to reach some really implausible solutions in this case, so taking a lot of care to ensure your generator and discriminator are both well specified is important.  While a discriminator is essentially going to be a 'traditional' CNN, our generator is going to do some tricks we haven't seen before.
#SLIDE
The biggest new element is the idea of a transposed convolution.  Remember back to when we were discussing convolutional neural nets.  In the example on the screen, we have a 2x2 filter; the four values in the 2x2 filter are being summed (i.e., they all have equal weights of 1), and a value of 7 (the sum of all values) is added to our activation surface. 
#SLIDE
A transposed convolution does exactly the opposite.  Here, we have the number 7 in our activation surface, and we want to distribute that number 7 across the four pixels in red.  
#SLIDE
This is done by ascribing a weight to each cell of our filter, and then using those weights to (in this case, multiply) the input value on the activation surface to "upsample" from one pixel to four pixels.  The challenge in the context of a GAN is to find the weights that best result in an image as similar as possible to the input training data.
#SLIDE
There are a number of rules-of-thumb for how many layers to introduce into your networks (discriminating and generating), but effectiveness is very, very task-dependent, and identifying good architectures for GANs is a very active area of inquiry.  A few important ones are to use batchnormalization, avoid fully connected layers if you have a deep net, and to use leakyReLu in your discriminator and ReLU in your generator.  We know relatively little about the exact interactions between these approaches, but a deep dive into this topic is available in the lecture notes in the reading "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks".
#SLIDE
There are hundreds of GANs out there tailored to different purposes, with some of the most impressive over the last year coming out of the NVIDIA and Google research groups.  Increasingly, transfer learning potential appears to be emerging with GANs, suggesting that a GAN designed for one topic area may be able to be extended to others, even if they are apparently far-afield.  This process of improvement is mirroring the evolution of convolutional classification algorithms so far, to the point where today you can download the trained network that generated the faces you see here off-the-shelf, in either pytorch or tensorflow implementations!
#SLIDE
That's it for today!  Just as a brief recap, today we touched on the general notion of a generative network, introducing the idea and value of creating synthetic data that is similar to an input training dataset.  We then discussed the implementation of a PixelRNN, PixelCNN, and discussed the pros/cons of the approach.  I introduced the Variational AutoEncoder, and finally closed with an example of how a GAN is implemented.  Hope you enjoyed it, and I'll see you next time!


p(image) = \prod_{i=1}^{n}p(x_{i}|x_{1},...,x_{i-1})
p(image) = \int p(z)p(x|z)dz