Welcome back to DATA 442!  Today we're going to start talking about some specific software for deep learning, and related architectures.
#SLIDE
To briefly recap from last lecture, we first covered a range of methods for how a learning rate can be selected, including step and exponential decay, and moved on to some optimization techniques that did not require step size, focusing on BFGS.
#SLIDE
We then discussed a range of techniques that can help improve the performance of our model on data it hasn't seen yet, including regularization, image augmentation, transfer learning and more.
#SLIDE
As you engage with assignment 2, one of the big things you'll notice is that our models are starting to get a lot more computationally demanding.  So, let's talk a little bit about just how some of those challenges can be overcome, and frameworks that can help.  This is a complex topic, and changes very quickly every year as new hardware, software, and related techniques for optimization are all moving at a rapid clip.
#SLIDE
Let's start with some very, very basic building blocks.  First and foremost, lets talk about the distinction between a CPU, GPU, and why they perform differently for image recognition.  Here, you can see an example of a CPU (upper-left), GPU (upper-right), and how these things sit in a traditional desktop computer at the bottom.  As you can see here, CPUs are fairly small - they tend to take up just a tiny fraction of the inside of any computer case, and in some cases (like this) don't even have dedicated cooling.  Contrasted to that, GPUs take up a ton of real estate - the card in this image is relatively small, and GPUs can easily take up 30 to 40% of available space on many motherboards.  They have their own fans most of the time, and also have dedicated power rails.  So, you can imagine that GPUs can - from a hardware perspective - simply put a lot more power (literally) towards solving a problem.
#SLIDE
So, what is this GPU, or Graphics Processing Unit?  Mostly, GPUs are associated with videogames - and, with good reason.  For most of the 1990s and early 2000s, the vast majority of the reason for the development of this type of processor was to facilitate our ability to display images on screens with higher and higher levels of resolution.  At the time, most people working in the space wouldn't have dreamed that the same underlying architecture used to render - say, the Zerg putting a Protoss army down - could also be used to dramatically improve our capability to fit machine learning models.  
#SLIDE
For the 1990s and 2000s, two major players in the GPU space developed cards: AMD and NVIDIA.  While the debate over which is better for gaming is an ongoing .. civil disucssion .. on Reddit, NVIDIA has put extermely large resources into making their cards better suited to deep learning, while AMD has lagged significantly behind as of this recording.  
#SLIDE
So, for most practical purposes, if you're looking to purchase a desktop or laptop (or server shard) to do deep learning, you're looking for the most cutting edge NVIDIA cards. This might change if AMD decides to catch up, but for now - if you're running an AMD card at home, you're going to struggle a bit to get it working with deep learning model architectures.
#SLIDE
So, why do we care so much about a GPU?  Let's do a comparison of why these different computer components can do such different things.  Looking at a CPU, in most consumer cases you will have just a few cores - i.e., the elements of a CPU that can follow a set of instructions.  In most laptops today, you'll likely have between 4 and 8 cores, which means that your computer can perform, at most, 4 to 8 different tasks at the same time (8 to 16 if you have specialized chips from Intel with hyperthreading).  This isn't a lot, but the extremely fast clock speed of CPUs means these tasks can be executed extremely quickly - for tasks in which there isn't much data movind around.  However, CPUs also (except a small cache) rely on physical memory that is generally located elsewhere on your motherboard (i.e., shared with the system), which can lead to bottlenecks when you need to move lots of data through processing instructions.
#SLIDE
Contrast that to a GPU.  A GPU is designed to update what colors should be displayed on a monitor - i.e., if your monitor has a resolution of 800 * 600, there are 480,000 pixel colors that need to be updated.  As you might guess, many of these updates are handled through some form of matrix multiplication - and, thus, matrix multiplication is something that GPUs are really, really good at.
#SLIDE
But, why? Think back to your matrix multiplication - in the first matrix in this example, we have 4 rows of information, with (for example) 5 bits of information in each row.  In our second matrix, we have 5 columns of information - again with 5 pieces of information in each column.  If we multiply these matrices together, we get the third matrix, which is a 4x5 matrix.  
#SLIDE
To get the value for each element of our 4x5 matrix, i.e., the one value in green here, we need to take the dot product between one of the rows in the first matrix and one of the columns in the second matrix.  
#SLIDE
Importantly, each of these dot products are independent - i.e., the value in our final matrix representend in green, organge, red and blue could all be calculated at the same time. So, you could easily imagine calculating every element of this output matrix in parallel.
#SLIDE
So, let's think back to our GPU superlatives.  In this example, it would be trivial to assign one calculation - i.e., one dot product - to it's own core, and simultaneously solve the entire matrix.  As these examples grow - i.e., to matrices of thousands of entries - you can imagine how important having lots of cores at your disposal would be.  Additionally, the two input matrices have data that has to be sent to the processing units of the GPU for them to computer the dot product in each case.  Because the memory that these matrices are stored in is local on the GPU, bottlenecks associated with data moving from shared memory (like a CPU) can be avoided.  As you'll recall from earlier lectures, this type of matrix operation is key to solving for many operations within neural networks (i.e., affine layer solutions).
#SLIDE
So, GPUs are better, but how do we interact with them?  This requires some mechanism to write code that runs on a GPU - and, right now, there are two different interfaces for this.  The first, and most common, and fastest, and what I generally recommend using as of today, is CUDA.  CUDA has been developed by Nvidia over the last decade or so, and is exceptionally performant for Nvidia graphics cards.  A lot of higher level APIs / primitives for deep learning have also been implemented by Nvidia, building on CUDA - these include, for example cuDNN, which implement a large number of the functions we've explored in this course (affine layers, convolutions, forward/back propogation for layers) directly in CUDA, which can lead to huge performance gains.  Contrasting to this is an implementaiton called OpenCL, which can run on any type of GPU hardware.  However, unlike CUDA, which NVIDIA has spent - roughly - a bajillion dollars improving, the amount of optimization in OpenCL is still lacking.  For some hard numbers about the relative effeciency of each case, you can refer to the article linked in the lecture notes and on this slide, but very broadly OpenCL can be between 16 adn 67% slower, depending on the application.  Perhaps more important in practice is that most software libraries have very strong support for CUDA, but general support for OpenCL is still much more limited.  Things are likely to change as more players enter the arena of creating cards explicitly for machine learning, but today Nvidia is really the core player.
#SLIDE
So, what does this all mean in practice?  In short: GPUs absolutely dominate CPUs when it comes to processing huge amounts of imagery through networks.  What you're looking at here was a study done by Microsoft that contrasted a variety of network architectures and implementations in their Azure cloud.  Keras is the interface that we are using to define layers, which is an interface into tensorflow (so, writing code directly in tensorflow is faster than Keras; however, Keras has a lot of helper functions that make life easier).  If you look at ResNet50, you'll see results from one of the most common network architectures used today.  If you were to implement a ResNet50 model on your own computer - say, when you're trying to solve a question on Assignment 2 - and you only had a CPU, you would expect to be able to forward and backward propogate about 10 images each second.  So, for a dataset like the UC Mercer dataset (2100 images), you would expect a runtime for ResNet50 of around 3.5 minutes before all images were seen (i.e., one epoch finished).  Contrast that to the 1 GPU case in orange - here, you would expect to see around 30 images processed a second, for a runtime of about one minute.  It's easy to see how quickly - for a large dataset - you would need to integrate GPUs in order to have a feasible chance of fitting your model before key landmarks, like a presentation for your boss, or the heat death of the universe. 
#SLIDE
On the hardware front, there are a few other considerations to be aware of.  First, you'll note that while your video card contains information on - for example- your model weights, the actual data that you want to apply the model to lives on your hard drive.  This is challenging for two reasons.  First, you need to be able to access information on that hard drive quickly; second, you need your CPU to be able to keep up with the requests to transfer data from the hard drive to the video card (as your CPU handles system operations - including data transfer!).  These types of issues are most commonly handled by ensuring the disk holding imagery is very fast - i.e., 10,000RPM (rotations per minute) or a solid-state disk drive.  Couple that with a multi-core processor capable of prefetching data off of the hard drive, which most modern chips can do, and you can avoid data input/output bottlenecks in most cases.
#SLIDE
Alright, enough of the hardware - let's start talking about the frameworks that most people (including us) use to interact with deep learning models.  This is a rapidly evolving landscape, and in this course we're primarily going to focus on Torch and Tensorflow.  There are a *ton* of frameworks out there - while they all orginated in academia, major industry players (Baido, Microsoft, Amazon) have all now started developing their own frameworks; however, most of these are much less broadly adopted.  
#SLIDE
Most of these frameworks really serve two key purposes, which can save us a lot of time vs. writing all of our own implementations (like on Assignment 2).  One of the most important things they provide is in simplifying building large computational graphs, which may have hundreds, thousands, or millions of computations.  This is in two ways - first, by providing a formalized way to string computational elements together in a graph (i.e., code to define a graph), and then second by - once the graph is established - automatically solving for the gradient in each step of a backpropogation.  The second big thing they provide is effecient integration of this code with GPUs (or other dedicated cards) - i.e., you can largely ignore the intracacies of integrating CUDA code with your numpy arrays, and just harvest the benefits!
#SLIDE
So - ultimately, what should you choose to do your implementations?  Both Torch/PyTorch and TensorFlow/Keras have most of the core implementations and capabilities you might want; most of the information on the internet about how Keras is "higher level" and PyTorch "lower" is a bit misleading, as you can get under the hood of either using Python.  However, there are some key distinctions to be aware of.  First, TensorFlow has a well established website - tensorHub - which has hundreds of pre-trained models you can pull off-the-shelf for transfer learning.  While pyTorch has something nascent up and coming (TorchHub), it - as of this writing - only has 38 models. 
#SLIDE
A second differentiating factor is in how models are constructed.  In PyTorch, you can change your computational graphs during the fitting process itself, potentially having your network architecture change; i.e., dynamic computational graphs.  TF/Keras assume a hard-coded architecture that is defined before your model fit begins (i.e., static).  While it seems like a dymanic graph would give a huge advantage to Torch, this can cut both ways - the big advantage of static graphs is that you can optimize your computational graph to be very effecient at run time.  
#SLIDE
Next up is general usability.  Writ large, Torch/PyTorch are a bit harder to read - i.e., you have to define nearly everything explicitly, up to and includign layer structures.  This can be wonderful when you want to dig into how everything works, but makes debugging a bit more challenging.  In contrats, TensorFlow - and especially Keras - is designed to be much more human-readable, with condensed code that allows you to define entire layers - or even sets of layers - in one line.  Recognizing this is a bit of a streotype for both cases, and both can get the job done, both my and general consensus is that Keras does a better job of making code that's readable, while PyTorch requires you to define more elements of your network upfront.  Both have similar levels of power at the end of the day, but just choose to reveal it to the developer in different ways.
#SLIDE
Finally - on speed, across the board, it's mostly a wash.  Torch, PyTorch, and TensorFlow will all give relatively similar performance, with some exceptions depending on the specific task.  However, of note, Keras is simply a little slower in every respect, which is the price paid for a simplified, higher-level coding environment.  Personally, I use both!  If I'm building a prototype of a simpler network, or doing a deep learning project with a new dataset for the first time, I generally dig in with TensorFlow or Keras.  If I need to build new network architectures of layers, or engage with some sort of novel approach to classification, I hop over to PyTorch.  
#SLIDE
Alright, switching gears now to model architectures!  This is a big, constantly evolving topic, but at the core comes down to understanding what types of neural nets work when it comes to image classification.  Ultimately, a lot of our knowledge now is seated way back in the first lecture - how humans, cats, and other organisms cellular hierarchies work.  We're going to take a few dives into different approaches, starting with something called AlexNet, which was released back in 2012.   
#SLIDE
AlexNet was the best network we had for the large ImageNet challenge back in 2012, and was trained on a pair of state-of-the-art GPUs with a whopping 3 *gigabytes* of ram.  The architecture is fairly straightforward - starting with an input image (in this case, 224 x 224), first 96 filters of 11x11 were applied (stride of 4), followed by additional convolutions of smaller filter sizes, leading into a max pool and finally a dense net.  This architecture was notable for a number of reasons - it was the first implementation of a ReLU based CNN architecture, it featured an adaptive learning rate - which was manually tweaked! - and it integrated a 7-CNN ensemble, in which 7 models were trained on the same data (with lots of data augmentation) and then averaged to get the final result.  AlexNet was an 8 layer model, so very shallow by contemporary standards.
#SLIDE
Things remained fairly similar in 2013, and then in 2014 two new, deeper nets came onto the scene.  The theme of these networks - of which VGGNet is the first we'll discuss - was to have smaller filters, and more layers. VGGNet used either 16 (in the case of VGG16) or 19 (VGG19) layers, and never had a filter size larger than 3x3; this was shown to improve accuracy over AlexNet, with the best accuracies getting all the way down to 7.3% error.
#SLIDE
But, why? Think back to our original discusison of convolutions, where we used a 2x2 filter.  If you consider this first convolution, the information going into the activation surface is representative of a total of 16 pixels at this point.
#SLIDE
At this second convolution, we're now considering another 16 pixels, of which 6 (highlighted in green) are unique - i.e., pixels that were not included in the first convolution. 
#SLIDE
The same thing happens when we move vertically - i.e., the value calculated here would have 6 new pixels, and then it's neighbor...
#SLIDE
Would have 3 more - i.e., the size of the square we draw information from in this example is 3x3x3.
#SLIDE
Now bring your attention to the activation surface at the bottom - if you convolve over this activation surface with the same 2x2x2 filter, we know that the actual window we're drawing information from - on the image - is 3x3x3.  This is referred to as the "effective receptive field".  As you stack filters, this continues to grow; i.e., in VGGNet, the stack of three 3x3 convolutional layers gives an effective receptive field of 7x7 on the image, in pixel space.   The advantage a set of 3 3x3 layers gives you, as contrasted to one 7x7, is that you have more filter weights that can capture more non-linearities in the images you're observing; this is borne out in the improvements we see in the VGG architecture.  Another nice benefit is that you can capture this with fewer parameters in the net!
#SLIDE
The other strong performer in 2014 was GoogleNet, which is also sometimes referred to as Inception v1.  This net outperformed VGG in some cases, but under in others, but what sets it apart is it's heavy focus on computational effeciency.  Across all of Googlnet, there are only a total of 5 million parameters that need to be fit - this contrasts to over 60 million for AlexNet, and 138 million for VGG16!  One of the ways this happened was through the use of so-called inception models, of which an example is shown on the slide, and we'll dig into on the next slide.  Another surprising element of this net: it has very few affine layers.
#SLIDE
So, what made this Inception approach so special?  The first goal was to build the smallest unit - "Inception Layer" - which is essentially a small network in and of itself.  Then, once the best inception layer is identified, stack them on top of eachother to achieve high levels of depth.  What they identified was a structure in which there are multiple receptive field sizes - i.e., we are convolving using 3x3, 5x5, and 1x1, all simultaneously, as well as keeping a 3x3 max pool of our inputs.  Very importantly, they include 1x1 convolutional layers as precursors to the 3x3 and 5x5 convolutions - in the case of Inception v1., these 1x1 layers have 64 filters; thus, the inputs into the 3x3 and 5x5 is always limited to 64 channels from the previous input.  This dramatically decreases the number of parameters required to fit an inception layer, even though it's conducting a wider range of convolutions.
#SLIDE
Here, you can see the full GoogleNet architecture, with a red box around one of the inception modules (which you can see stacked here).  
#SLIDE
The GoogleNet further has a stem network, which serves as the input into the first set of inception modules.  This is a standard network, which serves to reduce the dimensionality of the inputs before any inception occurs.
#SLIDE
In purple is where the final classification layer is - i.e., these classifications are fed into your test function to establish the final accuracy of your model, and used with a loss funciton to pass back gradients.
#SLIDE
One of the most interesting elements of the Inception net is these auxillary classification branches.  Essentially, these branches do the same thing as the final classification layer, but are only fed into the loss function - i.e., they are used to boost the gradient of the lower levels of the network, but are not used in establishing the final classification for a test.  This is another solution to trying to prevent gradient decay or saturation in deep networks. 
#SLIDE
In 2015, another key architecture emerged that has served as the baseline for a huge array of contemporary models - ResNet.  ResNet took depth to the extreme, increasing from the around 22 layers of GoogleNet to 152 layers, and implementing a new type of architecture around residual blocks.  The payoff of this was huge: they knocked error down to 3.57%, and equally impressively swept every single category.
#SLIDE
One of the key rationales behind the development of ResNet was the fact that, if you just stack convolutional blocks on top of eachother forever, you hit a saturation point pretty quickly at which it becomes challenging (if not impossible) to find a reasonable set of optimal parameters.  This can lead to dramatic underperformance of deep networks relative to shallow networks, as the number of parameters simply becomes unmanageable too quickly.  Working under this hypothesis, He et al in 2015 sought to construct a model architecture in which a deep model could always perform *at least as well* as a shallow model.  In practice, the concept of what He et al. came up with is fairly simple - rather than try to fit a function on the basis of the input data alone, as would be the case in a standard net, instead the function you seek to optimize is F(x) + x - i.e., the input X value is actually added to the final set of 'scores' (or outputs of any kind).  Imagine, for example, you have an exceptionally deep network wiht over 50 residual blocks, all strung together.  You identify an optimal solution after three of these blocks.  By solving for F(x) + X, you can simply set the remaining block values to "0", precluding the need to optimize deeper layers.
#SLIDE
As an example of the full ResNet architecture, here is the ResNet34.  I show you ResNet 34, because 34 layers can fit on a single powerpoitn slide, annd the only difference between this and ResNet 152 is - you guessed it - more residual layers!  The architectures are, as you can see, very straightforward - you just keep stacking, and then throw an affine layer at the end with a target number of nodes equal to the number of classes in your case.  The flexibility of ResNet - which can handle both simple and complex problems due to the residual architecture - has made it an immensely powerful "go to" for a very wide range of problems.
#SLIDE
And, that's it for today!  As a brief recap, in this lecture we introduced a broad overview of GPU vs CPU hardware, introduced a few coding frameworks for deep learning and chatted about their pros/cons, and finally wrapped up with a discussion of popular network architectures today.  Hope you enjoyed, and more next time!