Welcome everyone to DATA 442. I'm very excited to offer this course, and thrilled to see all of you here. A few notes on the course - first, all lectures (and these videos) will be up on the ICSS website for you to view. If for any reason you aren't able to attend class, you will be able to check out the lecture within a few hours of recording. The URL is icss.wm.edu/data442/, and you can also find the URL in your syllabus. Second, you will also find a number of resources on that website, including course policies and other links. This includes links for the submission of your assignments, course policies, and details on office hours. We also provide you with links to the many courses that this course is built on top of, and encourage you to explore them to learn even more! Third, all assignments in this course are automatically graded by - you guessed it - AI. Given that robots aren't perfect, and some of you may dislike being judged by a machine intelligence, you can of course request your grades be reviewed by a human. Please note that if you make such a request, this can result in your grade going up *or* down, as it will be completely regraded, so be cautious! ====== SLIDE 2 With that, let's get started. This course is heavily focused on computer vision - how we train algorithms to interpret data collected by sensors that is represented in a graphical form. There has been an enormous increase in the amount of graphical data - from smart phones, to cameras, to satellites, we are collecting tremendous amounts of graphical data being produced every minute. We can look at the hard numbers to see just how incredible the scale of graphical data production is - according to Sandvine, in 2019 we saw over 60% of the data downloaded over the internet was for video streaming; this contrasts to the second-place contender, websites, in at 13.1%. Think about what that includes - every picture on your phone, the video of your dog rolling in the mud, and the latest episode of 'The Office' that you chose to stream. This poses an enormous challenge for algorithms that seek to understand this information - sorting through what is important and what is not from thousands of satellite images after a natural disaster, identifying illegal or threatening activities, ==SLIDE or enabling you to search for a "ridiculous coffee maker" on google image search. Making the challenge of image data even bigger is the fact that computers - as they have been coded over the last 100 or so years - are not well equipped to analyze it. Instead, nearly all of the algorithms we have derived have focused on analyzing discrete "units of observation" - think rows in a spreadsheet. Further, the human mind has a hard time analyzing imagery at scale - when every frame of this massive amount of data could contain key information, and the differences between those slides important, we are simply ill-equipped to capture the changes. In fact, if we wanted a human being to watch every minute of footage uploaded to youtube in real time, it would require nearly 18,000 humans! This has critical implications for industry - if an advertiser is willing to pay to guarantee they won't have an ad put alongside - for example - violent footage, we need to have an economically viable way to do that. ===SLIDE Because of the scope of the challenge, computer vision has tremendous overlap with a wide range of disciplines. When we think about computer vision, we're really thinking about the algorithms required to interpret image content - for example, if a picture is a dog or a cat. This is closely related to both image processing (i.e., engineering to make satellite imagery usable by correcting for the altitude of a sensor) and machine vision (i.e., teaching a robot how to interpret signals to minimize the chance it will fall over). If you consider the types of knowledge needed to replicate the process of human data intake through our eyes, processing and transport to our brains, and then cognition of that data - it's really quite a lot! We draw on fields like neuorobiology to understand how the human eye works and sends information - much of what you will learn in this class seeks to replicate biophysical processes. We rely on math, computer science, and related disciplines to implement the algorithms required. And, we rely on a range of physical sciences to help design and interpret the information from sensors. Increasingly, we also rely on physical engineering to help build new hardware that allow us to implement algorithms for computer vision more effeciently than is possible today. ===SLIDE To give you a little background on where your instructional team is coming from, we (myself and the TAs) are based out of the Initiative for Computational Societal and Security Research at William & Mary (the ICSS). In this group, we focus on a range of projects related to computer vision, ranging from predicting road quality from satellites, to school test scores from street view imagery, to computer vision to parse archival documents. ===SLIDE This course is interrelated with a number of other courses offered at William & Mary, and assumes some baseline knowledge from other courses. First and foremost, we assume a baseline proficiency in Python; this can be achieved by completing DATA 141 or CSCI 140. If you are sitting in this room and uncomfortable with basic python, you may want to reconsider! DATA 310 provides baseline knowledge on the ways that machine learning algorithms work, and you should be comfortable with concepts like optimization, hyperparameter optimization, and neural network backpropogation. Finally, this course provides a very heavy focus on computer vision specifically, with a big dose of deep learning (as one of the most popular CV techniques today). ===SLIDE One of the first topics we'll be exploring is the history of computer vision, to give you a breadth of background on how we got to where we are today. We'll start 2 billion years ago, with the Protists; creatures that continue to evolve on earth today. The image you see here is Eythropsidium (we'll call her 'E') - and about 50 microns in length. It has what we might consider one of the first examples of an "eye", though we don't know exactly when it evolved; further, 'E' has no brain, and so little is known about how the information captured by this "eye" is actually used. ===SLIDE Approximately 635 million years ago, we believe Jellies and Sponges began evolving structures similar to eyes, though with extreme limits in what wavelengths they could capture, and relatively little known about how the information taken in was processed. ===SLIDE Starting approximately 543 million years ago, we saw a massive explosion of species with eyes, with the trilobite Olenellus fowleri being the first we can identify in the fossil record. While it likely was not the first, it is the first we currently have physical evidence of. This led to a massive rush of evolution, with the fundamental nature of life changing to adapt to the new capabilities that vision brought on. I encourage you to read more in the article at the bottom of this slide (by professor Dr. Schwab at UC Davis's Department of Ophtalmology and Vision Science), but it is sufficient to say that we now observe vision as a core element of nearly all capabilities of intellegent creatures - from manipulation, to transportation, to mating, nearly all species (including humans) primarily rely on visual data to interact with the world. ===SLIDE In many ways, computers have been living in a way similar to organic animals prior to vision being evolved; similar to organic animals evolution of the physical sensors to take in light (eyes), and the ways to process and understand the data generated (neurons, the brain), so too are innovations in computational capabilities advancing computers towards a new era of capability. The physical "eyes" of computers are largely sensors that we have created to capture light artificially; we can trace these innovations all the way back to Abū ʿAlī al-Ḥasan ibn al-Ḥasan ibn al-Haytham, known in the west as "Alhazen". He created a pinhole camera ("Camera Obscura"), where light would enter into a small area, and then be projected onto a screen inside of a container. In 1839, this was married with the idea of film when Louis Daguerre used silver iodide-based solutions that would produce an image when exposed to light. The evolution of film continued rapidly through 1885, when George Eastman began the production of paper film and the "Kodak Camera", which laid the groundwork for much of the mass photography we have today. ===SLIDE As the evolution of sensors continued, so too did our understanding of how visual data was interpreted by the mind. In 1959, Hubel and Wiesel conducted studies in which the electrical responses of cats minds (i.e., when neurons were firing) were measured as they were shown different types of shapes on a screen. Much of computer vision was inspired by this work - it was one of our first opportunities to understand how mammals processed data which came in through the eye. The key thing they learned was that in the visual cortex of the cat's brain simple structures - relatively simple groups of neurons - directly responded to stimulus, and especially simple stimulus like changing the orientation of edges in an image. These simple structures then passed their findings (electrical impulses) on to more complex structures, allowing for an integration of information and conceptualization of visual inputs. This general approach is what forms the basis of the deep learning strategies you'll be learning about in this course. ===SLIDE Alongside developments in physical sensors and our understanding of human intuition, the computer science community was also evolving to be able to represent the world around us in computer code. In 1963, Larry Roberts began this process by conceptualizing features in the world as "blocks", which are defined by feature points delineating the difference between objects. This provided the fundamental logic required to enable a computer to save, and then reconstruct, three dimensional shapes. However, this method proved limited in that not all things can be communicated with blocks! Thus, in the 1970s a push occurred to try and improve our ability to represent objects in the "real world", with scholars such as Brooks, Binford, Fichler and Elschlager leading the charge. The big idea these works contributed to was the idea that complicated geometrical objects can be represented by a collection of simple primitives. David Lowe built on this idea in the 1980s, seeking to identify objects based on outlines (in other words, edges - sound familiar?). These few cases formed the basis of the field of object recognition, but were heavily limited in their capabilities. ===SLIDE Another paradigm of imagery recognition that was being explored in parallel to computational definitions of objects was the computational *identification* of objects. While it can be very hard to figure out what an object is (i.e., a ball, a razor, an umbrella), it - hypothetically - should be easier to identify what objects are in a scene. Initial attempts at segmentation began as early as 1968, with Muerle and Allen exploring strategies for defining areas within images as belonging to the same "object"; this was built on extensively by the remote sensing community who were interested in using segmentation to take advantage of another new technology emerging, satellite data. Today, object segmentation is a common first-step in object recognition, where first unique objects in images are identified, and second the appropriate labels are applied to those objects. ===SLIDE An enormous leap was made in 2001, when Viola and Jones showed it was possible to localize where faces were in images in near-real-time, with relatively little processing power, using an adaboost-based algorithm. This work was not only important for it's illustration of the capability of machine learning for imagery recognition, but boosted computer vision into the forefront of industrial innovation - today, nearly every camera shows the bounding box of faces! Simultaneously, research was probing into how objects could be identified even though they may regularly change - for example, a sunflower may appear different due to a change in petals; a logo may appear differently if it is on a curved object; or, generally, objects may appear different because of the angle a picture was taken, shadows, or other challenges. In 1999, Lowe came to the realization that even under these variable contexts, there were some features that remained "Invariant" - i.e., they would not change irrespective of the angle they were viewed at. Thus, he reduced the challenge of object recognition down into a first stage: seeking to identify what these invariant features were. ===SLIDE Starting in 2006, many of the algorithms engaging with object detection, segmentation, and recognition came together in the first of a series of large-scale challenges posed by the computer science community. Dr. Mark Everingham of the University of Leeds compiled a large dataset (PASCAL), containing consumer photographs from Flickr which are labeled according to the objects they contain (i.e., sheep, dog, potted plant). This provided the first large-scale dataset of images, with 500,000 labeled data points which could be used to establish the accuracy of computer vision algorithms in real-world settings. From 2006 until 2012, an annual competition was held in which training data was provided to different groups to test, and then a testing set was used to ascertain the accuracy of their findings. This competition was predicated on the 20 classes of objects PASCAL identified - and naturally led to a bigger question: what about everything else? ===SLIDE Enter Imagenet, which was a collaboration across numerous groups of computer vision researchers and is currently hosted by the Stanford Vision Lab. Imagenet is still being improved today, and represents nearly 22,000 categories of objects, with over 14 million images. This posed an all-new challenge: how accurately could computer vision algorithms predict from the entire world of data? This was a critical breakthrough for the computer vision community: not only was there a broad dataset available to test algorithms, but a dataset large enough to mitigate the challenge of overfitting for imagery data. Because imagery data is highly complex - both in terms of the total amount of data, but also the types of features that might be detected - it is particularly prone to challenges associated with overfitting, and it is very difficult to determine those challenges with small datasets. Imagenet - the product of a massive effort built on top of Amazon's mechanical turk - was a key step forward in overcoming these challenges. ===SLIDE ImageNet spurred a number of competitions, as well as extensive development of different computer vision architectures for the classification, localization, and object detection of information in images. One of the best summaries of these was prepared by Olga Russakovsky in 2014, which showed the improvemennt of these models from 2012 through 2014, as well as the challenges that still remained at that time. This slide shows the error of algorithms in identifying the class of an entire image - i.e., if only the class of the item in the image needs to be established. As this shows, over a two year period from 2012 to 2014 our ability to discern between classes increased dramatically, from approximately 83% accuracy to 93% accuracy. ==SLIDE Here, you see the ability of models to identify the location of a specific object within an image. As you may notice, the cases where your own eye may have an easier time discering the object in question were also easier for the algorithms in question. Similar to the image classificaiton challenge, accuracy for localization improved by approximately 10% over this two-year period, but still remained around 75%. While I do not show the results (I encourage you to go read the paper at the bottom of this slide!), the accuracy for the most challenging estimate - object detection across the entire image - resulted in a precision of around 37% (indicating many objects were still being missed, missclassified, or erroneously identified). ===SLIDE There are two big moments in the evolution of ImageNet modeling efforts. First, in 2012 the AlexNet (previously SuperVision) provided an enormous gain in accuracy over past models - nearly a 10% improvement over anything that had come before. This network approach was predicated on a "Convolutional Neural Network", or CNN - the style of network that will form the basis of nearly everything we do this semester. The second big moment came in 2015, He et al. introduced a new algorithm - ResNet - which for the first time was able to achieve an image classificaiton error rate lower than a human classifier. In this study, ResNet classified nearly 97% of images correctly, as contrasted to the approximately 95% classified correctly by a Human in Russakovsky's earlier work. The big innovation in He's work was depth - the number of layers used in the network, which gave birth to the term "Deep Learning." Of note - the Human in Russakovsky's paper was one poor, lonely Ph.D. student at Stanford. Aren't you glad you aren't at Stanford? ===SLIDE Now, we're going to shift our focus back to today, and - more importantly, DATA 442. In this course, we will explicitly focus on a few key approaches within the field of computer vision: the processes of image classificaiton, image segmentation, object detection, and image captioning. ===SLIDE When we talk about image classificaiton, we're talking about classifying the entire image - "Does this image contain a Dog" would be a common example of a classification challenge. ===SLIDE When we talk about Image Segmentation, we're focusing on breaking the image into it's constituent components - what pixels represent the same objects in the image? ===SLIDE Object Detection is an even bigger challenge - can we localize where objects are, and label them? ===SLIDE Finally, Image captioning focuses on the intersection between natural language processing and image recognition - writing sentences which describe images based on their contents. ===SLIDE In this course, we'll be heavily focusing on Convolutional Neural Networks - not just for fun, but because since 2012 they have reigned supreme over nearly every class of image recognition challenge. Fundamentally, the architecture of CNNs has not changed tremendously since their inception - AlexNet and ResNet both still share a number of similarities with the initial model you see here (a model designed to auotmatically read checks!). Core concepts - how convolutions are done, feature maps, subsampling, pooling, and fully connected layers are all still present today - and we'll be doing a deep dive into each of these. ===SLIDE Of particular note for our course is the importance of data and hardware evolution over the last few decades. While the basic concepts for CNNs go back to the late 90s, limitations in computation and data had very dramatic implications for the types of theories and networks we were able to test. Unlike many other fields of inquiry, Data Science as a whole is predicated on algorithms with so many free parameters - parameters that are calibrated by stochastic processes - that it is infeasible, if not impossible, to provide theoretical solutions as to what will or will not work. While we can make really good guesses - based on examples from the natural world and past experience - the field differs significantly from disciplines such as Physics in which single, deterministic solutions can be proven. Advents such as quantum computing - which would enable a massive leap forward in our ability to parameterize models - could change this dramatically over the coming decades, and reshape the entire field. This slide illustrates two of the huge driving forces behind the improvements in computer vision from the late 90s to today. First, we have physical hardware - since the late 1990s, we have seen a literal order-of-magnitude improvement in clock speed of processors, and multiple orders of magnitude improvement in the number of transistors available on our infrastructure. These numbers are reflective of (though not directly a measuremennt of) the number of parameters we could meaningfully estimate before the heat death of the sun, and thus the depth of networks we could feasibly test. Similarly, our ability to test has increased dramatically as data has become more plentiful - we've moved from testing our models based on around 12 megabytes of hand-written characters, to oveer 35 terrabytes of thousands of classes of objects today. ===SLIDE Advances have not all been about quantity - we've also seen a large increase in the diversity of image datasets available for use. From labeled satellite imagery, to video, to emotions there are now a wide variety of different types of models one can seek to train and test networks on. This has been critical in enabling us to apply computer vision to an ever increasing range of real-world problems. ===SLIDE Now, on to a few logistical items for the course, and how we hope you'll take advantage of the opportunity to learn all about computer vision. First, we'll be using Piazza for communication - all students will be enrolled (you'll receive a link), and you can choose to post anonymous questions on Piazza, or respond to any items that come up. Both myself and the TAs for the course will be responding regularly to the content you post on Piazza, and it is one of the best (if not the best) resource for this course. Further, Piazza will have all of the content and links you need to succeed in this course - our most recent syllabus, how to sign up for office hours, and how to submit assignments. We heavily encourage you to check in with Piazza on a regular basis. This course is also built on the shoulders of many giants, but in particular we seek to provide a "liberal arts" take on material coming out of MIT and Stanford on the topic of computer vision. Just a few years ago, one of the first textbooks on this topic was released by MIT Press, and will provide an excellent primer for those of you seeking a bit more depth on any topics we cover. While totally optional for this course, it is freely available online, and will provide a good background for many of our activities. Additionally, the Stanford Computer Vision Lab has provided their lectures and course materials for free online; again optional, these materials will provide a good opportunitiy for you to do a deeper dive on topics of interest to you. ===SLIDE At the end of this course, we hope that you will have the confidence to apply convolutional neural networks to a vast range of challenges. We'll really be focusing on three things. First, we hope to instill in you a deep understanding of how convolutional networks work - not just how to plug data into them. You should be able to write a network from scratch by the end of this course, and have a deep understanding of how to debug and tweak the networks for your own problem cases. Second, we'll be focusing on practical applications - i.e., this won't just be a course about theory, but also one about providing the knowledge you'll need to select the right software and hardware tools to build your own solutions. Finally, we'll also be setting aside a little bit of time to really take a step back and visualize all of the incredible things going on in a convolutional network. Unlike the human brain, which we (at least, today) can't plug right into to see what it's seeing, we *can* do that with a CNN. This opens the doors to a number of beautiful insights, and also a few nightmares along the way. ===SLIDE Grading in this course comes down to 5 moments - 3 assignments which will be released throughout the course, a midterm and final. In each case, you will be given a grade by an algorithm - after all, if you're going to be creating algorithms that may impact the lives of others, it's onnly fair that you are on the receiving end at least once! Throughout the course, you can always challenge your grade and opt for a human review of your score; however, please note that this will be a completely blind review - the human grader will not be aware of the grade the AI gave you, and thus your score could go up or down. We'll have a deeper discussion about grading once the first assignment is released. Please note that late submisisons are not accepted. The deadlines in this course are intentionally set a few days later than absolutely necessary already - so, imagine you've just already been given an extension by default, and submit your work early! Please note that we highly encourage collaboration, but that the work you submit should be your own. We do not have any group projects in this course. ===SLIDE Finally, I want to close today with a few baseline assumptions about you, so you are aware as you choose to either stay in or drop this course! First and foremost, I am assuming you have a strong proficiency in python - every assignment in this course is going to be submitted in Python, so you will need to be prepared. While we will be available to help with some issues, most debugging will be up to you. Second, this course will require some farily basic knowledge about how to take derivatives, what they are, matrix algebra (multiplication), and a few other basic concepts. If you haven't taken upper level math coursework that covers these topics, you should be fine, but may have a little catching up to do here and there. Finally, I assume you have a knowledge of basic machine learning concepts such as optimization, classifiers, cost functions, and related concepts. You should have learned these in the pre-reqs for this course. ===SLIDE That's all for today - I encourage you all to head over to the course website and explore the resources we have available, and wish you all good luck in the course!