Course Overview

MIT Introduction to Deep Learning (2022) | 6.S191

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

The MIT 6.S191 Introduction to Deep Learning course, instructed by Alexander Amini, is a fast-paced, one-week boot camp designed to delve into the foundations of deep learning. The course is structured to offer hands-on experience through software labs using TensorFlow. Utilizing deep learning, comprehensive neural networks are constructed to examine massive data sets for prediction and decision-making tasks. The course emphasizes both the theoretical underpinnings and the practical applications of neural networks, leading up to a competition where top submissions get the opportunity to deploy their models on a full-scale autonomous vehicle. The course offers students a choice between a project proposal or an analytical essay to complete their credit requirements.

Highlights

Kick-off with a futuristic presentation on deep learning's potential. 🎥
Deep learning surpasses traditional machine learning by learning hierarchical data features. 📈
The course utilizes innovative simulations for autonomous vehicle training. 🚘
Gain hands-on TensorFlow experience in software labs. 💻
Students’ final projects allow practical, creative deep learning exploration. 🧠

Key Takeaways

Deep learning leverages neural networks to extract patterns from data for informed predictions. 🤖
The course offers rapid exposure to deep learning in a boot camp style. 🏃‍♂️
Modules include TensorFlow labs and project-based learning. 🧪
Exciting competition: Students can deploy models on a real autonomous vehicle. 🚗
Adaptive learning rates optimize model training efficiently. ⏱

Overview

The kick-off session of MIT's Introduction to Deep Learning course set an exciting tone with a virtual presentation showcasing the transformative power of deep learning. Alexander Amini introduced the interactive and intensive nature of the course, promising to condense complex and expansive neural network concepts into an engaging week of learning.

This fast-paced boot camp will dive into neural network fundamentals, starting from single neurons to multi-layered networks designed to extract hierarchical features from data sets. Through TensorFlow labs and lectures, students will acquire practical and theoretical skills essential for navigating deep learning landscapes, including tackling issues like overfitting with techniques such as dropout and early stopping.

A unique highlight of the course is the chance to work on live simulations for autonomous vehicle training. Students can enhance their models to race cars virtually, with the top designs being implemented on actual self-driving cars, providing a thrilling capstone for a week of dynamic and immersive learning.

Chapters

00:00 - 01:30: Introduction and Overview The chapter introduces the course 6S191 - Introduction to Deep Learning. The instructor, Alexander Amini, welcomes everyone. The course is described as fun and fast-paced, co-taught with Ava Soleimani.
01:30 - 10:00: Deep Learning Basics and Applications Introduction to deep learning basics, an overview of what the field encompasses and its potential applications. The class is designed to cover substantial material within a short period of one week, geared towards providing foundational knowledge and practical experience in deep learning.
10:00 - 21:00: Neural Network Fundamentals The chapter titled 'Neural Network Fundamentals' begins with the author referencing software labs using TensorFlow. It is described as a brief yet intensive course, akin to 'a one-week boot camp in deep learning,' highlighting the vast information presented within a short time frame. The chapter is introduced by posing a fundamental question about the nature of deep learning. Instead of delving immediately into a technical definition or elaborating on the power and appeal of deep learning, it promises to demonstrate the topic, suggesting a practical, example-driven approach.
21:00 - 36:00: Training Neural Networks # Title: Training Neural Networks In this chapter, the focus is on training neural networks, which is a pivotal aspect of deep learning. The lecture is part of the MIT Fitness 191 course, an official introductory curriculum at MIT that delves into the various applications of deep learning across numerous fields such as robotics and medicine. The discussion highlights the transformational impact deep learning is having on technology and its broader implications on various industries. By the end of this chapter, students should have a foundational understanding of how neural networks are trained and their significance in the field of deep learning.
36:00 - 49:00: Optimization and Learning Rates This chapter discusses the role of optimization and learning rates in the development of deep learning algorithms. It includes examples of how deep learning and artificial intelligence can be used to accomplish impressive tasks, such as creating realistic speech and video content. The speaker emphasizes the transformative impact of these technologies and expresses hope that the course will be enjoyable and educational for learners.
49:00 - 66:00: Backpropagation and Practical Insights The chapter discusses the power of deep learning in creating realistic videos and the responsible use of such technology. A particular example is provided where a video and audio are purposely degraded by the authors to prevent misuse, demonstrating the capabilities and potential pitfalls of deep learning technologies.
66:00 - 77:00: Overfitting and Regularization This chapter delves into the concepts of overfitting and regularization within the context of deep learning. It highlights deep learning's significant progress in recent years, particularly its capacity to generate realistic data, including realistic videos. The enthusiasm and feedback from students underscore the excitement in mastering these advanced techniques and algorithms, which drive such remarkable advancements in the field.
77:00 - 81:00: Summary and Next Lecture The chapter discusses the use of deep learning to create fully simulated environments that resemble the real world. It includes examples of these virtual worlds, which are generated using real data and enhanced through deep learning and computer vision. The focus is on a data-driven simulator developed at MIT, which facilitates the placement of virtual simulated cars within these environments. This technology is particularly useful for training autonomous vehicles.

MIT Introduction to Deep Learning (2022) | 6.S191 Transcription

00:00 - 00:30 okay good afternoon everyone and thank you all for joining today i'm super excited to welcome you all to mit 6s191 introduction to deep learning my name is alexander amini and i'm going to be your instructor this year along with ava soleimani now 6s191 is a really fun and fast-paced class and for
00:30 - 01:00 those of you who are not really familiar i'll start by giving you a bit of background on on what deep learning is and what this class is all about just because i think we're going to cover a ton of material in today's class and only one week this class is in total and in just that one week you're going to learn about the foundations of this really remarkable field of deep learning and get hands-on experience and practical knowledge and practical guides through these
01:00 - 01:30 software labs using tensorflow now i like to tell people that 6s 191 is like a one week boot camp in deep learning and that's because of the amount of information that you're going to learn over the course of this one week and i'll start by just asking a very simple question and what is deep learning right so instead of giving you some boring technical answer and description of what deep learning is and the power of deep learning and why this class is so amazing i'll start by actually showing you a
01:30 - 02:00 video of someone else doing that instead so let's take a look at this first hi everybody and welcome to mip fitness 191 the official introductory course on deep learning taught here at mit reflecting is revolutionizing so many views from robotics to medicine and everything in between you'll learn the fundamentals of this
02:00 - 02:30 field and how you can build some of these incredible algorithms in fact this entire speech and video are not real and were created using deep learning and artificial intelligence and in this class you'll learn how it has been an honor to speak with you today and i hope you enjoy the course
02:30 - 03:00 so in case you can tell that video was actually not real at all that was not real video or real audio and in fact the audio you heard was actually even purposely degraded even further just by us to make it look and sound not as real and avoid any potential misuse now this is really a testament to the power of deep learning uh to create such high quality and highly realistic videos and quality models for generating those videos so even with this purposely
03:00 - 03:30 degraded audio that intro that we always show that intro and we always get a ton of really exciting feedback from our students and how excited they are to learn about the techniques and the algorithms that drive forward that type of progress and the progress in deep learning is really remarkable especially in the past few years the ability of deep learning to generate these very realistic uh data and data sets extends far beyond generating realistic videos of people like you saw in this
03:30 - 04:00 example now we can use deep learning to generate full simulated environments of the real world so here's a bunch of examples of fully simulated virtual worlds generated using real data and the power and powered by deep learning and computer vision so this simulator is actually fully data driven we call it and within these virtual worlds you can actually place virtual simulated cars for training autonomous vehicles for example this simulator was actually designed here at mit and when we created it we
04:00 - 04:30 actually showed the first occurrence of using a technique called end-to-end training using reinforcement learning and training a autonomous vehicle entirely in simulation using reinforcement learning and having that vehicle controller deployed directly onto the real world on real roads on a full-scale autonomous car now we're actually releasing this simulator open source this week so all of you as students in 191 will have first access to not only use this type
04:30 - 05:00 of simulator as part of your software labs and generate these types of environments but also to train your own autonomous controllers to drive in these types of environments that can be directly transferred to the real world and in fact in software lab three you'll get the ability to do exactly this and this is super exciting addition to success one nine this year because all of you as students will be able to actually enter this competition where you can propose or submit your best deep learning models to drive in
05:00 - 05:30 these simulated environments and the winners will actually be invited and given the opportunity to deploy their models on board a full-scale self-driving car in the real world so we're really excited about this and i'll talk more about that in the software lab section so now hopefully all of you are super excited about what this class will teach you so hopefully let's start now by taking a step back and answering or defining some of these terminologies that you've probably been hearing a lot about so i'll start with the word intelligence
05:30 - 06:00 intelligence is the ability to process information take as input a bunch of information and make some informed future decision or prediction so the field of artificial intelligence is simply the ability for computers to do that to take as input a bunch of information and use that information to inform some future situations or decision making now machine learning is a subset of ai or artificial intelligence specifically focused on teaching a computer or
06:00 - 06:30 teaching an algorithm how to learn from experiences how to learn from data without being explicitly programmed how to process that input information now deep learning is simply a subset of machine learning as a whole specifically focused on the use of neural networks which you're going to learn about in this class to automatically extract useful features and patterns in the raw data and use those patterns or features to inform the learning tasks so to inform those decisions you're going to try to first learn the features and
06:30 - 07:00 learn the inputs that determine how to complete that task and that's really what this class is all about it's how we can teach algorithms teach computers how to learn a task directly from raw data so just be giving a data set of a bunch of examples how can we teach a computer to also complete that task like the like we see in the data set now this course is split between technical lectures and software labs and we'll have several new updates in this year in this year's edition of the class
07:00 - 07:30 especially in some of the later lectures in this first lecture we'll cover the foundations of deep learning and neural networks starting with the building blocks of of neural networks which is just a single neuron and finally we'll conclude with some really exciting guest lectures were and student projects from all of you and as part of the final prize competition that you'll be eligible to win a bunch of exciting prizes and awards so for those of you who are taking this class for credit
07:30 - 08:00 you'll have two options to fulfill your credit requirement the first option is a project proposal where you'll get to work either individually or in groups of up to four people and develop some cool new deep learning idea doing so will make you eligible for some of these uh awesome sponsored prizes now we realize that one week is a super short and condensed amount of time to make any tangible code progress on a
08:00 - 08:30 deep learning progress so what we're actually going to be judging you here on is not your results but other rather the novelty of your ideas and the ability that we believe that you could actually execute these ideas in practice given the the state of the art today now on the last day of class we'll give you all a three-minute presentation where your group can present your idea and uh win an award potentially and there's actually an art i think to
08:30 - 09:00 presenting an idea in such a short amount of time that we're also going to be kind of judging you on to see how quickly and effectively you can convey those ideas now the second option to fill your grade requirement is just to write a one-page essay on a review of any deep learning paper and this will be due on the last thursday of the class now in addition to the final project prizes we'll also be awarding prizes for the top lab submissions for each of the three labs and like i mentioned before this year we're also holding a special prize for lab 3
09:00 - 09:30 where students will be able to deploy their results onto a full-scale self-driving car in the real world for support in this class please post all of your questions to piazza check out the course website for announcements the course canvas also for announcements and digital recordings of the lectures and labs will be available on canvas shortly after each of the each of the classes so this course has an incredible team that you can reach out to if you ever have any questions either through canvas
09:30 - 10:00 or through the email list at the bottom of the slide feel free to reach out and we really want to give a huge shout out and thanks to all of our sponsors who without this who without their support this class would not be possible this is our fifth year teaching the class and we're super excited to be back again and teaching such a remarkable field and exciting content so now let's start with some of the exciting stuff now that we've covered all of the logistics of the class right so let's start by asking ourselves a question
10:00 - 10:30 why do we care about this and why did all of you sign up to take this class why do you care about deep learning well traditional machine learning algorithms typically operate by defining a set of rules or features in the environment in the data right so usually these are hand engineered right so a human will look at the data and try to extract some hand engineered features from the data now in deep learning we're actually trying to do something a little bit different the key idea of deep
10:30 - 11:00 learning is that these features are going to be learned directly from the data itself in a hierarchical manner so this means that given a data set let's say a task to detect faces for example can we train a deep learning model to take as input a face and start to detect the face by first detecting edges for example very low level features building up those edges to build eyes and noses and mouths and then building up some of those smaller components of faces into larger
11:00 - 11:30 facial structure features so as you go deeper and deeper into a neural network architecture you'll actually see its ability to capture these types of hierarchical features and that's the goal of deep learning compared to machine learning is actually the ability to learn and extract these features to perform machine learning on them now actually the fundamental building blocks of deep learning and their underlying algorithms have actually existed for decades so why are we studying this now well for one data has become much more prevalent so data is
11:30 - 12:00 really the driving power of a lot of these algorithms and today we're living in the world of big data where we have more data than ever before now second these models and these algorithms neural networks are extremely and massively parallelizable they can benefit tremendously from and they have benefited tremendously from modern advances in gpu architectures that we have experienced over the past decade right and these these advances
12:00 - 12:30 these types of gpu architecture simply did not exist when we think about when these algorithms were detected in and created excuse me in for example the neuron the idea for the foundational neuron was created in almost 1960. so when you think back to 1960 we simply did not have the compute that we have today and finally due to amazing open source toolboxes like tensorflow we're able to actually build and deploy these algorithms and these models have become extremely
12:30 - 13:00 streamlined so let's start with the fundamental building block of a neural network and that is just a single neuron now the idea of a single neuron or let's call this a perceptron is actually extremely intuitive let's start by defining how a single neuron takes as input information and it outputs a prediction okay so just looking at its forward pass it's forward prediction call from inputs on the left to outputs on the right
13:00 - 13:30 so we define a set of inputs let's call them x1 to xm now each of these numbers on the left in the blue circles are multiplied by their corresponding weight and then added all together we take this single number that comes out of this edition and pass it through a nonlinear activation function we call this the activation function and we'll see why in a few slides and the output of that function is going to give us our our prediction y well this is actually not entirely correct i forgot one piece of detail
13:30 - 14:00 here we also have a bias term which here i'm calling w0 sometimes you also see it as the letter b and the bias term allows us to shift the input to our activation function to the left or to the right now on the right side here you can actually see this diagram on the left illustrated and written out in mathematical equation form as a single equation and we can actually rewrite this equation using linear algebra in terms of vectors and dot products
14:00 - 14:30 so let's do that here now we're going to collapse x1 to xm into a single vector called capital x and capital w will denote the vector of the corresponding weights w1 to wm the output here is obtained by taking their dot product adding a bias and applying this non-linearity and that's our output y so now you might be wondering the only missing piece here is what is this activation function right well i said it's a nonlinear function
14:30 - 15:00 but what does that actually mean here's an example of one common function that people use as an activation function on the bottom right this is called the sigmoid function and it's defined mathematically above its plot here in fact there are many different types of nonlinear activation functions used in neural networks here are some common ones and throughout this entire presentation you'll also see what these tensorflow code blocks on the bottom part of the screen just to briefly illustrate how you can take the concepts
15:00 - 15:30 the technical concepts that you're learning as part of this lecture and extend it into practical software right so these tensorflow code blocks are going to be extremely helpful for some of your software labs to kind of show the connection and bridge the connection between the foundation set up for the lectures and the practical side with the labs now the sigmoid activation function which you can see on the left hand side is popular like i said largely because it's the it's one of the few functions
15:30 - 16:00 in deep learning that outputs values between zero and one right so this makes it extremely suitable for modeling things like probabilities because probabilities are also existing in the range between zero and one so if we want the output of probability we can simply pass it through a sigmoid function and that will give us something that resembles the probability that we can use to train with now in modern deep learning neural networks it's also very common to use what's called the relu function and you can see an example of this on the right and this is extremely popular it's a piecewise function
16:00 - 16:30 with a single non-linearity at x equals 0. now i hope all of you are kind of asking this question to yourselves why do you even need activation functions what's the point what's the importance of an activation function why can't we just directly pass our linear combination of their inputs with our weights through to the output well the point of an activation function is to introduce a non-linearity into our system now imagine i told you to separate the green points from the red points and that's the thing that you want to train and you only have access
16:30 - 17:00 to one line it's an it's not non-linear so you only have access to a line how can you do this well it's an extremely hard problem then right and in fact if you can only use a linear activation function in your network no matter how many neurons you have or how deep is the network you will only be able to produce a result that is one line because when you add a line to a line you still get a line output non-linearities allow us to approximate arbitrarily complex functions and that's
17:00 - 17:30 what makes neural networks extremely powerful let's understand this with a simple example so imagine i give you a trained network now here i'm giving you the weights and the weights w are on the top right so w0 is going to be set to 1 that's our bias and the w vector the weights of our input dimension is going to be a vector with the values 3 and negative 2. this network only has two inputs right x1 and x2 and if we want to get the
17:30 - 18:00 output of it we simply do the same step as before and i want to keep drilling in this message to get the output all we have to do is take our inputs multiply them by our corresponding weights w add the bias and apply a non-linearity it's that simple but let's take a look at what's actually inside that non-linearity when i do that multiplication and addition what comes out it's simply a weighted combination of the inputs in the form of a 2d line right so we take our inputs x of t x
18:00 - 18:30 transpose excuse me multiply it as a dot product with our weights add a bias and if we look at what's inside this parentheses here what is getting passed to g this is simply a two dimensional line because all right we have two inputs x1 and x2 so we can actually plot this line in feature space or input space we'll call it because this is along the x-axis is x1 and along the y-axis is x2 and we can plot the the decision boundary we call it of the input to this um class to this activation function
18:30 - 19:00 this is actually the line that defines our perceptron neuron now if i give you a new data point let's say x equals negative 1 2 we can plot this data point in this space in this two-dimensional space and we can also see where it falls with respect to that line now if i want to compute its weighted combination i simply follow the perceptron equation to get 1 minus 3 minus 4 which equals
19:00 - 19:30 minus 6. and when i put that into a sigmoid activation function we get a final output of approximately 0.002 now why is that the case so assume we have this input negative 1 negative 2 and this is just going through the math again negative 1 and 2. we pass that through our our equations and we get this output from g let's dive in a little bit more to this feature graph well remember if i if the sigmoid function is defined in the standard way it's
19:30 - 20:00 actually outputting values between 0 and 1 and the middle is actually at 0.5 right so anything on the left hand side of this feature space of this line is going to correspond to the input being less than 0 and the output being greater than 0.5 or excuse me less than 0.5 and on the other side is the opposite that's corresponding to our activation z being greater than 0 and our output y being greater than 0.5 right so this is just following all of the sigmoid math but illustrating it in pictorial form
20:00 - 20:30 and schematics and in practice neural networks don't have just two weights w1 w2 they're composed of millions and millions of weights in practice so you can't really draw these types of plots for the types of neural networks that you'll be creating but this is to give you an example of a single neuron with a very small number of weights and we can actually visualize these type of things to gain some more intuition about what's going on under the hood so now that we have an idea about the perceptron let's start by building
20:30 - 21:00 neural networks from this foundational building block and seeing how all of this story starts to come together so let's revisit our previous diagram of the perceptron if there's a few things i want you to take away from this class in this lecture today i want it to be this thing here so i want you to remember how a perceptron works and i want to remember three steps the first step is dot product your inputs with your weights dot product add a bias and apply a non-linearity and that
21:00 - 21:30 defines your entire perceptron forward propagation all the way down into these three operations now let's simplify the diagram a little bit now that we got the foundations down i'll remove all of the weight labels so now it's assumed that every line every arrow has a corresponding weight associated to it now i'll remove the bias term for simplicity as well here you can see right here and note that z the result of our dot product plus our bias
21:30 - 22:00 is before we apply the non-linearity right so g of z is our output our prediction of the perceptron our final output is simply our activation function g taking as input that state z if we want to define a multi-output neural network so now we don't have one output y let's say we have two outputs y one and y two we simply add another perceptron to this diagram now we have two outputs each one
22:00 - 22:30 is a normal perceptron just like we saw before each one is taking inputs from x1 to xm from the x's multiplying them by the weights and they have two different sets of weights because they're two different neurons right they're two different perceptrons they're going to add their own biases and then they're going to apply the activation function so you'll get two different outputs because the weights are different for each of these neurons if we want to define let's say this entire
22:30 - 23:00 system from scratch now using tensorflow we can do this very very simply just by following the operations that i outlined in the previous slide so our neuron let's start by a single dense layer a dense layer just corresponds to a layer of these neurons so not just one neuron or two neurons but an arbitrary number let's say n neurons in our dense layer we're going to have two sets of variables one is the weight vector and one is the bias so we can define both of these types of variables
23:00 - 23:30 and weights as part of our layer the next step is to find what is the forward pass right and remember we talked about the operations that defined this forward pass of a perceptron and of a dense layer now it's composed of the steps that we talked about first we compute matrix multiplication of our inputs with our weight matrix our weight vector so inputs multiplied by w add the bias plus b and feed it through our activation
23:30 - 24:00 function here i'm choosing a sigmoid activation function and then we return the output and that defines a dense layer of a neural network now we have this dense layer we can implement it from scratch like we see in the previous slide but we're pretty lucky because tensorflow has already implemented this dense layer for us so we don't have to do that and write that additional code instead let's just call it here we can see an example of calling a dense layer with the number of output units set equal to 2. now let's dive a little bit deeper and
24:00 - 24:30 see how we can make now a full single layered neural network not just a single layer but also an output layer as well this is called a single hidden layered neural network and we call this a hidden layer because these states in the middle with these red states are not directly observable or enforceable like the inputs which we feed into the model and the outputs which we know what we want to predict right so since we now have this transformation from the inputs to the hidden layer and
24:30 - 25:00 from the hidden layer to the output layer we need now two sets of weight matrices w1 for the input layer and w2 for the output layer now if we look at a single unit in this hidden layer let's take this second unit for example z2 it's just the same perceptron that we've been seeing over and over in this lecture already so we saw before that it's obtaining its output by taking a dot product with those x's its inputs multiplying multiplying them via the dot product
25:00 - 25:30 adding a bias and then passing that through through the form of z2 if we took a different hidden node like z3 for example it would have a different output value just because the weights leading to z3 are probably going to be different than the weights leading to z2 and we we basically start them to be different so we have diversity in the neurons now this picture looks a little bit messy so let me clean it up a little bit more and from now on i'll just use this symbol in the middle to denote what we're calling a dense layer dense is
25:30 - 26:00 called dense because every input is connected to every output like in a fully connected way so sometimes you also call this a fully connected layer to define this fully connected network or dense network in tensorflow you can simply stack your dense layers one after another in what's called a sequential model a sequential model is something that feeds your inputs sequentially from inputs to outputs so here we have two layers the heightened layer first defined with n hidden units and our
26:00 - 26:30 output layer with two output units and if we want to create a deep neural network it's the same thing we just keep stacking these hidden layers on top of each other in a sequential model and we can create more and more hierarchical networks and this network for example is one where the final output in purple is actually computed by going deeper and deeper into the layers of this network and if we want to create a deep neural network in software all we need to do is
26:30 - 27:00 stack those software blocks over and over and create more hierarchical models okay so this is awesome now we have an idea and we've seen an example of how we can take a very simple and intuitive mechanism of a single neuron a single perceptron and build that and build that all into the form of layers and complete complex neural networks let's take a look at how we can apply them in a very real and practical problem that maybe some of you have
27:00 - 27:30 thought about before coming today's to today's class now here's the problem that i want to train an ai to to solve if i was a student in this class so will i pass this class that's the problem that we're going to ask our machine or a deep learning algorithm to answer for us and to do that let's start by defining some inputs and outputs or sorry input features excuse me to the to the ai to the ai model one feature that's let's use to learn from is the number of lectures that you attend as part of
27:30 - 28:00 today as part of this course and the second feature is the number of hours that you're going to spend developing your final project and we can collect a bunch of data because this is our fifth year teaching this amazing class we can collect a bunch of data from past years on how previous students performed here so each dot corresponds to a student who took this class we can plot each student in this two-dimensional feature space where on the x-axis is the number of lectures they attended and on the y-axis is the number of hours that they spent on the final project the green points
28:00 - 28:30 are the students who pass and the red points are those who failed and then there's you you lie right here right here at the point four five so you've attended four lectures and you've spent five hours on your final project you want to build now a neural network to determine given everyone else's standing in the class will i pass or fail this class now let's do it so we have these two inputs one is four one is five this is your inputs and we're going to feed these into a single layered neural network with three hidden
28:30 - 29:00 units and we'll see that when we feed it through we get a predicted value of probability of you passing this class as 0.1 or 10 percent so that's pretty bad because well you're not going to fail the class you're actually going to succeed so the actual value here is going to be one you do pass the class so why did the network get this answer incorrectly well to start with the network was never trained right so all it did was we just started the network it has no idea what success 191 is how it
29:00 - 29:30 occurs for a student to pass or fail a class or what these inputs four and five mean right so it has no idea it's never been trained it's basically like a baby that's never seen anything before and you're feeding some random data to it and we have no reason to expect why it's going to get this answer correctly that's because we never told it how to train itself how to update itself so that it can learn how to predict such a such an outcome or to predict such a task of passing or failing a class
29:30 - 30:00 now to do this we have to actually define to the network what it means to get a wrong prediction or what it means to incur some error now the closer our prediction is to our actual value the lower this error or our loss function will be and the farther apart they are the uh the farther the part they are the more error we will incur the closer they are together the less error that we will occur now let's assume we have data not just from one student but for many students
30:00 - 30:30 now we care about how the model did on average across all of the students in our data set and this is called the empirical loss function it's just simply the mean of all of the individual loss functions from our data set and when training a network to to solve this problem we want to minimize the empirical law so we want to minimize the loss that the network incurs on the data set that it has access to between our predictions and our outputs so if we look at the problem of binary
30:30 - 31:00 classification for example passing or failing a class we can use something a loss function called for example the softmax cross-entropy loss and we'll go into more detail and you'll get some experience implementing this loss function as part of your software labs but i'll just give it as a a quick aside right now as part of this slide now let's suppose instead of predicting pass or fail a binary classification output let's suppose i want to predict a numeric output for example the grade that i'm going to get in this class
31:00 - 31:30 now that's going to be any real number now we might want to use a different loss function because we're not doing a classification problem anymore now we might want to use something like a mean squared error loss function or maybe something else that takes as input continuous real valued numbers okay so now that we have this loss function we're able to tell our network when it makes a mistake now we've got to put that together with the actual model that we defined in the last part to actually see now how we can
31:30 - 32:00 train our model to update and optimize itself given that error function so how can it minimize the error given a data set so remember that we want the objective here is that we want to identify a set of weights let's call them w star that will give us the minimum loss function on average throughout this entire data sets that's the gold standard of what we want to accomplish
32:00 - 32:30 here in training a neural network right so the whole goal of this class really is how can we identify w star right so how can we train our the weights all of the weights in our network such that the loss that we get as an output is as small as it can possibly be right so that means that we want to find the w's that minimize j of w so that's our empirical loss our average empirical loss remember that w is just a group of all of the ws from our from every layer in the model right so we just concatenate
32:30 - 33:00 them all together and we want to minimize the we want to find the weights that give us the lowest loss and remember that our loss function is just a is a function right that takes us input all of our weights so given some set of weights our loss function will output a single value right that's the error if we only have two weights for example we might have a loss function that looks like this we can actually plot the loss function because it's it's relatively low dimensional we can visualize it right so on the x on the horizontal axis x and y axis we're having the two
33:00 - 33:30 weights w0 and w1 and on the vertical axis we're having the loss so higher loss is worse and we want to find the weights w0 and w1 that will bring us the lowest part to the lowest part of this lost landscape so how do we do that this a process called optimization and we're going to start by picking an initial w0 and w1 start anywhere you want on this graph and we're going to compute the gradient
33:30 - 34:00 remember our loss function is simply a mathematical function so we can compute the derivatives and compute the gradients of this function and the gradient tells us the direction that we need to go to maximize j of w to maximize our loss so let's take a small step now in the opposite direction right because we want to find the lowest loss for a given set of weights so we're going to step in the opposite direction of our gradient and we're going to keep repeating this process we're going to compute gradients again at the new point and keep stepping and stepping and
34:00 - 34:30 stepping until we converge to a local minima eventually the gradients will converge and we'll stop at the bottom it may not be the global bottom but we'll find some bottom of our lost landscape so we can summarize this whole algorithm known as gradient descent using the gradients to descend into our loss function in pseudocode so here's the algorithm written out as pseudocode we're going to start by initializing weights randomly and we're going to repeat the two steps until we convert so first we're going to compute our
34:30 - 35:00 gradients and then we're going to step in the opposite direction a small step in the opposite direction of our gradients to update our weights right now the amount that we step here eta this is the the n character next to our gradients determines the the magnitude of the step that we take in the direction of our gradients and we're going to talk about that later that's a very important part of this problem but before i do that i just want to show you also kind of the analog side of this algorithm written out in tensorflow
35:00 - 35:30 again which may be helpful for your software labs right so this whole algorithm can be replicated using automatic differentiation using platforms like tensorflow so tensorflow with tensorflow you can actually randomly initialize your weights and you can actually compute the gradients and do these differentiations automatically so it will actually take care of the definitions of all of these gradients using automatic differentiation and it will return the gradients that you can directly use to step with and optimize and train your weights
35:30 - 36:00 but now let's take a look at this term here the gradient so i mentioned to you that tensorflow and your software packages will compute this for you but how does it actually do that i think it's important for you to understand how the gradient is computed for every single weight in your neural network so this is actually a process called back propagation in deep learning and neural networks and we'll start with a very simple network and this is probably the simplest network in existence because it only contains one hidden neuron right so it's the smallest
36:00 - 36:30 possible neural network now the goal here is that we're going to try and do back propagation manually ourselves by hand so we're going to try and compute the gradient of our loss j of w with respect to our weight w2 for example this tells us how much a small change in w2 will affect our loss function right so if i change and perturb w2 a little bit how does my error change as a result so if we write this out as a derivative we
36:30 - 37:00 start by applying the chain rule and use we start by applying the chain rule backwards from the loss function through the output okay so we start with the loss function here and we specifically decompose dj dw2 into two terms we're going to decompose that into dj dy multiplied by d y d w two right so we're just applying the chain rule to decompose the left hand side into two gradients that we do have access to
37:00 - 37:30 now this is possible because y is only dependent on the previous layer now let's suppose we want to compute the gradients of the weight before w2 which in this case is w1 well now we've replaced w2 with w1 on the left hand side and then we need to apply the chain rule one more time recursively right so we take this equation again and we need to apply the chain rule to the right hand side on the the red highlighted portion and split that part into two parts again so now we propagate our gradient
37:30 - 38:00 our old gradient through the hidden unit now all the way back to the weight that we're interested in which in this case is w1 right so remember again this is called back propagation and we repeat this process for every single weight in our neural network and if we repeat this process of propagating gradients all the way back to the input then we can determine how every single weight in our neural network needs to change and how they need to change in order to decrease our loss on the next iteration so then
38:00 - 38:30 we can apply those small little changes so that our losses a little bit better on the next trial and that's the backpropagation algorithm in theory it's a very simple algorithm just compute the gradients and step in the opposite direction of your gradient but now let's touch on some insights from training these networks in practice which is very different than the simple example that i gave before so optimizing neural networks in practice can be extremely difficult it does not look like the loss function landscape that i
38:30 - 39:00 gave you before in practice it might look something like this where your lost landscape is super convex uh super non-convex and very complex right so here's an example of the paper that came out a year ago where authors tried to actually visualize what deep learn deep neural network architecture landscapes actually look like and recall this update equation that we defined during gradient descent i didn't talk much about this parameter i alluded to it it's called the learning rates and in practice it determines a lot about how much step we take and how
39:00 - 39:30 much trust we take in our gradients so if we set our learning rate to be very slow then we're model we're having a model that may get stuck in local minima right because we're only taking small steps towards our gradient so we're going to converge very slowly we may even get stuck if it's too small if the learning rate is too large we might follow the gradient again but we might overshoot and actually diverge and our training may kind of explode and
39:30 - 40:00 it's not a stable training process so in reality we want to use learning rates that are neither not small not too small not too large to avoid these local minima and still converge right so we want to kind of use medium-sized learning rates and what medium means is totally arbitrary you're going to see that later on just kind of skip over these local minima and and still find global or hopefully more global optimums in our lost landscape so how do we actually find our learning
40:00 - 40:30 rate well you set this as the define as a definition of your learning algorithm so you have to actually input your learning rate and one way to do it is you could try a bunch of different learning rates and see which one works the best that's actually a very common technique in practice even though it sounds very unsatisfying another idea is maybe we could do something a little bit smarter and use what are called adaptive learning rates so these are learning rates that can kind of observe its landscape and adapt itself to kind of tackle some of these challenges and maybe escape some local minima or speed
40:30 - 41:00 up when it's on a on a local minima so this means that the learning rate because it's adaptive it may increase or decrease depending on how large our gradient is and how fast we're learning or many other options right so in fact these have been widely explored in deep learning literature and heavily published on as part of also software packages like tensorflow as well so during your labs we encourage you to try out some of these different types of of uh optimizers and algorithms and how they
41:00 - 41:30 they can actually adapt their own learning rates to stabilize training much better now let's put all of this together now that we've learned how to create the model how to define the loss function and how to actually perform back propagation using an optimization algorithm and it looks like this so we define our model on the top we define our optimizer here you can try out a bunch of different of the tensorflow optimizers we feed the output of our model grab its gradient and apply its gradient to the
41:30 - 42:00 optimizer so we can update our weight so in the next iteration we're having a better prediction now i want to continue to talk about tips for training these networks in practice very briefly towards the end of this lecture and because this is a very powerful idea of batching your data into mini batches to stabilize your training even further and to do this let's first revisit our gradient descent algorithm the gradient is actually very very computationally expensive to compute because it's computed as a summation
42:00 - 42:30 over your entire data set now imagine your data set is huge right it's not going to be feasible in many real life problems to compute on every training iteration let's define a new gradient function that instead of computing it on the entire data set it just computes it on a single random example from our data set so this is going to be a very noisy estimate of our gradient right so just from one example we can compute an estimate it's not going to be the true gradient but an estimate and this is much easier to compute
42:30 - 43:00 because it's it's very small so just one data point is used to compute it but it's also very noisy and stochastic since it was used also with this one example right so what's the middle ground instead of computing it from the whole data set and instead of computing it from just one example let's pick a random set of a small subset of b examples we'll call this a batch of examples and we'll feed this batch through our model and compute the gradient with respect to this batch this gives us a much better estimate in
43:00 - 43:30 practice than using a single gradient it's still an estimate because it's not the full data set but still it's much more computationally attractive for computers to do this on a small batch usually we're talking about batches of maybe 32 or up to 100 sometimes people use larger with larger neural networks and larger gpus but even using something smaller like 32 can have a drastic improvement on your performance now the increase in gradient accuracy estimation actually allows us to converge much quicker in practice so it allows us to more smoothly and
43:30 - 44:00 accurately estimate our gradients and ultimately that leads to faster training and more parallelizable computation because over each of the elements in our batch we can kind of parallelize the gradients and then take the average of all of the gradients now this last topic i want to address is that of overfitting this is also a problem that is very very general to all of machine learning not just deep learning but especially in deep learning which is why i want to talk about it in today's lecture it's a fundamental problem and challenge
44:00 - 44:30 of machine learning and ideally in machine learning we're given a data set like these red dots and we want to learn a model like the blue line that can approximate our data right said differently we want to build models that learn representations of our data that can generalize to new data so assume we want to build this line to fit our red dots we can do this by using a single linear line on the left hand side but this is not going to really well capture all of the intricacies of
44:30 - 45:00 our red points and of our data or we can go on the other far extreme and overfit we can really capture all the details but this one on the far right is not going to generalize to a new data point that it sees from a test set for example ideally we want to wind up with something in the middle that is still small enough to maintain some of those generalization capabilities and large enough to capture the overall trends so to address this problem we can employ what's called a technique called
45:00 - 45:30 regularization regularization is simply a method for in that you can introduce into your training to discourage complex models so to encourage these more simple types of models to be learned and as we've seen before it's actually critical and crucial for our models to be able to generalize past our training data right so we can fit our models to our training data but actually we can minimize our loss to almost zero in most cases but that's not
45:30 - 46:00 what we really care about we always want to train on a training set but then have that model be deployed and generalized to a test set which we don't have access to so the most popular regularization technique for deep learning is a very simple idea of dropout and let's revisit this picture of a neural network that we started with in the beginning of this class and in dropout during training what we're going to do is we're going to randomly drop and set some of the activations in this neural network in the hidden layer to zero with some probability let's say we drop out 50 of
46:00 - 46:30 the neurons we randomly pick 50 of neurons that means that their activations now are all set to zero and we force the network to not rely on those neurons too much so this forces the model to kind of identify different types of pathways through the network on this iteration we pick some random 50 to drop out and on the next iteration we may pick a different random percent and this is going to encourage these different pathways and encourage the network to identify different forms of processing its information to accomplish
46:30 - 47:00 its decision making capabilities another regularization technique is a technique called early stopping now the idea here is that we all know the definition of overfitting is when our training set is or sorry when our model starts to have very bad performance on our test set we don't have a test set but we can kind of create a example test set using our training set so we can split up our training set into two parts one that we'll use for training and one that will
47:00 - 47:30 not show to the training algorithm but we can use to start to identify when we start to overfit a little bit so on the x-axis we can actually see training iterations and as we start to train we can see that both the training loss and the testing loss go down and they keep going down until they start to converge and this pattern of divergence actually continues for the rest of training and what we want to do here is actually identify the place where the testing accuracy or the testing loss is minimized and that's going to be the model that we're going to use
47:30 - 48:00 and that's going to be the best kind of model in terms of generalization that we can use for deployment so when we actually have a brand new test data set that's going to be the model that we're going to use so we're going to employ this technique called early stopping to identify it and as we can see anything that kind of falls on the left side of this line is are models that are under fitting and anything on the right side of this line are going to be models that are considered to be overfit right because this divergence has occurred now i'll conclude this lecture by
48:00 - 48:30 first summarizing the three main points that we've covered so far so first we learned about the fundamental building blocks of neural networks the perceptron a single neuron we learned about stacking and composing these types of neurons together to form layers and full networks and then finally we learned about how to actually complete the whole puzzle and train these neural networks and to end using some loss function and using gradient descent and back propagation
48:30 - 49:00 so in the next lecture we'll hear from ava on a very exciting topic taking a step forward and actually doing deep sequence modeling so not just one input but now a series a sequence of inputs over time using rnns and also a really new and exciting type of model called the transformer and attention mechanism so let's resume the class in about five minutes once we have a chance for ava to just get set up and bring up her presentation so thank you very much