Reading the source code for DETR (object detection with transformers)

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In this video, Mak Gaiduk takes viewers through the process of setting up and training a DETR (Detection with Transformers) model, originally developed by Facebook researchers. This detailed walkthrough covers the process of acquiring the dataset and repository, configuring the setup, training the model on different datasets, and reading and understanding the source code. Gaiduk offers insight into unique aspects of the model's architecture, such as transformer encoder-decoder mechanisms, positional encoding, and the role of datasets with non-canonical representations. He explains technical details like the training loop, optimizer selection, and debugging setup with tools like Jupyter notebooks.

Highlights

The video provides a step-by-step guide on setting up and training the DETR model. 🛠️
Mak explains the significance of using different datasets for training, highlighting Coco's advantages. 🎯
Technical aspects like positional encoding and transformer mechanisms are discussed in layman's terms. 📊
The creator demonstrates using Jupyter notebooks for efficient debugging and understanding source code. 🐞
Understanding the training loop and model evaluation enhances your deep learning toolkit. 📚

Key Takeaways

Learn how to train DETR models and replicate Facebook's results. 🚀
Understand the importance of different datasets like ImageNet and Coco. 🧠
Explore the intricacies of transformer architectures and their practical applications. 🤖
Discover debugging tools and efficient coding practices. 💡
Gain insights into model evaluation and optimization techniques. 🔄

Overview

Mak Gaiduk opens the video by introducing the DETR model and outlining the session's goals, which include training a model from scratch and understanding its code. The detailed setup involves acquiring datasets like ImageNet and Coco, essential for object detection tasks, and setting up the environment to run the model efficiently.

As the video proceeds, Mak goes into the nitty-gritty of the source code, explaining the architecture of the DETR model with a focus on transformers, including encoder-decoder mechanisms and positional encoding. He discusses the process of setting up Jupyter notebooks for debugging, which allows for real-time troubleshooting and code evaluation.

To conclude, the video highlights the significance of choosing the right optimizer and explains the step-by-step training process. Mak emphasizes moderation techniques like dropout and batch normalization and introduces upcoming content on advanced topics like deformable DETR and generative AI techniques.

Chapters

00:00 - 01:00: Introduction and Overview The chapter titled 'Introduction and Overview' provides an introduction to a data object detection model developed by Facebook researchers, focusing on detection with transformers. While there is a previous video explaining the algorithm, this chapter emphasizes training a model independently, reimplementing benchmarks, and interpreting the code. The plan includes guiding the reader on training their own model and sourcing initial data.
01:00 - 03:00: Training Your Own Model The chapter titled 'Training Your Own Model' provides guidance on replicating research results in machine learning by setting up repositories accurately. The chapter also delves into debugging techniques using tools like Jython, notebooks, and Python debuggers. These tools assist in examining model code by stepping through it methodically and printing intermediate variables to enhance understanding.
03:00 - 05:00: Debugging Setup The chapter titled 'Debugging Setup' focuses on understanding and debugging the setup required for a model. It outlines a comprehensive walkthrough of various components involved in model preparation. The contents include understanding the code, data sets, and data loaders, along with image transformations. It emphasizes reading and interpreting source code related to data handling. The chapter also delves into model architecture, including the source code for U-Net loss and training loop. Moreover, it divides the model into two main parts, highlighting the backbone that utilizes keras, suggesting a focus on practical implementation details in model debugging and setup.
05:00 - 09:00: Data and Data Loaders The chapter titled 'Data and Data Loaders' discusses the differences between older models like 'ResNet-50', trained in 2015, and newer datasets such as 'ImageNet'. The 'ImageNet' dataset is highlighted as being significantly larger, with 150 gigabytes of data, thousands of classes, and 1.2 million images. A video is mentioned that explores the 'ResNet' model, the 'ImageNet' dataset, and their source code.
09:00 - 15:00: Model Architecture This chapter discusses the challenges of downloading and training high-resolution, canonical images of objects. It explains that instead of downloading these images, the final weights for ResNet-50 will be obtained from the Torch Hub. Meanwhile, the Transformer and the queries will be trained from scratch.
15:00 - 21:00: Transformer Components The chapter discusses the use of the corco dataset for training, highlighting its compact size of only 20 gigabytes comprising 90 classes and about 100,000 images. Despite its smaller size compared to other datasets like ImageNet, corco proves to be more effective for object detection tasks. The primary reason for this effectiveness is the presence of non-conical representations of objects in the corco dataset, setting it apart from others like ImageNet.
21:00 - 27:00: Loss Calculation This chapter discusses the concept of 'Loss Calculation' related to image processing and object detection. It starts with an explanation of canonical representation, highlighted with the example of a picture of a dog placed centrally, and unobstructed, facing the viewer. This is contrasted with images in the Coco dataset, which contain multiple objects that may be obstructed, partially visible, or facing away. It emphasizes the complexity and variability of real-life image scenarios, suggesting that complete visibility is not necessary for effective analysis.
27:00 - 32:00: Training Process and Conclusion This chapter goes through the process of training machine learning models for object detection. It explains the importance of using diverse and challenging datasets to better prepare models for real-world applications. It also provides guidance on accessing relevant code repositories, specifically mentioning the Facebook Research data repository, where necessary code can be cloned and implemented.
32:00 - 37:00: Future Plans and Video Series In this chapter, the focus is on preparing for training a model by downloading the necessary dataset. The speaker emphasizes the importance of acquiring the full dataset, which includes training images, validation images, and annotations, totaling about 20 GB. The dataset in question is the Coco dataset, and instructions are provided to download it from the specified website.

Reading the source code for DETR (object detection with transformers) Transcription

00:00 - 00:30 hello again today I will talk about uh data object detection model um authored by Facebook researchers uh for detection with informers and here's the brief overview of it so um I have a previous video about the actual algorithm and how everything works this one will be more about training one yourself reimplementing The Benchmark and reading the code um so here is the plan for today so first how to train your own model how to get the initial data how to
00:30 - 01:00 set up the repository run it and get the same results uh as Facebook researchers reported on the C data set second I'll talk about my debugging setup about jython notebook and python debugger that helps me navigate uh source code for models like these to run the code to debug the code and step through it line by line and maybe print out intermediate variables uh which helps to read the
01:00 - 01:30 code and understand what's actually going on there then we'll talk about the data so which data sets are used data loaders image Transformations and so one we'll read the actual source code for data for that um then we'll talk about model architecture again looking at source code for U dat loss and the training Loop all right first how to do it yourself the model itself can be broadly divide it into two parts so the first part is the so-called backbone which just uses can ions similar to
01:30 - 02:00 previous models like reset 50 that was trained in back in 2015 um so this one is actually trained on a different data set called image net um I also have a video about that I have a video about this specific reset model and the image net data set and reading the source code for that but basically it's an older data set it's much bigger so it has 150 gigabytes thousands of classes and 1.2 million Imes Imes so it
02:00 - 02:30 is a bit challenging to download that and to train something on that uh and the images themselves are high resolution mostly canical representations of objects um but we'll not do this today so we just get the final weights for resonant 50 from the torch Hub and then the the rest Parts the Transformer and the queries and so on uh that will will train from scratch
02:30 - 03:00 so they are trained using the corco data set it is much smaller only 20 gabt 90 classes about 100K images um it's curious because this sort of data set turns out to be enough for object detection tasks and it actually works better than image net and the main explanation for that is that um it has non-conical representations of the objects so in image net if you want
03:00 - 03:30 a picture of a dog you would have it right in the middle of the image facing towards you and this is so-called canonical representation so it's unobstructed it's facing towards you it's in the middle now in Coco data set you would have many objects objects might be obstructed or only part of the object will be present in the picture they might face away um you might only see like really small part of the object uh and this is how it is in real life life um and turns out we don't need that
03:30 - 04:00 much data to be able to detect objects um to learn to detect objects uh and this sort of more challenging data works better for real world applications right so first let's go get the code um so it is located in this Facebook research dat repository so you can just go ahead and clone it uh somewhere
04:00 - 04:30 uh second thing you will need is the actual data set um since we are training the proper model um I recommend to download it in full um so here you would need train images validation images and then the annotations so it will be about 20 GB in total once again this is just the Coco data set um data set download um website so you would need to download it
04:30 - 05:00 unpack it and uh you would have the directory structure ready for training so here I have it uh downloaded and unpacked so it has three folders annotations with labels train and ball with the actual images um and then I also have the repository the data repository uh that contains the code to run it so uh repository is pretty much self explanatory they have a lot of
05:00 - 05:30 instructions on how to set up for detection they have some notebooks with some cool demonstrations like this one where they actually visualize attension masks and so on uh and finally to train to train the model using just the um Coco data set so in here they use the torch distributed launch module to launch it it is only needed if you want um multiple
05:30 - 06:00 gpus um if not you can just run the actual main. piy script just with python um another thing you might need to do if you just get started with this is to install requirements so this is how I do it so just type Pon python peep install minus error requirements.txt and it should install all requirements so the only thing worth
06:00 - 06:30 noting here is that you need to pay close attention to what sort of version of python you use because um you might have jupyter notebook you might have Visual Studio code you might have Pap installed and then the actual python interpreter and they might all refer to to different versions of python so to to help with that I just always invoke the F python itself and then ask it for the IP module to install something that
06:30 - 07:00 makes sure that uh so the the python 3.9 is the one I usually use for training so will always invoke this python the peep uh and I will use the same in Visual Studio code in dryer notebook um so it will work for a little bit and that's it so everything is now installed I can run it um so this is the actual launch command
07:00 - 07:30 so I just run main. Pi I need to provide the path to the data set um where we store the train validation and annotations and I also provide device MPS so I'm currently doing that on my Mac um just because I'm recording a video here on Mac it will be MPS on everything else with GPU it will be Cuda so let's go ahead and run it and just wait until some iterations start to go and it is started so we can see that we
07:30 - 08:00 started the first Epoch it says that it will take 21 days on on a Mac to train just the first EPO and I believe we need like 300 of those uh but with a proper GPU it will take about maybe a week uh to train the whole thing uh depending on like how many of them you have or how powerful it is maybe even less so this is basically it this was the easy part thanks to the fact that the the GitHub
08:00 - 08:30 repository is really well supported it has all the instructions it mainly uses the native pytorch code meaning that it will work on every device on Macs on gpus on CPUs this is not the case for many other models where they would have some custom layer only implemented for Cuda for example uh so this is quite easy so I'm going to interrupt it now and let's uh talk about how to actually read the code right so it's first I will
08:30 - 09:00 talk about my debuging setb so how do I set set things up to um to make it easier so first I'm going to visual studio code so this just opens the root of the repository so as so you can sort of read the code launch it read the modules and so on uh so the first thing we need to do is to select python interpreter so if I just click here you might might notice that I have something
09:00 - 09:30 like eight different pythons installed on on my Mac um so I need to make sure that I use the same python for pip for dependency installment and and individual Studio code so in my case it's the one from home brew the python 3.9 so let's go with this one so now I've selected it and I that that will actually help you solve problems if it's say something like torch is not
09:30 - 10:00 installed for example so that's one thing second thing I want the debugger to work so I've set a br break point here in in the very beginning of the main function um and then I'm going here to the run and debug tab uh and press on ADD configuration so um it opens this launch. Json file uh which basically uh says what um extra arguments you might
10:00 - 10:30 want and what options so most of this is just created by default and then this line uh just uh was typed by me to provide proper arguments so here I have to specify the Coco path and device and that's it so now it should work so we can try to launch it again and just see if it works so to do that I go back to this run and theb tab um um and I have
10:30 - 11:00 to open the proper um mode so this launch. Json was called python current file and this is the name of the configuration I have to choose here in run and debug so I'm going back to main I click on it and just uh push run right so uh here in the terminal I can see that it runs and it has stopped on um on this break
11:00 - 11:30 point so I can see here the controls to step over to continue execution until the next break point and so on oh now it's stopped yeah uh I can go to debug console for some additional controls so right now I want to print out the um content of this ARS um argument for example right so here um I only provided a couple of arguments with the data set path and um device most of
11:30 - 12:00 them are default but it still helps to just print out the exact content um and just copy it and save it somewhere to be able to reproduce it later in jyon Notebook so uh it is really useful um another thing that I want to do is I want to create a jupyter notebook right in the root of the repository so I'm doing that by just typing touch andbox do I python notebook right and this will
12:00 - 12:30 create an empty file but in my case um it already exists uh and I will provide the link to the actual JY notebook that I use so I can either just write it from scratch or just use mine so this is it so uh I placed it in the root of the repository which means that I can now import the same stuff that is used in the main.py uh so some of that is just um third party libraries that's okay and
12:30 - 13:00 then some of these are actually referring to other files within the repo so for them to be importable I need the notebook itself to be in the root of the repository so now I can just copy paste that um past it here uh and it will work um then I also have some helper functions so I Define the class names to be able to match the label to to to name of the class like person bicycle and so on and then the first one here is the
13:00 - 13:30 extra class for No Object uh I also Define the um plotting function so I need to be able to plot results to make it um more obvious what is going on so to do that I have the unnormalized class because actual tensors used in training are going to be normalized which means they have zero mean and uh one standard deviation um
13:30 - 14:00 it's not really good for plotting because plotting engine assumes that pixel value is always above zero so if you just plot the normal tensor U you'll see a lot of just black spots so all the colors will be off you would see like blackened image uh so I do the reverse process of unn normalization uh to to make it pretty uh and here I just use the well-known uh mean and standard deviation values for different colors of
14:00 - 14:30 the Coco data set you can you can see this stuff often in the code then just some helper converter functions so C data set provides bonding box coordinates in the form of um coordinates of the center and then height and width uh and I convert that to just coordinates of two points and then the actual plotting function so the plotting function will use the actual image tensor um then uh the actual
14:30 - 15:00 labels so the ones with the like class for for the object and the bounding box uh so I convert everything needed and then just add a patch on top of that to draw the rectangle and we'll see how that works later also there is a thing for mask here so I'll explain later why we need that um and finally um I Define the args class and uh um variable so this allows me so the values
15:00 - 15:30 I just copy pasted from here so learning rate backbone learning rate and so on I just copy pasted it from the debugger um and I Define it in the special way as a class inhering from namespace class um and I just inst instantiate it like this and thanks to this I can now call different function just copy paste it from main.py like this so now I can use
15:30 - 16:00 this ARS variable the same way it's used here with the actual common line passing so I don't deal with common line uh and I'm not doing this instead I just create this variable that contains all the same stuff so now I can just copy paste everything from here and it will still work now let's talk about the first part in the training that is loading the data what data set is loaded is used how do we load the data how do we transform images uh which uh augmentation
16:00 - 16:30 techniques are used so before reading the source code I usually just imagine the main components that need to be there because the source code of a real model might be quite complex it is not that complex in this case but sometimes it is and it helps to have these reference points um that help you structure the reading um and for for a model they are usually quite simple so we load the data and in P that means
16:30 - 17:00 that we have a data set that deals with the actual loading images from disk uh and so on we have data loader which deals with um doing that process of loading data with some buffer to make sure that you don't wait for image processing during your training um it handles shuffling and stuff like that then we would have um so the data loader is something that that you then iterate on then you have the model um which is
17:00 - 17:30 something that you can call feed in an image tenser and get the predictions out then you'd have Optimizer sometimes you'd have learning rate scheduler attached to the optimizer and sometimes you would have parameter groups that allow you to control the optimizer behavior for different parameters um you'd have the main training Loop loss um and in the training Loop You' probably iterate over EPO then every EPO
17:30 - 18:00 you would iterate over training data set then do some reporting for validation metrics um deal with checkpoints and that's it so if you compare this code to some of the other models like YOLO for example it is much simpler partly because it was done by Facebook research um the same company that created pytorch so a lot of stuff that is used here just refers to to um Native py code so for
18:00 - 18:30 example we don't have a custom Logic for schedu we just call the torch function called step learning base schedule we don't have a special data loer we just call the native pytorch data loer and so on so the entire code is quite clean and simple uh it's only about 100 lines um uh if you don't count the argument parsing and here you can just read it down and see all the main stuff we've
18:30 - 19:00 talked about so we Define the model and Criterion is like a loss function um we sometimes deal with GDP which stands for distributed data parallel not in our case we just skip that we have parameter groups um in here we just diff um differentiate between backbone and not backbone we have Optimizer in this case Adam Schuler data set data loader
19:00 - 19:30 um sometimes we load from checkpoints if we Zoe the training and this is the main Loop right so this is the overall code structure that we'll read in the next in next maybe one hour so we start with the data set so let's just see where it is so again we have data set train and data set wall so you can just copy um data set train we can see that it's only using the ARs variable so we don't need
19:30 - 20:00 anything else and that's what I did here so I build a data set and it just works and now I can get stuff from the data set so the data set API is simple it's like python list you can provide it with an index and it will return you something in our case that something is two variables one will have an image and the second will have a label so here here I'm printing the shapes of
20:00 - 20:30 images uh and uh so the first one and the second one and the uh example of a label so you can see that images are different every time so as any real world images tend to do um so they come from different sources from cameras from internet and so on so they all will have different sizes um and then for labels we mainly are interested in just the boxes which are bounding boxes and the
20:30 - 21:00 labels which are numbers corresponding to objects so every number corresponds to a specific class like person car train and so on um everything else we're not that interested in but um Coco data set actually has a lot of stuff going on so apart from just object detection it might be used for segmentation which is when you have um actual pixel values for every object so which pixel belongs to this object which is not you would have
21:00 - 21:30 the way points for pause estimation for example and so on so some sometimes you might see stuff like that in labels here uh again right now we are not that interested in that and then I plot the uh image and annotate it uh with the actual bounding boxes using the function that um I've talked about before so this is the first image in the data set um as you can see it's quite large image high
21:30 - 22:00 resolution a lot of objects clamp together um so it's a challenging task and then after that I want to deal with the data data loer so if I go and just try to find the data order in the code I'll see that it uses patch sampler um this one so Random sampler and that is needed because every time we iterate
22:00 - 22:30 over the data set in the next Epoch we want all the images to be in different order they want different images to appear in the same batch um at each EPO which basically prevents us from maybe going in circles uh because we just randomly happen to uh get stuck there or maybe from the model just remembering the batch setup and the actual images so it's always better to just Shuffle stuff um and also we have this collate
22:30 - 23:00 function which is also quite interesting so we'll look at that in a minute so um first of all yeah I um create the sampler I create the um data loader um and I can iterate over the data loer like this uh so just type in like for Loop just break after the first iteration and then X is now available and I can just print what it is uh and this is quite interesting so the
23:00 - 23:30 type of X is now nested tensor and the nested tensor let's find it here um so so it's not a p torch object it's defined right here somewhere in the utilties file so it basically allows us to load images of different sizes in a batch so tensors have a tough time
23:30 - 24:00 working with objects of different size because we like tensors to just have like look like a cube or hyper Cube where every Dimension will have equal length of elements for like every index and because of that because the actual images that we get are all of different sizes we deal with that with nested tensors and masks so nest Master tenser is basically like a wrapper
24:00 - 24:30 around tenser that takes in various images of different sizes then it determines uh which size uh is the largest and all of that by the way is happening in this collate function so collate function in the data loader usually is responsible for the custom code that can be used to U unite the different objects in a badge uh so in our case we have like the B size is just two so we got two images
24:30 - 25:00 with different sizes and then we call the collate function to create a patch out of that and here we call this nested tensor from tensor list function so what it does is it looks at the input images determines the maximum along every Dimension so what is the max he Max width then it pads all the images to this Max he and width but it also Saves The Mask so that we know uh which of the
25:00 - 25:30 image content is padding and which is the real image so this is how it looks like in reality so this is um I guess the first image so let's plot the second and see how it looks like right so this is the second image so the first one was probably bigger and that's why I didn't have this gray area but now the first one is slightly smaller so we pad it at with zeros um
25:30 - 26:00 after unn normalization the zero which usually responds to black color changed to gray so now we have this image padded with gray zones on each side but we also have the mask so if um if I print the content of X Mask at zero um it will say false for those parts of image that were were not uh
26:00 - 26:30 padded um and then it will say true for um those that are ped so not real content so you can see here that the upper left corner is all false so it's real image and then all the right side and all the bottom side is just padding so this will be important once we start to deal with Transformer and masks for that um what else can see here so we can see the actual image sizes uh
26:30 - 27:00 it's 608 by 834 we can for example compare that to um what we had in image net so in image net I think we just resized all the images to something like 20 240 by 240 or something like that so now the actual images are like three times smaller along every Dimension which is good it allows the model to see more to react better to smaller objects um it's also
27:00 - 27:30 curious that the backbone of our model is just the reset so the res net was trained on the actual image net so it only saw this 240 by 240 images however the actual reset model which we call backbone here it took the input image then it applied a series of convolutions to basically scale down the size of the image and increase the number of channels and then at the very last step it had a fully connected layer so as
27:30 - 28:00 long as you have a fully connected layer in your network um you make it work only on fixed size images so the actual reset would not be possible to apply it to an image that is not 240 by 240 however um for for the backbone here we just throw away the final part the fully connected layer and the actual output and we just care about this intermediate representation right after
28:00 - 28:30 convolutions uh and convolutions can be applied to anything so every pixel in this intermediate representation can be seen as just sort of receptive area that that is a function of input image somehow and the actual process of applying convolution looks like sliding this filter along every possible position and so you can enlarge your image infinitely in theory and it will still work so you'll enlarge your image
28:30 - 29:00 your representation the downscaled version of the image will also enlarge and you are not confined by any specific size so after you throw away all the F connected layers from the backbone you know can apply it to um input of any size uh but still different networks deal with different sizes differently so yellow just resizes and sort of pads everything together to have images of
29:00 - 29:30 the same size during training uh that's not what we do here in here we actually train on images of different sizes without resizing and stuff like that um but we have padding to make sure that we train correctly um Also let's look at image transforms that we had so transforms are part of the data set so let's go to the code here so this is build data set code it says build
29:30 - 30:00 Coco uh and it instantiates this C detection data set so again there's only a bunch of lines here we can see that mainly just inherits from torch vision. Cod detection data set so it's really cool it this class can be used to just load this image folder format that we had for Coco so they have just a giant folder with a lot of images for train a
30:00 - 30:30 giant folder with a lot of images for validation and then a separate folder with annotations for all all the images so this is the format and the Coco detection P torch torch Vision class already handles all of that really well so it deals with loading the images caching validating and so on uh in here we so to Define our own custom data set class we can inherit from this byor class um we also reimplement two methods
30:30 - 31:00 so for init and for get item so we need is the one called during construction and then get item is called every time we actually request an image specifying an index and here we just do it to apply our custom transforms um and we also have this convert Coco poly to mask so this one mainly is used just for um segmentation task so it's not really
31:00 - 31:30 doing anything for object detection and we can see here that the this convert class also has some stuff regarding the KE points which might be used for like posst estimation and so on um but for actual object detection it's not doing much so we basically get the labels and we just return them um and the class itself is rather simple so we just use the actual image loading loading from disk uh functionality from from the
31:30 - 32:00 Native by torch um data set class and then we have transforms and let's see the transforms um they are built here again quite simple so they mostly rely on pytorch Native uh stuff again unlike YOLO that had some stuff from arations library some stuff just defined as Tor transforms writing code in here it's all clean and easy just the torch Vision
32:00 - 32:30 Transformations and you can see basically resizes and horizontal flip so this is how resize works so the main method is usually random reiz crop but there are other alternative that do just resize or just crop so the idea is we have an input picture we select pretty much random area of it uh we crop it out that's living some the stuff out um and we also then rescale it a little bit so
32:30 - 33:00 we change the aspect ratio slightly and we do that to make sure that the the task is more challenging for the model so it doesn't remember the exact images it is ready for real world where you might be looking at objects from different position and so on and then um random horizontal flip is basically just flipping the image like getting the mirrored version of it so you can notice that the hand used to be on the left now it's on the right and these are all the
33:00 - 33:30 Transformers that are used in data again this is much simpler than yolow that also had different color augmentations um Mosaic and so on now as we've talked about the data we have loaded it we have the actual tensor that we can inspect and call the model on let's discuss model architect so this is the picture from the article uh with like the overview of the architecture so we start with the backbone so we feed in the
33:30 - 34:00 image here the output of the backbone is the condensed representation of the image depending on the initial image size it can be something like 20 by 20 uh in height and width and then something like 248 um Channel Dimensions we then project that down to 256 because 200 it's too much for Transformer it's really expensive then we deal with position itional encoding so because the first step in here is
34:00 - 34:30 basically to flatten out the image um we lose the position information and because of that we explicitly encode that into the embedding itself um so the encoder is just the self attention so we have this flattened image representation of like 20 x 20 which got flattened to 400 sequence elements so output of the encoder is the same 400 by 256 um elements basically creating the
34:30 - 35:00 more refined image representation then we have these queries coming from the bottom that corresponds to basically our experts who are supposed to judge which objects arew uh so we have this cross attention that we'll look at the output again is 100 by 256 so now we uh went from the image Dimensions to the number of experts diage which is always 100 in this case which also means that data will not be
35:00 - 35:30 able to detect more than 100 objects and then we add a couple of um independent heads which which are just dense layers to project that to class probabilities and to um bounding box predictions so this is the code uh I've copied from main. PI so again to build the model and the Criterion and so on I just need the arcs so I execute it uh it will load the
35:30 - 36:00 backbone part of the reset from the torch hub using pre-train weights uh the rest will be just initialized randomly so here is just the overview of the model let's take a quick look of that so it will have a Transformer Transformer encoder decoder then uh embeddings for query for position and then the backbone the back bone once again I have a detailed video about reset about how to
36:00 - 36:30 train it it has a bunch of convolutions uh the so-called bottleneck blocks that use um convolutions with the Cal size one which are just basically not propagating any information between pixels but they still apply some transformation between Channel channels then we have the bottleneck which is the union of bunch of convolutions with residual connection between them uh and that's it and so we are only interested
36:30 - 37:00 in um the final layer so we are cutting off the um fully connected layer and we get this sort of condensed image representation out of it and now we have our X um I place it to the same device MPS metal performance shaders uh Apple's analog to Cuda and gpus and I can just call it it will work for some time which where um and then I get the output so
37:00 - 37:30 the output is a dictionary it will have logits just class probabilities the size of that is two for two images in the patch by 100 so I have hundreds of 100 of experts each of them tries to detect something and 92 so 91 for 91 classes in the Coco and then an extra class for No Object and then bounding box de uh detection is just the four numbers for
37:30 - 38:00 four coordinates of of the bound box rectangle um now thanks to the fact that I sort of launch it from the same folder from the same everything uh I can now go to the model so let me find the code for that go to definition build the model is defined here so the model is a DAT so it will be like multi- layered so
38:00 - 38:30 the outer layer is called data which handles everything and then it has big components like the backbone and the Transformer and then Transformer itself has components like encoder decoder and encoder has components like self attention some fly connected layers and so on so let's just start with the very top with the most generic uh p p and then sort of dive deeper and deeper so I
38:30 - 39:00 can just put a breako here then return here and just say debug cell no not this one I need the one where I call the model so I'm I have a breako on the model call so it should stop if I am calling the model and it should stop at that um forward method and it did so let's just print out intermediate shapes now and see how it
39:00 - 39:30 looks like so the call to the backbone returns features and position so positional codings are considered part of the backbone in this code so we'll take a closer look at that a bit later um and we can see that features also contains the mask so features is an EST tensor let's just bring out what it has the type of it is list so the API is a bit weird this function called receives
39:30 - 40:00 a list of um nested tensors so we have might have many of them even though one nested tenser usually represents batch but then the length of features is just one so we can have features the first element of that that's going to be nested tensor so I'm typing tensors to get the a access to the actual image and then do masks will give access to um to
40:00 - 40:30 the mask and I just print out the shape so it's 2 by 248 by 32 by 26 so two stands for bch Dimension just two images 248 is the channel dimension of this intermediate representation in in the reset and then the compressed picture uh is 34 by 26 um so you can actually bring the initial size as
40:30 - 41:00 well so we had a batch that got resized to 1,63 by 800 13 so it's like the dimensions of the biggest picture and then after pass passing through the backbone we have 34x 26 so um we'll see a lot of these numbers later 34x 26 because because we'll sort of keep passing that um forward right um now we
41:00 - 41:30 have this features passed through the backbone and then the next call is just the Transformer code call but that's it so Transformer represents the encoder and decoder at the same time here um so um the output of that should be the actual queries so let's let's printed so now this is an actual tensor not a nested tensor you can just print its shape right and
41:30 - 42:00 now it's 6X 2x 100 by 256 so we know what two is two stands for just batch Dimension two images then we have 100 queries like 100 experts trying to detect some objects in the picture um and this again means that the DAT might only predict up to 100 objects then 256 is the dimension of the Transformer so we had to down sample our initial reset
42:00 - 42:30 output a little bit to pass into that um but I also have six here so six here stands for basically output of all the intermediate layers so this is the scheme of it so I'm encoder also has several layers but I'm not um drawing it here but this is the decoder so uh we get queries passed from the bottom and then the output of the encoder passed from the side to do the cross attention and then the queries represent a
42:30 - 43:00 sequence of lengths 100 and then I do a series of cross attention self attention steps with some fully connected layers and um residuals and I do it several times and I have six layers in total uh but another interesting thing is that all the intermediate outputs are used in the loss as well so um you can think of these queries as
43:00 - 43:30 having um basically being experts and after every step of cross attention and self attention you expose experts to the image so you let them sort of see the image and decide what they're seeing you do self attention so you let them discuss and refine their own um representations um but you should expect that every intermediate layer already has has some understanding of what is going on in the picture and what the
43:30 - 44:00 objects are so when we connect that to the loss we are basically asking intermediate layers to do some good predictions as well and it turns out it helps with with the model I'm not sure if the initial queries also do that um maybe they do it might be able to see it in code later so we are back uh with our output so right now the DI the embedding dimension of that is 256 and then we just do couple of fully connected layers on top of that so the class embed is let's see
44:00 - 44:30 what its type is so it is just a linear layer so fully connected layer or dense layer and then this one is the same oh it's actually an MLP so it stands for multilayer Perron which basically means a bunch of um T layers stack together so let's step through that right and output class
44:30 - 45:00 shape the outer shapes are the same so six layers time two images per batch time 100 experts time 92 so we had 256 Dimension embedding and then we want class probabilities out of that so 92 that's one for object class and 91 for other classes in the Coco data set and then the coordinates outputs bounding box outputs only have four predictions
45:00 - 45:30 per expert uh and that's it so this one returns so this is the auxiliary loss I was talking about so to actually save that so the out here contains just the final outputs um but to get the intermediate outputs I also call this set Al loss as well so it basically what it is going to do is it will apply the same um fully connected layers to intermediate layer
45:30 - 46:00 outputs as well to get the um predictions for the um object for the objects from the intermediate um layers so if we come back here like embedding dimension of all of these is 256 right so to get the class predictions we apply fully connected layer here and we also need to apply two intermediate layers as well here as well and that outputs the 92 numbers
46:00 - 46:30 for um class probabilities and then four numbers for bounding box coordinates that's it I can keep stepping through and yeah I'm in some weird place in Python right now so I just press continue and that's it so I have returned to my juper notebook cell so basically what I did here is I have my model in the JY notebook I have my image
46:30 - 47:00 I applied model to the image I was able to step through that in the debugger and I got the predictions out so this way we can debug um to py cells basically but we only went through the top level so let's go through um through the same a bit deeper I'm going to press debx sale again um so the first thing that might be interested is the positional embedding
47:00 - 47:30 so it happens inside the backbone right um backbone let's see where it is constructed it should be constructed somewhere here right when we are building the model when we are building the model we construct positional embeddings we construct the actual backbone and then the join so the Joiner as we can see here it applies the
47:30 - 48:00 backbone and then just looks up the required positional embedding and appends it to the output so basically the output of the backbone is this image um like 32 by 26 or whatever then positional embedding is basically a mapping from every possible position to a tensor to an embedding with 156 Dimensions so we'll see how it
48:00 - 48:30 looks like right now and we need to look up because the actual image is smaller so we look up corresponding parts like which parts of the image actually made it um are actually real and we look those parts up in the overall positional Matrix and then just sort of append both of them so both of them are returned and we'll see how they add it together later so let's go to this build positional coding
48:30 - 49:00 function um so this one is going to use positional embedding sign let's see the forward pass here so this is the image from my previous video um about positional encoding s cosine positional encodings so we have this input um sequence here uh which basically corresponds to small pieces of the original image forever pixel sorry not original but condensed representation withd and then so for one
49:00 - 49:30 dimensional case this this axis will stand for say x and then this axis will stay stand for embedding Dimension and so you can think of this as every element in the sequence will get its own 256 long um vector and so first element in that Vector will depend on the position as a sign function with quite high frequency right and then I don't
49:30 - 50:00 know hundredth element within that will correspond to sign as well but with much longer frequency and then the last element is also just the sign but the frequency is so high is so low that you can barely notice that it's a sign it just sort of grows slightly so this is for onedimensional case and these s cosine these coordinates
50:00 - 50:30 within the embedding Vector can be thought of as just basically knobs of different sensitivity so think of a I don't know machine that turns the volume but it has a lot of knobs so the first knob you slide it just a little bit and the volume goes up by a lot so it's really insensitive you can get a lot of range from that from really low volume to really high volume but it's hard to set the precise number and then you have the second knob that says that is less that is more sensitive
50:30 - 51:00 so you have to um adjust it to to to rotate it much um much more to to get the same range but now it's really sensitive so you can really sort of control um precise level of volume that you need in a really narrow interval and then another Knob is even more sensitive and so on so these Comin of knobs basically allows you to achieve great uh
51:00 - 51:30 range of possible values you can represent as well as the sensitivity and the exact representation um so this is a bit excessive for the actual position because I know we just have like 100 elements within the sequence or less um but it's still works so um and the reason we want s and cosine so this is the graph for sign and it will have 128 um coordinates actually and we have
51:30 - 52:00 the same 128 dimensional cosine in in coding and we just concatenate them and the reason we want sign and cosine is that it's easy for the network to manipulate them so if you think about how the network might represent relative position for example the formula becomes really easy when you have S and cosine something like you just and square both of them and just take the sum um so it's it makes it easy
52:00 - 52:30 for the network to to use this inord and then if you have two dimensional case then you have x axis you'll have y AIS and both of them will have their own sort of 1D positional coding so y Ed in coding will have 64 coordinates for sign 64 for cosine and the same for x and then every pixel in the image will get its own xaxis encoding and y axis
52:30 - 53:00 encoding and sort of concatenate all them together to get the 256 out so let's try and see that in the code so I'm going to break here just press continue and see what I have so um let's take a quick look of that so we have our positional embedding um object it has some temperature um 10,000 and then this is the forward path
53:00 - 53:30 um so let's iterate a little bit forward so first we initialize the embedding so the not mask is inverse from the mask the mask was an object of false and true um objects where false stands for um real part of the image and stands for the um gray area that we added so let's scroll down here right so we have
53:30 - 54:00 an image down below it is basically padded with zeros and we are not really interested in positional codings for this part and this is the real image um so for the not mask it's in Reverse it now stands true for real image and false for um padding when I take come sum cumulative sum um I basically convert this B intenser two numbers so true will stand for one and
54:00 - 54:30 then false will stand for zero and take the cumulative sum so after this the Y bad let's bring the shape it's 2x 34 by 26 so we have the some coordinate um stuff happening as well so I basically have this 1D sequence of cumulative sum that should look like 01 1 2 3 4 and Etc and I would have that repeated over columns and then
54:30 - 55:00 the X embed will be also this 1D sequence but now along different axis again repeated but now over rows so let me just print the content of that for the first image and here we can see that the first row is 1111111 and then zero back where the mask was supposed to be then 222222 and then again zero where the mask is supposed to be and then the final one 34 34 34 and then
55:00 - 55:30 again a couple of zeros and that's for y and bet the X and bet would be the same but transposed right um and we sometimes normalize it let's see what happens afterwards so print out the same embedding and now instead of ones twos and 34s I normalized it in such a way that this last number is 6.28 and that stands for exactly 2 pi so that's 2 pi and that corresponds to
55:30 - 56:00 basically one period of the sign function so we start we go up we go down we go up again and then down and then we stop but this is going to be the highest frequency of sign that we use so when I show this picture I basically lied to you and this picture keeps being copied from website to website so sorry about that basically here it shows that the the highest frequency is really high so you get like 10 periods of the sign
56:00 - 56:30 within the image but actually it is just 2 pi so the highest frequency You' see here is just going to have one period so it's probably pictured correctly on this layer something like that right so back to our code I step through so now DM is now in the embedding space so d. shape is just 128
56:30 - 57:00 let's just print out the entire tensor so it's 01 2222 01 2 3 4 Etc up to 128 um and then I use my temperature so now it is it used to be from 1 to 128 and now it is um duplicated so you can see 1 1 1.15 1.15 1.33 1 .33 and we we will use that later to get S and cosine so the first number
57:00 - 57:30 the even number will be um used in sign and the odd number will be used in cosine or vice versa not sure and then now it is from one to 10,000 roughly so the temperature used to be about 10,000 um and now we divide by that so now we um get our po X the di mention PIR is in the denominator which basically means that the temperature
57:30 - 58:00 controls how the frequency Falls so in here it will fall by 10,000 roughly so we started with a period of exactly 2 pi and then it fell down by so much that it's really long right so you can imagine if you had smaller temperature like exactly two for example then the period will not fall that much it will fall to something like pi um which corresponds to half of the sign sign W so let me step through that and
58:00 - 58:30 let's try to view this PA y um tensor so it's basically now the final one before applying the sign so it will have basically the angle that we pass into the sign function um in radians so let's first try the shape of that shape that's 2x 34 by 26 by 128 so I'm going to try to print out some rows from that right so first I'm
58:30 - 59:00 interested in the first image only and then I want to say let's print out the first row the entire first row and just the first coordinate okay so it's all equal so I messed up I need the First Column not the first row so let's try it like this right and now I see it so this is like the zeroth element within the positional
59:00 - 59:30 embedding is going to correspond to this lowest um positional embedding with the highest frequency which we determine to be 2 pi and here we can see that it starts with zero and then it goes up up up up up up and then ends it to P so it doesn't change that much um so again just exactly one period of sign function then if I look at the
59:30 - 60:00 second one it's almost the same then it starts to go down a bit so now it's from 0o to three so it's almost Pi so it's now like half of period and then for 127 the last one it's almost zero so it's barely changing then we apply the sign and cosine to it let's print out the same tensor the first coordinate
60:00 - 60:30 um sorry once again yep so now it used to be just denoting angle from zero just rising up to 2 pi and now it should be the actual sign function so it goes up up up until about one then it starts to go down so we went up and then it starts to go down then it went down and then up again and it stops at zero so it's half period as promised no actually
60:30 - 61:00 two entire period sorry like this and then again if I look at the coordinate number 10 um it's still the same pattern but now the period is slightly um slightly lower um and then the final coordinate of that so it's a cosine and then for the sign yeah it's SL it's barely changing and that's it we return that so we concatenated position y position X we got
61:00 - 61:30 256 tens are out of that let's just confirm that yep so it's 256 by 34 by 26 um and this is the positional coding layer so it does not use any um trainable parameters it just basically initializes all the numbers it needs to represent position depending on the actual image that we've seen uh sorry not the image but positions within the image so it's not actually using any diesel information so that was the Deep dive
61:30 - 62:00 into the first thing we wanted that the positional encoding now let's do the same for other parts of the model uh so starting with this plus sign here where we sum up the output of the backbone with the positional coding and then the actual Transformer layers so Transformer encoder Transformer decoder and the final output um so I had this break point in the beginning of the dater so let me just remove that then I also want to go back to the build model
62:00 - 62:30 code um it calls build and here I can see that dat is the model we just looked at the outermost layer and it just receives this Transformer uh as an input so let's dig into that and this Transformer Returns the model called Transformer so this is the one that actually defines Transformer
62:30 - 63:00 encoder decoder so encoder consists of several encoder layers and then decoder consists of several decoder layers and so on so again let's just break on the forward path here and see what we can see um right so still at the call function for the model itself and I am in the beginning of the
63:00 - 63:30 forward method for Transformer so now we've received the source tensor that should be the actual tensor let's confirm that um The Mask positional embedding uh and then the query embedding let me come back to data see where was the query embedding right so the query embedding comes from the data
63:30 - 64:00 model um and it's called just embedding here so it's pytorch nn. embedding so this is the um doc string for by torch embedding and it basically just defines um a lookup table so input to that is just a number from 0 to 100 and then you provide Prov it with the number from 0 to 100 and it outputs unique set of weights and these weights are
64:00 - 64:30 trainable now we are back to Transformer and let me just print out the shape of the query embedding so it's 100 by 256 so we have 100 queries 256 embedding size so then we step over and we do some data transformation to change change the the shape not change the content then we do this repeat function because we might
64:30 - 65:00 have several images in the batch so we repeat the qu embedding um for every image then we initialize the target um in the same shape and then we call encoder so this is still quite a high level code now let's just look at what the output is so the input to this is just
65:00 - 65:30 SRC which was um the initial image so we had something like 34 by 26 and I believe that yields the 884 um size oh yeah that's that's how we did it so we had um source and we flattened it um then we pass in the mask and the
65:30 - 66:00 positional embedding and so the output of that should be of similar shape this is just the refined representation of the image so let's steps through that and print out the shape for this right so it's still the same shape 884 by 2 by 256 and this is the quick reminder of what happens in the encoder so as input we have this sequence of flatten part of
66:00 - 66:30 the image each uh being represented by a sort an embedding of size 256 then we project each of them individually through the query layer and the key layer then we calculate dot product between each possible pair this one and this one this one and this one this one in itself and so on so every possible pair we get this attention mask then we do this projection step
66:30 - 67:00 again for every element in the sequence using the third um Den layer called value and then we multiply that with the tension mask which basically means that for a certain Row in the output for the certain element of the sequence so I we look up the corresponding Row in the tension Matrix um so it will have just a bunch of numbers with the same uh lengths at the
67:00 - 67:30 sequence so 884 in our case and then we use those numbers for weighted sum so we do a weighted sum of all the values we had um to get this one output element and then we repeat the same for all the output elements so result of that is just and that's just one in quod layer we repeat that several times result of that is is the same shape basically refined uh representation of the image and then we apply the
67:30 - 68:00 decoder so decoder now will have a different shape 6 by 100x 2 by 256 so now this is how decoder looks so incoder just outputed this memory um which is like the final sequence from the encod after this The Memory Remains unchanged it is passed as as a side input which means it's passed as key and
68:00 - 68:30 value to every cross attention layer in the decoder and the decoder itself starts with our randomly initialized Vector of queries 100 of them and then they go through this series of cross attention self attention dense layers residual connections and so on so the input to this one the normal input not the sign input as the sequence lengths of 100 and the output is the same also 100 but we also
68:30 - 69:00 have this extra addition to the loss where we output every intermediate layer as um auxiliary output to use that in the loss as well and then we return all of that okay I've returned to my cell with just calling the model again so again this was quite quick because we just had incode layer decod layer but we've seen the shapes of input and output now let's
69:00 - 69:30 try to dig deeper deeper into the actual layers as well um so let's take a quick look on how they are constructed so um we have this encoder here which gets incoder layer as an argument so let's remove this break point and look at the inorder itself so when it is built it receives the incord layer and number of layers which is six in our case and then just clones them right and
69:30 - 70:00 then it does the forward step so let's break at the Transformer inorder forward step um somewhere here and see what happens right so now we have this this is encod so input to that is the sequence with lengths 884 corresponding to just flattened pieces of the image so let's confirm that 884 by 2 by 256 then we initialize output as the
70:00 - 70:30 current Source the image and then we just call the layer right so everything important um happens in the layer but we can see that we just rewrite the output for eyeball so output of the layer is it the same shape of the input so we just compute the output and save it to the same variable and repeat the process six times so let's now bring break at the in quarter layer forward
70:30 - 71:00 path so it has two forward methods one where we normalize um after the attention and one when we normalize before the attention so let's just break it both of them and see which one actually happens and this forward po post is the one that happens right and so input to this is um the source so the tensor with the actual image uh images and the positional
71:00 - 71:30 embedding and then we call this with positional embedding function um right and you can see here that all it does is just sum them up which is curious so we are not projecting the positional embedding um with anything extra we're not we don't have any trainable layer on top of that so we can print out the entire First Column for example in the first coordinate of
71:30 - 72:00 that and this is the exactly the sign function that we've seen before uh only now it's a bit flattened I believe yeah because the source is flattened as well and we just sum that with the image representation uh which is a bit weird I guess so if we come here now we had our backbone that outputs some good representation of the image uh we have a projector layer on top of that to project it down to
72:00 - 72:30 256 um so we give our model the chance to decide how it is going to use this embedding space so it might use the first 254 coordinates for just the image all right and then focus on the rest uh and the sleeve two for the positional encodings but the positional encodings are not passed through any other layer so basically you have this embedding for the image and then every position here gets spoiled in some way with this
72:30 - 73:00 positional encoding um so I guess it's good that the last coordinates don't have that high numbers um maybe you would you wouldn't need that high temperature if you just allowed an extra trainable layer here but I don't know so now we added this positional encoding and then we just call the self attention block here self attention is defined as pytorch multi-ad attention so again this
73:00 - 73:30 is the native P function so this is the definition for it so it receives basically um query key and value so three vectors three embedding sequences and then we say which embedding Dimension we need and how many heads during initialization and in here here we call it with query key and
73:30 - 74:00 value right and we can see that query and key are the same as source with positional embedding but then value is just the source which is interesting so for calculating the attention Matrix we use this position information but then to get the actual sequence of outputs the actual value has no positional qu which is interesting um I didn't know that before reading the
74:00 - 74:30 code um we could also try to look into the actual multi-ad attention layer as well it is going to be quite tough because this is now this is not our scientific code this is the actual pych code you can see here a lot of ugly implementation details so this is the forward function and it is going to be really really long so we can see that sometimes they have this fast path uh
74:30 - 75:00 when some conditions are met and you might just do that in Native Cuda C++ code sometimes You' call and like this one the native multi- header tension and there just some C++ bending I believe and then sometimes you would have this non-native code code that now calls the functional multiat tension method right and again it is really really long
75:00 - 75:30 like a thousand of lines um but the actual piece where the calculation happens um is somewhere here right this bmm bmm stands for batch matrix multiplication so if I come back to slide where the attention happens what you basically do here when I'm saying that you need to do this dot product
75:30 - 76:00 with for every possible pair this operation is actually really well represented by just matrix multiplication so you have this huge Matrix of size bch times sequence length times number of heads times um embedding per head which is like 32 I believe and you have another Matrix like this again um batch times sequence length
76:00 - 76:30 times uh head times embedding oops it's getting deleted but basically when you do the matrix multiplication the badge matrix multiplication is basically going to remove the last Dimension so it will calculate the dot product um for every possible pair Within These batches and within the heads and the
76:30 - 77:00 output is just what we need so it's basically this dot product of query and key calculated over all possible pairs um within the sequence and then for every batch for every um head unfortunately in my case um the actual execution goes only to the native code so we will not be able to just go there um in the debugger uh but at least the code is
77:00 - 77:30 here so you can actually look at what is happening here um so let's just see what happens so again this is the beginning of in quar layer so we added the positional embedding to some sources we computed this self attention output of that is just the same as uh the output of um as the input then we do Dropout normalization um then this
77:30 - 78:00 src2 which basically is needed for residual connections so now src2 was the output of self attention and then we compute the result of that as initial Source plus drop out out of src2 which means that we had this output of attention we now have this residual connection and Dropout so we apply Dropout to the output of the self
78:00 - 78:30 attention layer and then we also add residual connection and then the same for linear so we have an extra linear layer on top of that just to be sure and again we do this residual connection trick and we also have the norm so uh two interesting things here for those that um stayed in computer vision and object detention for quite a long time so in other models that we've
78:30 - 79:00 seen like Yello like resonet 50 and so on we use batch norm and we don't use Dropout uh and here we can see that it uses layer Norm instead and Dropout so let's discuss that U so someone actually uh answered this question on stack overflow with a really cool image that helps you visualize that so bch Norm let's say you have a layer with let's
79:00 - 79:30 say 256 dimension of embedding so you have 256 outputs of uh of this layer the coordinates um then you would have the batch Dimension so You' have 256 outputs for every image in the batch and then in our complex case of Transformers you also have this third sequence d mention so what batch normalization does is it computes the standard deviation and the
79:30 - 80:00 mean for the badge Dimension so you look only at one particular coordinate within the layer but you look at all the images within the batch to compute this um mean and standard deviation and to normalize the layer this is called the batch norm and then the layer Norm is when you only look at the current image so you are not looking to add other images in the batch when Computing the mean and standard deviation but you look
80:00 - 80:30 at all the coordinates within the layer um and there are some reasons actually explained here why that happened for language Transformers because you might do next token prediction where you're not supposed to look at what happens next in the sentence and then if you also look at different examples in a batch it becomes weird they all might have different sequence length uh they all might have information that you were
80:30 - 81:00 not supposed to look at so it was easier to just use lorm um I'm not sure if that is actually the case for images it looks pretty safe to use for images but we just use the same because it's just the common transformal block um so we use layer norm and another thing here is that when you use the BGE dimension for normalization um you basically introduce some Randomness into the process because
81:00 - 81:30 the batch will be different every time you do shuffling and because of that these Norms the mean and standard deviation will have some noise in them and because of that you basically have The regularization Happening Here so you add some noise to this layer and that makes the job harder for the model so it cannot just remember the picture it's harder for the model to overfit and that's and that was the reason we didn't
81:30 - 82:00 use Dropout before in in resonant 50 for example because bch Norm in itself is a good regularizer and now when we don't use batch Norm when we use layer Norm we don't have that anymore so we have to use something extra as a regularizer so here we use drop drop out Dropout basically zeros out random coordinates in the output something like 10% of them and
82:00 - 82:30 that again basically prevents the model from overfitting and that's it so we can see Dropout linear layer and that is the output of this in quar layer another thing I wanted to mention here is just the um number of trainable parameters in the multi header tension so as we've discussed this is like the juiciest part of the entire model and unfortunately it's all implemented in
82:30 - 83:00 the white orch code itself so it's quite hard to read but we can at least look at it from the outside and try to determine what is going on there so this is saved as self. self attention um let's read the code one more time so during initialization our parameter weights are initialized here so in case we have different query key in value we'll have
83:00 - 83:30 different tensors for that uh in case of self attention we'll just one we have one big tensor with three times the dimension so let me print that out so it is 768 by 56 so basically um it doesn't depend on the sequence length it's basically this 256
83:30 - 84:00 by 256 so we project an element uh of length um 256 to the same embedding size and these query key and value weights are then reused for every element in the sequence so what we were doing here when when doing the projections when I said that we're going to project some element to the um value representation of it and then add that
84:00 - 84:30 with some attention weight uh to the output and then another element will be projected to this value all these projections to Value from the initial element in the sequence they all happen using the same set of Weights the same Den layer basically um right now let's look at the decoder so let me remove the break points here first a quick look at the decoder
84:30 - 85:00 itself so now again decoder is like a wrapper around individual layers that are passed in from the outside and here it is slightly different so again we iterate over all the layers in the decoder there are going to be six of them so we just initialize output as the target at first and if we look at the code for transformal itself the target basically starts with
85:00 - 85:30 just zeros so we have these Zer coming from the bottom as um basically our query input it is going to be added um to the actual query eddings at some point and then we compute this output of of the decorder layer six times and then we also have this Logic for intermediate results
85:30 - 86:00 so return intermediate is going to be true which means that when we do the decod layer forward pass we save outputs from all the individual layers as well uh but the most interesting part happens in the decoder layer so let's make a break point here um again it has just forward and forward pre so let's break it both of them press
86:00 - 86:30 continue and we can see that the normal forward um is the one that happens this one so I'm going to forward post so this is the one that happens by default um now this instead of positional embedding I have my query embeddings so again at the very beginning for the first layer the target initialized is
86:30 - 87:00 initialized as zero and then at for every layer I'm going to add the query in coding um then I do self attention drop out normalization then multi-ad attention again drop out normalization and then return that so couple of interesting pieces here so self attention is
87:00 - 87:30 basically the same but now instead of operating over this image sequence with like 884 sequence length I have my query input with just 100 elements um and I just do self attention so let's see where it this defined Yep this is just multi-ad attention um and then we have a second attention which will be used for cross attention again it is defined using the same P torch module called multi-ad attention
87:30 - 88:00 but now when I apply that I have my query I have my key uh and I have my value and so memory is this fixed Inc coding output so the final representation of the image itself and it is pass into multi-ad attention uh to cross attention as a side input as key uh and as value but then key is used
88:00 - 88:30 with positional coding um and value is just used as is so let's print out the shape for position in coding for example yeah it's 884 by 2x 256 now this is my slide made for cross attention so we use the actual image sequence actual pieces of the image as
88:30 - 89:00 query and sorry s key and value so we first use it as a part of attention Mass calculation where the key comes from pieces of the image and then a query comes from um from the input experts let's call it that and we also have seen in the code that when we we do this key calculation we also add the actual positional embedding from the image and then when we do the query calculation we
89:00 - 89:30 use our query embeddings as as this extra step and then when when we do the second step of cross attention we now computed the attention mask in the previous step and then again we take the value from the image but now without positional embedding and we use this attention mask to project like that from whatever it it was our um image sequence length down to 100 which stands for
89:30 - 90:00 sequence length for the query uh so other than that the code is pretty similar um so basically again the majority of the calculation happens in this self attention block and multi-ad attention block all of them are actually the same function but in case of self potention we basically pass uh the same tensors as input almost the same right so here the key and query are equal and
90:00 - 90:30 then value is the same but without this query embedding uh and then during multi-ad attention query key and value will all be different and that's it so this is the majority of the encoder decoder code let me just remove the bra point and return from here and now let's also discuss loss so here's a quick
90:30 - 91:00 reminder of the problem we trying to solve so let's say we had two birds on the picture and we have two ground with objects which with classes and bounding boxes and then the model outputs something so in the very beginning it will be just random um then at some point it will be quite good but still not entirely correct all the time and we need to deal with this situation where the model has predicted more objects
91:00 - 91:30 that there are and we're not sure like should we match this bounding box that seems like a good match for this bird but it says it's a cat or maybe this bounding box instead uh that has the class correct but then the bounding box prediction is worse and so on so how do you do that we do that using the BART matching law so basically we had our set of ground trth objects uh and the set of model predictions and we need to find
91:30 - 92:00 such connections such that every one of those nodes is used only once and it optimizes some sort of loss so the actual connection is done using the Hungarian matching algorithm also called matcher in the code and the actual loss consists of three terms so we have classification basically just cross entropy between predicted class and the actual class where the class can be person train dog Etc or No Object so
92:00 - 92:30 no object is a special kind of class that is also taking part in crossentropy and then the two terms for bounding boxes so L1 loss which basically just measured the distance between predicted four points for the bounding box and the ground truth four points and then G IU which stands for generalized intersection of Union this is just differentiable alternative for IU which
92:30 - 93:00 is the measure of overlap between the boxes so let's come back to our code right so we just were looking at the predictions now we also constructed this Criterion as part of our build model call over here um so now let's save the outputs of the model somewhere so so we can use them and this is again how they look like so outputs is the dictionary with a lot of stuff inside so it will
93:00 - 93:30 have um final outputs for class probabilities for B boxes and it will also have the set of intermediate outputs of intermediate layers as well and then we just pass that dictionary onto the loss to the Criterion um and that's it right so we can just run this cell it will calculate all the losses so here we have 25 of them some of those like this cardinality error is actually only used for
93:30 - 94:00 debugging it's not taking part in final loss calculation then loss C is the crossentropy for the final layer same for um guu and bounding box and then we have the same three alternatives for every intermediate layer again uh as well so first layer second layer third layer and so on so now um let's see where it is constructed so here is our build model code and so it has the set
94:00 - 94:30 Criterion which is the actual loss and then it receives a matcher from this build matcher code so let's first look at the matcher so the matcher is the Hungarian matcher let's just put one break point here and then one in the forward pass of the set Criterion as well right in the beginning right so now I can just click
94:30 - 95:00 debug cell and something will happen all right so the first thing we notice here so we start we get this dictionary of outputs we have our targets let's print what that is so targets is just a list the first element of that is just objects for the first
95:00 - 95:30 image um and then second element is object for the second image and each objects knows about ping boxes class labels and so on so first thing we do here is we actually construct a different output without Oaks so without intermediate layers and we pass that one into match so we've promised that Hungarian matcher will optimize the final loss uh in the code that's not entirely correct so the
95:30 - 96:00 final loss takes um all the intermediate layer outputs as well then the M the Hungarian matcher only looks at the final output so that's the first difference let's just step through that and um see what happens inside the matcher so let's just iterate a bit forward to understand what it is right so first another difference
96:00 - 96:30 between the mat and the final loss that will be used is this class um class um loss classification loss so in the final loss that is supposed to be cross entropy in here we just approximated with one minus probability so this one minus probability function should look similar to the um negative log likelihood so at least it sort of changes in the same direction but the actual curve itself is different
96:30 - 97:00 so we don't really have the guarantee that the minimum of this criteria in the matcher will be the same as the minimum of the final loss but we still sort of use that because it seems to work and it's easier to construct that way um then we calculate these three cost terms and then add them all up um with some weights um into the final
97:00 - 97:30 Matrix so let's print the shape of that so it's 2 by 100 by 9 so we have batch of two images for every image the model outputs 100 um expert predictions and then we have nine objects in total for two images so this Matrix actually contains all the intermediate cross predictions so it also knows the loss for predictions of the experts from the second image with respect to ground Earth objects in the first image uh
97:30 - 98:00 which is a bit weird but then it is fixed in this line where you do this split here so during split we actually split those batches so that um only the exports from the same image are calculated in in the loss and then the main part of that is this linear suum assignment which is imported from scipi so here is the doc for it this is basically the CPI function that receives
98:00 - 98:30 a matrix of as an input Matrix corresponding to these graph Connections in the borted graph with the cost function associated with every Edge and then it does this Hungarian matching algorithm that finds the optimal match so all of that stuff is happening inside this function will will not read the actual implementation uh and then after that we just have the indices so let's print it out this is just a small array that
98:30 - 99:00 actually shows which ground Earth objects match to which expert predictions and we return that then we come back to the actual loss function and here we can see that um we basically iterate over all our losses defined here they basically calculate the loss that I've talked about before so the classification loss bounding box loss L1 bounding box G loss uh and then we do the same for all
99:00 - 99:30 auxiliary outputs so not much interesting going on here uh and then we return that um now I also want to talk a little bit about how training is done so if I come back to main. Pi again this is quite clean code it's more or less easy to read so we have our model we have our data loader then we have our Optimizer um someone who had studied object
99:30 - 100:00 detection models this might become interesting so other object detection models like yellow like resnet um they usually use stochastic radi in descent so SGD Optimizer um in here we use adom um it work Works slightly differently but the overall concept is still the same it basically calculates these running momentums for the gradients and uses
100:00 - 100:30 them to dynamically adjust the um learning rate as well as to capture the overall direction of the gradient descent um and that seems to work well in some cases better than just SGD with momentum uh and then not so good in other cases so we tend to use SGD with convolutional based models but then as soon as you have a Transformer you'll see that most of the actual models start to use adom instead then we have this
100:30 - 101:00 step alert scheduler that basically decreases learning rate by something like 10x um every few dozen steps so the actual learning rate looks like this lad that just goes down little by little and then if you look at the actual metrix like um um average Precision it will go up in these steps as well so it will saturate for a given learning rate and then after adjustment it will go up a little bit
101:00 - 101:30 further and so on then we also have the sampler um we construct our data loader the actual training happens here so we iterate our EPO for every EPO we call this train one EPO function uh the function itself iterates over the entire data loader it gets the images puts them to device then we call
101:30 - 102:00 the model to get model predictions for the given images then we calculate the loss um using our loss function or Criterion as we call it here so the loss is a dictionary then we just sum them up um to get one final scalar number um sometimes we do this scaling um then [Music] we do zero grad to Zer out previous duration do the backward pass and
102:00 - 102:30 optimize a step to update model parameters so all of that I think by now everyone is pretty much used to that um so this is all I wanted to talk about today um I hope it was useful again today we had an introduction to the code so I have my previous video with more U straightforward explanation of the actual model algorithm um so if you don't understand some stuff that's happening right now you better return to
102:30 - 103:00 that video but I hope that it's useful because um you might want to implement a model like that yourself you might want to just um make sure that the Benchmark reported is correct you might want to consult the ground truth and again source code is the ground truth so it is better than the paper it is better than different YouTube bloggers so it's good if you can just go ahead and read it on your own and I'm hoping I can help with that
103:00 - 103:30 um next up um I'm going to talk about more stuff for object detection so one of the things I want in my channel apart from just explaining some algorithms and helping to read code is to set out the learning path so you can start from somewhere and you're not sure how the overall field looks like and then I will guide you through different aspects of that and help you get somewhere so for
103:30 - 104:00 now I've made a bunch of videos about image classification and object detection so we started with just simple introduction to P torch with digit classifications that you can just sit down and train and code down yourself in half an hour then we threw in a lot of deep learning Concepts like like optimizers L functions um activation functions all that stuff that was actually needed to be able to train proper models then we
104:00 - 104:30 had our first big project for training the actual reset 152 on the imet data set they pick 150 g g gigabytes data set and I have a video about that about reading the code for that as well then we moved on to object detection that is detecting bounding boxes for multiple objects in picture um I have videos about YOLO I have videos about dat um and then I wanted to sort of split Out Future um videos into two directions so
104:30 - 105:00 the first one will continue exploring various models in object detection so the very next one is going to be called deformable Data so this is the papers with code website that basically reports the current leaderboard Bo for Coco data set so the actual dator is somewhere here at 40 so it's not the best one at the
105:00 - 105:30 moment um YOLO that we've seen before is somewhere here and then stateof the art is this one codet and this model actually has a bunch of New Concept in that that were introduced over the past 3 years maybe um so deformable deformable attend um Dynamic anchor boxes Den noising all that interesting stuff so I will make a bunch of videos
105:30 - 106:00 about those to sort of set up the foundation for everything we need to know in order to discuss the actual current state-of-the-art model and the very next one will be about deformable dat at the same time I wanted to start the um the branch about generative AI um so the very first video in that Branch will be about neural sty transfer uh this is quite an old technology I think it originates back in
106:00 - 106:30 2015 uh and it sort of branches out directly from resnet so we will not need advanced concepts like Transformers and so on that we've talked about in recent videos um but it's still really interesting and it sets some foundations for future generative AI videos I'll have and then we'll move move on to more interesting stuff like an additional image generation where you just generate an image similar to what you've seen before in the data set and that can be
106:30 - 107:00 done using variational Auto encoders diffusion guns and then there's a long road towards image generation with text prompt for example and stuff like that so I hope that sounds interesting to you um hope to see you soon