Hierarchical Vision Transformer with Shifted Windows

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In this video, Aarohi introduces the Swin Transformer, a significant evolution in the field of vision transformers. Published in 2021, the Swin Transformer architecture addresses the limitations of traditional vision transformers, especially when processing high-resolution images. With a focus on computational efficiency, Swin Transformer uses a shifted window approach to manage the complexity of high-res images, differentiating it from its predecessors. Aarohi explains the intricacies of its architecture, including patchification, linear embedding, and multi-layer perceptron layers. By employing a unique window-based attention mechanism, the Swin Transformer balances precision and efficiency, making it a powerful tool in the realm of image processing. The video concludes with a brief overview of its application in tasks like image classification and segmentation.

Highlights

Swin Transformer is designed to handle high-resolution images efficiently. 🖼️✨
Patchification turns images into smaller patches for easier processing. 🔄🧩
Shifted windows in Swin Transformer provide efficient image attention mechanism. 🪟🔍
The overview simplifies complex Transformer blocks for viewers. 🧠🔍
Different layers in Swin Transformers offer precision and flexibility. 🎯🔄

Key Takeaways

Swin Transformer excels at handling high-resolution images effectively without hefty computational costs. 🖼️💻
Patchification breaks down images into manageable pieces, crucial for processing. 🗺️🔍
Aarohi covers the complex architecture of Swin Transformers in an easy-to-digest manner. 🧩📚
Shifted windows enhance computational efficiency, making Swin Transformers stand out. 🪟⚙️
Swin Transformers offer versatile applications in image classification and segmentation. 📊🎨

Overview

Welcome to Aarohi's channel, where today we dive into the Swin Transformer, a 2021 innovation transforming how vision transformers handle high-resolution images. Unlike its predecessors, the Swin Transformer uses a shifted window method to better manage computational burden while maintaining performance. 🤖💡

The Swin Transformer's architecture begins with breaking down images into patches, a critical step for handling large datasets effectively. These patches undergo linear embedding, preparing them for processing by Transformer models. Aarohi explains how two types of window attention—regular and shifted—play a crucial role in the working mechanism of these transformers. 🖼️📏

Aarohi wraps up by highlighting the Swin Transformer's versatility in performing image classification and segmentation, delving into how it achieves precision at different resolution stages. She also touches upon the various Swin Transformer models available, catering to different needs with varied parameters like layer numbers. The session promises future insights into practical applications, setting a stage for new learning avenues. 🎓🔍

Chapters

00:00 - 00:30: Introduction and Overview The chapter begins with an introduction by the speaker, Arohi, who is presenting a video on her channel. She introduces the topic of Swing Transformers and mentions that the Swin Transformer architecture was introduced in a paper titled 'Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,' which was published in 2021. Arohi explains that the Swin Transformer was developed to address some of the limitations found in vision transformer models.
00:30 - 01:30: Swin Transformer Architecture The chapter introduces the Swin Transformer architecture, emphasizing its superiority over the Vision Transformer, particularly in processing high-resolution images with lower computational complexity. It hints at explaining the architecture and the functionality of different layers in the Swin Transformer. The architecture begins with a 'petrification' step, which involves processing the input image. Further details on the layers and function are anticipated but not provided within this excerpt.
01:30 - 02:30: Petrification and Linear Embedding The chapter discusses the process of 'petrification' in the context of image processing. It explains how an input image is divided into patches. For an image of size 224x224 with patch size of 4x4, the total number of patches generated would be 3136, forming 56 patches per row and 56 patches per column.
02:30 - 05:30: Swin Transformer Blocks and Attention Mechanisms The chapter discusses the concept of Swin Transformer Blocks, focusing on attention mechanisms. It introduces the idea of a linear embedding layer within the Swin Transformer. This layer is employed to transform input data, such as images, into a compatible format for processing by a Transformer model. The transformation is necessary because images are composed of pixels, while Transformers operate on sequences of tokens. Thus, the process involves converting images into token sequences so they can be effectively handled by the Transformer architecture.
05:30 - 13:00: Patch Merging and Hierarchical Process The chapter discusses the concept of Patch Merging and Hierarchical Process in the context of Transformers. It explains how linear embedding is used to convert image pixels into a numerical representation in the form of vectors. These vectors are then fed into the Transformer model layers as input.
13:00 - 15:00: Task-Specific Applications and Variants The chapter 'Task-Specific Applications and Variants' discusses the application of vectors utilizing linear embedding processed by the main components of the Swing Transformer blocks. The Swing Transformer block is composed of two subunits, each featuring a normalization layer followed by an attention module. It notes that the attention module in the first subunit differs from that in the second unit.
15:00 - 16:00: Conclusion The chapter 'Conclusion' covers different attention modules used and discusses them briefly. After describing the attention modules, it follows with an explanation of an additional normalization layer and a multi-layer perceptron layer. The focus is on the first subunit employing a window multi-head self-attention limb, while the second subunit utilizes a shifted window multi-head approach.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Transcription

00:00 - 00:30 hello everyone my name is arohi and welcome to my channel so guys in my today's video I'll talk about swing Transformers so the swin Transformer architecture was introduced in a paper titled swin Transformer hierarchical Vision Transformer using shifted Windows this paper was published in 2021 so the swin Transformer built upon the success of a vision Transformer architecture award the vision Transformer models suffers from limitations when it comes
00:30 - 01:00 to handle a high resolution images so swing Transformer outperform Vision Transformer due to their ability to effectively handle large images with lower computational complexity so today we'll understand what a swing Transformer will see the architecture of Swing Transformer will see how different layer Works in it okay so let's start so this is the architecture of Swing Transformer the first step is petrification means the input image will
01:00 - 01:30 go to the first layer which is where we are performing the patification we are dividing our input image into a patches so suppose if the input image size is 224 into 224 and the patch size is 4 into 4 then we are going to have total number of patches equal to 3136 means you will have 56 patches per row and 56 patches per column total number of patches will be 3136 as you can see here
01:30 - 02:00 okay so once you're done with the petrification then we will go to the linear embedding layer the linear embedding layer in the swing Transformer is a technique used to convert the input data into the format that can be processed by a Transformer model okay so images are made up of pixels but Transformers works with sequence of tokens all right so therefore what we need is we need to convert the images
02:00 - 02:30 into a sequence of tokens and uh this is where linear embedding comes into play linear embedding is a process for converting the image pixels into a numerical representation or in a vector okay so these vectors are then fed to the Transformer layers okay Transformer model layers as input okay so once the image is converted into sequence of
02:30 - 03:00 vectors using linear embedding it can be processed by the essential layers of the Swing Transformer blocks okay now let's talk about swing Transformer blocks so the swing Transformer block consists of two subunits you can see this is the first unit and this is second unit okay so each subunit consists of normalization layer followed by attention module and you can see that the in first unit the attention module is different and in second unit we are
03:00 - 03:30 using different attention module okay we'll discuss about them in a minute okay so after the attention module we have again another normalization layer and then we have a multi-layer perceptron layer okay so now the first you subunit uses a window multi head self attention limb okay while the second unit subunit is using a shifted window multi-head
03:30 - 04:00 retention layer okay now let's understand both the shifted window and the window multi-attention cell potential layer in detail okay so before moving to this how let's see how this multi uh this multi head self potential layer Works in Vision Transformers what happens in Vision Transformer how this attention mechanism work in Vision Transformer so basically it calculates
04:00 - 04:30 the relationship between each patch and all the other patches in the image okay and this approach has a computational complex complexity meaning it becomes unefficient when we are going to have high resolution of images okay because each patch we are comparing the each patch with all all the other patches in the image okay so when we are working with high resolution images this will be a difficult to handle because the
04:30 - 05:00 computational complexity will be there okay so to address this issue the swing Transformer uses a window based multi-self attention layer okay now a window is simply a collection of patches and attention is computed only within each window okay so for example a window size of two by two patches and window
05:00 - 05:30 based self you know how let's suppose if the window size is two by two patches that means we will compute the attention on the basis of those four patches which are there in that window okay so let's take another example suppose we have a feature map of size 8 by 8 okay which is evenly partitioned into two by two windows and the size of each window is four into four okay m is equal to 4 size
05:30 - 06:00 is 4. so each window contains 16 patches so we'll calculate the attention inside these windows means one window will have 16 patches so we'll calculate the tension on the basis of those 16 patches per window okay so this window based self-attention cannot provide you know connections between the windows so you can see like it is only calculating the attention on the basis of the patches which are there
06:00 - 06:30 in one window but how we will see the connection between the patches okay so uh for that um we will use shifted window multi-attention layer okay so to know to provide the connections between the windows while maintaining their computational efficiency shifted window for partitioning approach is proposed in the paper okay so in this shifted window partitioning approach we displace the
06:30 - 07:00 window from the regularly partitioned window okay so we are displacing it so the shift size is 2 how much displacement will be there with the size of two so in the so for example if the m is equal to 4 okay so that means it will be 2 the shift size will be 2 okay so after Shifting the window by two by two
07:00 - 07:30 we are having a total nine Windows okay but previously where we used this window self potential layer over there we were having only four windows but now we are having nine Windows now what we want is we want to have only four windows for that author suggested another technique called cyclic shift to reassemble these nine Windows back into the four windows so now let's understand what is cyclic
07:30 - 08:00 shift okay so again we are assuming that we are having a feature map of size eight by eight okay and we have a window or size four by four we shifted two times to the top and two times to the to lift and we will have something like this now we have a lot of Blank Space right you can see here we can uh you know Zero part it but what we will not
08:00 - 08:30 do that instead of that instead of Shifting the windows let's shift the patches we will shift the patches using cyclic shift okay so to understand this cyclic shift concept let's let's consider one example and then let's understand it suppose you have four areas a b c and d like this okay now let's see where these areas would be after shifting okay so shifted area D we will shift what we
08:30 - 09:00 will do we will shift the area D two times up and two times left and here this is how it will look after shifting okay now we want to shift the area a two times top and two times left but there is no space on the top okay so it will appear at the bottom right of the image okay so same way area C and D will be shifted like this so this was our
09:00 - 09:30 original patches and after cyclic shift these are our shifted patches okay now that we have uh you know now we have shifted patches we can Define the windows on Windows now okay so we have some patches over here you can see we have some patches within each window we can pass these windows through MSA which is um masked MSA in this case to produce
09:30 - 10:00 an output okay so why we are using this Mast MSA over here so the reason is because if you just look at this second window over here okay so there are some patches that you know that are adjacent to each other while in the original patches they are at different locations but after shifting they are adjacent to each other okay but to what what we want is we don't want to be patches look like
10:00 - 10:30 this right we want the patches to be to like the patches which is how they are in the real you know real position okay to to solve this we Define some sort of masked mechanism okay and in the end we will shift it back to the original patches location okay and we have that once we shift the patches at the original location so we can use those
10:30 - 11:00 patches for our you know rest of the neural network okay so this is how window multihead self-attention module and shifted multi head self-attention module work okay so now the other thing is there is patch merging now instead of processing all the patches individually throughout the entire network swing Transformers selectively merges the adjacent patches to capture Global
11:00 - 11:30 Information effectively okay so in swin Transformer patch merging is performed hierarchically in multiple stages okay each stage consists of a transformer blocks followed by a patch merging operation and the patch merging layer merge four patches okay so uh with every merge both the height and the width of the images are further reduced by a factor of two so you can see in stage
11:30 - 12:00 one the input resolution is height of the image divided by four width of the image divided by four okay but after patch merging the resolution will change to height by 8 and width by 8 okay which will be the input to the next stage and for stage 3 the input resolution will be height by 16 and width by 16 and for the fourth stage we are getting uh we are providing the
12:00 - 12:30 input to four stage is height by 32 and width by 32 so how this size is getting reduced let's see that so split the image into two by two and in each group start the patches depth wise then combine the Stacked groups okay so the merging operation takes four neighboring patches and concatenating them along the channel dimension so this effectively down sample the input by a factor of N
12:30 - 13:00 and in our case n is equal to 2 2 means each group consists of 2 by 2 neighboring patches okay so this is how though it will reduce the size of the image every time okay so finally once you now you clear about patch merging so finally we can have a task specific head suppose we want to perform image classification so for the image classification we just look at the final
13:00 - 13:30 output at the last stage of the student Transformer okay and pass it through the linear layer like Vision Transformer which is just a single you know a simple multi-layer perceptron okay so it provides you a class score for image classification okay but guys in case you want to do object detection or segmentation task then you need output of all the stages of Transformer okay so so we will be having you know you can
13:30 - 14:00 say that we will be having features from different stages with different resolutions okay so how different resolution because in the first stage the path size would be uh four and then we are having path size eight then we are having patch size uh 16 and then we are having pad size 32 every time resolution is increasing okay so we can say that algorithm will learn to work with different skills so for object detection and email segmentation we provide
14:00 - 14:30 um output of each stage to the our task specific head if it is a you know object detection task or uh segmentation task okay so this is how it works and guys there are four variants of this in Transformer tiny small base and large and the difference between all these variants are like parameters like the C and the number of layers okay so you can see here that all these models
14:30 - 15:00 have different number of layers and the parameters and that's it guys this is how the swing Transformer the basic idea of Swing Transformer I have explained you today so in my next class I'll show you how to perform image classification and then we'll see how to perform object detection using swing Transformer so I hope this video is helpful thank you for watching