Experimenting with Vision RAG

Multimodal RAG: ColPali + Byaldi + Vision AI Models ⚡

Estimated read time: 1:20

    Learn to use AI like a Pro

    Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo
    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo

    Summary

    In this video by AI Anytime, the host explores the integration of vision recognition models with document retrieval frameworks. The focus is on using open-source technologies like ColPali in conjunction with models like QN 2.5V for building a multimodal retrieval-augmented generation (RAG) system. The tutorial emphasizes the ease of using these models on compute-limited devices, assisted by tools like Unslø—a key player in running resource-intensive tasks. By showcasing the use of Google Collab and popular utilities, the video provides a step-by-step guide for creating a vision RAG, capable of extracting data from image-heavy documents. The speaker also covers common challenges, such as runtime crashes, and offers solutions to mitigate these issues.

      Highlights

      • ColPali and QN 2.5V are the stars of this Vision RAG tutorial. 🌟
      • Unslø helps to leverage powerful models on limited hardware. 🏗️
      • Utilizing BL library simplifies working with multimodal models. 💡
      • Watch out for unexpected crashes in Google Collab, especially during setup. 💥
      • Create an index from images and conduct inquiries on image-derived data. 🔍

      Key Takeaways

      • Explore the world of Vision RAG using open-source tools like ColPali and QN 2.5V. 🚀
      • Unslø empowers users with limited GPU resources to run complex vision models. 💻
      • Multimodal models make it easy to retrieve and analyze data from images. 🖼️
      • Beware of runtime crashes in Google Collab—it's a common hiccup but solvable! ⚙️
      • Building Vision RAG systems can revolutionize how we interact with image-rich documents! 📄

      Overview

      In this video, AI Anytime walks us through an innovative way to integrate vision and text data retrieval using open-source technologies. At the heart of this approach are the ColPali framework and QN 2.5V model, which together enable users to pull information from images effectively. The tutorial provides a comprehensive yet straightforward guide to setting up this system using readily available tools and platforms like Google Collab.

        One of the standout features highlighted is the use of Unslø, a tool that facilitates running complex models on minimal GPU resources. This is particularly valuable for users without access to high-performance computing setups. The speaker also delves into practical aspects, including potential issues with Google Collab environments, and shares useful tips to overcome them.

          Ultimately, this Vision RAG setup offers a powerful method for analyzing data from image-centric documents. The combination of ColPali and visual language models promises to transform traditional data extraction processes, making them more interactive and versatile. Whether for academia, business, or personal projects, the possibilities of this technology are vast and exciting.

            Chapters

            • 00:00 - 00:30: Introduction to Vision RAG The chapter 'Introduction to Vision RAG' discusses the experimentation with Vision RAG, specifically using the coal pali and QN 2.5 V model. The focus is on discovering, asking, or extracting information from images or media files, which traditionally has been done through documents. The chapter is introduced on the AI Anytime channel.
            • 00:30 - 01:30: Overview of ColPali and QN 2.5V Model The chapter introduces ColPali, a document retrieval framework, used in conjunction with open-source vision language models like QN 2.5V. The objective is to explore building a vision RAG (Retrieval-Augmented Generation) capability utilizing these tools. Emphasis is placed on using completely open-source stacks without proprietary components.
            • 01:30 - 03:00: Using GitHub Repository and Unsloth The chapter discusses the use of a GitHub repository named 'coal pali,' which is focused on efficient document retrieval using vision language models. The repository provides access to various models and comprehensive information about them. Additionally, the chapter introduces the use of a tool called 'Unsloth,' which aids in computations on devices with limited computational capacity. The tutorial is conducted using Google Colab with an L4 GPU, emphasizing the necessity of GPUs for running the tutorial effectively.
            • 03:00 - 05:00: Details on Model Setup and Configuration The chapter discusses the essentials for setting up a vision rack project, emphasizing the importance of having a GPU. The speaker talks about Unslo, which they regard as one of the most significant innovations in the last five years. Unslo enables users with limited GPU resources to run advanced computational tasks on a single GPU.
            • 05:00 - 07:00: Challenges with Google Collab The chapter discusses challenges encountered while using Google Colab for machine learning tasks, particularly focusing on compute limitations. It credits Unslo for solutions concerning these limitations. The speaker explains their use of Unslo's Q& wheel, a model available on Hugging Face, mentioning its 7B parameter instruction-tuned version. They have loaded the model in 4-bit precision, which can be adjusted based on needs, highlighting the efficiency this brings for longer context training. The chapter also touches on using gradient checkpointing as a strategy to manage these compute constraints, and confirms the successful download of model weights, emphasizing adaptive approaches to overcoming technical hurdles.
            • 07:00 - 09:00: Creating and Indexing Images In this chapter titled 'Creating and Indexing Images', the focus is on preparing a model for inference, ensuring that the model is ready to save tensors. A significant part of the discussion is on using BL, which is described as a tool for leveraging late interaction multimodel models like Cali with minimal code. This empowers the user to implement these models effectively using just a few lines of code. The chapter hints at the availability of a broader pipeline in a notebook, which allows for a seamless workflow.
            • 09:00 - 10:30: Demo: Image Search and Queries The chapter discusses a high-level abstraction of code, emphasizing that it reduces the need for extensive coding. The speaker suggests that additional information can be found in resource B. The chapter highlights the importance of popular utilities for handling images, suggesting that installation might vary based on the operating system, with Unix and Linux being easier compared to Windows and Mac.
            • 10:30 - 11:30: Use Cases for Education and More The chapter titled 'Use Cases for Education and More' discusses the usage of a RAG (retrieve and generate) multimodal model, specifically mentioning the v vor coal qn2 v1 model by LD. It provides guidance on installation and support, pointing readers to documentation and GitHub for further details.
            • 11:30 - 13:00: Wrapping Up and Future Use Cases In this chapter, the discussion focuses on the concluding thoughts and potential future use cases for the project. There is a demonstration involving a utilities library and the use of images from Substack CDN, highlighting brands like Tesla, Netflix, Nike, and Google. The process includes navigating to the image folder, creating images, and downloading them, culminating in a successful execution of these functions.

            Multimodal RAG: ColPali + Byaldi + Vision AI Models ⚡ Transcription

            • 00:00 - 00:30 Hello everyone, welcome to AI Anytime channel. In this video, we are going to go through a simple vision rag experimentation with coal pali and QN 2.5 V model. So most of the time you know you have these questions that how we can you know discover information or ask information or extract information from images or media files right we have been you know uh using documents to kind
            • 00:30 - 01:00 of retrieve information and build rag on top of it but how we can build vision rag right that's the question so there's a very good framework and this is going to be completely open source we're not using any closed source stack so we're going to use coal pali which is a basically uh document retrieval framework okay that combines with opensource vision language models like QN 2.5V or any other models right to kind of uh build a vision rack capability so let's do that right now if you look at here on my screen I'm on
            • 01:00 - 01:30 their GitHub repository it's called coal pali efficient document retrieval with vision language model right so you can go through it and they have support of these model can see the models and find out everything over here and I'm also going to use onslaught for this uh tutorial tutorials. So uh because because unsloth helps on a compute limited devices. So that's the agenda of this video. So if you look at here you know uh I am using Google collab. I I'm using L4 GPU. You know you need some sort of GPUs to run this tutorial. If
            • 01:30 - 02:00 you want to build a vision rack uh kind of a if you're doing vision rack project you need to have a GPU. Now I'm using Unslo. Unslo is uh one of my favorite uh I'll not say invention uh it's an innovation in my views one of the most innovative thing that has happened in last 5 years is unsllo because unsllo has empowered people like us because we are GPU poor uh you know to uh run these kind of capabilities right on single GPU
            • 02:00 - 02:30 or on a very compute limited devices so all credit is to unslo for this now I'm using Q& wheel by unsllo you can see it's on their hugging face repository 7B model the instruction tuned model I have load in 4 bit you can make it true also you know if you want to load in 4 bit and you can either either do true or unslo you know for uh longer context so I'm just using gradient checkpointing unslo over here and you can see the model has been download uh basically has been downloaded you know the weights uh
            • 02:30 - 03:00 save tensors and then I also have made this model ready for inference so you can find out the entire pipeline over here Right? And of course this notebook will be available. And then we are using BL which is very important. So let me just show you here. If you go to BL in GitHub, right? It says uh it's for it says use late interaction multimodel models such as Cali in just a few lines of code. So basically it empowers you know to use Calpali and those models the multimodel models you know in few lines
            • 03:00 - 03:30 of code. So basically a highle abstraction you know of such code. So we don't have to write a lot of code right. So that's what it is. Now you can go to B and read about it. Okay. Now so let me come back and you can see we are we are we need popular utils because we're going to use images right. So popular utils are required right. Uh you know it depends uh what kind of operating system you are using. Popular utilities is a bit difficult to install on Windows and Mac Windows machine. It's easier to uh work on Unix and Linux distros. Uh and
            • 03:30 - 04:00 of course I'm going to use sometimes I have seen with this problem. I'm going to show you in bit. Now you can see I'm using rag multimodel model from by LD and you can see we are using v vor coal qn2 v1 okay this particular model now if you come here you can find out more details about how to install it and all the supported thingy you can see uh how can how you can create an index and so on and so forth so you can go through the documentation and GitHub repository
            • 04:00 - 04:30 to understand better now I'm going to also okay this going this is going to take a bit of Okay, we're going to wait for it. But uh I think that should be fine. That should be done. Yeah, now this is fine. Some utilities library. And I'm using some uh images from Substack CDN. You can see it's Tesla, Netflix, Nike, Google. And of course uh once we uh go to the image folder, we can see it. Of course, it's going to create an image and it downloads the image. You can see it has downloaded the
            • 04:30 - 05:00 image, right? And now we're going to create. And this is a moment where sometimes it breaks the runtime. I have seen in the collab. You can see it says your session restarted after a crash. It has happened with me. I'm going to take a pause. I wanted to show you this because it has it has always happened with me. Okay. So I'm going to take a pause here and come back after this uh indexing step. Okay. All right. As you guys can see now the indexing has been created. I have always noticed it when I run it for the first time with co pali and bi it always uh ask me to restart. Maybe because we have to we have have to
            • 05:00 - 05:30 restart the uh server because of course the runtime because of by early installation it might need to restart. So please keep that in mind if you're using Google Collab. Now you can see indexing file uh added page one of document zero to index document one to index blah blah blah and our that has been created. Okay. So let's get created over here you know in attention and you can find out your embeddings and collections and everything over here. Now we have a search functions. So I'm just going to of course this converts
            • 05:30 - 06:00 the bytes io the bytes to the images using pillow library or pil library. We are using text streamer from transformers. I'm just going to go next. And we have a simple prompt template here. So I'm just go we using apply chat template from the tokenizer. And I'm just going to go next here. Right. You can see we are taking this Nike image. So if I just open this image over here. Right. Let's say open this Nike image. Right. You can find out uh it has a FY25 Q3 income statement. That's what we are
            • 06:00 - 06:30 looking at here. I'm asking a net profit probably in this question. What is the net profit for Nike? It's going to respond something. Let's see if it responds correctly. Now, perfectly fine. It's $0.8 billion. You can see it over here on the right hand side where my cursor is right now. This is really interesting because and this is completely open source. You're not using any closed source models. And if you are building a multimodal rag or a vision rag where you have to you have these kind of images what you can also
            • 06:30 - 07:00 do guys right you can pass also extract all the images and save that in a folder create the index using this particular method and then of course use that through rag right so I think that should be that should be very interesting as well you know in my views because you don't you are not using any open AI APIs like models that supports uh let's say uh images like we have models like GPT4 4 or uh even cloud 3.7 supports or Gemini 2 you know that support. So you
            • 07:00 - 07:30 can you can use that uh model if you if you want to if you're okay to use it but for some of the highly sensitive data you might not want to use closed source model you might want to use open source models so in that scenario you have to use this particular architecture you know to go ahead with now if you look at here on my screen also right now I have opened uh gradu application I built simple gradu right over here that that shows you that what kind of capability capabilities it has. So, let me just
            • 07:30 - 08:00 probably download this. I don't know if I can download it. Okay. Or if I already have it. Okay. See? Yeah, that's fine. Nike. We have this Nike over here. Let's just upload. And let me ask the question. So, you can see we we ask about net profit, right? Uh what kind of questions we can ask here. What was the gross? What was the gross profit? You know, let's ask this question here. Right now, if I ask it, you can see this is a gradu application,
            • 08:00 - 08:30 right? It says, of course, we we didn't use the chat template over here, but that is fine. You can see it says the gross profit for Nike in Q3 FY25 was 4.7 billion. You can see it over here, right? $4.7 billion with a margin of 41%. Fantastic. I liked it. So, this kind of thing we can build and we can build a lot of things, not only this, right? We can build you know stuffs like in in uh education sector as well where we where you want to create uh questions answers looking at some kind of images
            • 08:30 - 09:00 or you want to explain some image or advertisement and whatn not right we can create this particular architecture over here that where you can upload an image and ask the question so you can combine coal pali and bld uh with any vision model you want right you have a lot of open source vision models as well nowadays that supports uh uh these frameworks. So give this code have a look try it out and let me know how are you what are you building you know basically with this architecture what kind of use cases you are solving with
            • 09:00 - 09:30 vision rag right uh and this is one of the most prominent use cases I see where you have lot of images within PDF as well let's say if you have scan documents these are nothing but images so how you can use open source stack you know to discover information or extract information so that's what I want to do in this video guys I'll give this notebook. Have a look, try it out. Let me know if you have any question, thoughts or feedbacks in the comment box. You can also reach out to me through my social media channels. Find those information on channel banner and
            • 09:30 - 10:00 channel about us. If you like this video, please hit the like icon. If you haven't subscribed the channel yet, please do subscribe the channel guys. That motivates me to create more such videos in near future. That's all for this video. Thank you so much for watching. See you in the next one.