Creating Vision-Enabled AI Agents

WebVoyager

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In this tutorial, hosted by Will from LangChain, you'll learn how to build an autonomous AI agent that can surf the web using vision capabilities. The demonstration utilizes L Graph, an open-source framework designed to balance expressivity and control in building complex LLM workflows, particularly for vision-enabled AI agents. The session covers creating a browser agent capable of reasoning, deciding actions, and executing them via a series of interactions on web pages, navigating using bounding boxes, and interfacing with various web tools. The tutorial guides you through setting up necessary tools, defining state machines, constructing prompts, and eventually compiling and testing the agent’s capabilities to perform multiple web tasks.

Highlights

Create an autonomous AI agent that uses vision to surf the web 🌍.
L Graph helps balance control and expressivity in AI workflows 🧩.
Build agents with reasoning loops for efficient task execution 🔄.
Use bounding boxes for precise interaction on web elements 📦.
Navigate challenges in AI development with hands-on debugging tips 🛠️.

Key Takeaways

Learn how to build AI agents that use vision to navigate web pages efficiently 🌐.
Utilize L Graph, an open-source framework from Lang Chain, to create controlled and expressive AI workflows 💡.
Understand concepts like state machines, bounding boxes, and tool integration to enhance AI capabilities 🤖.
Explore the practical implementation of compiling and testing an AI agent with real-time web interactions 🎯.
Stay updated with Lang Chain developments for continuous learning and advancement in AI technology 📚.

Overview

In an adventurous lesson by LangChain, learn to build an autonomous web-surfing AI agent equipped with vision capabilities. Leveraging L Graph, this tutorial balances control with expressivity, seamlessly enhancing workflows for AI agents that see the web through bounding boxes and execute tasks intelligently. 🕵️‍♂️

The tutorial delves into essential components like stateful graphs, defining agent states, and creating browser interactions using APIs. It demonstrates constructing reasoning-action loops for task precision, addressing real-world challenges such as handling complex webpage structures. By integrating various tools and techniques, you will enhance agent efficiency in a fun, invigorating way! 🎉

Finally, practical tests are run to watch the agent in action—navigating web tasks, understanding complex prompts, and creatively solving issues demonstrated with real-time browser interactions. Tune in for a tech treat that merges educational content with engaging AI development, ensuring you're thrilled and informed for your next AI venture! 🚀

Chapters

00:00 - 01:00: Introduction and Overview This chapter provides an introduction to building AI agents with vision capabilities. It focuses on creating an autonomous web-surfing agent that helps perform complex tasks using vision technology. Will, from Bankchain, introduces the concept of using L graph, an open source framework from the Lang chain team, as a tool to balance expressivity and control in building workflows for click-type experiences.
01:00 - 02:30: Building a Vision-Enabled Web Agent The chapter introduces the concept of building a vision-enabled web agent, emphasizing the use of Lang graph for AI agents. The chapter suggests that readers unfamiliar with Lang graph can check out introductory videos on YouTube and examples in the provided repository. The goal of the chapter is to enhance understanding of workflows that Lang graph enables, particularly through the example of building a vision-enabled agent based on the web voice Voyager.
02:30 - 04:00: Using the Web Voyager Paper as a Reference The chapter explores an agent described in a paper by researchers from Chuang University, although the code has not been released yet. The discussion is based on understanding and impressions from the paper's content. The agent is identified as a basic reasoning and action loop type, similar to React-style agents. It operates by generating a chain of thought to justify its upcoming action, aiming to improve accuracy. Following this reasoning, it then specifies the action for the tool.
04:00 - 05:30: Considerations and Challenges in Building Web Agents The chapter discusses the process and challenges of building web agents that interact with users and APIs. It explains a loop process where the agent continues to call APIs until the language model determines it has achieved its goal. Key considerations include dealing with raw text, which often contains extraneous information that needs to be filtered and processed effectively.
05:30 - 07:00: Mark Style Bounding Box Annotation This chapter discusses the challenges of building resilient agents by minimizing complexity and distractions to improve the likelihood of achieving their goals. It highlights the limitations of machine code meant for human tasks and the difficulty of extracting precise data from large screenshots. The chapter also introduces an innovative approach for more effective processing.
07:00 - 09:00: Agent Structure and Workflow The chapter covers the use of bounding boxes and annotations to help software agents interact with user interfaces more effectively. By marking elements and enabling selection with a mouse, agents can perform actions more efficiently. It highlights creating affordances for UI interaction, improving the agent's success in tasks. Additionally, a diagram illustrating the basic structure of the agent's workflow is referenced, indicating a focus on enhancing UI engagement and task execution.
09:00 - 10:30: Using LGraph Framework The chapter 'Using LGraph Framework' discusses the components and functionalities within the LGraph framework. The main elements include the web browser interface annotated with bounding boxes for interaction, the trajectory to track past actions preventing repetition, and the LLM impr prompt encoded by the 'brain'. Additionally, it takes you through the integration of tools, which serve as the API connecting the LLM with the web browser, facilitating actions like scrolling, typing, clicking, waiting, and navigation.
10:30 - 12:00: Defining the Agent and Tools The chapter discusses the quick assembly of a system that defines agents and tools, acknowledging that the current version lacks optimization. The speaker suggests improvements such as increasing parameter size, enhancing prompts, and adding more tools to make the system more resilient. They highlight Lsmith as a tool for easier optimization, offering features like account sign-up and access to traces, which assist in debugging and further development.
12:00 - 13:00: Creating the Web Browsing Agent This chapter discusses the first steps in creating a web browsing agent, specifically focusing on setting up the necessary accounts and API keys. It introduces a project called Web Voyager where these elements will be applied. The chapter suggests that while signing up or getting off a waiting list is optional, it's highly recommended for the development process, and offers support contacts for further assistance. Additionally, there is a mention of setting up and installing required components, with a promise to revisit the application development aspect later.
13:00 - 15:00: Initializing the Browser and Testing The chapter titled 'Initializing the Browser and Testing' explains the process of setting up a browser for automated testing using several technologies and tools. It mentions a package that provides a runnable API to the OpenAI SDK, suggesting the installation of Playwright, a browser technology that facilitates the control of language models. Additionally, alternatives like Selenium are mentioned, emphasizing their interchangeability depending on specific needs. The chapter also includes steps to allow asynchronous browser operations within a Jupyter notebook.
15:00 - 18:30: Task Execution and Evaluation The chapter 'Task Execution and Evaluation' discusses the concept of creating and utilizing a stateful graph, which functions similar to a state machine in task execution processes. It highlights how the next steps in such a system are determined by the current state and the connections within the graph. The example provided involves elements like a web page, user input, images, and bounding boxes, which all play a role in defining the process and execution within the graph structure. This setup is particularly noted to be useful for demonstrations and educational purposes, though not necessarily needed in a production environment.
18:30 - 20:30: Inspecting and Debugging with LangSmith The chapter titled 'Inspecting and Debugging with LangSmith' discusses how to observe and debug interactions between AI models and their tool interfaces. It explains how predictions are made and the observations that can be logged when a tool performs actions like clicking or typing. A simple string records the observation, and additional information can be stored within message logs to keep track. The chapter suggests experimenting with these configurations to enhance performance beyond pre-configured settings. It details the process of defining states before creating tools, which act as the API between language learning models and browsers, as demonstrated in L graph and state definitions.
20:30 - 23:00: Adjusting Prompts and Testing The chapter titled 'Adjusting Prompts and Testing' discusses how a graph structure is used to manage functions and nodes for operations. In particular, it describes the 'Click' operation where page information and bounding box parameters are used. By accessing a list of bounding boxes, the operation retrieves pixel-level XY coordinates for precise click actions. Other functions within the graph are briefly mentioned, although not detailed within the provided transcript.
23:00 - 24:00: Enhancements and Conclusions In this chapter, the focus is on enhancements and conclusions related to typing text and using a tool with a Bing box and text content. The tool is capable of searching and filling out forms using the text content. There are aspects of the tool that still require updates, such as overwriting information and improving the scroll operation functionality. Although the current operation is kept simple without specifying how far to scroll, there is recognition that improvements can be made. These enhancements would allow better navigation when the entire content is not visible in one browser view, as well as better handling of content changes.

WebVoyager Transcription

00:00 - 00:30 ever wondered how to build an AI agent that can see in this tutorial you will learn how to create an autonomous agent that can surf the web and assist you in performing complex tasks solely using Vision hi it's will from bankchain today I'm going to be teaching you how to build a vision enabled web browsing agent using L graph L graph is an open source framework from the Lang chain team that makes it easy to balance expressivity and control when building C click type llm workflows such as those
00:30 - 01:00 commonly used for AI agents if you're not familiar with Lang graph we've got a nice introductory series of videos on YouTube that you can check out as well as a number of examples in this repository here that you can see to see more basic examples of how to build with it hopefully through the course of building this Vision enabled agent you'll see a little bit more and understand a little bit more but with types of workflows that it really empowers you to build for our example we'll be building this based on the web voice Voyager
01:00 - 01:30 paper by H at of chuang University they haven't released the code yet so this is my best impression or understanding of how the agent operates based on the description in the paper the agent itself is a basic reasoning and action Loop type of agent if you're familiar with react style agents what it does is it generates a Chain of Thought saying why it's going to make the next action to improve its accuracy there and then it'll give the action for the tool and then the arents
01:30 - 02:00 is going to be passing that and then that will then call an API and then we return and and decide to continue to Loop until the llm uh decides it has the answer has accomplished the goal that it was set out to accomplish um once that state is reflective then it returns that output or response uh to the user there when building a web browser agent there's a number of considerations that you're going to have um for one if you're just feeding it raw text it's going to have a lot of useless information that is all going into into
02:00 - 02:30 this Mach machine code um that that actually isn't helpful in getting it to the end state in general in building agents you want to reduce the complexity as much as possible reduce distractions as much as possible so that you can make it as resilient as possible and give it a better likelihood of actually accomplishing its end goal um so the web was designed for human eyes but even then if you pass in a big screenshot like this it's kind of hard for it to get that level of precision just from generating text so they have a nice appr apprach that uses this set of
02:30 - 03:00 Mark style bounding box um annotation that will generate these bounding boxes around the different things and then it can then select these elements um by number and then uh use the or the mouse in order to make these clicks and and other types of actions this is a nice way of creating affordances for the um or UI affordances for the agent so that it can actually interact in a more successful way um if we look at this diagram I made um the basic structure is as follows you have the state of the
03:00 - 03:30 graph which is basically the web browser that we annotate with those banding boxes you have the question the trajectory which is just the past series of actions that it's taken so that it knows not to repeat itself or it knows what it's tried before um you have the llm impr prompt here encoded by the brain um and then you have the tools tools in general are just the API between the llm and the outside world in our case the outside world is the web browser uh so we have things like scrolling typing clicking waiting going back and going to searches
03:30 - 04:00 um I put this this together pretty quickly I um haven't optimized it at all so a lot of these examples could probably make be made more resilient by either giving it more tools uh increasing the parameter size or improving the prompts uh Etc and so a lot of this type of building um you can can probably optimize more for your use case um one thing that makes it really easy to do that is lsmith um so you can click here and you can sign up for an account I already do and this gives us a nice set of of traces that you can look through and and um help debug and
04:00 - 04:30 develop your application regularly we'll get back to this a little bit later once we actually start running code um so once you sign up for that uh again it's optional but highly recommended and if you need to get an account or or get off the weit list you can feel free to follow up with us offline um you can email us at support L.D um so we'll set up the API key here and then we're going to be sending all of these to this little project here um to called Web Voyager later um and then installing the requirement requireed L
04:30 - 05:00 graph this package and Len Smith tracing and the lenain open a wrapper um which is um yeah gives a runnable API to the openis SDK and then if you haven't already you can install playright playright is the browser technology that um allows the llm to be giving a control to that you can also use selenium or other things like that it's not um super important for this case and then this is just a step that um uh allows the async browser to be operating a jupyter notebook so
05:00 - 05:30 wouldn't need that in production but for the sake of the notebook can do that all right on stage agent basically when you're creating a graph um is a stateful graph and you can think of it as a state machine um so then what dictates the uh the next step is both the state and the series of connections within the graph um to determine like what you're going to be doing next in our case' got a number of elements here you've got the web page itself so this is a page within playright you've got the user input uh and then you've got some things like the image and the bounding boxes as well as
05:30 - 06:00 it's prediction and then you have some this other things so like the observation of the tool once it actually does and clicks or Types on things um it's just a simple string and then the list of messages so if it's going to be saving additional information in the messages we can keep track of it there um you can try tweaking this to see if you can get better performance than what I have already configured here once you define that state we'll start creating the tools which again are the API between the llm and the browser you see in L graph and in the state for
06:00 - 06:30 graph we accept this state uh into each of these tools into each of these functions which are nodes in the graph um so in this case for The Click operation um where we get the page and then the arguments that are passed by the llm which in this case are the bounding box that we have as well as the um that's actually it um the bounding box then we can look up in this list of B boxes up above and we can get the XY coordinates which are the pixel level um information about where we're going to click um the other functions are defined
06:30 - 07:00 similarly so for typing text again you have a bing box and then you also have the um the text content that you're going to be typing into it so if it needs to search if it needs to be filling out forms it can use this um we've got some information to overwrite here and and all that other stuff we've got this scroll operation right now I'm keeping it simple and not even letting it tell the tool how far to scroll um this could definitely be improved but it lets it Go and navigate about a whenever you can't see all of it in a single browser view you can wait for things to
07:00 - 07:30 load you can go back a page say it traverses and it can jump back to Google which is kind of like a get out of jail free card here in this case I'll run this cell and I forgot to run the for cell so here we go um now that we have the tools we can actually Define the agent and this is kind of the fun part um the agent as we mentioned before is really composed of a couple things there's the um The annotation part so the thing that actually goes in takes the screenshots
07:30 - 08:00 and generates the bounding boxes and then that image will be passed to the llm and then there is uh the prompt and the llm uh here I have a Javascript file that's also in this repository it just generates bounding boxes so that we can screenshot it and we'll just unmark the page afterwards that we don't have those bounding boxes follow us along as we go and navigate the web um this will be the annotate function and then we have this parsing function here so when the llm generates raw text we take that and then we look for this action prefix and then
08:00 - 08:30 whatever arguments are semicolon separated andur it there again you can try playing around with different formats and other types of things to make this more robust if you need to um we are storing the prompt on the Lang Smith Hub um you can see what it looks like here we I'll go have this prompt ready then you can compose it with the llm and the HM this pipe syntax if you're not familiar with it is um basically a part
08:30 - 09:00 of L expression language all it does is it creates a new object by composing these together so the output of prompt goes into llm which then goes into this output parser thing that takes the text from the overall output message and then this function that we just defined above which is parse um this is a single unit and we're assigning it to this variable prediction um in the output and then composing that with annotate here this
09:00 - 09:30 this annotate function again here is what takes the browser and generates the bounding boxes once you define this if you want to see in a more visual format oops I scroll up uh you can even get graft ask key I now need to print that and you can see basically what I was saying so again it's all piping similar to bash piping um the data will FL through these different pipes and then
09:30 - 10:00 the output of this will be assigned to uh the prediction value here and then the output of this whatever keys are going into here will then be joined in the output my Jupiter was messing up there for a second if you want a quick view of what the body boxes look like I pulled up this the screenshot here so you can see we just go and look for things with HS basically and then we'll draw these things around it the L think is to say
10:00 - 10:30 I'm going to click on number 18 here and so then we can resolve those using the bounding boxes to these things here makes it easy again to help the agent interact with this world um once we go here we are time to start creating the actual graph we've got one more function that we need to Define in order to um to finish our workl here which is the update scratch Pad basically anytime this agent will Loop we want to be able to format the observations the tools
10:30 - 11:00 into a structure that we can then put back into the prompt eventually so what this does is it looks at the string output of those functions we defined above as tools and we um then will be pressing a little bit and then just putting it into this list so previous actions and we'll just numberous steps uh and putting that in a scrap scratch Pad variable when we investigate the traces of like Smith Light you can get a better sense of how this is working now it's time to actually compose the graph so if you're not as familiar with playing graph um I'll take
11:00 - 11:30 a little bit longer to describe these two these operations for you if you are very familiar with it then you can probably skip forward a few seconds the basic steps here are you're going to create this graph Builder object from the stet graph which then takes in that agent State uh above this type that we defined above um and that is basically the state that is the part of the state machine so anytime a node returns it updates the state we then start adding nodes and defining edges edges again do the work Edge to find how the work is
11:30 - 12:00 then routed between each of the different units so we add the agent that we did above we're going set it as the entry point so when the user calls it it first goes to the agent next we're going to add this update scratch Tad function because we just created it and we're going to say every time the update scratch Pad function returns next pass that state over to the agent um again this is because we completing the loop there from any time the tool is called and it goes back to the agent to make the decision what to do next
12:00 - 12:30 after that we've got this list of tools here that we're going to give each a name and we're going to add them to the graph and then we're going to add an edge that says anytime this tool completes go back to that update scratch Pad node we defined here um and then finally all before I get too far see this again that piping expression language syntax all this does is it says the tool function defined above first do that and then wrap that in the key
12:30 - 13:00 called observations so it takes the string puts that in and then says this is the observation key that you turn from it and again this is because we Define the observation in the um L graph State it'll then update that after this note is called we finally will create this single conditional Edge that will determine the whole structure of how logic is routed um basically after the agent runs this function will be called and then it will take the output of the agent and then
13:00 - 13:30 decide which of the other nodes to go to um so if the action is answer then we'll end and so the user can see the final result that the agent returned if the action is retried which if you look back above is just output if you have a parsing error or something like that this lets the agent handle issues if it outputs in the wrong structure um otherwise we return the actual action itself since we named the action the same value that we put in the prompt for the llm we can just directly have it output and it'll route it right back to
13:30 - 14:00 that node so we'll do this the compile step just takes this Builder and creates the graph itself um and then it's time to play around with it let's see how we did um first we're going to create the actual browser object and then there's a helper function which will just print all the things out to the notebook so we can watch it running in real time so you'll notice it open the browser and navigated to Google I'll go back here and I'll first all right for our first question we're going to ask can you explain the web voyage paper on archive so we'll
14:00 - 14:30 give it a little guidance here um as you remember lm's there today they haven't been trained on data that's very recent so it certainly doesn't have baked in knowledge within its parameters uh of this paper so it needs to use the web browser to see so you can see first it sees it says I'm going to type in web voage your paper archive it knows how to Google good um and then it clicks number 27 so this little paper here um and then once it actually decides to move on from there you can see it opens it up
14:30 - 15:00 and you can follow along also in the browser here you can see it's it's opening there um and then it responds to the answer so it presents a study on building end end web agent with large multimodal models it's an advancement and uh we gbt models detail architecture training and okay so it doesn't it's a little lazy it's as likely details this it doesn't actually go and scroll through but you know they got the CH is was able to get there so let's give it another question maybe explain today's XK CD comic for me why is it funny um so I'm not sure how much of a sense of
15:00 - 15:30 humor these models have but we'll see if it can explain it's starting from the same point that we left off before because as we mentioned the the browser is just uh it's there it knows to go back to the start and then it searches today's XKCD comic and it goes and goes to the search bar there finds it oh you see it has these things here and then gets all these nice popups but it's able to stay true and ignore them for now we'll see if it can click number
15:30 - 16:00 [Music] 24 which one is that actually takes me a while oh yeah it just clicks the link there this is on a slight delay so you can see it actually did open it up and what I like about this little display is it shows what it looks like to the agent as you go through so it gets the final response and says shows an image seem curious observation about the greenhouse effect she's a timeline with two significant historical events and talks about U James Watt who kicked off the
16:00 - 16:30 Industrial Revolution and then some of it so um the humor is a of the early recognition of the seriousness but um it's been a while so you haven't made that much progress a little bit of a depressing comic but you know a classic for XKCD we can see if it knows how to find what the latest blog posts are from Lang chain again it may or may not know what Lang chain is but it knows how to use these tools because we've given it instructions so starts at XKCD looks up Google and
16:30 - 17:00 then we'll see what it does after that these first couple of questions follow a similar pattern where it goes to Google it does a search does a couple of clicks relatively straightforward um so maybe we can see if it can do a more complicated task next it's going through opens this up and then it's able to read that it has open gbts and multi-agent workflows and L graphs so look at that seems like we're actually getting the right stuff
17:00 - 17:30 um again this model is only taking in the image and then some of the instructions that we're having so it's kind of impressive it's able to read all of this from the browser let's give it a more challenging one let's tell it to search a one-way flight from New York to rovic for one adult and analyze the price graph for the next two months so lots of steps um we implicitly if you remember above give it only 150 steps to complete so we can see if it can complete on time um so it's starting from the L change blog and
17:30 - 18:00 see it goes back to Google there to start out agent we'll see if it can use Google Maps this is a little bit more challenging since it's less of a strict web oldfashioned web page actually have to navigate different things within it we'll see what time it needs or I should leave to get to SFO by 7 o'clock um starting from SF downtown and then let's see what it does it'll take a little bit of time so um again it go to Google looks at Google Google
18:00 - 18:30 Maps we can even follow it in real time on the browser looks like it's open it up what is it going to do next all right for our last one we'll ask it something a little bit different
18:30 - 19:00 we'll see if it can navigate Google Maps so what time should I leave to get to SFO by 7:00 starting from SF downtown and this one will probably take a little bit of time and we'll see how it does so starts by searching for Google let have downtown so it doesn't even start by going to Google Maps on 10 not sure where it's going actually oh there it goes Google Maps it's got there look looks like it is thinking at 7: a.m. because I haven't
19:00 - 19:30 passed at the current time so that's another thing that I can probably improve X zero X out of there a little interesting like this is what it's seeing here
19:30 - 20:00 that's going through and clicking a few things looks like the driver had an error here but it's going to try to get through some [Music] things and it's ding a little bit one thing that we can do now as is good the time as any to check out is you can look
20:00 - 20:30 at all these operations within lsmith uh and then you have this whole record that you can share through team go and debug exactly the steps and inputs that you had that led to each situation um so you can look at all these llm calls which again is correspond to those agent calls to see the series of steps it had so again we asked it so you check here and then this is what it was fed in uh and then you can say it thinks so fine I need to search this information
20:30 - 21:00 on Google Maps first just going to type and seef have downtown goes through makes some decisions um you can check as here you can look at click 10 these things you notice that we're only passing the most Rec and screenshot we're not passing previous ones to show these different observations or anything like that maybe it's something you could try out later on to see if it improves you can go to some of these decisions and it looks like it's F here and it says you ask for this this I should initially to search for directions on the map and this says
21:00 - 21:30 click zero um and so that's that's a little bit interesting one of the things that I like to do with here is you can open up in the playground and then say maybe if I updated the instructions here I can actually um test it out test it out so it's not actually trying to click zero maybe it's trying to type zero um so maybe I can go in and change some of these instructions here or say web browsing and say um remember to
21:30 - 22:00 type when needed as this will also select and submit the request for you or something again I am not an exate prompt engineer so we'll see how that goes and then once you have that I've already added my API key and you can hit start and cp4 these it takes a little bit of time but then you can start
22:00 - 22:30 seeing what it look like so yes oh this is the search quy here is already entered and so it's so you can make some updates you can keep on changing around things maybe you'll say it's clicked here um Etc I think one other issue with this is it seems like the bounding boxes aren't correct so you can improve that actual affordance layer um but again all of this lets you go and see the exact steps you need to taking and then go and play around with it to improve in the UI without actually to change any of your code initially um you can go and step
22:30 - 23:00 through each of these steps and then see exactly what has been done we go back here it seems like it's getting a little bit closer but it ran out of steps um this is a little bit too challenging for this agent using the um existing setup um and so we will have to go and and make some improvements if we wanted it to be able to do something like this um anyways I I think that's enough for today hopefully I've shown you a couple of things you've learned how to um build a something that looks like this Voyager paper which I think is a pretty cool set of ideas uh I've
23:00 - 23:30 taught you and you've been able to build something with L graph which lets you go and and write regular python code or Jo the JavaScript code as well uh to just compose everything together in a nice little graph that you can then build and improve uh to balance some of this control with some of that more magical AI type of experience and you've um also gotten something that can go and browse the letter for you so I guess again if you want to get access to Lang Smith or you want to try building this out you can check out the Lang graph repo to see these examples that we've shared or um
23:30 - 24:00 you can sign up for Lang Smith at smith. Ling chain.com and finally don't forget to subscribe to the Lang chain Channel this will help you stay on top of all the most recent AI advancements and give you nice tutorials such as this on how to build uh different types of Agents rag evaluations and all those types of things um we'll make sure that we keep the content fresh and that you see up to dat so thanks again and until next time