Xingyao Wang - OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
Estimated read time: 1:20
Learn to use AI like a Pro
Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
The presentation by Xingyao Wang delves into the principles and architecture behind OpenDevin, a platform designed to empower AI software developers as generalist agents. It emphasizes the platform's ability to handle software development tasks by utilizing tools such as web browsers, terminals, and file editors. By integrating these tools into an AI framework, OpenDevin aims to automate various software engineering tasks and establish a foundation for digital transformation. This open-source project encourages community collaboration, fostering an environment where everyone can contribute to the development of AI-driven software solutions.
Highlights
Xingyao Wang introduces OpenDevin, focusing on empowering developers through AI tools. π οΈ
The project emphasizes automation in software engineering to revolutionize digital systems. π
OpenDevin is an open-source initiative, inviting community participation and collaboration. π₯
The platform's agents use runtime sandboxing for secure and efficient task execution. π
It's highlighted how these AI agents can detect errors and self-correct during software tasks. π€
Key Takeaways
OpenDevin empowers AI software developers to become generalist agents by providing essential tools like web browsers, terminals, and file editors. π οΈ
The platform can potentially automate all software engineering tasks, transforming digital interactions. π
Community collaboration is a cornerstone of OpenDevin, encouraging open-source development and innovation. π₯
The ability to self-correct and adapt makes these AI agents highly efficient for software development tasks. π€
OpenDevin's architecture includes innovative features like runtime sandboxing to protect and streamline operations. π½
Overview
In this fascinating presentation, Xingyao Wang, a pioneering figure in artificial intelligence, delves into OpenDevinβa revolutionary platform aimed at transforming software developers into generalist agents through enhanced AI capabilities. The talk highlights how by integrating AI with fundamental development tools like terminals and browsers, OpenDevin is setting the stage for automating a myriad of software tasks.
OpenDevin is not just a platform but a vision to bring AI-driven solutions into mainstream software development. By leveraging the collective power of the developer community, this open-source project pushes the boundaries of what AI can do. It's an invitation to coders everywhere to step into a new era where machines and humans collaborate seamlessly to achieve incredible feats in digital environments.
Adding a fascinating layer to the technical discussion is the system's ability to self-correct. OpenDevin's architecture allows AI agents to recognize and amend mistakes autonomously, a critical feature for maintaining efficiency and relevance in complex software environments. This underlines the platform's potential in not only easing the workload of developers but also ensuring higher accuracy and reliability in their work.
Chapters
00:00 - 01:00: Introduction and Speaker Background The chapter introduces the session by welcoming attendees and sets the stage for a discussion with a special guest. The speaker briefly apologizes for the mispronunciation of the guest's name and reintroduces him as Sing W, the co-founder and chief AI officer at Allheim AI. Sing W is noted for his work on building open source AI agents for software developers. Prior to his current role, he was pursuing a PhD in computer science.
01:00 - 10:30: OpenDevin Platform Overview OpenDevin Platform Overview covers the academic and professional background of a key figure in developing interactive language agents using advanced machine learning models like LLMs and BLMS. This individual, who has been associated with prominent institutions such as the University of Illinois and Michigan, has made significant research contributions recognized by inclusion in leading conferences like ICL, ICML, and others. The chapter highlights his notable achievement of receiving the KN 2024 outstanding paper award and his experience working at Google, underscoring his credentials and influence in the field.
10:30 - 23:00: Architecture and Agent Details In this chapter, the speaker introduces themselves, mentioning their experience at Microsoft and byon tent as both an intern and a full-time employee. They're happy to be invited to speak and share insights about the open-hand framework. The chapter is set to delve into the details of the framework's architecture and the agents within it.
23:00 - 31:00: Evaluation and Benchmarks In the chapter titled 'Evaluation and Benchmarks', the author introduces a new open source repository, previously known as open deit and now called open C. The project was inspired by Cognition ST's first AI software engineer demo. Open C serves as a platform for AI software development by acting as a generalist agent, akin to providing human developers with essential tools like a web browser, terminal, and file system. The goal is to enhance and facilitate the work of AI software developers.
31:00 - 43:30: Future Directions and Community The chapter titled 'Future Directions and Community' discusses the potential future advancements and applications of AI in the field of software engineering. The focus is on providing AI agents with a fundamental set of tools or actions that can theoretically enable them to perform almost all tasks associated with software engineering. The chapter suggests that, much like a programmer uses a limited set of essential tools to complete their work, AI too can be optimized to use a core set of capabilities to achieve similar results. This could pave the way for more efficient and autonomous software development processes.
43:30 - 58:20: Q&A Session The chapter, titled 'Q&A Session,' discusses the overarching influence and importance of software engineering in the contemporary world. It implies that solving software engineering challenges can lead to advancements in digital agents and improve various aspects of human-technology interactions, especially in the digital realm. The speaker hints at how this foundation positively motivates the technical architecture underpinning 'open hand,' before offering to demonstrate its operation through a video.
58:20 - 60:00: Wrap-up and Closing Remarks In the final chapter titled 'Wrap-up and Closing Remarks,' the speaker discusses the functionality and user interface of a new tool that allows users to easily prompt it with natural language instructions to write scripts. The tool can display top stories on hacks and demonstrate agent interactions with a file system. It also features a VS Code plugin, enabling users to open files in Visual Studio Code with a simple button click. Traces show the agent's self-awareness and responsiveness.
Xingyao Wang - OpenDevin: An Open Platform for AI Software Developers as Generalist Agents Transcription
00:00 - 00:30 okay um so I think people are gonna Jo in I guess so I'll give you a quick intro and then we get start um so welcome everyone uh and our today's guest is um I'm not gonna Mis pronounce your name again yeah so sing W he's the co-founder and chief AI officer at allheim AI he's building open source AI agents for software developers and before joining um AI he was a PhD cand in computer
00:30 - 01:00 science at the University of Illinois and he had research interest is by in building interactive language agents powered by llms and blms and his research has been published in top venes like ICL icml mop KN and SE and he has received congrats for this uh KN 2024 outstanding paper award uh he received his BS in computer science and data science from the University of Michigan and then he has also worked Google
01:00 - 01:30 research Microsoft and byon tent as um intern and then fulltime I guess um welcome to go here for Community um yeah we can get started thanks for accepting our invite yeah thanks and and really like really happy to be invited to be able to hear to to to speak here and sharing basically like how like our the entire open hand framework so maybe I could get started so like the the the
01:30 - 02:00 the paper or technically it's like a open source repository I'm presenting today it's like called open C formerly we are called open deit because like the repo started when like cognition St released their first like AI software engineer demo and we are like a open source software which is like a platform for AI software developer as a generalist agents so the fundamental motivation for this is like say you are a human developer say if I give you some tool like web browser terminal F system and
02:00 - 02:30 for example a text program editor what could you do if you s like as a programmer myself if visor like reflect on my everyday work like it's it's like almost all of my work is like perform using like these three tools so we we think like by giving the agents these are like a fundamental tools or fundamental action space it's able to in theory it's able to perform like all almost all the software engineering task
02:30 - 03:00 and if you are able to I mean our world is like written by software right now so if you are able to solve software engineering then Pro possibly means like you're able to solop digital agent and almost like ultimate all of like the hum TK as at least on digital words so this this was like the initial motivation or the technical architecture that forms the pillar of open hand like before I dive into detail there I can share like a a running video for like how open end
03:00 - 03:30 the day works so I just record this one yesterday so we have a front and UI you can see you can simply just prompt it with a with with a natural langage instruction to tell it to write the B script that desri that display the top story on hacks and you'll be able to see like the agent is like writing files to a file system and we also have like a vsod pluging where you can just click a button and then you'll be redirect to open BS code and then if you see the traces here you can see like the agent will the agent itself will notice it's
03:30 - 04:00 like making some mistake it's able to like self-correct itself and eventually finish the task and yeah we also have like some couple we also have we we also have some couple examples about browsing so this is this is more like an early example that we recorded uh let me see how yeah so so for this this is like a really simple browsing example is basically go to our GitHub page and tell how many stars are GI repos recuring and
04:00 - 04:30 in this interface you can see like there's like a view browser and agent is like browsing the website and reading the content like how many Star it has so the goal of open hand is like we are we're trying to create a platform that allows user to use this agent for modeling and manipulation of complex system basically if you want to use AI to automate your sub engine in t we off you to use it but like only other and we
04:30 - 05:00 also wanted to build an infrastructure to enable AI to perform software development in the broadest sense I mean almost all of our work on digital words invol some level programming and we hope Open Hands to be infrastructure to support that it to be to be in a sense like it's like to be General enough that it's helpful to the broadest list of task we are performing today and then finally the most important thing is like we want to do it all in the op collaborative way in contrast to many
05:00 - 05:30 close Source like alternative and that's why we call our company all have ai we want like everybody to help building like contributing our future so yeah now I've been now now I've start to talk about like our platform so the main architecture open H consists of like three primary components so we have an agent of abstraction where Community can imp can contribute different implementation of agent into agent so what what is an agent so we consider agent takes in
05:30 - 06:00 event history which is like the past action and observation and then it will output like an action so that's all the things the agent need to do and that's all the things like agent developer need to be need to implement inside open they just need to take care of how you convert a history into action and then the second thing will be event Stream So event stream is more like a software architecture that allows us to track the history of action observation so it's extremely like
06:00 - 06:30 it's simply like a stream of messages so the first scen for example here you can see the user could send the message through the user interface and then this message will get pushed into the event stream so this will be like message Source user and then when agent receive that the agent will thr like so the message could be like can you create a list of number from one to 10 and create a web page display then the agent will start acting based on this instruction for example agent could start create file and then once the file created the agent could start writing Pyon code and
06:30 - 07:00 see the observation of python code like recent python code and agent will start try to run a server and then finally the most important part and I think it's like the most challenging part in the development open handle will be like this agent one time so it's it's it's like a compliment part to the agent component it's like agent is like taking a history and producing action whereas agent run time basically converts an action and execute an action so it produce an observation basic on action
07:00 - 07:30 so our implementation uses like a do sandbox and I I'll talk about that in detail in the coming slide any questions so far yeah I'm happy to take any questions like during presentation if you all have any yeah so yeah I I then then I talk about like round time sandboxing now so the one of the biggest motivation is like for doing wrong time sand boxing is like we want to do really really interesting things we want the agent to
07:30 - 08:00 be able to run to own command and just do the things without worrying about like destroying our own computer otherwise there are some cases like we observe like the agent like in order to solve a task agent will stumble and then try to delete like your entire like computer a directory which is like not the desirable thing to do like on your local computer so we definitely need s like SX here so what overhand do is like overhand execute all action in a doctor sent box what we call is like a Rong time to isolate the environments and
08:00 - 08:30 then the Open Hands run hand support running agent on arbitrary Docker images so essentially what we do is like we maintain an action execution API that runs inside a docket s box and given an arbitrary docket images open hands will rebuild the image into a Open Hands compatible images by installing this action execution API so basically if you are a python developer you can just bring your own python base images and open hand will install this action EX inside your python images and you can
08:30 - 09:00 work on you you can allow the agent to work on whatever environment you want to work on so that was like the the most challenging part in my opinion and so for Action spaces as we talked about earlier we we as human as have three primary tool we have terminal which have access to file system we have a web browser and we also have like a fileer so here in Open Hands we allows we allow agent to interact with a
09:00 - 09:30 terminal file system through for example we have like execute uh action and we also have like act ex ipun action so these two are basically similar to like how you can interact with like the terminal on your computer or you can interact with like IPython notebook for example on your browser or whatever OBS or something and yeah this brows I I'll talk about browser in in the coming slide so this each of these actions have have corresponding like observation from the output of this command
09:30 - 10:00 extion and the web and then for the web environment oh yeah sorry oh yeah there there is like a typle on the on the title here so this will be like the file editor so so for code editing what we are doing right now is like we are we have like two set of primarily way to like replaceing file right now we are primarily rely on the second bent so the the way we do that is like we give agents a tool that it's able to edit a file by placing a specific scen with
10:00 - 10:30 another stren and that's it basically the agent can just look at a file by itself and just replace some part of the string with like for example update content of a file that we can use and finally for web environments we allows the agent to interact with the browsers through through a browser Jam domain specific language so we basically have a browser that allows us to perform web browsing actions for example navigating clicking scrolling and Inter TX for example here's like a screenshot
10:30 - 11:00 of some a subset of functions AJ is able to do for example AJ is like able to scroll through web pages agent is able to select for example uh certain buttons or certain check boox agent is able to F some fors based on like their IDs and then the web functionality is primarily Implement through browsering which is like another open seource Library that's specialized in browsing agents this allows us to render HTML PDF as well as screenshot as screen shot and it's able
11:00 - 11:30 to summarize HTML as a tax accessibility tree and it also allows us to do like viewboard viewport filtering which basically means like you can future web page to only uh keep the content that you are interested in before we uh before dive into interface any questions so far about the architecture side yeah I had a quick question what about when it's two-way so for example you fetch data from a database and
11:30 - 12:00 that's the action and that has a dependency for a follow on thing that says maybe like open the URL column value in the next action how does that get played out in your event stream and runtime setup yeah yeah definitely that that that's a very interesting question I do think our AC can currently support that so I think what what happen is like for example you're are connecting to your database there are two ways agent can do that agent can just use for example the agent can use this file Editor to create a new file that new
12:00 - 12:30 file you import for example if it's a sqli something I I'm just assuming if you're is xql it can just from like xql import and create a database instance by generating python code and agent will show like uh python run that script and then the observation of that script will be pushed so so so the the the action to run a script will push into eventory and then that will get picked up by the agent run time of yeah should just go back to go back to just go back to this
12:30 - 13:00 slide so for example um you say connect to a connect to a database agent would write the file execute the file and the result of the and then the the like the the the the execution is request would be picked up by agent R time and then you end up with like command BR observation like this yeah this is actually the exact observation exact type like example you were asking for so for example you can you may see something like this it's like you have started a development server that's
13:00 - 13:30 running on like thisp with HTML HTTP like address and what the agent can do is like you can simply run a browser interactive action that goes through these pages and you'll be be able to retrieve like help obervation for you so this observation is it's more like a 16 version of the real observation ageny just for illustration paper but does that make sense for you yeah can I ask a followup so the context that I'm coming from is like execution depending on the execution of a the
13:30 - 14:00 output of an execution before and it's from the concept of like MPC and how they're trying to make context available for uh other Agents from output so is the point that you're saying is that the browser action can have some file it depends on and that file is created by a previous action is that okay yeah so basically own B some are like definition of the agent so the
14:00 - 14:30 agent have the access to the enir event history basically entire event history basically like everything it have done before it's so so basically when the agent is like running browser Interac is able to look at like what all the previous thing you have there all the previous contact and be able to decide by itself like what I want to do in the next so I think what what I really like about open hand is like we are trying to make as little assumption as possible in terms of like agent design we we hope to
14:30 - 15:00 let the agent do all the work they want to instead of like us like pointing it to different direction yeah any any other questions thank question no I can go to the interface as as you all just see through the UI we have interface similar releas this is more like outdated screenshot so the open source Library evolves extremely
15:00 - 15:30 fast I think this is a screenshot from like a few weeks ago and we just recently have like a brand new UI and we we also have like a hosted version currently available in beta feel free to go off to our website if you want to join wayist or you can just use the open source version from our GitHub and then next I talk about agents we Bri touched about like how we Implement agents and here is like a more concrete example of like how agent so a minimal agent requires you to for examp design a reset
15:30 - 16:00 function and then as well as writing a stat function that's taking a state so the state will just be like the action history like all the private action observation messages and then once you have all those messages you just throw this message at L you will get your response and then you can sort decide to parse your response correspondingly you can parse them into for example command R action par them into iyon round action or you can par them into bre browser interactive actions and we do the inter we interface the LM through light LM
16:00 - 16:30 which is another open source package that standardize the LM call across all different provider that basically allows us to essentially support like every LM you can have in a world some works well some doesn't and we also have a like a open source agent topware people can just contribute their different type of like agent impementation if they want and then the default agent ofan is like codex agent so this is like the default and strongest coding agent in open so
16:30 - 17:00 the gist of this method is like it allows the agent to to to basically uh wrong arbitrary action through python or code so it's like before before this most before this and most of the agent most of the agenting tool or the agents it's like you need to define a list of Json a list of tools and then you can ask the agents or the LM to call these tools by putting we Json but like the problem
17:00 - 17:30 there is like if you want to do more things then it probably means like for example if if I just want to look up the weather today all I need is like weather API but like you the next day I suddenly start I suddenly wanted to read like Excel file then I probably also need to add Excel tool like as as more you ask for the agent the the list of tool just get expanding so quickly but like the idea of codak here is like we as human
17:30 - 18:00 we we have this like software engineer we have a human developer we already build like 99% of tools out there why why we invent why reinvent like all these tools by defining them ever again so the idea here is like hey here's a pivot hey here is a pyot here's a back shell and just go ahead like import whatever package you need install whatever package you need because that models are trained on like the WB Corpus which is like basically it's like a condensed version of like all human existing mod so it knows more than you so for example
18:00 - 18:30 it ask you to import Excel file it's able to like install pandas which is like a library and then you be able to read Excel using a library and do all the post processing so that in that way we are able to save a lot of like API so so it saes the human developers housle to De to manually defy a lot of apis and lot of agent to potentially do way more things and then is able to and the the third point point about Open Hands agent is like we have a micro
18:30 - 19:00 agent architecture that allows you to customize the generalize agents like o Cod that with certain LEL content for example in our like daily development workflow GitHub is like obviously very important because our open projects like GitHub like the agent might not have a good way of how it should interact with GitHub because it's a pre-rain on so many knowledge so many ways to interact with GitHub you can use like a GitHub command line API you can use like GitHub
19:00 - 19:30 W API so there's a lot of thing you can do so here what we so what we implemented here is like essentially you able to Simply write a markdown file and then you can sort of like set a few triggers and when some of the triggers were Tri triggered by like user messages say Hey you go to GitHub and just pull me something we can automatically inject relevant Contex into like the agent observation so that agent will be able to see this like relevant guide in terms like how you should interact
19:30 - 20:00 with a GitHub and this is like user defined and quite flexible yeah any questions like before before dialation yeah not yes that's going to evaluation so I think this is like the the part where we pour the most of our effort in in the paper so why we need eval harness
20:00 - 20:30 and why we need to include so many benchmarks so we recognize that if you want to develop like a general software agent that should it should not Excel only on like coding code editing time but like it also need to be able to form like web browsing as a bunch of and and a bunch of auxilary task for example answering questions about Cod repository or conducting online research and some some some sort of use cases I run into when I was like using open hand is like I am thinking about implementing a new feature but I'm not sure how to get
20:30 - 21:00 started I just update to browse all the files just summarize how the things works now for me and it's like going back and forth with it so it's it's not about like simply solving GI issue so that was the reason we want to incorporate as broad list of evaluation as possible so the three main category we evaluate agent is like based on software engineering web browsing and M assistance so for example for software we have like Benchmark like get sweet Ben that fixes GitHub issues as well as
21:00 - 21:30 for example doing some machine learning coding and bu informatic coding and then for web so web is like basically navigate a real realistic web brow like website or something to do the browsing and get information interact with website and then finally forell this is more like a traditional LM based Benchmark where it's where it's like you you need to answer a question based on like a multer interaction or using a tool to browse the lab to get GA some information so the highlight of like
21:30 - 22:00 software engineering is like we primarily currently evaluate on sweep Ed so sweep is essentially a GitHub issue fixing Benchmark where the input to the agent essentially you have like a GitHub issue that describe the issue developer is having you have the codebase that and then the agent need to take in these both these two and then they need to interact with the code base they they could addit code they could run a code they could run existing unit test in order to fix the issue and then eventually we will basically
22:00 - 22:30 score the agent output using unit test so this unit test was written like by humans and to validate for example if this issue have been successful fixed so this is like very representative of like a view time that human developer is like working on and the second thing is like web browsing uh that we mainly evaluate on weina so WEA basically have a selfhosted a fully functional a bunch of like fully functional web application like Reddit
22:30 - 23:00 they have GitHub git lab they have a web Stop Shop they have knowledge sources and what the agent need to do is like given a task for example here tell me how I how much I spent on full purchase on March 2023 and then the agent will need to take actions and get observation from this Ro like website and then just keep going forward until it's able to answer the questions and then eventually the evaluation like how we evaluate success of the Agent B based on function
23:00 - 23:30 successfulness for example if I ask the agent to help me purchase something eventually I just check the state of that website to see if that item is like really purchased or not so this allows us to basically evaluate this with like better like accuracy and then the most representative category of the M assistant would be like The guia Benchmark so here is like example of like one di question and it's like the Excel example I use before The for example the instruction be like attach
23:30 - 24:00 Exel file contains a sales of manual items for local fast f chain and is answers and question about that and that's also something like we ask opab to do nowadays like pretty frequently I would just like go out on GitHub and just D dump me a excile down me a CSV file so and then so I can analyze it and then be Beyond this spech Beyond these benchmarks uh in the past couple of month we've been spending a lot of time to build of a parallel evoltion infrastructure so the challenge here is
24:00 - 24:30 like because we are using a agent run time and then we are running the run time inside the docker set box then you create a challenge of like you cannot really easily paralyze those which makes evaluation very very slow for example here in the past like running a sweet bch evaluation using open hand was like taking like two days so what we do what we did was like we implemented a cloud infrastructure on for example any call providers and then you can just start
24:30 - 25:00 for example a few hundred doer images at the same time and then you'll be able to parallel EV Val and that basically reduce the time of right running EV valve around like two days to for example 1.5 hours and yeah and we we this functionality is currently in like close data and but like you you you can feel free to join our slack Channel if are interested in trying now provision like API key and for the results this is like a July paper so these are the results from
25:00 - 25:30 July so the same Coda agent without any modification to it system prompt it it demonstrat like competitive performance across three major task categories which are like software engineer um software engineer web browsing and resell system and then yeah that that that's that's basically the takeaway it's not the best every at everything it was not the best at everything but it's able to work on everything without significant modification which which is like one of like our private motivation we want to
25:30 - 26:00 build a real General agent based on software that that can work on any software and I can work on a lot of different tasks and then the most recent result we got is like is October I mean like a few month up paper publishing recently we get into like the best score on sbench verified and lighted for example right now open hands score about 53% which is like I think is like the first system AI system that surpress like 50% uh mark on sweet bench verified and and
26:00 - 26:30 it's open source and it's like open source in MIT license so that's a good thing and then we also have like some results about like different models like open source models versus like pring model so the Stace for now is like this this is from October so before we get this state of the art so but like the takeaway here is like prepar model currently still working works best like we have couple research collaboration going on on our community that aiming to bridge the gap here and making open
26:30 - 27:00 source more useful for well yeah and and and open hand is like not a static system it's not a static software engineer Library where it just go gone sale and everything that that that that's it we are continuously adding more and more exciting agent benchmarks into open handy bage Partners just in the last couple of weeks we have like science agent bench which evaluates langage agents on datadriven scientific
27:00 - 27:30 discovery we have ml B from OPI which allows agent to work on Caro competitions and soften and we also have a commit zero Benchmark just the pr just merged like two days ago or something it's it's basically a fromont scratch code generation challenge that it's like the you have all the test cases and the agent just need to implement all the functionality to pass all the test cases yeah any questions so far
27:30 - 28:00 yeah uh finally I guess this is like one of the last SL I have we are really appreciative like of our Open Hands Community the Open Hands started really with a TW and 3D shortly after that comes out and right now we are a open source Community consist of like more than 200 developers who joined us in the last six months or so and we we now have like 37k stars and we are one of the GitHub top 100 python projects and we have a lot of researchers from many Institute for examp tmu uiu billion
28:00 - 28:30 dollars also have like member of several companies who are like excited about like agents excited about software software writing agent and and excited to use it and help improve us like going forward yeah any questions so far before I go into develop my road map yeah so I guess there are several things we've been working on for example on the agent quality side on the agent ability side as well as on interface
28:30 - 29:00 side so we have basically the quality side the remaining problem be something like how to better localize files in very large say with like tens of thousands of files how what's the ability to interact with user when user provides like un and questions so the agent could for example if you just provide a really bague instruction agent could just do whatever it think it's right without ever helping you and then can we brid the gap between
29:00 - 29:30 open source and close Source model by training models on agent directories on in terms of like agent capability how do we perform the browsing better and how do we support multi model better so right now right now you can just simply paste like your screenshots into open hands and open hands will read those screenshots but like but like it would be also nice to be able to just pinpoint open hand to a local uh code repository where it has like read me that contains a link to an image and is able to read
29:30 - 30:00 those images so how do we basically allow allow it to do that and in terms of interface how can we really like further improve our UI to make it like more intuitive for user to use and make it more useful for our product yeah the and and and I guess this is like the conclusion page and just as we talked before openam is like for three different type of like user profile for example for agent user we want to provide easy use interface to finishing to finish coding and web task
30:00 - 30:30 for agent researchers we hope Open Hands provide for agent researcher open hand provide the infrastructure that allows you to do sand boxing evaluation and have a bunch Baseline to allows you to do research quickly and if you're contributor agent Open Hands more and more users and researcher may use it and finally for open open source developer Open Hands is like an interesting project help deliver Advanced AI cability to more users and we are open project so we we are we encourage people to join our giop join our slack or join
30:30 - 31:00 our Discord if we interest case And discussing for seek for collaboration on any of theic ideas yeah I guess that's all for my presentation today I happy to take any questions so yeah um thanks for this great presentation um yeah I'll open the floor for a few Q&A um yeah so if anyone got hands you can feel free to unmute yourself ask questions or you can drop them in the chat if you want and then we can read
31:00 - 31:30 them um maybe until then I can start with my question so um I know you guys are using um external llms for the main part right but in the in the like in the in the sandbox is there any like implementation to prevent like for example unauthorized access or like some malicious activities that maybe the the
31:30 - 32:00 model doesn't support or that the model is doing or something like that does does it have any measure like to prevent such unauthorized access in the sandbox yeah so yeah that's a that's a very good question so I think currently what happens is that we have we we basically allows the agent to run on different mode so we can ask the agent to run at roofs where it have like almost every access inside but like not outside or you can just
32:00 - 32:30 give agent like ordinary user where it can just modify file install some like python packages but not system level packages like for now we primarily allows the agent to basically do everything it's like your roots you can do everything inside the S boox but like because because of the WR time because of the SBX mechanism so the folder provided by user is some for example if you have a GitHub repository your repository is like clone into Lo into inside the box so if even if the agent
32:30 - 33:00 messes up it doesn't really like B Su you a sense and then for the for the security guard rail we have like a security analyzer this is like in collaboration with enir which is another St focus on like agent safety so we basically have a security analyzer where we have a action confirmation mode so if you have that M turns on the security analyzer will analyze like every action the agent inid and then it would sort like evaluate the security risk of the
33:00 - 33:30 action for example this action is like R RS like entire root directory then this is like obviously a no and then you basically prompt the user say do you want to wipe out your computer or not and then if the agent just execute some like Harless bash command like a list of folder then the security analyzer will simply just let it let it go to to an to execute those action if not sure if that ansers your question yeah yeah that's a good answer thanks
33:30 - 34:00 um yeah so any questions sure let me get back in uh in terms of like for people starting out um Open Hands is probably a good one to check out but is there I think I've heard of like uh meta GPT or something like that I'm not sure the just like agent orchestration libraries what are some of the simpler most popular
34:00 - 34:30 ones yeah I think yeah I think mea gbt and maybe autogen some of that is also a popular is also like a popular software libraries for agent but like I think most of the existing libraries are primarily focused on like a multi-agent and I think I I personally I think the most challenging part is like a general on time and existing Library have relatively limited support agent run time but it really depends on the task
34:30 - 35:00 you are working on you you want it to work on I say but there is like a lot of great Library out there and we we actually include a in a paper there is like a relative word comparison table where we compare Open Hands against like different different existing different existing libraries that's true uh if you were to try and build sort of a browser use uh like example where like somebody with like a set of tasks
35:00 - 35:30 and um I don't know if you want to use Puppeteer or something but and what how would you go about building it within the Open Hands framework so I would assume you'd want to set a web use agent and then maybe like some of the user interfaces are just like piping in the chat and asking an llm to kind of generate a bigger instruction for the agent is that the general shape of how you tackle sort of like find all the
35:30 - 36:00 prices on Amazon.com for example for for store products yeah I think that's that could be the way to go actually so actually I think there is like probably two ways if you simply ask it to find like all the prices like all amaz agent start like if you're asking a general agent agent could start writing like a software program that start crawling those websites so that could be like one way to do it if if it's able to get it done
36:00 - 36:30 it's probably more efficient than having LM to using a browser but like if the agent got banned or somewhat they on the other hand if you just want to like a LM driven web browsing agent and in over hands what we can do is like essentially you can just remove the uh for example you can remove the python server you can remove the B shell from the action space and you only ask the agent to interact with this browser and then it's it's like like you you can just say something like
36:30 - 37:00 go to URL some some URL find me a list of like prices on this web pages something like that and the agent would sort of like complete that task but I I I still think maybe the first if we are paring like a structur website the first R could be potentially useful as well oh that's a good point actually uh if other people have questions feel free to join in know just excited to have him here and don't want to use up
37:00 - 37:30 too much okay there's a question yeah what's yeah that's that's a great question so it's like what what what's happening inside the model and understand how the M agent makes decis but like the I think the major challenge here is that this show like Frontier capability right now only existing in Frontier models like provided by like open eye or atopy and those models are
37:30 - 38:00 close SCE which means that we have a little ways to basically run mechanistic interpret methods like that but like we are really working on to get open source model working on that and maybe after that we were able to do more throw like this deep analysis into like how the agent actually how the model actually performs when the agent RS Tas that that's that that's a good question but like but like because of we are open source we actually have the entire event stream basically all the
38:00 - 38:30 event action observation stream pass into the agent available and then we also upload them to a hugging base space where we host our evaluation results and as well as like the sweet ement Benchmark where we submit or our submission so I think currently we are mainly analyzing like the action and observation them instead of like analyzing something that happens deep inside the model prors not sure if you answer your question feel free to speak speak up if you have
38:30 - 39:00 any follow out um yeah so until then I also have one question so I um I saw one of the limitations would be um integrating some like VMS which will have some missing information from like visual context and stuff like that and then also with longer context like for example complex huge C right so is is there maybe like I don't know if you have implemented it
39:00 - 39:30 here or any pass in the future but is there any uh way of doing maybe Chun chunking uh like this complex scod to fit in the context length or like use external like VM sources to like interact visual inputs like varing things like that yeah so I guess for first question visual input we do support it right now so you can just simply it's it's like simpler it's like similar to how you can interact with it CH gbt you can simply take a screenshot paste it into the chat box and then you
39:30 - 40:00 can just type whatever you want and then the agent or the underlying LM if they support Vision input they will be able to see your screenshot and continue working on that like that's how it's like this capability is like primarily exist on Frontier Model like CL on it or like gbd4 or something like for open source models we currently don't have good support for that so that's definitely something we want to work on So eventually we want a like like a really open source model that able to do for example 60 to 80% of things Frontier
40:00 - 40:30 Model is able to do uh in terms of featuring hand and then I think the second question is context window and that's actually a very very good question so that's something we've been actively working on and what the the one of the simpler solution we being implemented is like if your if your MTH got too long and you sort of like just truncate the first View and bring it back to for example if you can start thres say like 128k tokens you can just say if it ever exceeds the compex window
40:30 - 41:00 you you can sort of delete the first VI messages and truncate it back to 128k tokens like we are also experimenting smarter message a smarter approach for example we can just like selectively summarize some of the observations for example if you install a python package and the the PIP install gives you like a thousand lines of output just just tell you you're you you have successfully
41:00 - 41:30 install the methis so I guess in the followup ter what you can really easily do is like you can just summarize that to like one line you successfully install this and that's it so that's also another thing we being AC working on but like in general I think the agent memories this are like short-term memory as well as how it can interact interact with like potentially long-term memory still it actively research view where a lot of people are working on this exciting process got it any other more
41:30 - 42:00 questions so back to my browser use uh kind of question so I have I found the ideal use uh and I'm trying to get a better understanding of it is just log in everywhere and update my home address for example um the first if I was to think through an event stream it would
42:00 - 42:30 be like uh get list of accounts or websites and like a password maybe that's stored in a CSV or something and then uh open URL and then find settings would that like the find settings instructions the settings page or like authenticate if it's not authenticated how would those kind of dis how do you dispatch like two different kind of instruction uh sets
42:30 - 43:00 would that be like the agent opens a page and then like responds back like meet to yeah yeah I think yeah I think that that would be the ideal workflow you you could just tell it like here is a list URL just go out there and update my home address to XYZ and the what that happened was like the agent go to website try to find the setting click the setting and then observe the setting and see okay you are not locking agent should the ideal pattern should be like
43:00 - 43:30 agent would go get back to you say hey you're an authenticator can I have you password or something interes is there a way to instrument the like authenticate flow uh for example as an agent action that I like fine-tune or train or something like that U sorry you mean the authenticate what what you mean by authenticate flow you mean like the how the agent performs can are you're asking can we train a specialized model just for the authentication process uh more
43:30 - 44:00 like I suspect between captas needing to ask for a password and stuff like that is there a way to I guess deploy a browser act agent that I have fine to him by showing it how to log in on like 50 websites and then that's the action it can kind of invoke which is browser use but a specific one that's been like yeah I think that's definitely possible and doable even I even suspect that if
44:00 - 44:30 you're just trting after the sh mod like CL it's already able to do this type of act FL locking but I think the trick here is like you you sure like wanted to separate the browsing section so basically you will divide this has for example if you want to visit like 50 website and change your home address on that 50 website you should you you you could like this patch 50 50 instances of routing a and then for each browsing agent all you need to do is just update address on one
44:30 - 45:00 one one app website so that's that makes the problem slightly easier for the agent and I I think I think that that will just work if you were to compare this to the computer use version which is basically it like screenshots and ask clouds after it gets to a certain stage Claud gives instructions what would be the key difference here is this the like uh multi-turn user agent interaction as
45:00 - 45:30 like the missing piece yeah I think that's I think on a high level basically if you are not considering any technical detail we are actually pursuing the same thing it's like user you just need to say anything and I'll do everything for you and that's the goal and and by the way we are also supporting we are also adding supporting for like a computer use into this browser into this browsing so and then from a technical perspective I think the main difference is like we
45:30 - 46:00 are right currently using like text only accessibility treating which is basically like a simplify version of HTML where the agent reads into like for example the source code of the website whereas browsing whereas computer use for enthropy they directly read the screenshot like going forward we do feel like reading the screenshot itself can be a more General approach because the the web is like really messy and pricing HTML you get the lot like weird things yeah that that that is not
46:00 - 46:30 entirely agent's fault but it's it's weird for agents so yeah we are also moving to that direction as well but yeah hopefully we can just incorporate that in the coming weeks yeah you almost need a browser like a neural engine that just like represents the neural effects of the code but doesn't really spit out pixels so you can just like easily scope into the action uh this is exciting stuff man thank you for answering question yeah thank you for your
46:30 - 47:00 questions uh we got one question in the chat oh yeah let me just um would be highly expensive to Fallen rounds of the B uh yeah so I think one of the limitation right now is like Open Hands currently only support this Frontier models because this is really like the frontier capab ities in terms of like but like as we see in the past few years
47:00 - 47:30 the cost of LM just go down like drastically right now it's like a hundred times or even more cheaper than what it is before so we are actually pretty optimistic on that in terms like a optimizing cost if if you're are relying on Frontier like model developers they for for cost optimizing or like on the other hand we do have an open source project research collaboration that works on fine tuning a open source model that's able to perform, open hands and we are making good progress there and hopefully
47:30 - 48:00 we'll able to share something maybe by the end of this year so that everyone can just use their laptop to close the smaller model that's kind of that's probably not as good as Frontier Model like they were able to support s Rel like task and then the second thing is like require a lot of clusters I think the the the remote I'm not sure if you're referring to remote runtime or something but the remote runtime is only required if we when we are developing agent and running large scale evaluation
48:00 - 48:30 to make sure the new for example the changes we introducing into agents allows the system to perform better but like if we are just using locally you you can simply go to our and clone the doer images and it's it's able to like run on your own laptop without like requiring like a h Cloud huge cloud cloud infrastructures set and you can also show like half hour this and it's it's it's as simple as like going to a URL and just start interacting with the
48:30 - 49:00 agents yeah not sure to answer your question yeah thanks for question you think computer use can be oh uh your question do you think computer use can be agent like this when you have screenshot people industry yeah so right now actually the the function is also a part of the use capability from atopic cloud
49:00 - 49:30 and this is like the computer use is more so the anthropic computer use is more involved in terms of the agents implementation basically when I take in action observation I choose to implement my agent using a computer use functionality something like that not sure makes sense to
49:30 - 50:00 gotcha um okay awesome so maybe um as the last point of questions um I want to ask you a few things uh so how was the open de project started because I remember gr nuck from CMU sharing about it and everything and then it quickly um got off to a point where it was like this big comparative to Deon which nobody expected and then some people Al dropping out from their phds to pursue their full time right so how was the
50:00 - 50:30 process of moving to that just if you can tell us about it and then how was making it to a startup lbel as maybe a conclusion Point yeah that that's a that's a that's an exciting Journey exciting ride and by the way I'm the I'm the one who drop off the PHD to join the Open Hand full time but anyway so I think it all start with like tweet and reading after that comes out like people are like really excited about the capability the all building as years and because s really show the devil so people think it's not
50:30 - 51:00 impossible to build it now and the entire open source Community was like super excited about that so once that Twitter read me comes out I think it's like instantly the repo with only one read me file got like a thousand stars a thousand GitHub stars and everybody was like super excited about that and I think gram was like the initial people who say okay we should put a really bad looking not working front interface just mimicking Deon and then just put it out there and then the amazing thing just
51:00 - 51:30 happen like the entire open source community so we have like no it's like there's no one there's no Center Central coordinator of some sort it's like everybody just working start working on their own say like I want this one to be added so people just start PR contributing code it's like the first month or so it's like very chaotic you have PRS closing everywh on GitHub is like and even some some people just even
51:30 - 52:00 two people like submit the two PRS that try to implement the exact same feature something like that that was like the first chaotic mod and then I think I think Robert who is like currently RTO he he joined in and he was like mainly focus on the making the architecture of open hand it was called open data at that time useful and then build like the first agent so the the system can R right now runs end to end example the front end can send request to backend and backend can connect to LM and
52:00 - 52:30 dispatching agent and have it work on some really toy TK and then I I I sort of jump in I brings my pathwork like Coda and I was saying okay can I just create an agent with my my past experience and integrate into open hand and then because like the one of Bigg motivation one thing I realized when I was doing research at the university is that to do agent research to do real useful impact so agent research you need to write a lot a lot a lot St otherwise
52:30 - 53:00 like you're are just playing with toy problems for example real world software engineering problems for example you typically spam like I don't I don't know I don't know about your guys I typically spend like for example 80% like my first day like setting up like cod like software project it's like taking a lot of time to set up and it's similar thing applies to agent in order to allow the agent to solve the type of complex problem you need too to do a lot of like complex code around it and then I was already in the halfway
53:00 - 53:30 of like writing code for my next research Pro I was like okay open de comes out maybe I can just put all my all my this existing code that be written just put them into open open open hand make it a exciting open source project that people are excited to use and then this way it's like I got a I got a robust softwares infrastructure I can use for my own research but like people also got a very useful products that can improve their productivity
53:30 - 54:00 something like that and I was like super excited about and at this point it's like we already sort of like make some of those Milestone there it's like we have a pretty useful code base for research different agent ideas and doing all the research while even our s is like using it pretty extensively on a daily basis I think on on November like 20% of the code in open hands code base is now written or it's like author or
54:00 - 54:30 co-authored by Open Hands which is like a great signal of like it's actually very useful in the everyday in our everyday developing process yeah and and I think as some point we decide to start a company because like I I'm really passionate about this open source project and and I don't want to see it dies because like if I have other obligations I want to work on researcher probably won't have like that much time to contribute to this project and so did others I mean it
54:30 - 55:00 to to keep like a project alive sort of like need a few people that's sort of like full time really dive into it and maintain the Cod base so it keeps running and keeps adding new newest features when like new things come out so that that was like one of the reason we we decid to found a company we want to keep this project rolling forward bring more value to people whereas bring bring more value to researcher like myself and bring more value to open source users where people are like I mean agentic technology is like in our
55:00 - 55:30 opinion it's like too important to be like build behind closed doors and only owned by like a few big corporations we hope everybody to have a sense of like what's going on be able to contribute to it awesome uh let's give it up for um zing and um thanks a lot for accepting our invite and then I'll be uh uploading this on YouTube maybe tonight tomorrow so we'll have the recording um thanks a lot for accepting our invitation to
55:30 - 56:00 present here uh and then for all who attended thanks a lot and I keep up with the Discord coh here for AI channel for more guests like this um yeah thanks everybody yeah thanks for invites thanks everyone for the questions great questions great all right thanks