Advancing AI in Software Engineering

From Code Generation Towards Software Engineering–Yangruibo Ding (Columbia University)

Estimated read time: 1:20

    Summary

    In this enlightening talk, Columbia University's Robin Ding explores how AI, particularly large language models (LLMs), is revolutionizing software engineering beyond mere code generation. Ding argues that while AI has significantly impacted code completion and generation, it falls short in broader software engineering tasks like debugging and symbolic reasoning. He emphasizes the need for LLMs to engage with deeper structural and functional aspects of software, touching upon security and the reliability of AI-generated code. Ding's research advocates for training and evaluating LLMs in ways that enhance their ability to manage complex software environments while ensuring security and privacy.

      Highlights

      • Robin Ding highlights AI's pivotal role in revolutionizing code generation but stresses its current limitations. 🎯
      • Ding argues for more comprehensive training to improve AI's software reasoning and security checks. 🔍
      • Improvement in LLMs could lead to full-stack automation of the software lifecycle. 🔄

      Key Takeaways

      • LLMs are transforming coding but need more abilities in software engineering. 🚀
      • Current AI models excel in code generation but lag in software reasoning tasks. 🤔
      • Future AI systems should manage entire software lifecycles with security in mind. 🔒

      Overview

      The talk explores the transformative impact of AI in software engineering through the lens of Robin Ding from Columbia University. Ding, in his research, focuses on how language models can be better trained not just to generate code but to handle complex software engineering tasks which entail understanding deeper software semantics, debugging, and ensuring security.

        Ding illustrates that while language models like ChatGPT have succeeded in automating code generation, there are still many challenges they face when it comes to understanding and reasoning about software structure and security, crucial for complete automation in software engineering. He discusses how AI models can be improved and evaluated to better handle these tasks, emphasizing the critical role of security and reliability in future AI developments.

          The future envisioned by Ding involves comprehensive AI systems that manage entire software lifecycles, from development to maintenance, with a strong focus on security and privacy. This vision anticipates AI-driven automation that can efficiently and securely aid developers across all software engineering disciplines, ensuring robust and reliable software systems.

            Chapters

            • 00:00 - 01:30: Introduction The introduction chapter sets the stage for a presentation at the collopium event, where Robin Ding from Columbia University is introduced as the speaker. The chapter emphasizes Robin's exciting work at the intersection of AI and software engineering. The focus is on enhancing language models (LMs) for tackling complex software engineering tasks, expanding beyond basic functions like code completion and generation, and incorporating more enriched context in model training.
            • 01:30 - 04:00: Challenges in AI-Assisted Coding The chapter titled 'Challenges in AI-Assisted Coding' discusses the complexities involved in AI-assisted software coding and design. It highlights the importance of understanding both the source code and the high-level reasoning needed for tackling complex questions related to software design and cross-cutting concerns. The chapter underscores the need for richer context as it is not always available within the local source code. Additionally, it emphasizes the significance of thorough evaluations, with particular mention of Robin's work, which aims to ensure that benchmark results are reflective of real-world deployments. This underscores the effort to push boundaries in evaluating AI in coding contexts.
            • 04:00 - 07:00: Limitations in Current Systems The chapter opens with an introduction by the speaker, who is honored to share insights on advancing language models from code generation towards software engineering. The speaker begins by posing a question to the audience about their current coding practices, specifically whether they rely solely on their own skills or utilize large models such as ChatGPT. Notably, only a few individuals are identified as 'hardcore programmers' who still write all their code independently, earning them the speaker's respect.
            • 07:00 - 11:30: Research Approaches The chapter discusses the integration of AI coding assistants into software development. The author describes their personal experience of using AI tools like chatbots for coding issues, stating that they rely on these tools for 50% of their programming tasks. This trend reflects a broader industry shift towards the incorporation of AI-generated code into mainstream software development.
            • 11:30 - 15:00: Advanced Software Reasoning The chapter discusses the emergence of AI-assisted coding and its significant impact on daily software development practices. While automation in code generation has been successful and is a cause for celebration among software engineers, there is also a recognition that this advancement alone is insufficient. The chapter argues that code generation does not encompass the entirety of software engineering. Therefore, to regard software development in a holistic manner, it is essential to look beyond just generating code.
            • 15:00 - 17:30: Enabling Global Software Reasoning The chapter discusses the deeper aspects of software engineering beyond code generation, likening it to an iceberg where the visible portion (code generation) is just a small part of the whole. For advanced automation, understanding and managing deeper layers, such as debugging and execution reasoning, is essential. The aim is to achieve automation that supports a wider range of tasks in software engineering.
            • 17:30 - 23:00: Monologue Reasoning The chapter 'Monologue Reasoning' discusses the anticipated developments in AI, particularly focusing on creating reliable and secure automated tools. These advancements aim to protect user privacy while enhancing software engineering processes. By integrating Learning Management Systems (LMS), the goal is to build comprehensive AI systems capable of full-stack automation, spanning from development to maintenance, although challenges are still present.
            • 23:00 - 27:30: Experimental Results and Tools The chapter discusses the limitations of Language Model Systems (LMS) in realistic software engineering tasks. It highlights that while LMS can automatically generate code, they struggle to understand or debug their generated code, leading to difficulties in resolving issues without human intervention. Additionally, LMS can sometimes produce 'hallucinated' code, which is not based on actual requirements or logic. As a result, the burden of fixing LMS-generated issues falls on human developers.
            • 27:30 - 32:00: Future Research Directions The chapter discusses the security concerns related to the usage of unauthorized packages and APIs in software development. It highlights that such unauthorized use can be exploited by attackers, posing significant security risks. Furthermore, industry users have expressed concerns about AI-generated code, which may contain subtle security flaws, potentially compromising their software products' security.
            • 32:00 - 35:00: Conclusion and Acknowledgements In the concluding chapter, the text discusses the limitations of AI models like Chat GPT, particularly in handling complex symbolic tasks. While it can solve basic problems like bitwise operations efficiently, it struggles with tasks that require higher symbolic reasoning, such as generating unit test cases. This highlights areas where AI still falls short compared to human software engineers. Additionally, acknowledgments might be included to credit contributors to the work or project.

            From Code Generation Towards Software Engineering–Yangruibo Ding (Columbia University) Transcription

            • 00:00 - 00:30 all right let's start welcome to the collopium it's my great pleasure to introduce our speaker today Robin Ding from Columbia University and so Robin has really done a lot of exciting work at the intersection of AI and software engineering specifically focusing on improving LM's capabilities for complex software engineering tasks that go well beyond code completion code generation and so in that space uh Robin is thinking about suitable model training with more enriched context well beyond
            • 00:30 - 01:00 source code and then also high level reasoning about more complex questions like software design cross cutting concerns where this richer context is really needed because it's not local in the source code itself and if that wasn't enough there's a lot more on evaluations so Robin's work really pushing the boundaries there on proper evaluations making sure that all the benchmark results we're getting are indeed good proxies for what real world deployments would look like all right so without further ado let's welcome Robin thank you all right um thanks a lot for that great
            • 01:00 - 01:30 introduction and I'm very honored to be here to in to share with you about my research in advancing uh language models from uh code generation towards the software engineering so let me start from this question so how many of you are still writing 100% of code on your own without the help of large models such as chat GPT oh there are a few you guys are really hardcore programmers nowadays and give you my uh respect so for myself I
            • 01:30 - 02:00 always keep chatb open on the side and I will prompt it whenever I have a coding problem which is quite amazing and thanks to all these great AI coding assistant I only write 50% of of my own programs and for the other 50% I will either prompt or just tap and this seems not only my personal situation since as also addressed by leaders from the big tech companies seems more and more AI generated code will be eventually integrated into the realistic software
            • 02:00 - 02:30 as well so probably at this moment we're all on the same page that the AI assisted coding is drastically revolutionizing how we write code every day while we are celebrating the success of the automated code generation as software engineers they probably still have to ask is this enough though is code generation all we need and the answer is obviously no because code generation is not exactly sovereign engineering so if we regard the software
            • 02:30 - 03:00 engineering as an iceberg the code generation is just the most obvious step of it so if we hope to achieve more advanced automation in software engineering we probably have to track what's what is deeper into the scene we definitely hope to achieve like very similar automation that we have seen in code generation but to support a broader scope of software engineering tasks such as debugging execution reasoning program etc and we also hope such automation to
            • 03:00 - 03:30 be realized in a trustworthy way where u the deployment of these automated tools could be reliable sec secure and also help the users to protect their private information and with all these information advancements we will achieve with LMS we would construct compound AI systems around these models to eventually approach the full stack automation that covers the whole cycle of the software engineering from the uh development to the maintenance unfortunately however when
            • 03:30 - 04:00 people directly apply LMS for the realistic software engineering tasks they actually complain about these models for example people notice that while LMR can automatically generate code they cannot fully understand their own uh their own generated code so whenever there's an issue they could not effectively selfde which actually leaves actual burden for human developers to fix the LM generated issue also LMS can hallucinate code
            • 04:00 - 04:30 packages and misuse APIs which actually leaves significant security concerns then the usage that the usage of these unauthorized packages and APIs could be potentially exploited by attackers and besides the industry users have also complained that the AI generated code could have very subtle security flaws that actually compromises the security of their software product we are able to find some
            • 04:30 - 05:00 concrete evidence regarding these limitations for example if we give chat GPT a very simple vehicle problem like a bitwise operation it can generate a clean and concise solution good but then even with this simple problem if we ask it to ask it to do a little bit more symbolic task that probably software engineers need to do every day such as the unit test case generation it fails and the model we test here is actually
            • 05:00 - 05:30 01 which has been trained with advanced general reasoning capabilities but seems it is not effective enough yet to perform the symbolic reasoning regarding the program semantics and this is not only my observations several other researchers have also identified such weaknesses of chatbt that they struggle to reason about the symbolic functional properties of programs such as the coding variance as well as their runtime behaviors and beyond such a local
            • 05:30 - 06:00 symbolic reasoning for isolated programs more global software reasoning is also required since software is actually designed into modules which maintains very complicated hierarchies and modular interactions here is an example that requires such a global software reasoning where we prompt CHTBT to validate the correctness of a very simple Python program to check whether the return value of this function is always a positive number using a
            • 06:00 - 06:30 third-party library called crosshair so crosshair is basically open-source uh program analysis tool for Python unfortunately however we realize that chatbd cannot even generate an executable program for it and the main reason for such a failure is the third party library we're using cross here is a very low resource pipet packages not as dominant as say numpy so chatbt does not does not know much about it and without the global reasoning of the of
            • 06:30 - 07:00 the whole library of crosshair such as its code bases and documentations chbd tends to hallucinate a including the modules to be imported the API to be used as well as the syntax that is required to trigger the backend symbolic tools and we actually systematically studied this issue of ChatGpt and it turns out that ChatGpt can only predict 10% of the API usages for those cases that require such a global software
            • 07:00 - 07:30 reasoning last but not least as we have discussed earlier LMS also tend to have a very weak sense of security for example here are two piece of functions from Lena's kernel and as you can see while these two functions share like tens of lines of code in common missing a very subtle if check in the middle will lead to a secure vulnerabilities of not pointed to reference however when we visualize the LM's embedding of these two functions
            • 07:30 - 08:00 using uh the OpenAI embedding model we realize that the LMS could not effectively distinguish between these two functions because they're textually similar however this is definitely not what we want since such a confusion will actually make LM struggle to tell apart the security vulnerabilities from the benign code simply because they could not effectively draw a classification boundary uh between like two overlapped embeddings i mean this is actually very confusing
            • 08:00 - 08:30 right like why are LMS are so capable at generating code while they have significant limitations for this kind of software reasoning so to answer this question we probably need to revisit how LMS are pre-trained on top of code so the first step of this pipeline is colle collecting data where we collect tons of open source code uh source code files from the GitHub repositories and then we flatten these code files into code tokens and then we randomly group
            • 08:30 - 09:00 several files to construct a concatenated sequence to fill up the model's contact length and during training we will actually train these models to auto regret to auto reggressively predict the next token in this concatenated sequence from left to right and when the training is done we will prompt the model with an incomplete program or natural language instructions and the model will complete the prompt accordingly finally the LMS will be systematically evaluated with the popular benchmarks
            • 09:00 - 09:30 uh where the models will be prompted to complete a code function or code file and their solutions will be evaluated uh against a set of predefined test suits while this pipeline has successfully produced several code generation models such as codecs from OpenAI and Colama from Meta it has limitations by design for the comprehensive software reasoning first for data the online code files are
            • 09:30 - 10:00 typically static code text and missed annotation of its symbolic functional properties such as the invariant code properties and the corresponding execution traces and without seeing a lot of such signals LMS will struggle to really connect the static code text with its deeper program semantics and also the quality of the online code data could be rather noisy where the malicious code could be directly memorized by the model during
            • 10:00 - 10:30 training without even noticing the danger so later the code generated by such models could be malicious as well and during training because of this random concatenation of code files which are typically irrelevant to each other predicting the next token in this concatenated sequence does not really teach LMS to reason about the moderate actions and global software dependencies in addition because the model is trained to predict the code tax
            • 10:30 - 11:00 it also struggles to tell apart those textually similar sequences while in code a very subtle difference in a comparison operator could change the control flow and trigger the security vulnerabilities and during inference without access to the most up-to-date references like the current code bases and documentations the model could hallucinate and make very outdated predictions based on its internal but still knowledge
            • 11:00 - 11:30 what is even more unfortunate is that these popular benchmarks focus completely on the function or file level code completion failing to expose these weaknesses that we have discussed so far so the LM's developers could keep overlooking all these important software engineering perspectives when they work on the next generation of LMS so to advance LMS towards the realistic software engineering tasks my research focuses on enhancing their
            • 11:30 - 12:00 fundamental software reasoning capabilities through each step in this pipeline to start with I definitely hope to collect more code specific symbolic signals rather than relying only on the static code text to train LMS so I perform program analysis to annotate the code syntax and underlying graph structures like the data and control flow as well as the deeper functional properties including both the static and the dynamic perspectives of the program
            • 12:00 - 12:30 semantics and beyond the local prim program features i also target the global reasoning of the software structures where I curate the project level data by performing dependency analysis over thousands of software projects to expose the software hierarchies and extract the modular interactions and besides the general reasoning of software programs security is another focus where we crawl the publicly reported security flaws and
            • 12:30 - 13:00 analyze them for fine grain annotations regarding their vulnerable patterns and evolving histories like when and where they're reported and how they're patched and then as the next step I customize the training to capture these enriched signals where I propose training strategies and augment the learning objectives for LMS to reason about the symbolic program features both explicitly in the in the token space and implicitly in the latent space
            • 13:00 - 13:30 in addition for better global software reasoning I also optimize the model architecture to incorporate broader software context in a flexible and efficient way so that LMS can effectively capture the crossfell dependencies and module interactions i further optimized the prompting strategy and constructed systems around these models to further boost their inference time performance encouraging them to statically reflect after the generation and keep improving the
            • 13:30 - 14:00 solution iteratively and also equip them to dynamically interact with the symbolic tools to uh retrieve information and also validate the correctness of their own solutions and finally I construct new benchmarks and design new metrics to evaluate LLMs beyond just file or function level code generation comprehensively assessing their consistency global software reasoning capabilities and security awareness
            • 14:00 - 14:30 under the realistic software development scenarios and this is the very high level overview of the research I have done so far and as a result my interdisciplinary research has been published in varied domains including software engineering programming languages machine learning and natural language processing and the outcome of these research has contributed to both open source communities and companies for example my latest vulnerability detection benchmark prime ball has been
            • 14:30 - 15:00 integrated into the evaluation system of Gemini from Google DM mind and my another benchmark the crosscode eval has drawn the very early attention from LM's developers regarding the importance of the global software reasoning for example deepseek was the first to utilize crosscode eval as their primary resource to evaluate and enhance their code models deepsec coder and then eventually Crosscode evolve becomes more popular to help with the development of more open- source LMS such as star coder
            • 15:00 - 15:30 2 from the big code and queen 2.5 from Alibaba additionally our approaches to incorporate the uh security awareness and broad code context have been directly integrated into Amazon Q improving the reliability and the robustness for the backbone LMS of the product so in the following of this talk I will focus on discussing our major efforts in teaching lms to reason about the program
            • 15:30 - 16:00 semantics and the software dependencies specifically we will start by introducing how we train LMS not only to generate code but also reason about the symbolic and functional properties then we'll go beyond such a local reasoning and train LMS to capture the crossfile dependencies and the modular interactions for better global software reasoning okay now let's start from the first part recall that while LMS are pretty capable
            • 16:00 - 16:30 at generating code they struggle to reason about the symbolic program semantics indicating there is still a gap between the neural models that are pretty effective at capturing the statistical distribution of data and the symbolic reasoning that is required for the software uh engineering practice and one of the main reasons of such a gap is as we have discussed earlier there is not much data online for the neural models to learn about these uh symbolic
            • 16:30 - 17:00 reasoning signals traditional PLSE techniques h actually have been very effective at these specialized analysis for program semantics but they also suffer from several problems for example the dynamic analysis does not scale well since the task coverage is always a challenge and such a execution based approaches always have very high latencies and cannot be directly applied on top of partial code and the static analysis on the other hand typically
            • 17:00 - 17:30 requires significant human efforts to write rules which are typically hard-coded and not flexible enough to capture those fuzzy patterns so my research targets to bridge the gap by training LMS to perform the symbolic reasoning so that we could scale such specialized program analysis without additional latency from the symbolic tools while the analysis itself is also more flexible in the online setting to
            • 17:30 - 18:00 tolerate those noises from the user requests and can be immediately applied on top of partial or even inexutable code so to achieve this goal we made several efforts in the LM training paradigm where we propose new training strategies augment the learning objectives and customize the model design to perform different types of symbolic reasoning for programs and for the sake of time we will only discuss the first approach in details and briefly cover the effectiveness of the
            • 18:00 - 18:30 other two approaches recall this lead code problem of the bitwise operation and when we execute this chatbt generated solution we realize it is actually buggy since the real output ts turns out to be 12 while the expected output is actually six because of this buggy line in the middle this actually reminds us how human program where they not only write code but whenever there's an issue they will
            • 18:30 - 19:00 engage in an iterated process of reflection that they review their implementation to identify those symbolic properties that are expected to hold true and analyze the oxygen traces against these symbolic properties to identify those unexpected behaviors and understand why the bug happens so if we would train LMS to imitate such a symbolic reasoning process they will also be able to self-debug and iteratively improve their own
            • 19:00 - 19:30 solutions to do this we propose SAM coder so the first key design of SAM coder is we specify four types of semantics to be learned by LMS enabling them to not only generate code but also reason about the its symbolic properties approximate indicates those highle intents that the program is about to achieve and one simple example can be the functional specifications of a
            • 19:30 - 20:00 binary search structural refers to the source code itself which actually implements the approximate semantics with algorithms and data structures and abstract includes those properties and constraints that will be identified during the code review which are expected to hold true regardless of the program inputs for example for binary search the rightmost pointer of the search should always be no larger than
            • 20:00 - 20:30 the length of the array and the array must be sorted before the search begins and the operational captures the runtime behaviors of the programs such as the execution traces with specific unit test cases that can be checked against the abstract semantics so to collect all these semantics for training we need a lot of fully executable code so that we could perform dynamic analysis to extract the
            • 20:30 - 21:00 comprehensive signals therefore we synthesize such a high quality data set you might wonder why we don't directly use the open- source code for training so the reason is while crawling these code is cheap scaling up their execution is both expensive and practically challenging for example here's a piece of human written class that implements some search algorithms but this class cannot be executed or traced at all since the open source code typically
            • 21:00 - 21:30 requires individual configurations for testing and not always have unit test cases for every random code block and the quality of these random code block is actually difficult to assess since in many cases we could not get the fine grain specifications to really judge their correctness like in this case the class does not even have any doctrines to specify what it does does exactly and without the unit test cases we could not even validate the correctness
            • 21:30 - 22:00 either therefore we implement a pipeline to synthesize the highquality and executable code samples the main idea here is we use the open- source code as seat to inspire the LM to synthesize problems of the realistic topics but with a more narrow scope to ensure the executability for example the inexecutable search class we just saw could inspire the synthesis of a binary search problem where the letter is much easier to be
            • 22:00 - 22:30 executed then we keep using LMS to synthesize the solutions for the problem and use symbolic tools to generate unit test cases to validate the correctness of these solutions and only keep those high quality ones that can that can pass all the test cases as our high quality training data with this pipeline we collect Pyax a high quality data set with sufficient executable score samples
            • 22:30 - 23:00 however while we are able to extract all the semantics with Pyax our preliminary study shows that simply dumping all these program semantics in this raw format will actually makes the training very inefficient and difficult to scale up since as you can see these varied semantics actually have varied data formats and granularities so that LMS really struggle to align the relevant information across these multimodel
            • 23:00 - 23:30 program semantics especially when dealing with the formal annotation of the abstract semantics and the execution traces that have a lot of irrelevant and redundant details so to address this issue we propose monologue reasoning that aligns the semantics with only natural language and source code the two most dominant types of data in the LM pre-training so the first key design here is to teach LMS to connect the
            • 23:30 - 24:00 varied code semantics with the source code itself through different granularities of alignment specifically we start from the most coarse grain alignment to construct approximate semantics by summarizing the overall objectives of the whole program and followed by the detailed explanation of how the objective is implemented concretely then we move to more fine grain alignment where we we identify the symbolic properties and constraints for
            • 24:00 - 24:30 each code blocks and explain them in natural language like the entry code block of this binary search requires certain preconditions from the input the while loop maintains invariant properties to ensure the uh search works as expected and the return block defines the post conditions when the program terminates we'll explain how we automatically annotate these monologues for training in a minute and finally we move to the most fine grained alignment at line level
            • 24:30 - 25:00 where we construct the operational semantics by explaining the executing effects line by line but again in natural language similar to the mental execution that human perform in their mind during the robot debugging we further construct monologues to reason about the reversed effects of execution similar to the reverse debugging features as we have seen debuggers like GDB and LDB but this is trying to teach the LMS about the
            • 25:00 - 25:30 abstract perspectives of the execution for example when the final state of the binary search is minus one which means the key is not found the model need to understand that there are potentially infinite possibilities of the input that can lead to this final states which is basically any combination of an array and any other key that is not part of it will lead to this final state of key iss so the model need to summarize the needs to conclude initial state as an abstract
            • 25:30 - 26:00 constraints rather than concrete values and with this birectional explanation we teach LMS to reason about code execution with both concretely and abstractly so the second key design of monologue reasoning is to train LMS to effectively reason about these aligned program semantics in an order from the highle intent to the low-level details to start with we still train LMS
            • 26:00 - 26:30 to write code to maintain this very fundamental capabilities then we will further train the model to review its own implementations by summarizing the overall objectives of the whole program and explain the details followed by identifying the key properties and constraints for each code block and finally selfdebug by reasoning about the executing effects in a step-by-step manner for both forward and
            • 26:30 - 27:00 backward directions okay so to train LMS to perform such a monologue reasoning we need to collect a lot of high quality monologues so we design a rejection sampling framework to perform the automatic data annotation with the strict quality control where we start by prompting LMS themselves to annotate the model logs while using carefully crafted instructions and few shot examples to ensure the monologues following our
            • 27:00 - 27:30 design and then we implement strict selection strategy for each type of these annotated monologues to validate their correctness and only keep those validated ones as our high quality training data since the whole system requires significant engineering efforts we will not go in too much detail here but feel free to ask questions at the end if you are curious about specific technology okay now let's talk about the results we start by evaluating SAM coder
            • 27:30 - 28:00 and LM with 6.7 billions of parameters to perform the symbolic reasoning to predict the axon effects purely statically on two popular benchmarks and as we can see SAM coder beats all the baseline models for of similar sizes with a large margin and can even outperform those industry models with 10 times more parameters in predicting the code runtime behaviors we also compare our monologue reasoning
            • 28:00 - 28:30 with other reasoning methods such as channel thoughts scratch pad reasoning and raw execution traces where we replace the monologue reasoning where we replace the model logs in our training data with each type of these reasoning formats and retrain sam coder multiple times for each of them as a a fair comparison since the proposed monologue reasoning specifies the important perspectives of the program semantics and smoothly align them at different granularities we can see that it reasons
            • 28:30 - 29:00 about the code execution much more effectively than other reasoning baselines we are also excited to see that SAM coder could effectively self-debug completely statically with the monologue reasoning recall this buggy solution of the bitwise operation generated by CHGBT and now let's prompt SAM coder to debug this code we notice that SAM coder is able to accurately predict the unexpected return
            • 29:00 - 29:30 value of this bucket program by itself without really executing the code and more interestingly with a reverse debugging capability it carries the expected value of six and tries to deduce the expected previous program states all the way back and successfully localize the discrepancy between the expected behaviors and the buggy execution and this selfdebugging also makes sam coder more effective at statically refining its own solutions
            • 29:30 - 30:00 than baseline models which is especially useful when the real execution is not an option like users only provide partial code for analysis when we serve a chatbot or we have practical concerns or restrictions or security concerns for frequent execution during their online deployment finally we also noticed that training with monologue reasoning which was originally designed for code analysis also results in outstanding code
            • 30:00 - 30:30 generation performance which highlights the general value of learning the symbolic reasoning regarding the program semantics besides training LMS to perform the symbolic reasoning explicitly in the token space like we have seen in monologue reasoning that explains everything verbally we have also explored training LLMs to learn symbolic code properties in the latent space for non-generative tasks due to the time restriction I could not go into details of these methods but I hope to
            • 30:30 - 31:00 quickly show you the results we evaluated these latent symbolic reasoning approaches with a broad scope of non-generative software engineering tasks including branch coverage and program states prediction as well as the clone retrieval code search and vulnerability detection in general we observe that learning to read about program semantics largely helps LMS to perform better in sovereign engine tasks significantly outperforming those LMs that are only trained to
            • 31:00 - 31:30 predict code text okay so far we have introduced how we enable LMS to reason about individual programs locally now let's take the next step to train LMS for more global software reasoning to understand the interactions and dependencies among the software modules recall the crosshair failure we have discussed at the very beginning where chatbt hallucinates due to the lack of the global software context such
            • 31:30 - 32:00 kind of programming that needs to reason about the external modules is also known as immodular programming so to systematically quantify LM's limitation in modular programming we start by constructing a benchmark for it where we first implement static analysis tools and perform dependency analysis over thousands of open source repositories to identify samples that requires the global the global software reasoning such as the code snippets with
            • 32:00 - 32:30 external dependencies and then we further identify the usage of such dependencies and create the prompt and target accordingly we also retrieve the ground truth crossfile context for each of these created samples and in this case the definition of the API to be used uh will be to be used in the target will be retrieved so that we could evaluate how LMS perform differently with and without the ground truth context
            • 32:30 - 33:00 with this pipeline we collect crosscode eval a multilingual benchmark with over 10,000 modular programming samples now we evaluate CHBT with crosscode eval and see how it performs now surprisingly CHGBT cannot handle such a modular programming without referring to the crossfile context but what is really surprising is that even when the ground truth context is retrieved and prepended to the prompt
            • 33:00 - 33:30 Chad Gibbt still tend to ignore it and keep hallucinating the predictions in many cases so to explain why such an ignorance happens let's recall the standard training pipeline LLMs on source code where the where the training code files are typically randomly ordered and concatenated for the next token prediction without the dependency analysis so the crossfile dependencies could be separated pretty far away from
            • 33:30 - 34:00 each other and due to the context length of the LMS such a long range crossfile dependencies will not be directly learned during the next token prediction because of the sequence truncation and on the other hand the irrelevant code files are typically concatenated together which misleads LMS to learn the wrong signals that they should frequently ignore the surrounding files since they're typically irrelevant and this is why even if the crossfall
            • 34:00 - 34:30 context is retrieved and prepended LM still don't understand how to leverage it so to this end we propose Cococom to redesign the LM's training strategy for the modular programming we first perform dependency analysis for each training file and retrieve their cross file context individually and then we propose to keep these relevant information for the model's reference during the training to
            • 34:30 - 35:00 capture the global software dependencies now the question is how do we perform such a training exactly the most naive approach is directly concatenating all the retrieved context and preparing them to the current completing file however practically this is very expensive because the computational cost Yes um how how you know what what is relevant right cross file
            • 35:00 - 35:30 context is the potential cross file context is enormous yes so in this case what we do the dependency analysis only for the import statements this is specific for Python yes and we will retrieve whatever they're importing and we will retrieve them as the crossfire context won't the the list of transitive imports be too Uh yes it is true and that's why I will actually introduce the next limitation and I will introduce our yeah
            • 35:30 - 36:00 right so uh basically the first issue is that practically this kind of prepending is very expensive because the computational cost actually grows exponentially with a context length and simply there's so many so much information in the retrieved uh uh context and exactly conceptually the information in the crossfall context could be quite sparse imagine you import a module that implements hundreds of APIs with thousands of lines of code but
            • 36:00 - 36:30 what do you really need to complete the current code current file is just a few API names so it will be a waste of compute if we directly prepend all the crossfile context as concrete tokens so to address these problems we propose a neuro architecture that extends the transformer self attention to jointly model infile and the crossfile context and our architecture starts by designing a crossfile context fusion where the retrieve the references
            • 36:30 - 37:00 are summarized in a compact embedding via a special summary token and this allows the model to efficiently integrate the relevant crossfall information without overwhelming the computation and this summary embedding is then mapped back as key vectors and value vectors at each transformer layer for later joint attention of the code generation process so during the code generation we extend
            • 37:00 - 37:30 the self attention mechanism of the generation model so that at each layer the model still attend to the keys and values from the concrete infile tokens while also jointly attend to those crossfile keys and values that are equivalent to just one token per reference where during fusion the model learns to extract the relevant information into these summarized representations so the token generation model does not need to explode its
            • 37:30 - 38:00 compute by redundantly processing the full content with tens of thousands of tokens and with this joint attention allows for more accurate modular programming in a compute efficient way yes so then the the cross file tokens how are those um much larger they will basically summarize because there are uh so many tokens in the for example in the retrieved modules and
            • 38:00 - 38:30 then we don't really know which one is more important than the others during this kind of dynamic code generation so we realize so the next slide is like we will actually train both the fusion modules and the code generation modules uh by connecting both two models and then their uh weights will be optimized towards the next one prediction so that they can flexibly design like flexibly uh figure out what are the important information to be fused into this summarized representation does that make
            • 38:30 - 39:00 sense yeah all right okay so to um evaluate our approach we pre-train an LM with 350 millions parameters on the code data from scratch with three different settings without the crossfall context which is the standard LM training strategy prepending the crossfall context to the completing file as
            • 39:00 - 39:30 concrete tokens which is the most naive approach that will waste a lot of compute and coco and we can see that cocoake is significantly more effective than the other two baselines and together with my collaborators at Amazon we're currently scaling up cocoake to tens of billions of parameters to improve the modular programming capabilities of the product LM at Amazon Q finally I would note that cook has three notable differences when compared
            • 39:30 - 40:00 with a standard encoder decoder architecture first our fusion and generation model uh generation are basically done by the same model rather than having two separate sets for the encoder and decoder making cocoa more effective in practice when scaling up the model size and second we fuse and jointly attend at each layer of the transformer to incorporate the homogeneous information since different layers in the LMS will learn different knowledge
            • 40:00 - 40:30 while the original encoder decoder architecture refers only to the representations from the last layer of the encoder when generating tokens using the decoder and third we get rid of the complicated encoder decoder attention and directly extend the self attention instead and empirically the latter is not only simpler but more effective throughout this talk I have introduced my major contribution to teach LLMs to reason about software
            • 40:30 - 41:00 programs both locally and globally and overall my research aims to improve each step in the LM pipeline to advance its capabilities for software engineering okay now let's recall the iceberg of the software engineering we have discussed at the beginning in general the LM empowered automation is still on the very surface of this iceberg my research has made the early efforts to push these models capabilities further to touch upon more
            • 41:00 - 41:30 involved software engineering tasks in the future while we will keep improving LM's capabilities to support more tasks we also hope to go deeper in this iceberg to explore the trustworthy deployment of these models ensuring the security reliability and privacy and looking further ahead my long-term vision is to achieve the full stack automation with compound AI systems which can autonomously manage the entire software life cycle from the
            • 41:30 - 42:00 development to the maintenance let's first talk about our short-term goal since there has been many challenges in trustworthy deployment of LMS where these models not only tend to generate insecure code but also fail to reliably provide consistent responses to different forms of the same questions in addition those very large language models such as chatbt has raised significant concerns from the industry users regarding the data
            • 42:00 - 42:30 leakage of their commercial code and information since these very large models could not be deployed privately while small models are much more preferred here since they can be easily customized for the private usage all these challenges have motivated me to start exploring these different perspectives of the trustworthiness of code LMS for example we have collected a vulnerability detection data set to analyze LM's weaknesses in security analysis
            • 42:30 - 43:00 and I am also currently leading a blue team called Align Coder in the Amazon trusted AI challenge where we explore how to train LLMs with advanced security awareness and how to rigorously align them to minimize the risk of generating malicious code moving forward I would keep exploring the trustworthy code generation as well as the training LMS to make like to to make security analysis more efficiently where we plan to teach these models to interact with
            • 43:00 - 43:30 the security analysis tools to learn from their feedback and the known vulnerable patterns and for the reliability perspectives we have explored the specialized coded reasoning as we have discussed in SAM coder to improve the model's robustness against different types of perturbations and interestingly as we have discussed before we realized that training models to reason could make them a parameter efficient where a small model could
            • 43:30 - 44:00 perform even better than larger models in certain tasks so this actually makes me very interested in training more small much like much smaller but domain expert models in the future that they can reliably solve certain tasks so that users could serve these models locally with very affordable cost which ensures the effectiveness while providing guarantees to predict their private code and data with the advanced capabilities and
            • 44:00 - 44:30 trustworthiness of LLMs we would achieve it in the near future the long-term goal of my research lies in constructing compound AI systems to achieve the full stack automation that covers the whole life cycle of software development and maintenance so inspired by the definition of automation level in self-driving car let's also think about what we would achieve in sovereign engineering to eventually approach the full stack automation i would define the first
            • 44:30 - 45:00 level of automation as the file or function level code completion which can be achieved by LMS that are trained only with predicting the next to code next code token then the next level would be one step further that the model will not only generate code as oneshot prediction but keep self-improving and my research has proposed several strategies to enable LMS with these advanced reasoning capabilities then we expect larger scale
            • 45:00 - 45:30 automation where the AI systems can interact with the environment use tools and retrieve external information to perform project level analysis and nowadays the agentic systems are roughly at this stage but with very narrow domain like GitHub issue solving as we have seen in Sweetbench so in the future I am targeting to extend these agentic prototypes to reliably support broader tasks in a more general t in a more
            • 45:30 - 46:00 general way the fourth level of the automation that I'm targeting will be something we haven't reached yet where the AI system can both implement and maintain the whole software feature which requires probably more than one LM agent with varied expertise and the system trajectories should be carefully designed to ensure the effectiveness and efficiency ultimately we hope to achieve the full stack automation that the AI system will
            • 46:00 - 46:30 autonomously manage the whole software life cycle this will be extremely challenging from the AI side we need to construct a hierarchical and collaborative agents like a group of software developers work together where they have distinct rules and they need to communicate to figure out how to solve those very complex tasks from the system building side we need to handle nearly all the complexity that any large system would have like how to optimize the distributed computing how to handle
            • 46:30 - 47:00 the concurrency how to design the workload routing with networking etc and in the coming maybe 10 years or more I would be very excited to collaborate extensively with researchers from all the related domains like AI PLSE security networking and systems to eventually approach this ambitious go together finally I would like to thank my mentors and collaborators and thank you so much for the time to listen to my
            • 47:00 - 47:30 talk [Applause] we have time for a couple questions i think the ceiling microphones will pick it up nicely if you want the microphone let me know i'll bring it over yeah go ahead uh so I have I have two questions so first is a more lowle question um so in the example projects you listed I think both of them uh propose lots of
            • 47:30 - 48:00 new techniques and since there are lots of moving parts do you know which of those techniques uh or which of those uh parts yields a larger gain compares to other parts yes uh so we actually have very detailed application study in the paper about different module that we created uh uh as part of the novelty and for example in the samore paper we we really evaluate like different semantic like
            • 48:00 - 48:30 how they uh have different impacts on the final uh uh code generation uh by removing one of the semantics out of the monologue for trajectories etc so but but yeah I didn't cover that here but you can check the paper for more details about the uh story there so if you move one of them do you always see like there's a degraation in yes there's always a like decrease in channel performance but uh the magnitude like how much they decrease depends on the task for example in code generation
            • 48:30 - 49:00 probably like the execution reasoning is not that important but for selfdebugging if we remove the uh uh the operational semantics will be a disaster it will be basically not not performing anymore okay so my second question is that so there are lots of existing research on all those fields you have touched so could you uh sketch a little bit like how are you going to uh you said connecting to all those fields but how are you going to leverage
            • 49:00 - 49:30 the existing research in those fields in your future research yeah so um I guess the mo like the easiest way of combining all these kind of technologies is really to construct these kind of uh agenic systems for example we will teach these models how to interact with the symbolic tools for example for for analysis or for verifications rather than relying on the model themselves to do that but try to enable them pretty similar to developers to really interact with all all these kind of off-the-shelf tools
            • 49:30 - 50:00 that's one example uh and as a mission at the very end I mean we definitely need all the insights from the system building to really construct a very kind of large agendic systems which are potentially could be deployed uh in in in in say cloud uh or somewhere else so are you mostly saying that uh it should be the case that LM should be able LM agents should be able to interact with the existing tools in the way that a
            • 50:00 - 50:30 human user interact with those tools it might not be necessary to be uh the human way i mean the reason is like maybe like machine have a more can figure out a more efficient way themselves that are more like friendly for themselves uh but from our experiments it's like bootstrapping these models to use tools in human way is uh always more effective than uh training them from scratch to interact with these tools yeah and probably that's also why the reason for example
            • 50:30 - 51:00 if you read this paper deepseek R1 uh they have two settings one is that directly train the reinfling from scratch the other one is like you have a code start phase you bootstrap these models uh teach these models how to reason with the human annotated data okay thank you yes it came up at various points in your talk sometimes explicitly sometimes a little bit in the background that uh LLM
            • 51:00 - 51:30 generated code can sometimes introduce subtle bugs right uh human generated code often times also generate subtle bugs true do we have any evidence yet on whether the kinds of subtle bugs are similar um that's a very interesting question um so I don't have any concrete study on top of that but my assumption will be they will be similar the reason is that these models are mostly trained with the
            • 51:30 - 52:00 human reader code so mostly the errors that made by human will be directly learned by the LM during the pre training yeah yes so do you anticipate that a human will have to read every line of code written by an LLM in order to look for those subtle problems maybe particularly in high assurance scenarios right um so I think uh human by by by looking through each piece of code generated by all will be ted
            • 52:00 - 52:30 like it will be a lot of efforts right so I guess the the better way is that if the models could validate the correctness of their own code automatically that would be more than ideal and we're actually exploring different ways for example whether we could um train these models to selfdebug so that but with the assumption that there are some unit test cases also like it will be very interesting to see like whether the model could say even verify their own uh generations by using the symbolic tools um yeah but um I guess
            • 52:30 - 53:00 the part I'm covering is mostly for selfdebugging with assumption that test cases are there but if say we could generate capne code then uh you know the compiler I can directly give you some feedback that whether your specifications really align with the programs yeah yes for some newly developed domain specific languages they might introduce some specific security issues and uh those we also don't have enough training data for those newly developed domain
            • 53:00 - 53:30 specific languages that's true well this M added code generations do pretty well for those do specific language and uh if if not like whether uh each domain specific language have a unique um training course yes so I guess there will be two questions one is this kind of general capabilities of code generation um um so for low resource language programming languages especially the the very new one for example like probably one year ago like
            • 53:30 - 54:00 ROS is very new right um and in these cases we do have some recipe to how to really train these models for these low resource programing languages but for security perspective it will be very difficult even for those most dominant languages like C it is very difficult to rigorously align these models to only generate secure code that's something we are working on but I would distinguish this like I I would try to make them different between the general capabilities of code generation which I I believe we can do for low resource
            • 54:00 - 54:30 languages but for secure secure code generation it will be very difficult even for the most dominant languages yes so in the first part of the talk you talked about training the code understanding model on a very high quality synthetically generated data set that was LLM generated so I was wondering how that if that makes the model very reliant on seeing very high quality data in reality a lot of human generated code is not that high quality has different coding styles so did you take a look at at the impact of that so
            • 54:30 - 55:00 you're saying uh does the qual like the high quality of our data make a a difference from the baseline models yeah that I guess does it make it um harder to apply to human generated code that might not have that might have a different level of quality or style yes so I guess one of the most important thing nowadays is not only the quantity of data is also about the quality of
            • 55:00 - 55:30 data in terms of data annotation so definitely uh if we train the larger models on always on high quality data it should always outperform those models trained on more but average quality of data so that's actually something we are expecting yeah but I guess in monologue reasoning the difference is not only the high data but also the new semantics that we are adding back into the training this kind of approximate semantics um uh abstract semantics and
            • 55:30 - 56:00 the operational semantics i think that will be the key part beyond the high quality data curation sorry I also have a followup so you mentioned you spend 50% of time relying on LM when you coding and 50% code by yourself since you are an expert in LM driven coding which part you think LM cannot help you why not 100% relying on LM that's very interesting right so um if okay so the short answer is that if
            • 56:00 - 56:30 we regard programming has two mode one is acceleration where you have very specific like you have very concrete specifications about what you want to do and LM are pretty capable at accelerating speeding up the whole programming practice but the other thing that the other mode is called exploration where you need to probably design the software you need to do abstractions at different levels and this kind of exploration mode of programming LMS could not do that well and one of the main reason is that we
            • 56:30 - 57:00 don't have a lot of data to train the model about software design and different layers of abstractions so I guess that's for me it's like I will be in charge of most like this kind of uh software design abstractions this kind of more creative stuff and the LMS will mostly help me to uh speed up what I'm sure about and another thing I would mention is like there will be two extreme capabilities that LM have one is that they are pretty capable at high level programming like this kind of data
            • 57:00 - 57:30 science related Python code generation the other extreme is I mean this in this kind of data science code the model pretty capable but for another level is like another extreme is like this kind of uh kernel related coding or system level coding like lens kernel cuda kernel or even compiler development the model could not do that pretty well so I guess that really uh distinguish when I will use larger models and when I will not yeah thank you if you have more questions I
            • 57:30 - 58:00 would encourage you to find Robin and talk more about his work and so with Thank thanks Robert again okay