Instructlab in 20 minutes! (Or: How to train your LLMs in style?)

Estimated read time: 1:20

Summary

In this video, Maximilian Jesch delves into the capabilities of small language models and the open-source project, Instr Lab, which assists in managing and creating synthetic training data. Jesch highlights the efficiency of smaller models, explaining how they are closing the gap with larger models like GPT-3.5. Instr Lab is presented as a robust tool, introduced as part of an open-source project, to manage data through a taxonomy approach, create synthetic data using teacher models, and improve training through multiphase methods. Jesch provides a hands-on demo, showcasing the generation of synthetic data and subsequent training of a small model with the knowledge of the 2024 Oscars.

Highlights

Small language models are growing in demand as they become more capable and efficient. 🚀
Instr Lab provides a structured way to manage training data, using categories like knowledge and skills. 🗂️
Creating synthetic data with a teacher model can streamline the process and make model training more cost-effective. 💡
The demo illustrated training a small language model with up-to-date Oscar information, leveraging synthetic data. 🎬
This innovative approach is tailored for addressing specific workflow challenges and fine-tuning language models for niche applications. 🎯

Key Takeaways

Small language models are becoming increasingly efficient and are closing the gap with larger models. 📈
Instr Lab assists in managing and creating data for training smaller models, utilizing a unique taxonomy approach. 📚
The project hinges on a robust research paper, ensuring its methodologies are grounded in science. 🧠
Generating synthetic Q&A pairs using a teacher model can save substantial time, making training more efficient. ⏱️
The demo showcases practical application of these concepts, teaching a model about the 2024 Oscars. 🏆

Overview

Maximilian Jesch introduces the intriguing world of small language models and highlights the rise of the open-source project, Instr Lab. This tool is designed to assist developers in managing and creating supplemental training data, improving the efficiency and applicability of small models. Despite the enormity of models like GPT-4, smaller models are proving their worth and nesting themselves into everyday applications, addressing challenges such as scalability, price, and speed.

Instr Lab uses a clever three-pillar system: managing data through a taxonomy approach, generating synthetic data, and employing multiphase training. This methodology stands on strong scientific grounds, offering a systematic and efficient route to train models. By breaking training data into structured categories and leveraging large teacher models for synthetic data generation, developers can significantly optimize the training process of smaller models.

In a detailed demonstration, Jesch shows the audience how to teach a model about the 2024 Oscars using Instr Lab's innovative methods. By preparing a database of questions and answers, and utilizing a teacher model to create synthetic data, Jesch efficiently trains a small model to grasp new and specific information. This approach emphasizes the adaptability and specificity with which language models can be fine-tuned, particularly beneficial for solving unique problems faced by larger language models.

Chapters

00:00 - 01:00: Introduction to Small Models and Instructlab The chapter introduces the significance of small models in the tech world, emphasizing their speed, affordability, and versatility in performing various tasks. It highlights the growing complexity and diversity in the realm of large language models, noting that while massive models like GPT-4 and LLaMA remain relevant, the latter half of 2024 marks a period where large language model applications are becoming widely used in everyday life. This widespread adoption introduces challenges related to scalability, cost, and speed.
01:00 - 02:00: Capabilities of Small Models In this chapter titled 'Capabilities of Small Models,' the discussion focuses on the rapid advancement and efficiency of smaller language models. While large language models are reaching a plateau in their development, smaller models are quickly improving and beginning to perform tasks that previously required much larger models like GPT 3.5. These smaller models, even those with as few as 7 billion parameters, are becoming increasingly capable and efficient, filling in the gaps and advancing at a quick pace.
02:00 - 03:00: Managing Training Data with Instructlab The chapter discusses the growing demand for small large language models and how they can be specialized to perform specific tasks by training them on custom data. It introduces Instr Lab, an open-source project that aids in managing training data and generating synthetic data, which can then be used to effectively train small models.
03:00 - 04:00: Taxonomy Approach in Instructlab The chapter discusses a notable project hosted in the main Instructlab repository, which has gained significant popularity with 600 stars and involves 124 contributors primarily from IBM and RedHat. The repository features a comprehensive README that provides detailed instructions and information about the project.
04:00 - 05:00: Generating Synthetic Data The chapter discusses the origins of a research paper titled 'Large Scale Alignment for Chatbots,' abbreviated as 'LAB.' It explores how large language models can be creatively utilized by combining different chains, resulting in innovative solutions. However, the focus is on ensuring that these implementations are not only impressive but also consistently yield positive outcomes.
05:00 - 06:00: Demo Setup and Model Configuration The chapter introduces inst clab, a tool designed to handle personal and company-wide training data efficiently using scientific methodology. It emphasizes its basis in hard science, engineered for ease of use. The core explanation of instap is broken down into three main concepts, beginning with data management.
06:00 - 08:00: Adding Custom Data and Initial Model Evaluation The chapter discusses adding custom data and evaluating the initial model of a project. It highlights a common practice where knowledge is gathered from various sources, such as company resources or online platforms, and saved in text or markdown files. This method is effective until the project expands or team members change, emphasizing the importance of maintaining organized knowledge management as a project grows.
08:00 - 10:00: Generating and Validating Synthetic Data The chapter titled 'Generating and Validating Synthetic Data' discusses the pitfalls of traditional data generation approaches, often resulting in vast amounts of data that may not be useful. To address this, a taxonomy approach is proposed, breaking down training information into three categories: knowledge, foundational skills, and compositional skills. The differences between these categories are explained in detail in the paper.
10:00 - 12:00: Training the Model This chapter discusses the concept of a taxonomy in organizing information, allowing it to be easily found and verified by others. The approach streamlines data collection and collaboration, providing a systematic method for teams to work together efficiently. The chapter highlights how this method, although possibly not the primary focus of its creators, has proven beneficial in managing data effectively.
12:00 - 14:00: Evaluating the Trained Model The chapter discusses the challenges faced by large customers and highlights the power of a specific system in managing these challenges. It specifically emphasizes the importance of creating and using synthetic data for training purposes, which allows for more tailored handling of different categories, such as knowledge and foundational skills. The ability to treat these categories differently is seen as a key benefit of the system.
14:00 - 16:00: Applications and Benefits of Instructlab The chapter discusses the compositional skills involved in Instructlab and hints at their significance as a 'secret source'. It suggests that while the concepts may seem abstract initially, they are grounded in practical examples, such as a Q&A YAML framework commonly used in Instructlab. The transcript indicates that many representations within Instructlab follow this Q&A format, making complex ideas more tangible and easier to understand. Additionally, a demonstration using a specific example is promised, which will help elucidate these concepts further.
16:00 - 19:00: Conclusion and Contact Information The chapter concludes the discussion with a focus on organizational terminology, likening 'tonomy' to an elaborate folder structure, exemplified through various categories such as knowledge, textbooks, culture, movies, awards, and Oscars. It highlights the use of a Q&A YAML file within this structure for managing and accessing information effectively. The chapter wraps up by providing contact information for further queries or engagement.

Instructlab in 20 minutes! (Or: How to train your LLMs in style?) Transcription

00:00 - 00:30 small models are amazing they're fast they're cheap they run pretty much anywhere and they get a lot of jobs done pretty well and the tech world is definitely becoming a lot more diverse when it comes to large language models there will always be a place for those enormous models for those GPT 4S and llama 4 or 5bs but right now the second half of 2024 large language model based applications become more and more mainstream they actually become part of our daily lives and with that they start facing the challenges of scalability of price of speed and just simply to deal
00:30 - 01:00 with all the different things that people want from them and while the super large large language models start plateauing slowly but surely like that uh the small large language models are closing the Gap really really quick like they become better really fast at this point you've got 7 billion parameter models which are really really tiny that can do the job that used to be you needed GPT 3.5 for that just a year ago because they are so small it actually is
01:00 - 01:30 feasible to train them on your own data and make them really good at the job that you need them for so all of that taking together just makes the market for small large language models the demand for small large language models grow super super fast and that's where instr lab comes in inser lab is an open source project that helps you manage the data that you need for training and creat synthetic new data which is super powerful and then use that data to train a small model so let's check it out real
01:30 - 02:00 quick um there is this project and it's got its main repo uh which is the main inst lab repo it's pretty Popular by now 600 Stars uh a lot of people working on it 124 contributors mainly from IBM and redhead you can see that there is a pretty extensive read me there and that pretty much explains everything you need to do and it's important to know that this
02:00 - 02:30 does not just come out of the blue but it's actually the manifestation of a research paper which is called large scale alignment for chatbots and somehow they came up with the acronym lab for that I don't know as everyone knows who's working with large language model it's really simple to just throw together a lot of chains and make large language models do cool stuff but make it in a way that is reproducibly producing good results is really
02:30 - 03:00 challenging and you really need a scientific method for that that's what inst clab is it is based in hard science and it's engineered to be efficient and easy to use so now let me explain to you what instap does in my own words I want to break it down to three three three three main Concepts number one is a way to handle your personal your companywide training data in a very opinionated way
03:00 - 03:30 that makes it easy to keep track of it the common way it's done right now is you just take a bunch of knowledge that you I don't know take from your company vikii or just some online sources and you dump them in a bunch of txt or markdown files and throw it on a big big pile that works pretty well as long as you who did this is still in the project and kind of still remembers what happened but as soon as your project grows as people come in and out
03:30 - 04:00 this concept just simply collapses and you just end up with a huge pile of more or less useless data oh to do that differently instr lab proposes what they call a taxonomy approach uh and they break every piece of training information up into three categories uh knowledge foundational skills and compositional skills I don't want to go into a lot of detail what the differences are that explained in great detail here in the paper or in the
04:00 - 04:30 the other repos but the idea is that every piece of information Finds Its place in this taxonomy and somebody else can find it again and check if this knowledge has already been added and it also gives you some really nice ways to it's out of the box it's got a g-like approach to dealing with data and actually the way that it makes it easier for teams to work together on collecting data is probably been an after thought for the authors of this paper that's
04:30 - 05:00 something that comes to my mind because I work with large customers all day and that's the pain that everyone is fighting with uh but the original reason why it's done like that is because it's super powerful when you want to go to the second and third step of the system that that I'm explaining to you uh and that is to create synthetic data and to use that synthetic data for training because then you can do different things for different categories you can deal with knowledge differently than with foundational skills different ly than
05:00 - 05:30 with compositional skills and that turns out to be kind of the secret Source here so this picture is nice but it's kind of abstract uh there's a really nice example that we will be I I will actually do a demo here at the end of this video and we'll be using that very example here and that's a bit more tangible uh this is a example for a Q&A yaml and most of the things you represent in insr lab are in this Q&A format uh like you always got a question
05:30 - 06:00 and an answer um and this Q&A snippet lives in this folder hierarchy so at the end of the day tonomy is a really fancy term for a folder structure and in this case the folder structure is knowledge textbooks culture movies Awards Oscars and this is our Q&A yaml here additionally in the case of knowledge you always have
06:00 - 06:30 a repo here that gives the context and if we check this out this is a very large markdown file I think it's just simply taken from Wikipedia uh and this will be used in the generation of synthetic data together with the Q&A parts to create a pretty large amount of synthetic Q&A pairs but I'm getting ahead of myself here because now I'm already talking about the second pillar of those three
06:30 - 07:00 pillars uh and that is the the synthetic data generation the main idea is you take a corpus of knowledge in that case this markdown file that we've just seen and then you manually create a set of so-called seat examples or seed Q&A pairs just some questions and answers about this Corpus of text that kind of represent that kind of knowledge that you want to get out of it and then instruct lab uses this this
07:00 - 07:30 text and those questions together with a really large model preferably what they they call a teacher model to create a large number of synthetic question and answer Pairs and this is a really powerful trick I running llama 405b is super expensive like it's a gigantic model and it costs a lot to run but that way you only have to run it once for like an hour or two and create the synthetic data that you want that
07:30 - 08:00 would otherwise take weeks or month to create manually and use that to train your small model like it's actually pretty genius we will see what that means in practice in a few minutes when I show you the demo and the third pillar is the what the authors of the paper called multiphase training and it's a pretty clever training approach that's mainly supposed to be very efficient and particularly be efficient against catastrophic forgetting like
08:00 - 08:30 this phenomenon when you learn something new and you forget something old for it I've been told that's not actually a thing in humans but it really is a thing in large language models I'm not going to go into detail here but the key idea is that you leverage this tonomy approach and you first train the knowledge and then train the skills plus a lot of fine details to it uh but that is the secret trick that is actually not Secret because here in the paper but
08:30 - 09:00 that is a lot more efficient than just throwing the whole pile onto qora which actually under the hood this is Cura uh and hope for the best like this actually works really well so let's jump to the demo at this point you pretty much need an apple with an M1 M2 or M3 chip or a Linux system preferably with a more or less powerful GPU I have a window system with a crappy GPU I ended up doing is to
09:00 - 09:30 Simply rent a VM and use that as a demo environment you can very well do that on Google collab um I just really like Lambda Cloud not affiliated with them in anyway but it's just a really good service for a really reasonable price and at this point I think I have a a6000 which cost me like 80 cent an hour uh I have a repo here where I have a very simple number
09:30 - 10:00 book and there we go so first things first let's do some installs this is specifically 0.17 obviously when you watch this video in the future try to use the newest version available and for some reason I had to reinstall llama CPP to a specific version otherwise I couldn't get the GPU to work this is not exactly supported but well it works for me the last command here in the list is IAB s info
10:00 - 10:30 that gives us a summary of the configuration and if you look here it says GPU offload true so we have full GPU support which is pretty relevant since we are running a pretty substantial a6000 GPU next we configure the models that we want to use and I highly recommend going with a beefy model uh the defaults suggest that you use granite or merite I don't know exactly as a teacher model but I don't
10:30 - 11:00 think that's a very reasonable choice I mean you want to use a preschooler as a teacher to teach another preschooler I don't know it definitely works a lot better when you use a college graduate model so a big one a mix 8 x7b as a teacher so I would highly recommend that here we do some configuration of the instr lab CLI nothing too exciting you can go into detail if you want but this works for now and here we are actually
11:00 - 11:30 downloading the models this I model download kind of the default this downloads Mite 7B that is a version of the mistol 7B that has already been trained using the instru lab method so this is already an improved version of mistal 7B and this will be the model that we will be further training with our specific knowledge and here we download mix TR 8 x7b we download that using hugging face repository we can
11:30 - 12:00 actually use the insert La CI for that which is kind of convenient but it's a pretty substantial 20 GB so this will take a moment so much for the generic setup now let's get specific we add our own data now what we've chosen what I've chosen simply because there is a really good example about that in their instr clap taxonomy repo are the 2024 Oscars they are obviously Way Beyond the knowledge Horizon of pretty much any model model just because they're pretty
12:00 - 12:30 new March 10th like just a few month old they're definitely not within the knowledge Corpus that model was trained on so if we check the raw model which we will do in a moment it performs miserably on those questions as I've mentioned before the way we present data to ins lab is highly opinionated you don't just dump information in there but you have to explicitly provide seat examples which are always Q&A pairs for
12:30 - 13:00 example when did the 2024 Oscars happen the answer that 2024 Oscars were held on March 10th 2024 but those questions in that example there are 12 of them but the real knowledge is hidden down here uh I've shown you this before and it is simply the Wikipedia article about the 2024 Oscars and this will be the base the context that will be handed to our
13:00 - 13:30 teacher model together with some of those example questions our teacher model will come up with a lot additional Q&A pairs to cover this whole topic and that's those Q&A pairs will be the main thing that will be used to actually train the model later I just created two terminal windows and started a model server on in one terminal and a Chad window in the other so all described here uh and this is using the very raw
13:30 - 14:00 naked merlinite model that does not know anything about the 2024 Oscars at this point so let's just try and see what what output we get it's it's going to be horrible I promise when did the 2024 Oscars happen U okay took place in February 24th yeah that's just hallucination like unusable unfortunately actually worse than unusable what film had the most Oscar nominations in 201 24 let's try
14:00 - 14:30 that one also going to be horrible are yet to 2024 Oscars are yet to take place well that's actually kind of true from the perspective of that model so I would actually count that as a semi win what film won the Oscars for Best Picture in 2024 um best picture has not been announced yet well again that's actually pretty true from the perspective of the model so you get the point it's not able to answer those questions but we we are going to change that and let's run this
14:30 - 15:00 cell oh and there it goes generating synthetic data 500 iterations we do have a bit of a version issue here but it works fine so here is our first example who presented the awards for best visual effects and best film editing at the 2024 Oscars an Schwarzenegger presented the awards and there we go a lot of warnings and here's the next next one
15:00 - 15:30 and the next one and that way we're going to generate 500 Q&A pairs it's going to take a while the reason why that is is because there's actually some pretty sophisticated prompting happening behind the scenes here you can dive deeper when you look into the paper but it's actually generating suggestions then evaluating those suggestions then creating answers evaluating those answers the factfulness of those answers there are quite a lot of steps happening before this Q&A pair pops out it's just not trivial to create high quality
15:30 - 16:00 synthetic data and well instro laap does that for us so this has been running for like an hour those are some of the examples that came out of it we've already seen a few of them one thing I want to point out that one's actually incorrect generating synthetic data is not a hands-off process it's a human in the loop process you have to check those synthetic data points it's not guaranteed that they will be correct most of them are but some of them are not and it's important that you catch them but for us that's good enough so
16:00 - 16:30 we're going to run the training oh this might take some time uh one thing I want to point out here number EO the default is one and iterations the default is 100 in my experience those defaults are way too low those numbers worked for this example for me pretty well this is going to go on for quite some time here even though the training part actually surprisingly fast it's probably going to take like I don't know 20 minutes or something after about 30 minutes we are done with our training but at the end it
16:30 - 17:00 saved our trained model here and now we can evaluate it we've got our server here again and here we run our chat window I limit it to 100 tokens because the model for some reason developed the habit of just keep on talking it's kind of weird but if you limited to 100 tokens it's really good so let's try a few things who posted one the most Awards what's at the 2024 film
17:00 - 17:30 Oppenheimer five Awards well that sounds pretty good who hosted the see what about Jimmy Kimmel oh yeah there we go it's kind of developed the habit of citing sources I don't know exactly where that comes from it might it's probably due to the way that the synthetic data is presented to the model so there's a lot of ways to improve this let's ask one more question when were
17:30 - 18:00 March 10th 2024 Dolby Theater in Los Angeles that sounds pretty good again just trying to quote some sources which is kind of nice but might want to add a stop sequence here so yeah we just successfully taught our model about the 2024 Oscars I think it actually was not all too hard that was a fun toy problem and well it it worked but in reality you would not solve a problem like that with fine-tuning getting factual knowledge into a large language model based system
18:00 - 18:30 you use rack retri Lo man generation for that it's easier and you can easily change the information that the system is retrieving where this fine-tuning approach really shines is solving the long tale of problems that you have with large language model applications if you have an agentic workflow and there are things that workflow gets stuck at and you know the problems you can generate data to solve that exact problem and create a model that is really good at
18:30 - 19:00 solving your workflow or for example if you've got German language like I'm German large language models always suck at German it's always the same other languages are even worse but it's just way harder to get an large language model do anything in German and for example if you want the model to quote properly I mean we've seen that model do quotations even though we didn't want it to but that's for example something you could teach a model with a skill a compositional skill and make it capable of doing this task really well and this might be crucial for your application I
19:00 - 19:30 hope you learned something and as always if any of this was interesting reach out to me LinkedIn Twitter Youtube wherever you want to reach me I'm always happy to chat thanks bye