Master the Art of LLM Fine-tuning
Advanced Data Prep and Visualisation Techniques for Fine-tuning LLMs
Estimated read time: 1:20
Summary
Fine-tuning large language models requires a comprehensive approach to data preparation. This video by Trelis Research provides a detailed guide on creating high-quality synthetic data for LLM fine-tuning. Starting with document ingestion and conversion to text, the process moves through chunking, generating question-answer pairs, and visualizing datasets to ensure comprehensive coverage. The video underscores the importance of contextualization, consistent grading, and representative evaluation data. Techniques like document summarization help to enhance context, while advanced visualization tools are used to compare models' performance in question generation, all climaxing in the creation of a balanced evaluation dataset.
Highlights
- Document ingestion and conversion to markdown: Evaluate different tools for accuracy and speed. π»
- Importance of chunking: Balancing context with specificity in generating QA pairs. βοΈ
- Visualizing datasets: Use tags and embeddings to assess model coverage and question diversity. π
- Creating evaluation datasets: Strategies for balancing and rephrasing to prevent overfitting. π
- Iterative QA generation ensures comprehensive coverage of content. β»οΈ
Key Takeaways
- High-quality data is essential for fine-tuning large language models. π
- Document ingestion and conversion are the first steps, followed by chunking and question-answer generation. π
- Contextualization and consistent grading ensure the data is usable for training. π―
- Visualization plays a critical role in understanding data distribution and model performance. π
- Balanced evaluation datasets help avoid overfitting in model training. βοΈ
Overview
If you're diving into the world of fine-tuning large language models, having top-notch data is your golden ticket. Trelis Research's latest video walks you through transforming ordinary documents into extraordinary training datasets. From extracting and converting documents into markdown, to breaking them into manageable chunks, every step is methodically explained.
The fun doesn't stop at data conversion. The video delves into the dreamy world of visualization, where you learn about the wonders of embedding plots and tag-based comparisons. It's about spotting trends, catching inconsistencies, and ultimately ensuring your model isn't just learning β itβs flourishing!
Finally, we delve into the secret sauce of crafting an evaluation dataset that's as balanced as a tightrope walker. The discussion covers innovative strategies like rephrasing and mirroring questions to ensure that your model's training doesn't just skim the surface but truly dives deep without drowning in overfitting.
Chapters
- 00:00 - 00:30: Introduction to Fine-tuning LLMs The chapter "Introduction to Fine-tuning LLMs" discusses the initial steps required to fine-tune a large language model, emphasizing the necessity of high-quality data. The process starts with document ingestion followed by converting these into text. Subsequent steps include chunking the text, generating questions and answers from it, and visualizing the data to ensure it is comprehensive. The chapter also promises to cover techniques for efficiently creating the necessary datasets.
- 00:30 - 01:00: Data Preparation Theory The chapter introduces the topic of data preparation theory with a focus on generating synthetic data. It discusses the use of scripts from an advanced fine-tuning repository and emphasizes understanding each step in the data preparation pipeline. The narrator promises to explain the process in detail to enable building from scratch.
- 01:00 - 02:00: Document Ingestion Techniques The chapter 'Document Ingestion Techniques' covers various stages and approaches in document processing. It includes performance comparisons of converting PDFs to markdown, different chunking techniques, and the trade-offs involved. The chapter also explores question-answer pair generation methods. A significant portion of the chapter is dedicated to visualization, emphasizing its importance in understanding the distribution of questions to ensure comprehensive subject matter coverage.
- 02:00 - 03:20: Chunking Approaches In the chapter titled "Chunking Approaches," the discussion focuses on comparing the performance of cutting-edge models in question-answer generation. The chapter highlights the importance of the creation of evaluation datasets, a crucial topic for effective fine-tuning of models. Although fine-tuning is mentioned, it's indicated that the detailed process will be covered in a subsequent video. Key elements of a high-quality synthetic dataset, such as coverage, are briefly touched upon.
- 03:20 - 06:00: Question-Answer Pair Generation The chapter 'Question-Answer Pair Generation' emphasizes the importance of generating comprehensive questions for fine-tuning and evaluation purposes. It highlights the need for coverage to reflect various topics, formats, and levels of difficulty in questions. Additionally, it underscores the necessity of contextualization, illustrating that questions need to provide enough context to avoid ambiguity, such as specifying what type of 'field' is being referred to in a question.
- 06:00 - 08:00: Visualization Techniques The chapter emphasizes the importance of context in question formulation and the use of a representative evaluation dataset in training models to ensure meaningful answers and prevent overfitting. Ensuring that the evaluation dataset mirrors the training set in scope and topics is crucial for accurate performance assessment.
- 08:00 - 10:00: Creating an Evaluation Data Set The chapter discusses the importance of creating a representative evaluation data set for testing purposes. It highlights various design choices one must consider during the setup of the evaluation set. Additionally, it emphasizes the need for consistent grading when assessing generated questions and answers, advocating for a standard rubric to ensure judges grade answers reliably and avoid inconsistency in assessment.
- 10:00 - 11:00: Conclusion and Next Steps The chapter discusses the importance of consistent grading in pipelines to avoid inconsistencies. It underscores the problem where correct answers are marked incorrect due to selectivity in marking only certain 'verbatim' answers as correct. The chapter outlines a step-by-step process beginning with document handling, starting with text extraction from documents and proceeding with various approaches, including utilizing vision technology.
Advanced Data Prep and Visualisation Techniques for Fine-tuning LLMs Transcription
- 00:00 - 00:30 If you're going to fine-tune a large language model, what you need is some highquality data. This means starting off probably with a set of documents and converting it into some kind of questions and answers. Now, there are various steps in this process, each of which I'll go through in this video. Starting off with document ingestion, converting documents into text, then chunking, then generation of questions and answers, then visualization of that data set to ensure that it's comprehensive, and last of all, I'll cover some techniques for creating an
- 00:30 - 01:00 evaluation data set. Now, throughout this video, I will be using scripts from the advanced fine-tuning repository, and I'll be working in the data prep folder. However, you should be able to follow along. I'll describe everything in detail if you want to build it from scratch yourself. I'm going to start off with a little bit of theory. I want to talk about the goals for generating data, what it means to generate highquality synthetic data. I'll walk through the pipeline, just a graphic showing you the various stages. And then
- 01:00 - 01:30 I'll talk about each of those stages. So document ingestion, I want to show you the performance of different approaches to conversion of PDFs to markdown. I'll talk about chunking approaches and the trade-offs there. question answer pair generation approaches. I'll then spend quite a bit of time on visualization. This is probably underappreciated and I haven't covered it enough on the channel before, but being able to see the distribution of your questions is going to be critical to ensure you can comprehensively cover cover your subject matter. I'm going to
- 01:30 - 02:00 compare the performance of different models, the uh cutting edge models on QA generation. And last of all, I'll talk a bit about evaluation data set creation, which is a specific but pretty important topic if you want to effectively do fine-tuning. Now, I will go further and actually do the finetuning, but I'm going to reserve that for a follow-on video. So, what are the goals and what does it mean to have a highquality synthetic data set? There are a few things we're trying to achieve. The first is coverage. If I have a document,
- 02:00 - 02:30 I want to make sure that the questions I generate for fine-tuning or for evaluation cover every aspect of that document. So coverage is going to be important and it needs to reflect the topics, formats, and difficulties, different difficulties of questions that might be posed. The second goal is contextualization. It's useless if you just ask a question like how long is the field because there is no way to know what field that is referring to. Is it a rugby field? A soccer field? This is what I refer to as contextualization.
- 02:30 - 03:00 The questions must have correct context uh within them so that they are correctly posed and there's a meaningful answer uh that can come back to it. A third thing is to have a representative evaluation data set. If you have a big training data set, you're going to use the evaluation data set during fine-tuning to check the loss that you're not overfitting. And you're also going to check the answers on the evaluation data set. And you need this to be representative of the training set. If it just has a subset of topics
- 03:00 - 03:30 or categories, it's not going to be representative. So I'll talk about some of the different design choices in setting up your evaluation set. And last of all, uh this is very related is consistent grading. When we generate questions and answers, we need to have a way then to grade correct answers. And we want to make sure that our judge is going to be consistently marking the answers correct based on some kind of rubric. If it's inconsistent and the
- 03:30 - 04:00 judge is maybe only marking certain verbatim answers correct, then you're going to find that actual answers that are correct are being marked incorrect and this is going to give inconsistencies in your pipeline. So having consistent grading is going to be pretty important. Okay, here is the pipeline. I'll just lay it down. If you don't get something, we're going through each step. So don't worry too much. We're going to start off with some documents, extract the text from it. We'll use a few approaches, one of which will literally be to send it into a vision
- 04:00 - 04:30 supporting LLM. Then we'll have a body of text which we'll split into chunks. And we're also going to take the text from each document and generate a summary. So we'll also have a set of summaries. There'll be one summary for each document. And given these chunks and document summaries, we're going to generate pairs of questions and answers for each chunk. And when we generate when we prompt a language model for questions and answers, we're going to pass in a chunk, but also the document summary, which will help give context to the LLM. This is going to give us a
- 04:30 - 05:00 training set of questions and answers, which we're then going to visualize. And we'll visualize them in two ways. One will be by tagging each of the questions. We'll get an LLM to assign some keyword tags. And then we will show a plot of the different questions organized by tags. And the second approach is embeddings. So here we'll calculate the embedding of each of the questions and then we will again do a plot that will cluster everything so we can see the spread of data and we'll be able to visualize that and see how different language models compare in
- 05:00 - 05:30 generating question answer data sets. Then we're going to move to create an evaluation set. And the way I'll do this uh is by uniform sampling. So I'm not going to take a random subset of the training set. I'm going to make sure that it reflects the categories that are in that training set and I'm then going to rephrase those questions that are taken from the training set to avoid overfitting and that's the way I'm going to create an eval set. Now there are some other ways you can do that too
- 05:30 - 06:00 which I'll talk about but this is the main one I'm going to use. So let's get started. Uh we'll walk through now each of these different sections. The first of which is document ingestion. So this is where you convert the documents into um text. We're not even converting to chunks yet, just trying to get a document into text. And there are three methods that I'll cover. The first is a library called marker PDF by VicP. It is probably the most accurate as you'll see and I'm going to show a demo. The second is Mark It Down by Microsoft. Very fast
- 06:00 - 06:30 uh and cheap and you can just run it on your CPU. And the third option is to send in the PDFs page by page and have Gemini Flash create markdown. And this is surprisingly accurate, versatile, and also pretty cost effective. So I'll show you each of those. In fact, I'll go over now to Windsurf where I've cloned advanced fine-tuning and I'm in the data prep uh folder. So actually I've just cded into
- 06:30 - 07:00 data prep and I have a test script here. It's in the test scripts folder and it's called I know my fonts are small. Let me just increase those. It's called PDF to markdown. And this is a simple demo script that's going to take a PDF file and it's going to use these three approaches I mentioned to convert it and then it'll calculate the time it takes for conversion. So let's see the time it takes to convert and let's see the quality of the results we get from this. So I'll CD into test scripts and I'm
- 07:00 - 07:30 going to UV run this script which is PDF to markdown and I need to pass in the path of this uh script here. So I'm just going to copy this and I need to wrap it in inverted commas because it has spaces in it. PDF and it's not able to find the PDF to markdown because I need to put in py. So, it's now installing all of these scripts because I've set them as dependencies at the top. And it's going
- 07:30 - 08:00 to take the PDF and pass it through marker, pass it through marked it down, and then pass it through Gemini Flash. And we'll see what we're going to get back. Now, to run this, you do need to have add in an API key because we're going to have to call Gemini Flash. You could call it directly, but I'm calling it through Open Router. And so, I've put in an open router API key. Actually, I think I'm reading it from av file which is located uh within the data prep folder. You can check out the
- 08:00 - 08:30 sample.env. Have we got a sample env? It's basically just open router API key is equal to that key. And you can see I actually have the syntax wrong here. So I need to paste this and I need to put PDF path and then I'll copy in the PDF path. Just this here. And finally I've put in the right command here. And I need to put in the path. It's an underscore. And we should be running now. Okay, great. So, we're running with marker. Marker does use OCR. It uses
- 08:30 - 09:00 sura and it uses a variety of models to identify layouts and even to identify equations. So, it's very targeted in how it converts a PDF into text. But it does mean that there are quite a few models that need to be downloaded and then run. And the first time you run it, if you're running on a Mac, which I am, or on a CPU, it's going to take a bit longer because it has to download the models. So, I'd expect to see that the time for processing here is going to be a bit longer for marker just because it's got to do those downloads. Okay, so we've
- 09:00 - 09:30 got uh marker is complete. Mark it down was almost instantaneous and then Gemini Flash was also pretty fast. Now, in my in my code here, when I call Gemini Flash, I am using um an asynchronous method to send the pages in parallel. So actually I have a default of 32 as a batch size. You could actually increase that. You can have a higher concurrency with Gemma with Gemini. And this means that each page is being processed in parallel which will speed things up by quite a bit. Now I've got complete
- 09:30 - 10:00 parallelization because the PDF I'm converting is just 24 pages. So this is the fastest that you would be able to get. If your document is longer than the concurrency, then it's going to add time because you're going to have to put in series some of those conversion steps. But what you can see here is the time for each of these methods. Marker takes about 19 seconds. Uh mark it down takes 08, so extremely fast. And then Gemini flash is about 20 seconds. And this is running with marker PDF locally. If you decided to run marker uh via their API
- 10:00 - 10:30 endpoint, then you would be able to get I think a few seconds in terms of response time. So it won't be as fast as mark it down, but it would probably be about 5 to 10x faster than using Gemini Flash. Now let's take a look at the quality that we're getting. So first we have Mark it down. And what you can do is open it as a preview the markdown. And you can see some of the problems already. There's there are spaces that are missing between words and that's going to be annoying when it comes to chunking because it's going to reduce the quality. Mark it down will split
- 10:30 - 11:00 things out by page. So you can see it's getting the text from each page. But here again, for example, it's putting in this added character before the field of play. So, it's kind of getting the raw text. It is getting the raw text, but it's certainly uh making some errors in terms of the correct breakdown of paragraphs, new lines, and even just adding things together without having a space or a new line between. By contrast, if you look at Gemini, and if you take a look in the
- 11:00 - 11:30 preview mode, it's making um it's making a little better of a job. You can see that here, for example, it's got some new lines that are more correctly being displayed. Let's scroll down a little bit more. In this section on page six, it's correctly bolding and then giving a description. So, we don't have the same issue that we saw with mark it down where there are characters adjacent to each other. And overall I would say the Gemini approach is pretty is pretty accurate. Uh probably will be a good
- 11:30 - 12:00 approach for many applications. Also it will sometimes convert images. So here in the appendex there was some text on the image. So if you look here at the very last page, sorry second last page, you can see it's the field and it's got some text on it and that text has been converted. So it's available here. Now, it's maybe not in a very useful format, but it's doing better uh than what say the mark it down library is doing. Now, we'll go to the gold standard, which is
- 12:00 - 12:30 looking at marker. And look how tidy this is organized. If I open up the preview now, you can see uh everything is very neat. Like look at this table here. It's beautifully parsed. Um we have a beautifully parsed table here. Another beautifully parsed table. Uh so generally, yeah, look at this. not very nice indentation. So you can see the difference. Uh markdown is uh just of much higher quality. Now it's not um
- 12:30 - 13:00 it's not going to convert the image. That's just because it's it's not designed to convert the image. So that's something to take into account. It will optionally if you set a flag, it will optionally allow you to inject images in here. So, if you built a more advanced pipeline, you could actually forward the markdown text and you could forward the images um as part of chunks when you're generating questions. For now, I'm just dealing with text. But anyway, this should give you a clear picture of how much better it is to use uh marker versus the other ones in terms of quality and just highlight how Gemini is
- 13:00 - 13:30 a good option. It's very versatile. You can send in any document and it will do a better job than just using mark it down. Now, I've summarized here. Sorry, this is very blurred. This is the same results that I showed you. Actually, marker will be a little bit faster if you run it again because all the models will have been downloaded. As I mentioned, marker via API, I think would be 2.4 seconds. They state 10 pages uh per second. You can also consider the cost. So, with marker PDF via the API, of course, free if you run it on your
- 13:30 - 14:00 own computer, it's $3 per thousand pages. Gemini Flash I calculated works out to about half a dollar per thousand pages. So, looks a bit cheaper. and mark it down. Since you're running locally, that's going to be basically free. That brings us to chunking approaches. So, we have now our text in the form that I just showed you. It's a big markdown file, and we want to break it into chunks so that we can ask questions or so that we can generate questions for those chunks. Now, there are a few ways that you can break up into chunks. And the the basic principle
- 14:00 - 14:30 is that you would like the chunks to kind of reflect natural groupings that a human would think exist, like paragraphs or at least sentences. Vac sentences is the first level of detection that you typically want to use in making chunks. And there are broadly two ways to do that. One is a regular expression where you just search effectively for text with a period at the end. It's a little more complicated than that, but that's basically it. And the other is to use a library with a small neural net. There's an NLTK library I've used in previous
- 14:30 - 15:00 videos. It's a bit slower, uses more compute. They're very small models though, so it's really not that slow. and it's a little bit accurate, more accurate in identifying sentence boundaries. So, if you look here in the pipeline and I'm going now to my uh full pipeline, in fact, I've got a few scripts. I've got a first script for ingesting that's going to ingest any documents I put in my data folder and it's going to use the markdown uh library. The next one is then chunking. And for chunking, I want to show you now
- 15:00 - 15:30 how I do the regular expression based extraction. And here it is. So split into sentences. And you can see it's looking for this pattern here. And that's how we're detecting sentences. So I'm using this. It's a little bit quicker and seems to work fairly well. Now, if you want to go a step further than just recognizing sentences, it can be good to also recognize tables. So you can do that also on a regular expression basis for tables, for CSV, probably for other formats, too. I'll show you here uh
- 15:30 - 16:00 recognizing table. You might have seen me uh glance past it, but you can set up some kind of regular expression that's going to extract tables. However, note that this is only going to work if you've got good conversion to markdown in the first place. So, it probably won't work with Mark It Down. It probably won't work with Gemini, but it probably will work if you use marker PDF because you'll then be able to cleanly extract the tables. And I'll show you how that works nicely. In fact, um I'll run that script in just a moment.
- 16:00 - 16:30 Now, something else to think about on chunking is what's size of chunks. So, you've got the sentences, you've got the tables, and what's your strategy now for defining the chunks? And a typical strategy is to set a minimum length in terms of tokens or characters and then a maximum length and group the neighboring sentences together. Now, the bigger the chunks you make, the better the context. If you have a chunk that's the full document when you're generating questions, then the model's going to know that full context for any question it poses. and it's less likely to be
- 16:30 - 17:00 missing key information. On the other hand, it's less targeted for question generation. So, you're more likely with a very large chunk that you've questions that cover some part of it, but the questions generated don't cover the entire chunk. So, this is the high level trade-off in your size of chunks. Now you can improve contextualization regardless of chunk size by feeding in a summary of the entire document so that at least you have some kind of context and that is something I recommend and it's something
- 17:00 - 17:30 I also do and I have a script here that's called summarize and quite simply it takes in uh a full document and it uses Gemini which is a very long context model. So you can see I'm using Gemini Flash 2 2.5 I think is out in preview mode. I've set a temperature and this here is basically just going to create a summary. So, it's going to get the files uh to summarize and then generate a summary. Here's my prompt. Capture capture the main ideas and key points. Preserve important details and examples.
- 17:30 - 18:00 Maintain the logical flow. Be concise while retaining the essential information. And here's the text to summarize. And it will return that summary. And that summary then is going to be used when we move to question generation, which I'll talk about in a little bit. But first, I just want to talk about uh chunking and I want to show you how to run that chunking script. Uh in fact, if you like, what I can do is go ahead and run all of the scripts to date. So, I'll run ingestion, I'll run summarization, and then I'll run chunking. First of all, I'll need to
- 18:00 - 18:30 initialize uh an environment and add my dependencies. I've already done that. So, all the dependencies are listed in the TOML file here. You won't need to do it if you're cloning this repo, but you will if you're creating this from scratch. Then I'm going to run my ingestion script. And optionally I could press force. That will just reingest anything because I've actually already run this before. I need to be in the right folder. Run ingestion. And probably it's just going to say that the document's already been
- 18:30 - 19:00 ingested. And yeah, it has. So it's skipping using marker to convert. And now we've got the extracted text uh which we can take a look at. It's in data and it's under text. And you can see here it should be the high quality text that was extracted by marker. And indeed it is because it's all very well formatted. Now we're going to generate a summary. Uh make sure I'm going to make sure to add in python.n because I need to call uh open router which is going to
- 19:00 - 19:30 make use of my Gemini model. By the way, open router it's kind of a cool service. You can call any model from it. It does charge a 5% premium, but one of the benefits is you can create a distinct API key that has a spend limit, which is pretty nice, especially if you're working with agents. So, just a tip on open router there or open router, I suppose I should say, uh, in in Ireland. And so, after setting my open router key in my environment variables here, and actually I I said sample.n,
- 19:30 - 20:00 but actually I have an n.example. And here's what that looks like in myv file. And now I'm going to uh configure the summarization settings. So what I've done is I've set up this whole repo so everything can be configured through a configuration file. And that configuration file is located in config. By default it will run config.json but you can also when you're running any script pass in a custom configuration. So if you look here, here's my default config and I'm able to control a few
- 20:00 - 20:30 things for summarization. I have a fairly low temperature. got a max tokens, the top p, top k, and min p. When it comes to chunking, then I've got the min length, and I've got the max length of chunks. And it's going to try and generate uh longer lengths of chunks. So, it's really going to end up with chunks closer to this 5 5,000. Um, except I am going to allow each table to take up its own chunk. Uh, so yeah, we'll see how that forms when we run the
- 20:30 - 21:00 script. And yeah, I'm in a position now where I can run the summary. So I can do UV run utils summarize. And if I want, I could pass in a config file. And here, yeah, I've just run with the default uh configuration file. And it's creating a nice summary. It's got about 200 2,000 different uh output tokens for that summary there. So that's the summarization. And next, we're going to move to the chunking. And for chunking, I'm going to go back and just uh show you again
- 21:00 - 21:30 what's in the configuration file. I've got a min length, a max length, and yeah, I've already talked about that. So, I think I can go ahead and run the chunking. Note that you may need to add a dependency here for regular expressions. And then we can run uh this chunking here. And chunking does not use a model, by the way, because it's just using uh those regular expressions. So, I could run force if I wanted. and it's created 11 uh segments into 12
- 21:30 - 22:00 chunks. So, because I'm identifying tables, it's able to identify text segments and then it's able to identify tables. So, in total, that's coming out to well, it's 11 if it's text segments plus tables, but it's basically splitting at least one of those, well, one of those exactly into two chunks to give 12 chunks, probably because it's over 5,000 uh tokens in length. So, let's take a look at the chunks. We've got chunk one, which looks to be just uh some text. Chunk two, some text. Uh
- 22:00 - 22:30 chunk three is some text here. And you can see it's neatly finished off with the end of a sentence. Chunk four, chunk five. Chunk five is quite short, probably. Um yeah, chunk five. Let's see. It may not be that short. Yeah, it's not all that short. It's just it was spilling over to the right. Chunk six, seven. And if we move on, we can see that the tables uh the table here has got its own chunk. So for the field
- 22:30 - 23:00 of play and then here these tables have got their own chunk as well. So I like the idea of keeping uh the tables within the same chunk because typically stuff within a given table is going to be related to itself. So I like that aspect of design. All right. So we've got our chunks generated now. And the next thing we're going to do is question answer pair generation. Now let me describe the high level of how this works and then I'll describe some different approaches. So the prompt used is something like this. This is a
- 23:00 - 23:30 pseudo pseudo prompt. Given the following document summary, we inject the summary. And the following chunk of text from that doc, we inject the chunk. Generate a series of questions and answers relying solely on knowledge from that chunk. So we don't want the questions and answers to refer to other knowledge. Otherwise, it's not going to be well grounded and could hallucinate. Now ensure that the question describes the context. This is also very important because if the question does not give the context which is touch rugby in this case then it may be difficult to know
- 23:30 - 24:00 what the answer should be for the length of the field or the shape of the ball. Then I recommend giving the format required for response and giving a oneshot example. Now there's one more tip that I recommend as well which is putting in a line here for reference. the context is and then brief context and this would just be some manual context that you specify maybe in the configuration file and I think I can show you that if I go to the configuration file uh right here you can
- 24:00 - 24:30 see that in QA I have this manual field which is provided for generating each question in addition to the document summary. So you can imagine if you had multiple documents, this is going to be provided for all documents whereas the summary will only be provided for a chunks given document. Okay, so this is the brief outline of how we will generate questions. But there are a few uh things to think about uh in the nitty-gritty here. The first thing to
- 24:30 - 25:00 think about is judging. Ultimately, even though we're just generating data here, we are going to partly judge the quality or determine the quality of the data by actually running evals by running evaluations and then judging the answers. And so we need a robust judging approach. And there are two broad ways of doing this. One is to just have questions and answers or ground truths. But the better approach I recommend is to have questions, evaluation criteria and then answers. Now, the answers will
- 25:00 - 25:30 be used for fine-tuning, but the evaluation criteria, this will be the bullets or the rubric that are used for grading the answer. And the reason I like this is because having a rubric is generally more robust than trying to match a ground truth answer. The problem with matching a ground truth answer is that often a generated answer could be correct. It's just formatted differently or in a different order to the ground truth. It's difficult to consistently grade exactly the same if you're just relying on a ground truth answer. So instead of just asking for questions and answers, we're going to ask for
- 25:30 - 26:00 questions, evaluation criteria, and answers. Second point is something I've said before. Contextualization is key. That's why we add in the optional uh well, the required context phrase. I guess it's optional if you put it in as blank. Now, a big question here or a big matter in generating QA is to figure out how many questions you want per chunk. And one approach is to just give a fixed number uh ask for a fixed number of QA per chunk. The problem is this is kind of inflexible because if you have a very
- 26:00 - 26:30 big chunk you will need more questions and if you have a smaller chunk you will need less and this is not very dynamic. So a more dynamic approach is to ask for as many questions as the LLM thinks it needs to cover the meaning and content of that chunk. And this is a good approach. This is what we're going to do. But actually, if you do this and you take the questions and feed them back and ask for more questions, you will often get more questions, which indicates that there often are still gaps even if you ask for all of the content to be covered. So, as a
- 26:30 - 27:00 solution, we're going to iterate and we're going to iterate until the language model is not able to find any more QA pairs because it deems the content to have been covered. Okay. Now I've talked about generating questions and answers whereas actually what I meant was generating questions eval criteria and answers but actually actually what I mean is we're also going to generate not just questions not just eval criteria and answers but also we're going to generate a question category which is going to help us determine
- 27:00 - 27:30 whether it's a reasoning or a fact-based question and we're also going to have the LLM try and estimate the the difficulty of the question. Now I would be very skeptical of the quality of the difficulty. I haven't run a test to measure actual difficulty versus perceived or projected difficulty by the NLM, but it's something maybe you can include and I've got it in there for now. Probably doesn't do much harm and I would need to test it out further to say that it's really meaningful. But ideally, this would be nice to have if it is meaningful because then you can use it for grouping your questions and
- 27:30 - 28:00 grouping your eval set a bit later on. Okay, so we're in a position now to run some of this QA and I'll just go through some of the parameters. So we have to define a model, a temperature, max tokens, top P, top K, min P. This makes sure that when you've got a temperature of non zero, you're not able to pick while tokens that could throw the answer off. We've got our context and I've also set this up so that we have batching so that we'll run in parallel with many requests. to speed things up. You can
- 28:00 - 28:30 choose to skip tables, not generate questions for tables, but that seems uh a bad idea. So, we're not going to skip the tables. All right. So, we'll move now past the chunking on to QA generation. So, I'm going to go down here and find a script. So, here we are. And there are a few options for generating questions. First is we can pass a custom configuration file. We can force regeneration. We can run in test mode. We can only process a given document.
- 28:30 - 29:00 And we can also iterate. This is what I talked about. But iterate means we will generate questions, pass them back and see if the language model wants to generate more. So let's run in test mode. And let's run with iteration like this. And it's not iteration, it's iterate. And everything has been generated already. So I'm going to add force. So what's happening here is it's found a document to process and it's now generating for iteration one and it's
- 29:00 - 29:30 generated five new pairs and now it's going to uh move to the next iteration iteration two and it didn't generate any pairs so it's going to stop iterating. Now it's going to move to chunk uh two. So it's moving to chunk two. It's doing one iteration generated 11 pairs. Uh it's now going to move to iteration two generated seven. Iteration three generated two pairs. And yeah, the number of iterations that we have is currently set to uh three at max. But
- 29:30 - 30:00 what I don't like is uh we didn't have enough iterations. And the reason for that is because if I go up to the config file here and if I check within QA, we can actually add in a flag that will set the max iterations. So we can add in this uh flag here max iterations of 10. Let's just uh paste here. Save. And if we run again and we run with
- 30:00 - 30:30 iterate and then force it should do something similar. Now there is some stoasticity here. So we may find that it runs out of questions earlier. But for the first chunk it's generating um six questions. It has no new questions on the second iteration. For the second chunk, it's generated 13 questions on the first iteration. On the second iteration, it's generated three. On the third iteration, it's generated three. So, it's clearly finding a lot more room for
- 30:30 - 31:00 questions. The fourth iteration is generating three. On the fifth iteration, it's generating two. Sixth generation is generating three. So, it's clearly finding a lot of pockets. This is quite a large chunk. If we check chunk number two, so potentially, especially if I expand it like this, potentially quite a lot of uh room for questions to be asked. And yeah, it's even getting as far as uh 10 iterations. Now, generally, I find that for a given chunk with a max length of 5,000, within 10 iterations,
- 31:00 - 31:30 you're likely to get to the end and it will have zero on a given iteration. So, that's why I recommend a config value of about 10. We're going to go ahead and take a look at the data. Take a look at the QA here. And you can see the generations all combined for the first question here. So here are our six questions. You can see them combined then for the second chunk here of which there are
- 31:30 - 32:00 many more. And notice how each one has got questions eval criteria, answer uh difficulty and then a category uh for the question as well. Okay. So now we have a series of uh questions. We would want to run that then for the full data set. So I've run it just on test which will only run two chunks but we would need to run that on the full data set in order to get questions and answers uh that are covering our full document. So now we have the question and answer pairs and what we want to do is maybe
- 32:00 - 32:30 visualize what they look like and compare if we generate question answer pairs with different documents because uh not with different documents but different models because we want to see well should you use 03 should you use Gemini flash what's the best model to use in order for question generation and there are two ways that we can visualize the questions the first is an embeddings based approach where we calculate the embedding for each question and we plot them. And the second is a tags based approach. So I'm going to go now to the embed viz
- 32:30 - 33:00 folder which is right here. So cd into embed viz. And within this folder there's another readme. Let me just close down all the other ones. And embed viz allows me to do a few different types of visualization. It's going to let me create embeddings visualize them within a single data set. Visualize train versus eval splits. uh compare different uh splits, different data sets rather and then also take a tag based approach where we generate tags with
- 33:00 - 33:30 Gemini flash and then compare the data sets based on tag distributions and visualizations. Now you might be wondering why didn't we generate tags when we generated the questions and answers. And the reason is because you want the same model to generate tags across all the data sets so that the tagging is consistent because different models will assign tags differently. And if you have a model tag itself, it's probably not going to categorize them in the same way as the other models would. So you're going to have tag
- 33:30 - 34:00 distributions that are not uh consistent. So that's why I recommend using a single model for embeddings across all of the data sets and a single model for tagging across all of the data sets. Now there are scripts here that you can run if you want to create some embeddings. You can run the generate embedding script. It's just going to use the nomic modern BERT model. It will take each question embed it locally actually here and it will save that embedding and then it will be able to use the embeddings to generate a plot of
- 34:00 - 34:30 the comparison. Now, if you want to compare embeddings, the comparison script will actually run the generation of the embeddings in the background. So, I'm going to start off and run a visualization of um now the second set of scripts here is around visualization. So, you can if you've generated embeddings, you can then you can then visualize them by passing in a folder uh that contains the embeddings. But in practice, it's easier rather than run these scripts separately, it's easier to
- 34:30 - 35:00 just run a comparison script that wraps the generation and the visualization. So the script I have for that is this uh compare embeddings here. And I'm able to just run this script and actually what I can do is point it to a hugging face data set. So I'm going to go over to hugging face. I'm just going to search here for trellis touch rugby and I'll sort by recently created and what I've done is create a number
- 35:00 - 35:30 of data sets using different models. So I've created them with flash with pro gemini pro with one with 04 mini. So here for example if I look at 04 mini I can copy that data set and I can paste it here. And why not compare one more data set as well. So I'll go back. Let's take a look at the Gemini Pro. And by the way, you can see the two chunks at the end. This just means that I'm only
- 35:30 - 36:00 generating questions for uh two chunks, which is a restricted data set. It just makes it easier to visualize by taking a smaller data set. So let's copy paste this one and let's try visualize. Now, this actually will work, but uh there's an extra flag I want to add in, which is interactive. And by adding the interactive flag, I'm going to be able to run or open a HTML file. And the HTML file is going to
- 36:00 - 36:30 allow us to interactively inspect the results. So, here's the embedding comparison results. I'm going to copy the path to the HTML file, and I'll open that up. Okay, so this is the comparison based on embeddings. You can see we've got the 04 mini model and we've got the touch rugby pro model. And this is a comparison done using tsne. It's um it's a student t test that is used it's
- 36:30 - 37:00 basically collapsing the multi multi-dimensional data down into two dimensions uh so that we can visualize it. And what you can see broadly is two things here. there are more green dots than purple dots. Um, so the pro model is generating more questions than the O4 model in this case. But you can see that roughly speaking the dots are kind of covering the same kind of space at least in these two dimensions here. And what this tells us is that basically the models are getting reasonable amount of
- 37:00 - 37:30 coverage uh across the data sets across the documents rather which is a good sign. If you had a particularly weak or a biased model, what you'd see is just the dots appear within a certain part of the graph. And that would be a concern because it might indicate that your coverage uh is not great. So ideally, what you would want to see is when you run a large number of models, you want to see that uh you want to pick a model that's covering basically what any model is able to cover. So if you wanted to go
- 37:30 - 38:00 a step further, you could show here uh flash. You could show here some of the other models like R1. In fact, maybe I should just go ahead and do that. Let's take a look at R1. Let's take a look at flash. Let's take a look at sonnet. So, what I'll do is I'll rerun my script. And I'm going to put on here R1. I'll put on here
- 38:00 - 38:30 flash. I'll put on here sonnet. And I'll make it interactive. And that should show five sets of dots. I think you can support I think I've got the script supporting at least five or at most five right now. So if I open up that file, you can see here the five different uh models. So this is answering the question of which model should you choose to generate questions and answers. And I would say looking on this basically all of the models are fairly
- 38:30 - 39:00 performant. Uh the flash model definitely generates less questions. There are fewer yellow dots, so potentially giving you less dense coverage. You can see that the Gemini Pro model has got very good coverage. The O4 Mini model has also got a pretty good coverage. Maybe down here, there aren't any O4 mini points. So, you could say that's maybe a little back bit of a drawback. Also, Gemini Pro doesn't get down here. You can check. So, these questions, who holds the copyright for touch rugby rules? um what copyright
- 39:00 - 39:30 restrictions apply to the touch rugby. So you can see those questions are related. Well, they're close to each other. Also, you can see that within a given data set, there aren't too many overlapping questions or overlapping points, which indicates that you aren't duplicating information. So here, if you've what range of expertise and perspectives were represented in the group that developed uh the rules and here it's a question on process and timeline. So even though they're close, they're definitely not a replica.
- 39:30 - 40:00 So basically I would say any of these questions uh here are probably going to be any of these models router are probably going to be fairly strong. If anything probably uh include the pro model or probably make use of maybe the ' 04 model. They may give the best performance in terms of question generation. Now that was the demo in terms of embeddings. You can also run a comparison with uh with tags instead of embeddings. So let's take a quick look at how that works. So if we're looking at tags, it's going to be a pretty
- 40:00 - 40:30 similar procedure. We have a script for generating tags. And once we have run the tags, we're then able to um run a comparison script, which I think also will call the script for making the tags. So what I'm going to do is let's just copy paste this here. And let's copy paste our list of models cuz I want to run all of the models. So yeah, I actually need to copy this piece
- 40:30 - 41:00 here. And then I'm going to copy from my previous command the names of all of these uh models. And you can control as well the minimum or maximum number of tags that you want to include. Um I will just put uh the default for max tags is five. Let's put that a bit bigger. And this is the max tags that are being shown in the plot. So let's run it like this. And just while this is running here, and note
- 41:00 - 41:30 that you can set which model is generating the tags. If you go to the config file, and if you scroll down to the tagging section, you can set the temperature, which I've set low, and I'm using GT40 Mini. Um, you can use you can use any model. You could use Gemini as well if you wish. So, it's going to run the generate tags on each of the data sets. It'll pass in each of the questions. We can take a look briefly at what that script looks like. Generate tags. And we should just
- 41:30 - 42:00 quickly check what the prompt involves. So, here's the system prompt. You're a helpful assistant that generates concise, descriptive tags. Generate exactly max tags. So, that's where the number of max tags comes into play that capture the key topics, concepts, and skills. Testing the question. Each tag should be one to three words long, lower case with hyphens between words. So we've now generated the tags and you can see that the comparison will generate two plots. It'll generate one which is
- 42:00 - 42:30 going to be very much like the plot we checked earlier. It's a reduced dimensionality plot down to two dimensions. And you can see the five different models here. So again, we're seeing a pretty good spread in the tags. There is this one area maybe where Gemini is creating some tags that other models aren't. And if you want to visualize it a different way, it's a little bit more intuitive. What you can do is check uh you can sort the tags according to the you can basically
- 42:30 - 43:00 do a histogram of the tags. So if I copy the path here and open this, you see now I have a histogram across the data sets. And this is an easy way to check whether you have uniformity in tags and it might indicate you want to generate more questions or tweak your question generation prompt to cover more on certain uh on certain topics. So here for example football rules sports equipment and you can see in green we've got pretty good coverage of all the tags that's in um pro 2.5.
- 43:00 - 43:30 Now we don't necessarily want this to be a flat distribution because we want it to reflect whatever the reality of the document is. So, I'm not looking for it to be flat. What I'm comparing is the ability of the different models to cover all of these tags. For example, if you look at fit rules, um that particular tag is not appearing so often for either the flash model and it's also not appearing so often for the sonnet model. So, just a few examples where those two models are not creating that tag, but
- 43:30 - 44:00 all of the models are including tags on touch rugby, sports regulations, game rules, touch football. So that is pretty much it on tags. Uh you have two ways now to visualize embeddings tags. What you're checking for is when you run multiple models. You ideally want the model you choose to cover pretty much what any model could reach. It's kind of like saying well imagine I asked five people to generate questions. They all might come up with slightly different questions. And ideally your full data set you want to cover that. You could even consider combining the data sets
- 44:00 - 44:30 from the different models. uh dduplicate if needed, but even the dduplication needed is not going to be that drastic because you can see because of sampling there actually aren't any very closely matching questions like here's what are the key uniform requirements and this one is what specific requirement applies to numbering system on player uniforms. Uh so even these two are not particularly close even though they are related here on the graph. So with that I'm going to cover the last
- 44:30 - 45:00 topic which is related to data set evaluation or rather evaluation data set and this is important for two reasons. One is if you do training you want to know whether your training has worked and you want to know whether it works on a data set that is not an exact replica. So there are a few approaches I'll cover here and then I'll show you the scripts. The first is to create a random split from training data. And this is the typical way that people generate eval splits. It's often what I
- 45:00 - 45:30 recommend as well. But it does have some drawbacks. And the key drawback is that if you remove examples in the training set that was aiming to cover a comprehensive set of data, you are going to remove data that you really want for training. And that's a problem because you're creating holes uh in the training set. Another problem is if you just randomly take a split from the training data like 10 or 20%. that random split probably won't be balanced across topics or difficulties or question categories in the same way as the training split
- 45:30 - 46:00 is. So you could perform well on the eval split but you're still performing badly on some category of the training split. So an alternative approach which is maybe a little bit better is to clone a balanced subset. So you take your training set and you take a subset of that maybe 10 or 20%. But you balance it across the tags for example or you balance it across the embeddings. Now if you're cloning it you are going to be measuring uh overfitting
- 46:00 - 46:30 because you were literally taking uh some of the examples from the training set. But actually, that's kind of good because if you're trying to teach the model some knowledge, it's natural that there should be some questions that are going to appear in the training that you actually care about it getting right in that verbatim form. So doing this cloned balance subset, you're going to be able to measure verbatim learning, which is good. The drawback is you're not going to do a good job at measuring overfitting. So the question is how can you get a balanced eval set that also
- 46:30 - 47:00 measures examples that generalize a little bit more than just verbatim copies and one way to do that is to clone but then rephrase the questions and answers. So think now about having a data set taking a balanced subsplit balanced across embeddings or tags and now get an LLM to rephrase the questions and answers and that's going to give you a data set that is not going to be verbatim and should allow you to measure if there is overfitting or not. And what I mean specifically there I'll talk
- 47:00 - 47:30 about in the finetuning video but broadly as your loss goes down for training you want to see your eval loss goes down too. If your eval loss um is basically staying flat and your training is going down, that means you're probably just overfitting to the exact examples rather than training the model more broadly. Now, there's one more alternative uh which is that instead of cloning and rephrasing a subset, you could just generate an entirely new data set. You could even generate it maybe with another language model. Although a
- 47:30 - 48:00 problem there is that you probably won't have the same category balances. So using a different model for eval split maybe isn't the best way to do it. But you could just create a second data set with the same main model which is a model that does well on covering comprehensive questions and then take a balanced subsplit of that. Now if you're generating a very large synthetic data set you might not want to do that because it means you have to generate as many training questions as eval questions and ultimately you're only going to use a subset of those eval questions. But it is uh just an option that would provide the properties of
- 48:00 - 48:30 being balanced and also not overfitting because when you regenerate a big subset or a big set you're going to have uh because the temperature is a factor it's going to be slightly different because of that. Next um I'm going to show you a script just to implement this. So if I go back and we're going to move back outside of visualization into the main data prep folder and um sorry I did rename this folder I think since earlier in the video I've renamed it visualization instead of embed viz. It's just for clarity because the
- 48:30 - 49:00 visualization is not just embeddings it's of tags as well. So I'm back in the data prep folder and now I want to run a script to generate a final data set. So, I'll go back to my main readme, not the readme in the visualization folder, the readme that's in the data prep folder, and I'll go to data set creation. And this is going to allow me to create a data set, save it locally, but also push it up to hugging face. And I'm going to do that first. I need to make sure to
- 49:00 - 49:30 add some dependencies here like hugging face hub um scikitlearn mplot lib and then uh this is already done if you're running it as a clone because I've got it added to the tomble file. You want to log into hugging face. So you can do UV run hugging face hub login and then we're going to run create data set pass in a configuration file and we can pass in this flag here to create an eval split and the eval split is going to be um it's just going to be cloned but it's
- 49:30 - 50:00 going to be balanced based on embeddings. Now there are some even more advanced uh features here. If you do use eval split, it will naturally select 20% of the data set or 32 examples whichever is lower. It will sample proportionally from each cluster identified by the elbow method using uh using embeddings. I I'll briefly mention how the elbow method works in a moment. And there's one more flag that you can create which is eval mirror. And what this does is it
- 50:00 - 50:30 allows you to create a second eval set that is not rephrased. It is just literally a subset that is uh verbatim copied. And that's going to leave you with a training set, an eval set where it's been rephrased, and then an eval set where it's been mirrored. And by doing this, you'll see it in the fine-tuning video, we'll be able to distinguish whether the learning is overfitting or whether it's generalizing. If it's overfitting, we'll see the model performing well on the
- 50:30 - 51:00 mirror set, the eval mirror set, but it will not perform as well on the eval set. that's been rephrased. Whereas if it's learning in a generalized way, then you should see the performance of the eval mirror and the eval split. And basically the two different eval splits by running them separately, we're going to be able to tell whether we've kind of overtrained. If we've overtrained, we're going to see that the eval mirror performance will continue to increase, but the eval split where there's rephrasing will not be increasing. Uh so
- 51:00 - 51:30 that will kind of allow us to tell if we've trained for too many epochs, for example. All right. So I'm going to run this here and I am going to include this flag uh for rephrasing QA which means that the eval split we generate will actually be rephrased. If I further add eval mirror, it will just add an extra data set split that will be called eval mirror. So let's in fact run this uh create data set. Let's run it with the eval split. Let's make sure the eval split is rephase phrased and then let's
- 51:30 - 52:00 also run an eval mirror. And you do need to be logged into hugging face with a token that has write permissions if you want to push up to hugging face hub. Also you need to specify here your organization and the name of the data set you want to push. Um further you can decide whether you want it to be public and you can also set a seed here. that seed is used in determining uh the split. So we're just pushing up the data set now. You can see there are three
- 52:00 - 52:30 different splits. Train, eval, eval mirror. And if we push that up uh to hug hub, we can take a look. So let's go to hugging face hub. I know I still have to explain the elbow method, but here we have a training set. Then we have an eval set. Now it'll be maybe hard for me to search but every one of these uh rows here has been rephrased. So if we look at the first row actually and we look at the question what what's the rationale behind fits
- 52:30 - 53:00 recommendations for members to incorporate specific elements and then we look at eval mirror and we look at this question why does the feder why does fit encourage its members to offer features in local competition rules. So that's actually the same question but it's been rephrased. Um in fact this one here is verbatim from the training set and the one that we have in the eval set has been rephrased. So that's what these uh different splits do. They should each be representative. The eval set should be roughly representative in distribution of the training set but
- 53:00 - 53:30 it's rephrased to avoid overfitting. And then the eval mirror set is also representative of the train set but it's a verbatim copy. So you can measure verbatim learning. Um and if performance incre improves a lot more than the eval split with rephrasing you know you're probably overfitting. So just very briefly on the elbow method, how does the clustering algorithm work? Well, broadly it will create a 2D map like I've been showing you. Let's actually go to the 2D map here. And it will start then to create
- 53:30 - 54:00 clusters. So it will cluster these points here for the data set. This is showing five data sets but for one data set it will cluster them according to distances and it will start off with a low number of clusters like maybe two and then it will repeat the process for three four five and what you will see is that the overall average it's called uh inertia but it's basically a square of distances the inertia or the representation the deviation from a perfect representation h that deviation
- 54:00 - 54:30 will will fall. So as you've more clusters, you're able to better represent the data. So you're getting a falling deviation, but at some point by adding more clusters, the rate of improvement of that deviation starts to kind of flatten off. So you start off, you add more clusters and your representation of the data gets better and better, but at some point beyond a certain amount of clusters, adding more clusters doesn't actually help that much more. And that's called an elbow. There's basically an elbow in the improvement. And that's the point where you typically
- 54:30 - 55:00 pick your say optimal number of clusters and you use that as a basis then for determining clusters that are going to provide balance to your data set. So that is an overview of data preparation. I am going to follow up with a fine-tuning video where I will show you how using a comprehensive data set like this outperforms using a naive data set with simpler chunking and simpler question and answer generation. I will go through the full training with
- 55:00 - 55:30 Unsloth on uh one of the more recent models, Gemma. I'll also cover a bit on the Mistral model and the script will work too with uh Llama models. Um it should even work with with Lama 4 if you have enough uh GPU memory. I'll put all of the links below in the description. You can find the scripts here access via the repo that's called advanced-finetuning. Advanced-find-tuning. It's on trellis.com. And in the meantime, if you have any questions, let me know down below in the comments. Cheers.