Why Testing Large Language Models with Collision Simulations Is Flawed
Sfere che si scontrano: un modo errato per testare gli LLM
Estimated read time: 1:20
Summary
In this video, Salvatore Sanfilippo critiques the recent trend of using collision simulations to evaluate large language models (LLMs). He explains how viral comparison videos, like Flavio's tests between O3 mini and DeepSeeker R1, fail to provide accurate assessments of LLM capabilities. Salvatore points out that simply testing models with tasks like detecting collisions in simulations is misleading. The video's primary argument is that these kinds of benchmarks do not truly reflect the models' abilities, especially when considering their mathematical and logical potential. Despite the fun aspect of simulating physics, he emphasizes the importance of more structured and comprehensive testing methods to accurately gauge LLM performance.
Highlights
- Viral benchmark comparisons between models like O3 mini and R1 are flawed. 🚫
- Simply giving credit is less important than showcasing why benchmarks are useless. ✍️
- Understanding collision physics requires more than a basic simulation setup. ⚙️
- Testing LLMs demands more than one-off prompts – it requires extended evaluation and adapted contexts. 🔍
- Mathematics is crucial in programming, especially in simulations and rendering tasks. 📊
- Even failed tests illustrate the usefulness of LLMs in expediting coding processes. 🔧
Key Takeaways
- Collision simulations are a poor metric for evaluating LLMs' true capabilities. 🙅♂️
- Viral benchmark videos often fail to credit creators and offer incomplete assessments. 📹
- Proper LLM evaluation requires structured, long-term testing of varied problems. ⏳
- Using physics and geometry tests highlight the need for mathematical understanding in programming. 🧮
- Flawed tests can still demonstrate LLMs' ability to save time and assist in coding tasks. 💻
Overview
In the video, Salvatore Sanfilippo criticizes the newfound approach of testing large language models with geometric collision simulations. He takes issue with viral videos like Flavio's, which compare models such as the O3 mini and R1 through simplistic tests that ultimately miss the mark on genuine evaluation. Salvatore stresses the point that while these videos may be entertaining, they do not offer a valid measure of an LLM's abilities.
Salvatore delves into the geometric and mathematical principles used in collision detection, illustrating the complexities involved in accurate simulations. He argues that testing these models under such limited conditions, usually with a single prompt, fails to reflect the models' real-world capabilities and adaptability. True evaluation would require multiple days of diverse problem-solving to tease out an LLM’s strengths and weaknesses.
Despite the flawed nature of these benchmarks, Salvatore acknowledges a positive aspect – LLMs can still be an asset in programming, aiding in quick implementations and exploring potential solutions. Their role in learning and coding assistance is undeniable, even if the benchmark tests themselves are inadequate gauges of competency. Ultimately, the progress of LLMs is undeniable, but requires realistic and thorough testing approaches.
Chapters
- 00:00 - 01:00: Introduction and Video Context The chapter introduces Flavio, who is of Italian descent, discussing his identity as both Sicilian and Italian. He introduces himself as the creator of a comparison video between the O3 mini and DeepSeeker R1.
- 01:00 - 02:30: Flaws in Current Benchmarking Methods The chapter discusses the flaws in current benchmarking methods. It begins by addressing the issue of people posting videos online without giving proper credit to the original creators. However, the author shifts focus from the credit issue, emphasizing that the real concern is analyzing why current benchmark methods are deemed useless. The author promises to demonstrate the uselessness of these benchmark methods, suggesting that the inadequacy of these benchmarks is more crucial to explore than who receives recognition for related content.
- 02:30 - 04:30: Understanding Sphere Collision in Computer Science This chapter discusses the concept of sphere collision in computer science, comparing the efficiency of different implementations. It starts by criticizing the R1 implementation, highlighting its inefficiencies, and contrasting it with the O3 mini implementation, which is considered better. The chapter notes that despite the improvements, the ball in the simulation approaches orthogonality, suggesting further areas for optimization.
- 04:30 - 07:00: Testing Collision Detection Models The chapter focuses on evaluating and comparing various collision detection models. It discusses the limitations and issues faced with current models, including their incapacity to function effectively. The narrative mentions a simulation test conducted on two models, but neither performs optimally as evidenced by the quote, 'the floor and still is not going anywhere,' implying immobility in the results, indicating the problems meet in both approaches.
- 07:00 - 09:00: Evaluating Programming Prompt Strategies The chapter discusses the effectiveness of various programming prompt strategies, using Sonet as a successful example. It highlights the importance of functionality over spectacle, showing how the ball effectively follows the hexagon.
- 09:00 - 10:30: Conclusion on Model Benchmarking The chapter discusses the importance of clean implementation and code organization in model benchmarking. It uses the example of 'restitution of one' to illustrate how a structured approach not only makes the code more aesthetically pleasing but also improves its effectiveness. The text suggests that while 'chain of thoughts' is a factor in evaluating large models, it should not be the sole consideration, emphasizing the need for a broader evaluation framework.
Sfere che si scontrano: un modo errato per testare gli LLM Transcription
- 00:00 - 00:30 Flavio, who is Italian like me (but I am more Sicilian than Italian, but I am partially Italian - you can have multiple identities) says: "Hello everyone, I am the creator of the comparison video between O3 mini and DeepSeeker R1,
- 00:30 - 01:00 which is currently going viral on the internet." He says that other people basically posted the video without giving credit, which is unfortunate. I think that giving credit is great, but I don't care about these questions that they have. I just want to analyze why these kinds of benchmarks are totally useless. So whoever gets the credit, they are useless, and I will show you.
- 01:00 - 01:30 For starters, here the idea is that the R1 implementation is terrible and the O3 mini implementation is good, which is not the case. The O3 mini implementation is better, but as you can see, the ball arrives almost to be completely orthogonal to
- 01:30 - 02:00 the floor and still is not going anywhere. So both are broken. Closing this, but it's not much if one is better than the other. For example, they tested these two, but if I run the simulation I obtained
- 02:00 - 02:30 with Sonet, this is the first one that works. So that's what should happen, okay? It's less spectacular because it works - the ball follows the hexagon. And if you want, Sonet provided a
- 02:30 - 03:00 very clean implementation with clean classes, it's amazing. Let's use restitution of one for example, it becomes nicer to look at, but it's the same thing and this is a much better implementation. So it's not like chain of thoughts is the only consideration that we have to evaluate if large
- 03:00 - 03:30 language models work or not. There are problems where it is very important, and this is, to be honest, one of them. But still, we are at a stage where models not doing much chain of thoughts, doing just a limited chain of thoughts like Sonet, can perform better. But still, the point of this video is that anyway, this kind of comparisons are useless, and I will show you. To start, since a bit of culture of computer science will not be a problem I guess,
- 03:30 - 04:00 here what we are trying to do is to understand if a line and a ball are colliding. What you do in this case is to find the perpendicular of the line passing from the center of the ball, and if
- 04:00 - 04:30 the distance between the crossing point with the line - the distance between this point and this point - is more or less the radius, this distance here, the radius of the sphere, then there is contact. Then based on the velocity vector of the ball, you calculate the new velocity.
- 04:30 - 05:00 If instead you had two balls and you want to understand if they are colliding or not, you take the two centers, use the distance formula that will be like this: X2 minus X1
- 05:00 - 05:30 squared plus Epsilon 2 minus Epsilon 1. Okay, so it's super easy - you get this distance here and what you do is to check if this distance is more or less the sum of the two radiuses, the first radius and the second radius. Then if it's between a given Epsilon, the spheres are colliding. If it is greater than the sum, the spheres are apart. If it
- 05:30 - 06:00 is less than the sum of the radius, the spheres are one inside the other. This is trivial geometry, but it is useful in order to understand what you are asking the model to do. And after you understand if two spheres are colliding like this, and based
- 06:00 - 06:30 on their normal, basically you decompose - if this is the velocity vector of this sphere, you decompose it in two components: one is perpendicular and one parallel to the segment connecting the two centers. You just use basically the parallel component, which is what will change the velocity between the two, and the orthogonal one is completely irrelevant.
- 06:30 - 07:00 You can easily see this if you imagine two spheres - this is this velocity vector, this is this kind of velocity vector. Here we will have the normal, and even if they are barely touching, no transfer will happen of motion because there is no parallel component.
- 07:00 - 07:30 Okay, so yesterday was at least okay with my cough, so I thought "okay, let's go out in Catania and stay in the balcony seeing the fireworks and the Sant'Agata festivities and drink wine" and now I am sick again, but I deserve it.
- 07:30 - 08:00 I deserve much worse I believe, but anyway, that's what we are asking the model to do. Moreover, there is a problem doing this test or this kind of tests just once - it's very unlikely to produce the same result because the temperature of the model is greater than zero. So sometimes it will write an implementation, sometimes it will write
- 08:00 - 08:30 another implementation. However, I tried to implement it again with R1, and for example, this is much worse than obtained by Flavio (I'm giving you full credits).
- 08:30 - 09:00 I tried with V3 vanilla and it kind of works for a second, but it's not worse than the R1.
- 09:00 - 09:30 So what I thought was - it's already very limited as a test, and you can believe maybe perhaps it's limited, but still this allows me to understand if the model is good at doing these basic geometry checks and to understand how to write the equations, the mathematical formulas in order to detect collisions, in order to detect the velocity. But it's never not even that.
- 09:30 - 10:00 So look at what I did - I wrote another similar prompt, a bit more complex in some way: "Write a Python program that shows a configurable number of balls bouncing inside the box. The balls should be affected by gravity and friction. The program should detect collisions both with the box and among the balls. The balls will be of different sizes and the bigger the ball,
- 10:00 - 10:30 the greater its mass, and this should play a role when balls interact." Why I wrote this? Let's silence this phone... I wrote this because when the balls are exactly the same, as I showed you,
- 10:30 - 11:00 basically you can just transfer the perpendicular forces between the two. Instead, if the masses are different, you have to exchange proportionally to the masses. So I asked Sonet to do that, and the result is that - also he thought it was fun to let
- 11:00 - 11:30 me click - and as you can see, it's not working very well. You see here they are like attracted, it's completely unrealistic, there is some error of some kind. I think I even tried to tell him that there was something wrong: "There is something wrong when calculating the velocity vector after the ball collision because the fact was different in the first version.
- 11:30 - 12:00 Forces added from time to time for some reason since balls accelerate a lot when colliding." So I wrote a new version which is the one that I showed you now. This first bug is no longer there, but still it doesn't work. So I tried what should be the winning model according to
- 12:00 - 12:30 the test we did - O3 mini should be the best, right? I also used the O3 mini high, so the one that thinks a lot. Let's check - it's completely wrong as you can see, doesn't work.
- 12:30 - 13:00 Instead, guess what? R1 wrote the only correct implementation of that. So the same domain, the same kind of problems, the same language, the same domain where it looked like O3 mini had an edge, and now instead Sonet,
- 13:00 - 13:30 even if it's very powerful, cannot nail it. O3 mini can't, and R1 is working very well. It's very nice, this kind of simulations. I did them a lot when I was learning to program. It's a lot of fun, also it's a great way to show, especially to young programmers, why it is important to know some mathematics. "Oh, mathematics is boring, I don't need to know
- 13:30 - 14:00 mathematics, I just can code" - but without mathematics, you cannot do a lot of things, a lot of cool things. You cannot do shading, rendering, 2D simulations. So all this to say that I believe you either should do some testing like test the model
- 14:00 - 14:30 for two-three days with many different problems against another model and take a score matrix or something like that, and then you did a proper and limited evaluation of the model - but you did something. But just throwing the first prompt that crosses your mind... Also, I will be very curious to know if just using a different prompt would produce a better result,
- 14:30 - 15:00 because if you check this prompt, it's the kind of prompt you would use in image diffusion models in order to create an image. Keep stay focused on the exact wording Flavio used: "Write a program, a Python program that shows a ball bouncing inside the spinning hexagon. The ball should
- 15:00 - 15:30 be affected by gravity and friction and it must bounce off the rotating walls realistically." It's a prompt more for a diffusion model. There are not enough clues. I would write like "represent the hexagon as a number of segments, then implement functions in order to find the collision between the ball and the segment" and so forth, to give maybe a bit more
- 15:30 - 16:00 context. Probably reasoning models should be able to rewrite the prompt in order to be more adapt, but still they fail sometimes. But anyway, it's better to use, to write a coding prompt not like you are imagining what you see in the screen but more in implementation-near terms,
- 16:00 - 16:30 assuming you have of course some idea about how it should work in the implementation side. Okay, just to say - if you want to benchmark, be prepared to do a lot of work, lot of hard work, otherwise the results are funny to show. It's nice, interesting to see what these models... well, you know, even these two failures are a testimony to the fact that
- 16:30 - 17:00 these new LLMs are very strong, are very great from save-time point of view because you can easily take this implementation and then start with a model to say "Let me show how you implemented collision detection, how the force was transferred" and you get to the point that the implementation is great. It's not a super easy task, it's not very complicated as well, but in general these models are making progress. Bye!