Why Testing Large Language Models with Collision Simulations Is Flawed

Sfere che si scontrano: un modo errato per testare gli LLM

Estimated read time: 1:20

    Summary

    In this video, Salvatore Sanfilippo critiques the recent trend of using collision simulations to evaluate large language models (LLMs). He explains how viral comparison videos, like Flavio's tests between O3 mini and DeepSeeker R1, fail to provide accurate assessments of LLM capabilities. Salvatore points out that simply testing models with tasks like detecting collisions in simulations is misleading. The video's primary argument is that these kinds of benchmarks do not truly reflect the models' abilities, especially when considering their mathematical and logical potential. Despite the fun aspect of simulating physics, he emphasizes the importance of more structured and comprehensive testing methods to accurately gauge LLM performance.

      Highlights

      • Viral benchmark comparisons between models like O3 mini and R1 are flawed. 🚫
      • Simply giving credit is less important than showcasing why benchmarks are useless. ✍️
      • Understanding collision physics requires more than a basic simulation setup. ⚙️
      • Testing LLMs demands more than one-off prompts – it requires extended evaluation and adapted contexts. 🔍
      • Mathematics is crucial in programming, especially in simulations and rendering tasks. 📊
      • Even failed tests illustrate the usefulness of LLMs in expediting coding processes. 🔧

      Key Takeaways

      • Collision simulations are a poor metric for evaluating LLMs' true capabilities. 🙅‍♂️
      • Viral benchmark videos often fail to credit creators and offer incomplete assessments. 📹
      • Proper LLM evaluation requires structured, long-term testing of varied problems. ⏳
      • Using physics and geometry tests highlight the need for mathematical understanding in programming. 🧮
      • Flawed tests can still demonstrate LLMs' ability to save time and assist in coding tasks. 💻

      Overview

      In the video, Salvatore Sanfilippo criticizes the newfound approach of testing large language models with geometric collision simulations. He takes issue with viral videos like Flavio's, which compare models such as the O3 mini and R1 through simplistic tests that ultimately miss the mark on genuine evaluation. Salvatore stresses the point that while these videos may be entertaining, they do not offer a valid measure of an LLM's abilities.

        Salvatore delves into the geometric and mathematical principles used in collision detection, illustrating the complexities involved in accurate simulations. He argues that testing these models under such limited conditions, usually with a single prompt, fails to reflect the models' real-world capabilities and adaptability. True evaluation would require multiple days of diverse problem-solving to tease out an LLM’s strengths and weaknesses.

          Despite the flawed nature of these benchmarks, Salvatore acknowledges a positive aspect – LLMs can still be an asset in programming, aiding in quick implementations and exploring potential solutions. Their role in learning and coding assistance is undeniable, even if the benchmark tests themselves are inadequate gauges of competency. Ultimately, the progress of LLMs is undeniable, but requires realistic and thorough testing approaches.

            Chapters

            • 00:00 - 01:00: Introduction and Video Context The chapter introduces Flavio, who is of Italian descent, discussing his identity as both Sicilian and Italian. He introduces himself as the creator of a comparison video between the O3 mini and DeepSeeker R1.
            • 01:00 - 02:30: Flaws in Current Benchmarking Methods The chapter discusses the flaws in current benchmarking methods. It begins by addressing the issue of people posting videos online without giving proper credit to the original creators. However, the author shifts focus from the credit issue, emphasizing that the real concern is analyzing why current benchmark methods are deemed useless. The author promises to demonstrate the uselessness of these benchmark methods, suggesting that the inadequacy of these benchmarks is more crucial to explore than who receives recognition for related content.
            • 02:30 - 04:30: Understanding Sphere Collision in Computer Science This chapter discusses the concept of sphere collision in computer science, comparing the efficiency of different implementations. It starts by criticizing the R1 implementation, highlighting its inefficiencies, and contrasting it with the O3 mini implementation, which is considered better. The chapter notes that despite the improvements, the ball in the simulation approaches orthogonality, suggesting further areas for optimization.
            • 04:30 - 07:00: Testing Collision Detection Models The chapter focuses on evaluating and comparing various collision detection models. It discusses the limitations and issues faced with current models, including their incapacity to function effectively. The narrative mentions a simulation test conducted on two models, but neither performs optimally as evidenced by the quote, 'the floor and still is not going anywhere,' implying immobility in the results, indicating the problems meet in both approaches.
            • 07:00 - 09:00: Evaluating Programming Prompt Strategies The chapter discusses the effectiveness of various programming prompt strategies, using Sonet as a successful example. It highlights the importance of functionality over spectacle, showing how the ball effectively follows the hexagon.
            • 09:00 - 10:30: Conclusion on Model Benchmarking The chapter discusses the importance of clean implementation and code organization in model benchmarking. It uses the example of 'restitution of one' to illustrate how a structured approach not only makes the code more aesthetically pleasing but also improves its effectiveness. The text suggests that while 'chain of thoughts' is a factor in evaluating large models, it should not be the sole consideration, emphasizing the need for a broader evaluation framework.

            Sfere che si scontrano: un modo errato per testare gli LLM Transcription

            • 00:00 - 00:30 Flavio, who is Italian like me (but  I am more Sicilian than Italian,   but I am partially Italian - you can have  multiple identities) says: "Hello everyone,   I am the creator of the comparison  video between O3 mini and DeepSeeker R1,
            • 00:30 - 01:00 which is currently going viral on the  internet." He says that other people basically   posted the video without giving credit, which is  unfortunate. I think that giving credit is great,   but I don't care about these questions that they  have. I just want to analyze why these kinds of   benchmarks are totally useless. So whoever gets  the credit, they are useless, and I will show you.
            • 01:00 - 01:30 For starters, here the idea is that the  R1 implementation is terrible and the O3   mini implementation is good, which is not the  case. The O3 mini implementation is better,   but as you can see, the ball arrives  almost to be completely orthogonal to
            • 01:30 - 02:00 the floor and still is not going  anywhere. So both are broken. Closing this, but it's not much if one  is better than the other. For example,   they tested these two, but if  I run the simulation I obtained
            • 02:00 - 02:30 with Sonet, this is the first one that works.  So that's what should happen, okay? It's less   spectacular because it works - the ball follows  the hexagon. And if you want, Sonet provided a
            • 02:30 - 03:00 very clean implementation with clean classes, it's  amazing. Let's use restitution of one for example,   it becomes nicer to look at, but it's the same  thing and this is a much better implementation. So it's not like chain of thoughts is the only  consideration that we have to evaluate if large
            • 03:00 - 03:30 language models work or not. There are problems  where it is very important, and this is, to be   honest, one of them. But still, we are at a stage  where models not doing much chain of thoughts,   doing just a limited chain of thoughts  like Sonet, can perform better. But still,   the point of this video is that anyway, this kind  of comparisons are useless, and I will show you. To start, since a bit of culture of computer  science will not be a problem I guess,
            • 03:30 - 04:00 here what we are trying to do is to understand  if a line and a ball are colliding. What you do   in this case is to find the perpendicular of the  line passing from the center of the ball, and if
            • 04:00 - 04:30 the distance between the crossing point with the  line - the distance between this point and this   point - is more or less the radius, this distance  here, the radius of the sphere, then there is   contact. Then based on the velocity vector  of the ball, you calculate the new velocity.
            • 04:30 - 05:00 If instead you had two balls and you want  to understand if they are colliding or not,   you take the two centers, use the distance  formula that will be like this: X2 minus X1
            • 05:00 - 05:30 squared plus Epsilon 2 minus Epsilon 1. Okay,  so it's super easy - you get this distance here   and what you do is to check if this distance  is more or less the sum of the two radiuses,   the first radius and the second radius.  Then if it's between a given Epsilon,   the spheres are colliding. If it is greater  than the sum, the spheres are apart. If it
            • 05:30 - 06:00 is less than the sum of the radius,  the spheres are one inside the other. This is trivial geometry, but it is useful  in order to understand what you are asking   the model to do. And after you understand if  two spheres are colliding like this, and based
            • 06:00 - 06:30 on their normal, basically you decompose - if  this is the velocity vector of this sphere,   you decompose it in two components: one  is perpendicular and one parallel to the   segment connecting the two centers. You just use  basically the parallel component, which is what   will change the velocity between the two, and  the orthogonal one is completely irrelevant.
            • 06:30 - 07:00 You can easily see this if you imagine two  spheres - this is this velocity vector,   this is this kind of velocity vector. Here  we will have the normal, and even if they   are barely touching, no transfer will happen of  motion because there is no parallel component.
            • 07:00 - 07:30 Okay, so yesterday was at least okay  with my cough, so I thought "okay,   let's go out in Catania and stay in  the balcony seeing the fireworks and   the Sant'Agata festivities and drink wine"  and now I am sick again, but I deserve it.
            • 07:30 - 08:00 I deserve much worse I believe, but anyway,  that's what we are asking the model to do. Moreover, there is a problem doing this test  or this kind of tests just once - it's very   unlikely to produce the same result because  the temperature of the model is greater   than zero. So sometimes it will write an  implementation, sometimes it will write
            • 08:00 - 08:30 another implementation. However, I tried to  implement it again with R1, and for example,   this is much worse than obtained by  Flavio (I'm giving you full credits).
            • 08:30 - 09:00 I tried with V3 vanilla and it kind of works  for a second, but it's not worse than the R1.
            • 09:00 - 09:30 So what I thought was - it's already very limited  as a test, and you can believe maybe perhaps it's   limited, but still this allows me to understand  if the model is good at doing these basic geometry   checks and to understand how to write the  equations, the mathematical formulas in   order to detect collisions, in order to detect  the velocity. But it's never not even that.
            • 09:30 - 10:00 So look at what I did - I wrote another  similar prompt, a bit more complex in some way:   "Write a Python program that shows a configurable  number of balls bouncing inside the box. The balls   should be affected by gravity and friction. The  program should detect collisions both with the   box and among the balls. The balls will be  of different sizes and the bigger the ball,
            • 10:00 - 10:30 the greater its mass, and this should  play a role when balls interact." Why I wrote this? Let's silence this  phone... I wrote this because when   the balls are exactly the same, as I showed you,
            • 10:30 - 11:00 basically you can just transfer the  perpendicular forces between the two. Instead,   if the masses are different, you have to  exchange proportionally to the masses. So I asked Sonet to do that, and the result  is that - also he thought it was fun to let
            • 11:00 - 11:30 me click - and as you can see, it's not working  very well. You see here they are like attracted,   it's completely unrealistic, there is some  error of some kind. I think I even tried to   tell him that there was something wrong: "There  is something wrong when calculating the velocity   vector after the ball collision because the  fact was different in the first version.
            • 11:30 - 12:00 Forces added from time to time for some reason  since balls accelerate a lot when colliding." So I wrote a new version which is the one that I  showed you now. This first bug is no longer there,   but still it doesn't work. So I tried what  should be the winning model according to
            • 12:00 - 12:30 the test we did - O3 mini should be the  best, right? I also used the O3 mini high,   so the one that thinks a lot. Let's check - it's  completely wrong as you can see, doesn't work.
            • 12:30 - 13:00 Instead, guess what? R1 wrote the only correct  implementation of that. So the same domain,   the same kind of problems, the same language,   the same domain where it looked like O3  mini had an edge, and now instead Sonet,
            • 13:00 - 13:30 even if it's very powerful, cannot nail it.  O3 mini can't, and R1 is working very well. It's very nice, this kind of simulations.  I did them a lot when I was learning to   program. It's a lot of fun, also it's a great  way to show, especially to young programmers,   why it is important to know some mathematics.  "Oh, mathematics is boring, I don't need to know
            • 13:30 - 14:00 mathematics, I just can code" - but without  mathematics, you cannot do a lot of things,   a lot of cool things. You cannot do  shading, rendering, 2D simulations. So all this to say that I believe you either  should do some testing like test the model
            • 14:00 - 14:30 for two-three days with many different problems  against another model and take a score matrix or   something like that, and then you did a proper  and limited evaluation of the model - but you   did something. But just throwing the  first prompt that crosses your mind... Also, I will be very curious to know if just using  a different prompt would produce a better result,
            • 14:30 - 15:00 because if you check this prompt, it's the kind  of prompt you would use in image diffusion models   in order to create an image. Keep stay focused on  the exact wording Flavio used: "Write a program,   a Python program that shows a ball bouncing  inside the spinning hexagon. The ball should
            • 15:00 - 15:30 be affected by gravity and friction and it must  bounce off the rotating walls realistically." It's a prompt more for a diffusion model.  There are not enough clues. I would write   like "represent the hexagon as a number of  segments, then implement functions in order   to find the collision between the ball and the  segment" and so forth, to give maybe a bit more
            • 15:30 - 16:00 context. Probably reasoning models should be able  to rewrite the prompt in order to be more adapt,   but still they fail sometimes. But anyway,  it's better to use, to write a coding prompt   not like you are imagining what you see in the  screen but more in implementation-near terms,
            • 16:00 - 16:30 assuming you have of course some idea about  how it should work in the implementation side. Okay, just to say - if you want to benchmark, be  prepared to do a lot of work, lot of hard work,   otherwise the results are funny to show.  It's nice, interesting to see what these   models... well, you know, even these two  failures are a testimony to the fact that
            • 16:30 - 17:00 these new LLMs are very strong, are very  great from save-time point of view because   you can easily take this implementation and  then start with a model to say "Let me show   how you implemented collision detection,  how the force was transferred" and you get   to the point that the implementation  is great. It's not a super easy task,   it's not very complicated as well, but in  general these models are making progress. Bye!