Valdemaras Girštautas, Jr, JavaScript Software Engineer
AI Benchmarking: Understanding Test Categories That Actually Matter
A few months ago, I was working on a small benchmarking exercise for one of the LLM features. I noticed something that, in retrospect, is trivial, but at the time seemed somewhat puzzling. Two of the models I was evaluating had roughly the same leaderboard scores. On paper, the two looked more or less equivalent.
However, when I started testing them on the task at hand, the differences became apparent. One performed well on structured data extraction but struggled with multi-step reasoning. The other did the exact opposite.
My initial thought was that there was something off with my testing. But as I dug in further, it became clear that the issue was not the models – it was my interpretation of the benchmarks. So I decided to dig deeper to better understand how to interpret them.
Here is a summary of what I found. In this blog post, I will give you an overview of my findings. There are four broad categories of tests that most benchmarks fall into. I will walk you through each category, give examples, and describe the typical characteristics of models in each.

These tasks test the model’s ability to think, follow multiple steps, track intermediate states, apply logic, and perform simple arithmetic. Performance on these tasks is sensitive to both the model and the prompt you use.
For example, in a math word problem, models can perform quite differently depending on how the prompt is structured. The key takeaway here is that models need space to think. If you constrain the prompt too much, reasoning performance will drop significantly.
These tasks test the model’s ability to choose a label, pick the best option from a set of choices, or decide between predefined categories.
These tasks tend to be more stable than reasoning tasks. Most models perform well on them, which can lead to the false conclusion that a model is “perfect” if you only evaluate this category. In reality, these tasks primarily test pattern recognition, consistency, and the ability to stay within constraints.
For example, I initially tested a model that performed flawlessly on classification tasks, which led me to believe it was excellent. However, when I later used it for more complex tasks, issues began to surface.
These tasks test the model’s ability to generate code in a specific programming language. They are highly sensitive to the model itself.
One thing I noticed is how much small changes to prompts matter. Even minor adjustments can lead to very different outputs. For example, I tested two models using the same code-generation prompt, and the results were vastly different. Interestingly, the model that performed well on reasoning tasks performed quite poorly on code generation.

These tasks test the model’s ability to generate well-structured outputs, such as filling out a table or producing a summary from a set of points.
These tasks fall somewhere between reasoning and classification in terms of stability. Performance is sensitive to both the model and the prompt. For example, when filling out a table, results varied depending on how the prompt was phrased.
One key takeaway from this exercise is to think carefully about the types of tasks you plan to solve and to evaluate benchmarks accordingly. If your case involves extensive reasoning, you should prioritise benchmarks such as math word problems. If it’s classification, focus on classification benchmarks instead.
Having excellent classification performance does not mean a model can handle reasoning tasks.
Code tasks are a completely different beast. There’s no “mostly right” – the code either runs or it doesn’t. That makes evaluation simpler, but also more sensitive. One key observation is how much small variations in prompts affect results. Even slight changes can produce drastically different outputs. This makes benchmarking harder because you’re not just evaluating the model – you’re also evaluating the prompt.
Structured output tasks sound simple: “Just return some JSON.” But in practice, this is where things start to get complicated. In one of the experiments I conducted, I tested natural language answers and JSON-formatted answers.
For simple cases, JSON worked great. But as soon as I introduced multiple fields or marginally more complex structures, issues started to appear: malformed formatting, missing fields, or simplified answers. What’s interesting is that the model often knew the correct answer but couldn’t represent it within the structure.

After going through these categories, one insight stood out: Benchmarks don’t tell you which model is best – they tell you what a model is good at.
This completely changed how I think about benchmarks. Instead of asking, “Which model scored highest?” I started asking, “What tasks does this model excel at?”
This becomes especially relevant in real-world applications. I’ve seen cases where a model chosen based on classification benchmarks failed in a reasoning-heavy pipeline, or a model that scored well overall struggled with structured outputs in production.
These failures often come down to a mismatch between the benchmark type and the actual task.
What works better is to think in categories first. Before you choose a model, first ask: Do I need reasoning? Is this mostly classification? Do I need structured outputs? Do I need code generation? Then evaluate models based on those needs. In some cases, you may need to combine categories – for example, a system that requires both reasoning and structured output.
Before this exercise, I treated benchmarks as a shortcut. Now, I see them as a diagnostic tool. They don’t give you final answers – they give you signals. If you interpret those signals correctly, they are incredibly valuable. If you don’t, they can be misleading.
So the next time you see a benchmark result, ask: What is this actually measuring? That question is far more important than the result itself.
Valdemaras Girštautas, Jr, JavaScript Software Engineer
Volha Khudzinskaya, Head of QM, and Dzmitry Mikhailouski, Lead SDET