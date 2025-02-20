Advances in artificial intelligence (AI) could be less significant than they appear. This is the main conclusion of a study conducted by researchers from the National University of Distance Education (UNED), in Spain, who suggest that the capabilities of models such as Openai O3-mini or Deepseek R-1 depend more on the memorization than on genuine reasoning.

The development of AI systems with reasoning skills has become the new focus of competition within the sector. Most of these models have been trained to respond to requests through “private chains of thought”, a procedure that allows them to “reflect” before generating an answer, according to companies such as OpenAi. The systems are enabled to segment the request and link it with previous information to offer a more precise response.

Google presents a new AI capable of reasoning The new model is called Gemini 2.0 Flash Thinking and its developers describe it as a tool that “explicitly shows its thoughts.”

The industry maintains that this is an advanced form of reasoning that resembles that of humans, and that it is evaluated through reference tests known as Benchmarks. Models with better scores in these exams are usually considered the most powerful. However, specialists warn that these Tests They have reliability problems, a situation that has been aggravated due to intense competition in the sector.

Julio Gonzalo, co -author of the study and professor of Computer Languages ​​and Systems of the UNED, has told The country that “if there is a lot of competitive pressure, too much attention is paid to Benchmarkscompanies would be easy and convenient to manipulate them, so We cannot completely trust the numbers that report us. ”

To evaluate the reliability of these tests, Gonzalo and the researchers of the UNED, Eva Sánchez Salido and Guillermo Marco, designed a simple, but effective experiment. Its objective was to determine if the models respond to the Tests through real reasoning or if they simply look for the most likely option based on their training data.

The abilities of AI depend on memory, not on reasoning

The essay consisted of modifying the Benchmarks traditional with the introduction of a generic response option: “none of the above.” With this, it was intended to force the AI ​​to reason instead of identifying previously learned patterns.

The tests were applied to 16 large-sized language models (LLM), among which Deepseek-R1, OpenAi O3, Gemma 2-27B, Claude-3.5, call 3, GPT-4 and Mistral 7b. The findings were revealing. “The results show that All models lose precision in a remarkable way With our proposed variation, with an average drop of 57% and 50% [en dos indicadores tradicionales de referencia]and ranging between 10% and 93% according to the model, ”the authors point out in their article.

The researchers also indicated that the language influences the performance of AI models. English tests usually throw better resultswhile performance decreases in Spanish and is drastically reduced with less common languages. Gonzalo explains that the difference between languages ​​is more noticeable in models with more limited neuronal processing structures. The compact versions of the LLM, which can be executed on devices and offer greater privacy, tend to present more linguistic biases, depending on the language used.

The study, conducted within the framework of the Odesia project in collaboration with the Red.es platform, concludes that AI models depend largely on memorization rather than genuine reasoning. Guillermo Marco emphasizes that this type of variation had already been tested in the formulation of questions in the Benchmarks. However, it emphasizes that the modification in the response options “allows to evaluate more precision the real progress in the approximate reasoning capabilities of the systems, without the success by memorization distorting the results.”

Despite the limitations found, the study clarifies that AI systems developers are experiencing with new techniques to improve the reasoning of their algorithms. An example is the OPENAI O3-mini model, which, although it loses precision in the modified tests, is the only one that manages to approve one of the Benchmarks. Similarly, he acknowledges that Depseek R1 was the model that showed a lower performance drop in modified evaluations.

