AI agents deliver results – but do they reason scientifically?

Corral makes visible how AI agents arrive at their results: the benchmark breaks down agent runs into hypotheses, tests, evidence, judgments and corrections. This makes it possible to identify both productive forms of scientific reasoning and problematic patterns, such as when evidence is not taken into account. © HIPOLE Jena / Corral

A research team co-led by Kevin Maik Jablonka from the Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena) and N. M. Anoop Krishnan from the Indian Institute of Technology Delhi has developed Corral, a new benchmark for AI agents in science. The preprint “AI scientists produce results without reasoning scientifically” has been published on arXiv (https://doi.org/10.48550/arXiv.2604.18805). The analysis shows that current systems can execute scientific workflows and deliver results; however, they often do not follow the basic principles of scientific testing and reasoning.

Artificial intelligence is expected not only to write texts or analyse data, but also to plan scientific experiments, analyse results and generate new knowledge. But when can an AI system truly be said to be doing science? Is it enough for the final result to be correct – or must the path to that result also meet scientific standards? This question is addressed in a new preprint by Jablonka’s team.

With Corral, the researchers developed a benchmark that evaluates AI-based scientific agents not only by their results, but also by how they arrive at them. To do this, the team analysed more than 25,000 agent runs across eight scientific domains – ranging from molecular simulations and materials data analysis to spectroscopic structure elucidation and hypothesis-driven chemical tests. The evaluation examined not only whether a task was solved, but also whether the systems take evidence into account, generate and test hypotheses, and revise their assumptions when confronted with contradictory results.

“We need to be clearer about what kind of scientific reasoning we expect from such AI systems,” says Jablonka. “When it comes to epistemic rigor, better training procedures may help. But in areas where we need reliable guarantees about the reasoning process, we will probably need different systems – for example, systems with symbolic and formally verifiable components.”

HIPOLE Jena is an institute of the Helmholtz-Zentrum Berlin für Materialien und Energie (HZB), operated in cooperation with Friedrich Schiller University Jena and the Center for Energy and Environmental Chemistry Jena (CEEC). Regarding the preprint, Helmholtz AI has already published a detailed background article on the work: https://www.helmholtz.ai/detail/do-ai-scientists-actually-do-science-new-benchmark-probes-the-reasoning-behind-the-results-featuring-dr-kevin-maik-jablonka-helmholtz-ai-associate/

Copy link

You might also be interested in

Science Highlight

08.06.2026
Magnon momentum microscopy: A new window into nanoscale spin-waves
An international team lead by the Max Born Institute has developed a new type of momentum microscopy to image magnons — the quanta of collectively excited spins — directly in two-dimensional reciprocal space using soft X-rays. Measurements have taken place at BESSY II and PETRA III, first author ist the HZB physicist Steffen Wittrock. Owing to its remarkable sensitivity, simplicity, and access to nanometer-scale wavelengths, this novel technique establishes a powerful and versatile platform for exploring nonlinear magnon interactions, which are promising for future computing schemes.
Science Highlight

08.06.2026
X-ray analysis reveals overpainted fascist symbols
Erich Mercker was a successful painter during the Nazi era and in the years that followed. After 1945, he covered up Nazi symbols in at least one of his paintings. With an interdisciplinary team, physicist Dr Ioanna Mantouvalou reports on this study in the Nature Journal Heritage Science.
Science Highlight

01.06.2026
Magnetic field during catalyst synthesis triples ammonia yield
Applying an external magnetic field during the synthesis of CoFe₂O₄ electrocatalysts triples the ammonia yield during electrocatalytic conversion. The magnetic field alters the surface states of the spinel oxide thin films, making catalytically active sites more accessible. In the journal 'Advanced Functional Materials', a team led by Marcel Risch at HZB and Sanjay Mathur at University of Cologne demonstrates a scalable strategy for developing next-generation electrocatalysts for efficient and sustainable chemical production.

AI agents deliver results – but do they reason scientifically?

You might also be interested in

The HZB in brief

Training and Vacancies