AI agents deliver results – but do they reason scientifically?

Corral makes visible how AI agents arrive at their results: the benchmark breaks down agent runs into hypotheses, tests, evidence, judgments and corrections. This makes it possible to identify both productive forms of scientific reasoning and problematic patterns, such as when evidence is not taken into account. © HIPOLE Jena / Corral

A research team co-led by Kevin Maik Jablonka from the Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena) and N. M. Anoop Krishnan from the Indian Institute of Technology Delhi has developed Corral, a new benchmark for AI agents in science. The preprint “AI scientists produce results without reasoning scientifically” has been published on arXiv (https://doi.org/10.48550/arXiv.2604.18805). The analysis shows that current systems can execute scientific workflows and deliver results; however, they often do not follow the basic principles of scientific testing and reasoning.

Artificial intelligence is expected not only to write texts or analyse data, but also to plan scientific experiments, analyse results and generate new knowledge. But when can an AI system truly be said to be doing science? Is it enough for the final result to be correct – or must the path to that result also meet scientific standards? This question is addressed in a new preprint by Jablonka’s team.

With Corral, the researchers developed a benchmark that evaluates AI-based scientific agents not only by their results, but also by how they arrive at them. To do this, the team analysed more than 25,000 agent runs across eight scientific domains – ranging from molecular simulations and materials data analysis to spectroscopic structure elucidation and hypothesis-driven chemical tests. The evaluation examined not only whether a task was solved, but also whether the systems take evidence into account, generate and test hypotheses, and revise their assumptions when confronted with contradictory results.

“We need to be clearer about what kind of scientific reasoning we expect from such AI systems,” says Jablonka. “When it comes to epistemic rigor, better training procedures may help. But in areas where we need reliable guarantees about the reasoning process, we will probably need different systems – for example, systems with symbolic and formally verifiable components.”

HIPOLE Jena is an institute of the Helmholtz-Zentrum Berlin für Materialien und Energie (HZB), operated in cooperation with Friedrich Schiller University Jena and the Center for Energy and Environmental Chemistry Jena (CEEC). Regarding the preprint, Helmholtz AI has already published a detailed background article on the work: https://www.helmholtz.ai/detail/do-ai-scientists-actually-do-science-new-benchmark-probes-the-reasoning-behind-the-results-featuring-dr-kevin-maik-jablonka-helmholtz-ai-associate/

Copy link

You might also be interested in

Science Highlight

01.06.2026
Magnetic field during catalyst synthesis triples ammonia yield
Applying an external magnetic field during the synthesis of CoFe₂O₄ electrocatalysts triples the ammonia yield during electrocatalytic conversion. The magnetic field alters the surface states of the spinel oxide thin films, making catalytically active sites more accessible. In the journal 'Advanced Functional Materials', a team led by Marcel Risch at HZB and Sanjay Mathur at University of Cologne demonstrates a scalable strategy for developing next-generation electrocatalysts for efficient and sustainable chemical production.
Science Highlight

29.05.2026
Materials chemistry shapes the future of catalysis
The synthesis of materials can serve as a tool for developing smart, adaptive electrocatalysts. This rapidly evolving field of research involves in-situ analytics, data-driven discoveries and autonomous robotics. These new approaches could accelerate the discovery of long-lasting and efficient catalysts for future energy conversion and the decarbonisation of the chemical industry. A recent article by Dr Prashanth Menezes and his team in the renowned journal Angewandte Chemie provides an overview of this research.
Science Highlight

22.05.2026
Imaging Ellipsometry for Process Control of Thin-Film Devices
A German–Israeli research team led by Dr. Andreas Furchner has demonstrated how imaging ellipsometry enables non-destructive characterisation and quality control of microstructured MXene thin films during device fabrication. The authors used two complementary ellipsometry approaches for precise, multi-scale access to key material properties. The work positions imaging ellipsometry as a powerful platform for monitoring thin-film uniformity, device integrity, and functionality throughout processing, including critical lithographic steps. The study was published in Applied Physics Letters and selected as an Editor’s Pick.

AI agents deliver results – but do they reason scientifically?

You might also be interested in

The HZB in brief

Training and Vacancies