AI in Chemistry: Study Highlights Strengths and Weaknesses
How well does artificial intelligence perform compared to human experts? A research team at HIPOLE Jena set out to answer this question in the field of chemistry. Using a newly developed evaluation method called “ChemBench,” the researchers compared the performance of modern language models such as GPT-4 with that of experienced chemists.
The study has recently been published in the journal Nature Chemistry (DOI 10.1038/s41557-025-01815-x).
More than 2,700 chemistry tasks from research and education were tested—ranging from fundamental knowledge to complex problems. In areas such as reaction prediction or the analysis of large datasets, AI models often excelled with high efficiency. However, a critical weakness became apparent: the models also produced confident answers even when they were factually incorrect. Human chemists, by contrast, were more cautious and questioned their own assessments.
“Our study shows that AI can be a valuable tool—but it is no substitute for human expertise,” says Dr. Kevin M. Jablonka, lead author of the study. The findings offer important insights for the responsible use of AI in chemical research and education.
HIPOLE Jena (Helmholtz Institute for Polymers in Energy Applications Jena) is an institute of HZB in cooperation with Friedrich Schiller University Jena (FSU Jena).