Scientists rely on a vast and ever-growing body of published research to advance their fields. As the volume of scientific literature explodes, large language models (LLMs) are emerging as potential tools to facilitate researchers navigate this complexity. But how trustworthy are these AI systems when it comes to providing accurate and nuanced answers to specialized questions? A new study from Cornell University and Google researchers puts LLMs to the test, revealing both their promise and current limitations in understanding complex scientific concepts.
The research, published March 10 in the Proceedings of the National Academy of Sciences, focused on the field of high-temperature cuprates – a class of superconducting materials. Researchers evaluated the ability of six LLM systems – including ChatGPT, Claude and Google’s NotebookLM – to comprehend scientific papers at an expert level. The findings highlight the critical need for curated data sources and improved visual reasoning capabilities within these models.
“This study is about testing out LLMs’ ability to read the literature the way an expert would read,” said Eun-Ah Kim, the Hans A. Bethe Professor of physics in the College of Arts and Sciences at Cornell, and the study’s corresponding author. “This paper is key now because everyone is very curious about what LLMs can and cannot do, especially in the context of artificial general intelligence (AGI). You’ll see critical gaps in what LLMs can do right now, which is clearly showing that they are not at AGI.”
The team, led by Haoyu Guo, a Bethe/KIC postdoctoral fellow with Cornell’s Laboratory of Atomic and Solid State Physics (LAASP), created a database of 1,726 scientific papers curated by human experts. This database, covering decades of research on high-temperature cuprates, was paired with 67 questions designed to probe deep understanding of the material. Experts then anonymously graded the responses provided by each LLM.
The Power of Curated Data
The study found that LLMs performed best when operating on trusted data sources – specifically, the papers collected by the researchers themselves, rather than relying on information scraped from the internet. “LLMs operating on trusted data sources – papers we collected ourselves, not from the LLM searching the Internet – tend to perform better,” Guo explained. Among the systems tested, NotebookLM, a Google product designed to answer questions based on provided documents, consistently outperformed others when provided with the curated dataset.
This suggests that the quality and reliability of the data fed into an LLM are paramount to its accuracy. Researchers also found that a custom retrieval-augmented generation (RAG) system, capable of retrieving both text and images from the curated documents, showed improved performance. This highlights the importance of multimodal information processing in scientific understanding.
Visual Reasoning: A Major Weakness
While LLMs demonstrated surprising proficiency in extracting text-based information, they struggled significantly with data visualization. “All the LLMs were surprisingly good at pulling out text-based information, but ‘totally incapable’ at engaging with data visualization,” Kim noted. She emphasized that critically analyzing data visualizations is a fundamental skill for scientists, and the inability of current LLMs to do so represents a serious limitation.
The custom RAG system, with its ability to retrieve images alongside text, offered a notable improvement in handling data visualization, underscoring the need for LLMs to better integrate and interpret visual information. Researchers identified several key areas for improvement, including more accurate attribution of claims (LLMs sometimes fabricate references), enhanced ability to synthesize complex information, and improved comprehension of plots and figures.
Implications for Future Research
The findings suggest that trusted LLM systems could be valuable tools for young researchers entering new fields, helping them quickly grasp the foundational knowledge and identify key areas for investigation. “Knowing the facts used to be brandished as a ticket to the table. Holding a fact in your head should not be the ticket. The ticket should be: Do you know how to think in a creative way? Can you approach problems from a creative angle?” Kim said.
This study represents the first output from the Cornell-led National Science Foundation AI-Materials Institute, directed by Kim. The researchers acknowledge that LLMs have improved significantly in the year since the benchmark was initially performed, but visual reasoning remains a significant challenge. Further development in this area is crucial to unlock the full potential of LLMs in scientific discovery.
As LLMs continue to evolve, their ability to accurately process and synthesize scientific information will be critical for accelerating research and fostering innovation. The ongoing work at Cornell and Google, and elsewhere, is paving the way for a future where AI can serve as a powerful partner to scientists, augmenting their capabilities and driving new breakthroughs.
What are your thoughts on the role of AI in scientific research? Share your comments below.