World modeling—the process by which an artificial intelligence model simulates and predicts the outcomes of various actions within a given virtual environment—is an important step in developing many AI applications, ultimately leading to better efficiency and more informed decision-making. However, such simulations are expensive to build by hand, requiring extensive effort from multiple human experts over weeks or even months—which is why researchers are exploring whether large language models, or LLMs, can act as automated world simulators and correctly predict how actions change the state of a given environment.
But a multi-institutional research team including Ziang Xiao, an assistant professor of computer science at the Whiting School of Engineering, has determined that even state-of-the-art LLMs like OpenAI’s GPT-4 cannot yet serve as reliable world simulators, more often than not failing to handle state changes that require math, reasoning, or common sense. The team recently presented its findings at the 62nd Annual Meeting of the Association for Computational Linguistics.
To test LLMs’ capabilities in predicting world states, the team first used text-based games—interactive environments with text interfaces that were the main format of computer games in the ’80s and ’90s—based on the precedent of LLM-powered games like AI Dungeon.
In work presented at the 2023 Conference on Empirical Methods in Natural Language Processing, the team demonstrated that GPT-4 could, in fact, use a set of 32 human-authored text-based games as templates for in-context learning and could generate new simulations for unseen basic tasks, such as boiling water or freezing ice cubes.
Based on the gameplay recordings in this work, the team then created a new dataset with over 75,000 state transitions—such as “Current State: Room temperature water is in a pot on the stove” + “Action: Turn on stove” → “New State: The water is boiling”—and tested whether LLMs like GPT-4 could accurately and logically simulate such state transitions by themselves.
Across models and conditions, the researchers found that the best recorded performance achieved 59.9% accuracy—but because simulation errors accumulate over longer simulations, even a model with 60% accuracy will reduce to less than 1% accuracy after only 10 steps through a simulation. Since even a simple task like boiling water has well over 10 state-transition steps—for example, finding a pot, bringing it to the sink, turning on the sink, filling the pot, turning off the sink, and so on—this poses a significant problem for LLMs simulating the real world.
“Our results indicate that LLMs are not yet able to reliably act as text world simulators,” says Xiao. “Although they show some capabilities in world modeling, LLMs require substantial improvement and much more future research to be able to generate the reliable, complex, long-term simulations we’re looking for.”
He and his collaborators note that LLMs that are specialized for specific tasks may perform better than generalized models like GPT-4 and have resolved to continue researching how they might train LLMs to generate more accurate abstract simulators.
“Through realistic and interactive simulation, we can not only improve AI capabilities in critical applications, such as creating synthetic training environments, but also advance science discovery—for example, by simulating realistic environments for studying human behaviors,” says Xiao.
Additional authors of this work include lead author Ruoyao Wang, a PhD student at the University of Arizona; Graham Todd, a PhD at New York University; Xingdi Yuan and Marc-Alexandre Côté, researchers at Microsoft Research Montréal; Peter Clark, the senior research director of the Allen Institute for AI (AI2); and Peter Jansen, an associate professor at the University of Arizona currently on sabbatical at AI2.