Published:
Author: Jaimie Patterson
Category:
Rendering of digital cyberspace.

Reinforcement learning, or RL—training algorithms to learn to make decisions through trial and error—typically requires interaction with the environment so that an AI model can learn from its mistakes. In real-world applications, like medical treatment or autonomous driving, those mistakes can be expensive or even deadly, leading AI researchers to train these sorts of models in simulated environments instead.

To ensure that machine learning algorithms trained in simulation still work in the physical world, a Johns Hopkins-led team proposes an imitation learning approach—AI training typically based on expert demonstrations—that can achieve optimal, real-world results. Their work appeared at the 38th Annual Conference on Neural Information Processing Systems in December.

The study of transferring RL policies—the strategies that AI agents use to choose their actions—from a simulated (source) environment to a real-world (target) environment is called off-dynamics reinforcement learning.

“The difference between the two environments is the transition probability, also known as dynamics—meaning that taking the same action at the same state in the two environments might result in a different end state,” explains senior author Anqi “Angie” Liu, an assistant professor of computer science.

The team started by identifying limitations of existing off-dynamics RL methods, finding that they often produce subpar results in a target environment. And while imitation learning can improve an RL method’s success, the researchers observed that it’s still unstable when there are significant differences between the source and target environments.

To address this issue, the team proposes using an imitation from observation process and a “reward-augmented estimator.” This involves introducing the source environment’s true reward signal—a measure of whether the agent’s actions were helpful or harmful—to help stabilize the learning process.

First, the team obtains action-state sequences from the source environment that resemble ideal ones for the target domain—for example, by generating data using Environment A that mimics the optimal behavior data in Environment B. Then they transfer the policy’s behavior from the source to the target domain through imitation learning from observation. But unlike existing approaches, this one doesn’t require an expert demonstration for learning to occur.

For example, take an uncooked egg (beginning state) and a scrambled one (end state). In regular imitation learning, an algorithm would have to be trained on the actual cooking of the egg. In contrast, an imitation from observation algorithm must infer that the cooking process took place sometime between the beginning and end states to achieve the outcome of “scrambled egg.”

“This is a more realistic option for training RL methods because it doesn’t require direct supervision, which can be expensive and time-consuming,” says Liu.

The team tested their method in a broken environment, in which certain action dimensions were disabled—for example, removing the ability of a robot to perform specific maneuvers—as well as an environment with modified parameters, in which some important physics settings were changed, such as gravity. According to the researchers, their method outperforms previous baseline methods without imitation learning.

Their next steps involve investigating off-dynamics RL under additional safety constraints and extending their method to an offline setting, in which they only have pre-collected data from the source domain and no access to the target environment.

“Our broad goal is to make robust reinforcement learning using very limited resources possible,” says Liu. “And of course, avoiding directly training policies in high-risk environments like the operating room or a city street, where learning from mistakes isn’t an option.”

Additional authors of this work include Yihong Guo and Yixuan Wang, Johns Hopkins graduate students in computer science and biomedical engineering, respectively; Yuanyuan Shi, an assistant professor of electrical and computer engineering at the University of California, San Diego; and Pan Xu, an assistant professor of biostatistics and bioinformatics at Duke University.

This work was partially supported by Liu’s Amazon Research and Johns Hopkins Discovery Awards, a seed grant from the Institute of Assured Autonomy, and the Center for Digital Health and Artificial Intelligence.