PWM: Policy Learning with Large World Models
Abstract
Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in
multi-task settings with different embodiments. World models offer scalability by learning a
simulation of the environment, yet they often rely on inefficient gradient-free optimization
methods. We introduce Policy learning with large World Models (PWM), a novel model-based RL
algorithm that learns continuous control policies from large multi-task world models. By
pre-training the world model on offline data and using it for first-order gradient policy
learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods
using ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to
27% higher rewards than existing baselines without the need for expensive online planning.
Video
Method overview
We introduce Policy learning with large World Models (PWM), a novel Model-Based RL (MBRL) algorithm
and framework aimed at deriving effective continuous control policies from large, multi-task world
models. We utilize pre-trained TD-MPC2 world models to efficiently learn control policies with first-order
gradients in < 10m per task. Our empirical evaluations on complex locomotion tasks indicate that PWM not
only achieves higher reward than baselines but also outperforms methods that use ground-truth
simulation dynamics.
We evaluate PWM on high-dimensional continuous control tasks (left figure) and find that it not only
outperforms model-free baselines SAC and PPO but also achieves higher rewards than SHAC, a method using the
dynamics and reward function of the simulator directly. In an 80-task setting (right figure) using a
large 48M-parameter world model, PWM is able to consistently outperform TD-MPC2, an MBRL method that uses
the same world model, without the need for online planning.
Single-task results
The figure shows 50% IQM with solid lines, mean with dashed lines, and 95% CI over all 5 tasks
and 5 random seeds. PWM is able to achieve a higher reward than model-free baselines PPO and
SAC, TD-MPC2, which uses the same world model as PWM and SHAC which uses the ground-truth
dynamics and reward functions of the simulator. These results indicate that well-regularized
world models can smooth out the optimization landscape, allowing for better first-order gradient
optimization.
Multi-task results
The figure shows the performance of PWM and TD-MPC2 on 30 and 80 multi-task benchmarks with
results over 10 random seeds. PWM is able to outperform TD-MPC2 while using the same world model
without any form of online planning, making it the more scalable approach to large world models.
The right figure compares PWM, a multi-task policy, with single-task experts SAC and DreamerV3.
It is impressive that PWM is able to match their performance while being multi-task and only
trained on offline data.
Citation
@misc{georgiev2024pwm,
title={PWM: Policy Learning with Large World Models},
author={Georgiev, Ignat and Giridhar, Varun and Hansen, Nicklas and Garg, Animesh},
eprint={2407.02466},
archivePrefix={arXiv},
primaryClass={cs.LG},
year={2024}
}