Abstract
Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task
settings with different embodiments. World models methods offer scalability by learning a simulation of the
environment, but often rely on inefficient gradient-free optimization methods for policy extraction. In
contrast, gradient-based methods exhibit lower variance but fail to handle discontinuities. Our work reveals
that well-regularized world models can generate smoother optimization landscapes than the actual dynamics,
facilitating more effective first-order optimization. We introduce Policy learning with multi-task World
Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is
pre-trained on offline data, and then policies are extracted from it using first-order optimization in less
than 10 minutes per task. PWM effectively solves tasks with up to 152 action dimensions and outperforms
methods that use ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27\%
higher rewards than existing baselines, without relying on costly online planning.
Video
Method overview
We introduce Policy learning with multi-task World Models (PWM), a novel Model-Based RL (MBRL) algorithm
and framework aimed at deriving effective continuous control policies from large, multi-task world
models. We utilize pre-trained TD-MPC2 world models to efficiently learn control policies with first-order
gradients in < 10m per task. Our empirical evaluations on complex locomotion tasks indicate that PWM not
only achieves higher reward than baselines but also outperforms methods that use ground-truth
simulation dynamics.
We evaluate PWM on high-dimensional continuous control tasks (left figure) and find that it not only
outperforms model-free baselines SAC and PPO but also achieves higher rewards than SHAC, a method using the
dynamics and reward function of the simulator directly. In an 80-task setting (right figure) using a
large 48M-parameter world model, PWM is able to consistently outperform TD-MPC2, an MBRL method that uses
the same world model, without the need for online planning.
Single-task results
The figure shows 50% IQM with solid lines, mean with dashed lines, and 95% CI over all 5 tasks
and 5 random seeds. PWM is able to achieve a higher reward than model-free baselines PPO and
SAC, TD-MPC2, which uses the same world model as PWM and SHAC which uses the ground-truth
dynamics and reward functions of the simulator. These results indicate that well-regularized
world models can smooth out the optimization landscape, allowing for better first-order gradient
optimization.
Multi-task results
The figure shows the performance of PWM and TD-MPC2 on 30 and 80 multi-task benchmarks with
results over 10 random seeds. PWM is able to outperform TD-MPC2 while using the same world model
without any form of online planning, making it the more scalable approach to large world models.
The right figure compares PWM, a multi-task policy, with single-task experts SAC and DreamerV3.
It is impressive that PWM is able to match their performance while being multi-task and only
trained on offline data.
Citation
@misc{georgiev2024pwm,
title={PWM: Policy Learning with Multi-task World Models},
author={Georgiev, Ignat and Giridhar, Varun and Hansen, Nicklas and Garg, Animesh},
eprint={2407.02466},
archivePrefix={arXiv},
primaryClass={cs.LG},
year={2024}
}