Paper Link

Summary of Contributions

This paper presents a method to learn long horizon behaviors by latent imagination with the following contributions:

  1. The authors propose a new method that enable learning values, policy and rewards in latent space while not limiting themselves to limited horizon behavior faced by online planning methods.
  2. The authors present a comparison of different representation learning method to drive latent imagination.
  3. Dreamer is shown to state of the art results on a variety of domains like Deepmind control suite, atari and Deepmind lab environments.

Detailed Comments

This paper presents a simple and fast method to derive high performance agents by using latent imagination. The structure of the algorithm follows a typical reinforcement learning setup where the agent collects environment interactions and learn a value and reward function with the following major differences: 1. The observations are assumed to be non-markovian and high dimensional. To deal with this the authors use a recurrent network the maps the observations to latent states that are low dimensional and markovian. These low-dimensional states will be used as an input to value function, reward and the policy. The authors present three ways to learn these latent behaviors: a. Directly predict rewards from states b. Use a reconstruction based loss that decodes states to observations along with predicting reward and transitions c. Use a contrastive NCE loss along with predicting rewards and transitions. The second part of the method is value function learning in the latent state space. Since there is access to learned dynamics and value functions all of which are differentiable, a multistep objective is proposed where the current policy is rolled out for N steps and a TD-lambda style backup is used for value function learning. The policy is updated to increase the N-step return using analytical gradients backpropagated through the dynamics and the value function. This allows to efficiently reduce the variance of policy improvement operator.

The authors experiment on a wide range of domains such as: Deepmind control suite, Atari (discrete) and Deepmind lab. Dreamer is seen to be the most sample efficient learning competitive behaviors to model-free baselines in a much more sample efficient manner while outperforming the model-based baseline PlaNet. Dreamer trains in 3 hours on the control suite tasks compared to the 11 hours training time of PlaNet. They also present a comparison of different representation learning methods and show that reconstruction based representation learning works the best and improving representation learning can potentially boost Dreamers performance. Overall the paper is clear to read and the paper presents a wide array of experiments to validate their approach and presents a method that is state-of-the-art in pixel-based control.