Summary of Contributions

This paper introduces a new off-policy reinforcement learning algorithm called Soft Actor-Critic with the following contributions:

A new RL algorithm for model-free learning that is significantly robust and sample efficient than previous off-policy and on-policy model-free methods.
Presents a method to automatically tune the entropy regularization that allows SAC to run with single set of hyperparameters across a wide variety of environments.
State of the art results on simulated MuJoCo locomotion tasks and on real world experiments.

Detailed Comments

The paper presents a new method called Soft-actor critic which is a model-free off-policy algorithm. Off-policy algorithms can reuse experiences collected from previous policies and have the potential to be more sample efficient than on-policy methods. Their method is based on a modified RL objective - from expected reward maximization to combined expected reward + causal entropy maximization. This allows the algorithm to learn policies that are exploratory and remove the need to hand-coded exploration routines such as epsilon-greedy or Gaussian noise while also selecting policies that have a high return. The paper also presents the analysis of their method on tabular environments where they prove that their method has the following properties: 1. The soft-Q policy evaluation operator converges to the soft-policy evaluation. 2. The policy update step in SAC results in policy improvement. 3. Alternating between policy evaluation and improvement converges to the soft-optimal Q value. The policies learned in this method can be stochastic and in practice are parameterized by function approximations in the form of gaussian. using a gaussian policy in policy improvement allows them to get the stochastic policy gradient using reparameterization, thus reducing variance. Tuning this entropy weight is cumbersome and the authors formulate a constrained variant for SAC where the constraint is to have the entropy greater than a predetermined constant. This is optimized using dual gradient descent.

The authors present an experiment on the simulated Mujoco domain and real work Claw as well as a quadruped. On the Mujoco domain, their method outperforms all previous on-policy and off-policy methods and demonstrates very stable training. It is also able to work with complex high-dimensional environments such as Humanoid where other methods fail. On the real-world quadruped task, the robot is able to learn to walk in 2 hours of real-world training time which is an impressive achievement. SAC is also robust on mild variations of terrain since it learns robust policies. In the Claw environment, which involves a dextrous manipulation task, SAC is able to learn in 20 hours from RGB images and 3 hours from state observations which are twice as faster as PPO with state observations. Overall this paper is clear to read and has wide-reaching consequences in improving the stability and sample efficiency in the field of Reinforcement Learning. I also appreciate that the experiments are comprehensive.