Summary of Contributions

This paper presents a new approach to learn goal reaching behavior from offline datasets with the following contributions:

A negative sampling method to mitigate the distribution shift problem when learning from offline datasets in the goal-conditioned RL setting.
A new method to stitch together temporally extended goals which may not be achieved by any single trajectory in the dataset.
Demonstrates utility of actionable models as a strong prior for downstream tasks.

Detailed Comments

This paper presents an approach to learn goal-reaching behaviors purely from offline datasets where observations are images. To do this the authors build on top of the standard goal-reaching method SAC+HER which is commonly used in online RL. SAC+HER when directly applied to offline dataset suffer from overestimation since there are no negative samples in the dataset. Their method Actionable models consists of two components: 1. Learning goal reaching policy with negative sampling: The authors take a pessimistic perspective and force all the actions not in the dataset to not lead to the goal of the trajectory. This assumes that no recovery is possible to the goal if a different action is taken. They achieve this by penalizing Q function for out of distribution actions. 2. Goal chaining: SAC+HER is only able to achieve goal seen in a single trajectory. To be able to achieve goals that might span two or more trajectories, the authors propose selecting goals randomly from the dataset and relabelling the reward of last state current trajectory with the terminal-Q function specific to that goal. This allows the agent to reason for extended horizons.

In the experiments the authors demonstrate that this method can be used to learn general purpose arbitrary goal reaching skills that are useful in downstream applications like: generalization to new goals, using actionable models as auxilliary loss, fine tuning actionable models with task specific loss. The authors present experiments both on simulated environments and real-robots. In the set of 4 simulated tasks, actionable models surpass the performance on previous baselines that are built for learning from online datasets by a wide margin for goal reaching tasks. In the real world experiments the authors demonstrate a high success rate and ablate goal-chaining part of their method. The experiments show goal chaining to be necessary in learning long horizon behavior. In the real world task of grasping objects, pretraining with actionable models is shown to increase the success rate of the final policy and using actionable models as an auxilliary loss is shown to speed up learning. Overall the paper is clear to read with extensive experiments including ablations spanning real world and simulated experiments.