Summary of Contributions

This paper discusses a new way to learn hierarchical policies from meaningful but not expert human demonstrations and reinforcement learning with the following contributions:

A hierarchical learning method that bootstraps from non-expert human data and is amenable to fine-tuning by reinforcement learning.
Demonstrates that a hierarchical imitation learning method can outperform flat alternatives for imitation.
Shows that their method outperform baselines quantitatively on baselines for long-horizon multistage tasks.

Detailed Comments

In this paper, authors provide a novel way to learn policies for long-horizon tasks by decomposing the policy hierarchically into a simple architecture and then bootstrapping it with human data before finetuning with RL. The method uses a simple hierarchical architecture where the high level policy defines a subgoal for the low level policy and the subgoal remains fixed for T timesteps (T=30). Their relay policy learning method is divided into two stages: 1. relay imitation learning 2. relay reinforcement learning. In stage 1, the authors leverage human demonstrations where the human are taking meaningful actions in the environment. They learn a goal reaching policy by relabelling the dataset by time chunks depending if it is the high level or the low level policy. In stage 2, the authors use this initialized policy to fine tune with RL. Since there are two levels of hierarchy, the optimization is challenging. The authors use decoupled updates where one policy is kept fixed while the other policy is updated via natural policy gradient. An additional behavior cloning loss is added to keep the policy close to the dataset.

The authors experiment on the kitchen manipulation domain with two interesting results. Each task in kitchen domain consists of four subtask and success is measured based on how many tasks the policy has completed. First they show that hierarchical imitation learning improves over traditional flat imitation learning as a result of additional data generation by relabelling and leveraging generalization by training on a variety of goals. Second, they show that baselines fail to learn on this long horizon domain with a very low success rate whereas RPL is able to solve 3.5/4 tasks on average. They also ablate the window size and propose variations of their method. Overall, the paper's readability can be improved. Kitchen as a benchmarking task for long horizon control maybe too restrictive of an assumption. These results might not carry on to unstable control domains such as hopper. It would also help to include experimental details and clarifications in the appendix in more detail.