Summary of Contributions

This paper discusses a new way to learn from expert demonstration with the following contributions:

Presents two new algorithms which sets imitation learning as an inverse RL problem where the agent tries to minimize performance gap with the expert under an unknown reward function.
Presents a theoretical analysis of sample complexity and convergence of the algorithm.
Shows that their method outperform baselines quantitatively on a gridworld environment and qualitatively on a car simulation experiment.

Detailed Comments

This paper proposes a new method to minimize the performance gap of the agent with the expert. In this paper, it is assumed that the reward function is linear in terms of features of the agent and the features are already known. In such setting, the value function of the agent is given by w.\mu_{\pi} where \mu is the feature state-action visitation. The algorithm works as follows: 1. Start with a random policy. 2. Find a reward weight vector that maximizes the minimum performance gap with respect to all previous policies that are seen in training 3. Learn the optimal policy based on the obtained reward weight vector. 4. Repeat step 2 . This algorithm involves learning a weight vector that resembles a SVM algorithm. The authors demonstrate that this algorithm converges in finite small iterations and also characterize the sample complexity of this algorithm. The authors also propose a second version of the algorithm called the "projection method" which replaces step 2 "max-margin" step with a projection step that is arguably simpler.

The authors test their algorithms on a gridworld domain where state-features are the grid-locations and rewards are sparsely distributed in some grids. The expert demonstrations are generated by RL on the reward function which is also sampled randomly. It is shown that both projection and max-margin based methods converge in a small number of iterations to a low value of performance gap. They compare with previous baselines based on supervised learning and show that their method is able to learn with fewer number of trajectories than the baselines. In a simulated driving experiment, a number of driving scenarios are considered with 2 minutes of expert demonstration. It is shown that their method achieve qualitatively good results that mimic the style of expert. Overall this paper is clear to read and provides an algorithm for Imitation with strong guarantees. The experiments are on small contrived environments but nonetheless demonstrate the utility of their method.