This paper discusses a new way to learn from expert demonstration with the following contributions:
This paper proposes a new method to minimize the performance gap of the agent with the expert. In this paper, it is assumed that the reward function is linear in terms of features of the agent and the features are already known. In such setting, the value function of the agent is given by w.\mu_{\pi} where \mu is the feature state-action visitation. The algorithm works as follows: 1. Start with a random policy. 2. Find a reward weight vector that maximizes the minimum performance gap with respect to all previous policies that are seen in training 3. Learn the optimal policy based on the obtained reward weight vector. 4. Repeat step 2 . This algorithm involves learning a weight vector that resembles a SVM algorithm. The authors demonstrate that this algorithm converges in finite small iterations and also characterize the sample complexity of this algorithm. The authors also propose a second version of the algorithm called the "projection method" which replaces step 2 "max-margin" step with a projection step that is arguably simpler.
The authors test their algorithms on a gridworld domain where state-features are the grid-locations and rewards are sparsely distributed in some grids. The expert demonstrations are generated by RL on the reward function which is also sampled randomly. It is shown that both projection and max-margin based methods converge in a small number of iterations to a low value of performance gap. They compare with previous baselines based on supervised learning and show that their method is able to learn with fewer number of trajectories than the baselines. In a simulated driving experiment, a number of driving scenarios are considered with 2 minutes of expert demonstration. It is shown that their method achieve qualitatively good results that mimic the style of expert. Overall this paper is clear to read and provides an algorithm for Imitation with strong guarantees. The experiments are on small contrived environments but nonetheless demonstrate the utility of their method.