This paper discusses a new way to learn from expert feedback in the interactive settings with the following contributions:
The paper motivates the problem of imitation learning by pointing out that supervised learning (behavior cloning) has the issue that the error scales quadratically in the horizon. This is the compounding error problem of behavior cloning. This paper presents a new algorithm DAGGER which essentially does the following: 1. Collect data from the current policy with occasional advice from expert policy 2. Relabel actions at the states visited by current policy with expert actions 3. Optimize the classification loss on the aggregated dataset and repeat 1. The authors show that this algorithm is no-regret in the sense that the surrogate 0-1 classification loss goes to 0 as number of iterations goes to infinite. They also show that the performance gap of the imitation policy and the expert policy is bounded linearly with time as opposed to quadratically (in behavior cloning).
The authors present experiments in two imitation domains: Super Tux kart and Mario. In both domains they compare to supervised learning, SMILE SEARN. They show that supervised learning does not improve with more data and Dagger converges to a better policy faster than SMILE and SEARN. The authors also present an experiment on handwriting recognition task which is an instance of structured prediction. They show that with DAGGER is competitive with baseline SEARN on this task and outperforms baselines. Overall the idea of the paper is interesting but the proofs are hard to follow and required me to spend a lot of time. I encourage the authors to add detailed explanation of proofs in appendix. The experiments seems convincing of the benefit of their approach.