Summary of Contributions

The paper presents a way to combine vision and touch modalities for control with the following primary contributions:

A new way of effectively fusing vision and haptic signals trained with self supervised learning using hand-designed auxiliary tasks.
Presentation of new set of tasks that require both haptic and visual reasoning.
Generalization of representations learned with SSL to other similar tasks.

Detailed Comments

In this work, authors present a new way to combine vision and haptic modalities for robot control. They achieve by using a combination of three features: vision features extracted by CNN, force torque features using causal convolutions and proprioceptive features using MLP, where proprioceptive input are the state of the robot and force torque readings are obtained from the haptic sensor. Using data collected using random policy+heuristic policy they form three auxiliary tasks: Predicting the optical flow from the combined features given the next action, predicting whether contact will be made given the next action and predicting whether the different modality inputs are time aligned. The authors are encouraged to provide more thorough explanation of the heuristic policy for data collection and presenting ablations indicating the importance of the heuristic policy. It would be interesting to see a discussion on importance of method of data collection for data fusion. It is also not clear why how inductive biases specific to data modality are important in this architecture.

The features learned via SSL are concatenated and frozen and used for policy learning for achieving peg-in-hole insertion task. They use TRPO for policy learning and show results on simulated as well as real robot. In simulation experiments, we can see the benefit of fusion across modalities which achieves a success rate of 80% whereas leaving any of the data modality leads to a stark decrease of performance to less than 5%. I encourage the authors to present comparison with "baselines + SSL" since their method has combined advantage of SSL features and data fusion. On real robots the fusion method has a high success rate, although it is surprising why triangular peg has a higher success rate than circular. Intuitively circular peg should be easier to insert due to symmetry. The authors also experiment transferring the representations learned with triangular peg to hexagonal and square pegs and find that transferring representations are preferable alternative to transferring policies. Authors should clarify the treatment of peg in the input, if it is included in the input and while prediction of optical flow. I assume including the peg when prediction of optical flow might deter transfer to other pegs. Authors should also present an analysis of each of the auxiliary tasks and its importance. It is not clear which auxiliary task is most relevant to robot control. Overall the paper is clear to read but lacks enough analysis and ablations to clarify the necessity of each of its submodules. I appreciate the real robot experiments as they help to showcase the benefit of their proposed method.