Summary of Contributions

This paper studies Siamese Networks and brings some of its interesting properties to light with contributions as follows:

Siamese network can prevent collapse with a simple modification rather than some complicated practices employed by previous papers.
The present a stop-gradient operation for improving representation learning for Siamese Networks and present an interpretation of stop-gradient operation as an EM.
They show that their simple modifications to Siamese networks can outperform state-of-the-art techniques that employ complicated modifications showing that negative samples, large batches and momentum encoders are not necessary for learning good representations.

Detailed Comments

The paper presents a simple modification to the original Siamese networks that solves the issue of collapse where the network outputs a constant. Siamese networks take in two transformation of images and predict how close they are and outputting a constant is a valid global minima. Now, previous works presented approach to regularize and prevent such solutions by using 1. Negative samples (contrastive learning) 2. Large Batch Sizes with advanced optimizers 3. Momentum encoders. This work instead presents a modification with a good amount of analysis that prevents collapse for Siamese networks.

The Siamese network takes two transformations of the same image and encodes it. The first encoding is passed through an MLP to transform it to be more similar to encoding of the second transformation and a cosine similarity loss between the resulting encodings are minimized. The authors introduce a stop-gradient operator which stops the gradient to flow through the second embedding. The authors do an extensive ablation study of the architectural units used in siamese networks and show that the stop-gradient operator is necessary in preventing collapse. They also demonstrate the importance of the MLP used to transform the encoding in the architecture and study other elements such as Batch Norm and batch size. They particularly find batch norm to be extremely important as it raises the accuracy on Imagenet with unsupervised learned features from 34.6 to 67.4 percent.

The authors present a discussion on how the stop-grad operator might be changing the optimization process which seems plausible but not a proof of why it prevents collapse. The stop-grad operator can be viewed as implementing a EM procedure where the different outputs from the network for different transformation can be viewed as two variables over which we optimizing alternatingly. The authors present an argument as to how this optimization procedure might change the solution path and prevent collapse but I encourage the authors to present further arguments to support the hypothesis. Finally, the authors show comparison of their method SimSiam on two tasks: Imagenet (from unsupervised learned features) and Transfer to other datasets. In both cases their simple method is able to outperform baselines by a small margin. Overall I think this paper is clear to read and its strength is its simplicity providing a baseline for future papers to compare against.