Summary of Contributions

This paper presents an extensive study of the transformer architecture with some modifications applied to image domain with the main contribution as follows:

The paper demonstrates that transformers can completely replace convolutions given large amount of data or as the paper mentions "Large scale training triumphs inductive bias".
Presents a modification to an existing approach to use transformers, which are prevalent in NLP to vision tasks.
Demonstrates state of the art performance and shows that transformers can be trained with fewer compute resources.

Detailed Comments

The paper presents an in-depth large scale study of transformers applied to vision tasks. As using self-attention for every pixel is too memory intensive the authors extend the approach of Cordonnier et al by using patches of images as individual token. A general patch size used is 16x16. These tokens are linearized, encoded and appended to a positional embedding. The authors state and demonstrate that they dont find any benefit of using 2D aware positional embedding instead stick to simple 1D embedding. Their architecture consists of layernorm and residual connections inspired from previous work. Thus they completely replace convolutions to Attention based mechanism. They additionally test a baseline called "Hybrid" which utilizes features learned from a CNN kernel to map to token embeddings.

They use the transformers by pretraining on a large dataset and finetuning to smaller datasets. Since the image resolution might be different for different datasets, the authors decide to keep the patch size constant and instead linearly interpolate the positional embeddings based on the size of the image. I think this is a heuristic way to reuse learned positional embeddings and wonder if authors can provide more analysis to show their validity.

Authors compare on a variety of transfer tasks; pretraining on Imagenet, Imagenet-21k, JFT and transferring to Imagenet, ReaL, CIFAR-10/100, Oxford PETS and Oxford Flowers. They also evaluate on VTAB dataset which evaluates low task transfer. ViT (Vision transformer) outperforms all the other methods when pretraining with a large dataset demonstrating the ViT with learned biases can outperform CNN's. With low size datasets CNN's still have better performance as shown in Figure 3. They also perform an additional study to compare the compute needed by ViT to Resnet based architectures and find that ViT require 2-4x less compute to attain the same performance as Resnet based architectures and also find that Hybrid models perform similar to ViT when the dataset is large enough. Inspecting ViT leads to some interpretable findings: a. The PCA of learned embeddings look like plausible basis functions but I think this is more qualitative than quantitative. b. ViT has learned interpretable positions embeddings (in 2D) c. ViT's last layers have attention that span the entire image. The final experiments tests ViT's utility in self supervised learning which gives them a performance boost of 2% over existing architectures which I found to be quite impressive.

Overall, I think this paper presents a modest extension over the existing approach (Cordonnier et al) for using transformer in vision but presents a very extensive study that establishes that Transformers can indeed replace Convolutions which has been the de-facto in the field. This also paves way to future research in efficient and better usage of transformers (instead of patches) that might work better for vision tasks. I encourage authors to add a discussion about the reproducibility of the experiments since the experiments are quite compute intensive with resource requirements not available to the large public.