This paper, under review at ICLR, shows that given enough data, a standard Transformer can outperform Convolutional Neural Networks in image recognition tasks, which are classically tasks where CNNs excel. In this Video, I explain the architecture of the Vision Transformer (ViT), the reason why it works better and rant about why double-bline peer review is broken.
0:00 – Introduction
0:30 – Double-Blind Review is Broken
5:20 – Overview
6:55 – Transformers for Images
10:40 – Vision Transformer Architecture
16:30 – Experimental Results
18:45 – What does the Model Learn?
21:00 – Why Transformers are Ruining Everything
27:45 – Inductive Biases in Transformers
29:05 – Conclusion & Comments
Paper (Under Review): https://openreview.net/forum?id=YicbFdNTTy
In this episode, Mandy from deeplizard will be building on what we’ve learned about MobileNet combined with the techniques we’ve used for fine-tuning to fine-tune MobileNet for a custom image data set using TensorFlow’s Keras API.
By imposing an objectness prior, this paper a module that is able to recognize permutation-invariant sets of objects from pixels in both supervised and unsupervised settings. It does so by introducing a slot attention module that combines an attention mechanism with dynamic routing.