In this talk from Xinlei Chen, Facebook AI Research, covers three of their recent explorations on pre-training.
First is an analysis on object/attribute detection pre-training, which produces bottom-attention features extensively used in vision and language research. The main finding is that plain grid features can work equally well without object proposals, while being significantly faster. Next is an approach for self-supervised visual representation learning. The main message is that a simple Siamese network can learn competitive representations, without commonly believed essential components such as contrastive pairs, or momentum encoders. Last is an architecture extension of major frameworks in self-supervised learning from convolutional networks to transformers. We find vision transformers can work out-of-box, subject to instability issues which we call out for awareness.