In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way.
대표적인 inductive bias
1. sliding-window manner = is intrinsic to visual processing, particularly when working with high-resolution images.
2. translation equivariance = good for objection detection.
그에 반해 Transformer는 NLP로 부터고안된 내용이기에 ViT는 image에 관한 indective bias를 갖고 있지 않다.
ViT의 특징
1. the scaling behavior
ViT의 한계
1. global attention design, which has a quadratic complexity with respect to the input size (계산량 많음)
2. Cyclic shifting같은 advanced approaches가 필요하고
3. 속도는 최적화 될수 있지만 system자체가 매우 정교하게 디자인 됐다는 것이다.
ViT의 한계극복
1. Hierarchical Transformer. ex) sliding window strategy(attention within local windows) ConvNet처럼..
Swin Transformer가 이방식을 택해서 Computer Vision의 제대로 된 Backbone이 될 수 있는 거의 첫 사례
결론 : 아 역시 기존의 접근(Sliding window등) 들이 틀리지 않았구나
Swin Transformer revealed one thing: the essence of convolution is not becoming irrelevant; rather, it remains much desired and has never faded.
ConvNets and hierarchical vision Transformers become different and similar at the same time: they are both equipped with similar inductive biases, but differ significantly in the training procedure and macro/micro-level architecture design. In this work, we investigate the architectural distinctions between ConvNets and Transformers and try to identify the confounding variables when comparing the network performance. Our research is intended to bridge the gap between the pre-ViT and post-ViT eras for ConvNets, as well as to test the limits of what a pure ConvNet can achieve.
Key Point : How do design decisions in Transformers impact ConvNets’ performance?
댓글