On Going/Deep Learning

[FastVit] Vision Transformer from APPLE

에아오요이가야 2023. 8. 18. 14:09

https://github.com/apple/ml-fastvit?s=09 

 

https://arxiv.org/pdf/2303.14189.pdf

 

In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off.

 

To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network.

 

We further apply traintime overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency