https://github.com/apple/ml-fastvit?s=09
https://arxiv.org/pdf/2303.14189.pdf
In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off.
To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network.
We further apply traintime overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency
댓글