[Paper Review] ConvNeXt - A ConvNet for the 2020s (3/n) ConvNet : Evaluations on classification

by 에아오요이가야 2023. 11. 28.

ConvNeXtT/S/B/L, to be of similar complexities to Swin-T/S/B/L.

ConvNeXt-T/B is the end product of the “modernizing” procedure on ResNet-50/200 regime, respectively.

In addition, we build a larger ConvNeXt-XL to further test the scalability of ConvNeXt.


The variants only differ in the number of channels C, and the number of blocks B in each stage

1. settings

ImageNet-22K(21841개의 class, 14M개의 이미지)로 pre-training을 했고 ImageNet-1K로 fine tuning을 했다.


Training on ImageNet-1K

epochs 300, AdamW, lr = 4e-3.

There is a 20-epoch linear warmup and a cosine decaying schedule afterward.

batch size = 4096, weight decay= 0.05

Data Augmentation = Mixup, Cutmix, RandAugment and Random Erasing

Regularization = Stochastic Depth, Label smoothing.

Layer Scale of initial value 1e-6

Exoponential Moving Average(EMA) for overfitting 방지


Pre-training on ImageNet-22K.

90 epochs with a warmup of 5 epochs. No EMA 나머지는 위와 같음


Fine-tuning on ImageNet-1K.

30epochs, Adamw, lr = 5e-5, cosine learning rate schedule, layer-wise learning rate decay,

no warmup, batchsize = 512, weight decay = 1e-8

224x224 이미지로 한번 진행한 후에 

384x384로 더큰 해상도를 위해 다시 fine-tune했음 (ImageNet-22K, ImageNet-1K pre-trained model에 대해)

이 마지막 과정이 Transfomer보다 쉬움



Our results demonstrate that properly designed ConvNets are not inferior to vision Transformers when pre-trained with large dataset — ConvNeXts still perform on par or better than similarly-sized Swin Transformers, with slightly higher throughput.



ConvNext는 비슷한 parameter수로 Vit와 비슷한 결과를 얻는다
