ConvNeXtT/S/B/L, to be of similar complexities to Swin-T/S/B/L.
ConvNeXt-T/B is the end product of the “modernizing” procedure on ResNet-50/200 regime, respectively.
In addition, we build a larger ConvNeXt-XL to further test the scalability of ConvNeXt.
The variants only differ in the number of channels C, and the number of blocks B in each stage
1. settings
ImageNet-22K(21841개의 class, 14M개의 이미지)로 pre-training을 했고 ImageNet-1K로 fine tuning을 했다.
Training on ImageNet-1K
epochs 300, AdamW, lr = 4e-3.
There is a 20-epoch linear warmup and a cosine decaying schedule afterward.
batch size = 4096, weight decay= 0.05
Data Augmentation = Mixup, Cutmix, RandAugment and Random Erasing
Regularization = Stochastic Depth, Label smoothing.
Layer Scale of initial value 1e-6
Exoponential Moving Average(EMA) for overfitting 방지
Pre-training on ImageNet-22K.
90 epochs with a warmup of 5 epochs. No EMA 나머지는 위와 같음
Fine-tuning on ImageNet-1K.
30epochs, Adamw, lr = 5e-5, cosine learning rate schedule, layer-wise learning rate decay,
no warmup, batchsize = 512, weight decay = 1e-8
224x224 이미지로 한번 진행한 후에
384x384로 더큰 해상도를 위해 다시 fine-tune했음 (ImageNet-22K, ImageNet-1K pre-trained model에 대해)
이 마지막 과정이 Transfomer보다 쉬움
핵심
Our results demonstrate that properly designed ConvNets are not inferior to vision Transformers when pre-trained with large dataset — ConvNeXts still perform on par or better than similarly-sized Swin Transformers, with slightly higher throughput.
ConvNext는 비슷한 parameter수로 Vit와 비슷한 결과를 얻는다
댓글