Starting Point is a ResNet-50 model을 Vision transformer가 학습하는 방식으로 학습시켰더니 performance가 더 좋았다.
Deisign Decision (network modernization : CNN의 현대화)
1. Macro design
2. ResNeXt
3. Inverted bottleneck
4. Large kernel size
5. various layer-wise micro design
Training Techniques
0. train a baseline model(resnet 50/200) with the vision transformer training procedure ~
DeiT와 Swin Trnasfomer에서 소개된 것들과 유사한 방법으로 학습시키다 보니 기존의 90 epochs 보단 많은 300 epochs 학습했음
AdamW optimizer를 사용했고
Mixup, Cutmix RandAugment, Random Erasing을 사용하여 data augmentation했고
stochastic Depth, label smoothing 사용하여 regularization 했음
위의 설면된 내용은 Appendix A.1에 있음
-> 이렇게만 했는데도 resnet-50 성능이 2.7% 향상됐음 이후의 모든 학습에 위의 recipe를 고정할 것임
1. Macro Design
Two step design, where each step has a different feature map resolution
1-1. compute ratio (changing stage compute ration)
- ViT가 ratio를 1:1:3:1 / 1:1:9:1로 쓰더라 그래서 ResNet에 (3,4,6,3) -> (3,3,9,3)으로 바꿨더니 또 성능이 향상됐음
1-2. stem cell structure (changing stem to 'Patchify')
- ResNet의 stem cell은 7x7 convolution layer with stride 2, max pool을 통해 1/4 downsampling of input image 하지만, ViT는 kernel size가 14 혹은 16같이 큰 걸 사용하며 non-overlapping convolution으로 patchify전략을 통해 stem cell과 같은 역할을 하더라 patch size를 4처럼 작게 해서 multi-stage design을 했음
- 그래서 4x4 convolution layer with stride 4로 stem cell을 patchify layer화 했더니 성능이 향상됐음
-> This suggests that the stem cell in a ResNet may be substituted with a simpler “patchify” layer à la ViT which will result in similar performance.
2. ResNeXt-ify
The core component is grouped convolution, where the convolutional filters are separated into different groups.
ResNeXt employs grouped convolution for the 3 ×3 conv layer in a bottleneck block. As this significantly reduces the FLOPs, the network width is expanded to compensate for the capacity loss.
We note that depthwise convolution is similar to the weighted sum operation in self-attention, which operates on a per-channel basis, i.e., only mixing information in the spatial dimension.
The combination of depthwise conv and 1 × 1 convs leads to a separation of spatial and channel mixing, a property shared by vision Transformers, where each operation either mixes information across spatial or channel dimension, but not both
-> 계산량이 늘었지만(Swin-T's의 channel과 같은 넓이로 network를 늘려서 64에서 96 width가 됐다) 정확도가 또 향상됐다
3. Inverted Bottleneck (모든 transformer block의 주요 특징 중 하나)
the hidden dimension of the MLP block is four times wider than the input dimension.
Interestingly, this Transformer design is connected to the inverted bottleneck design with an expansion ratio of 4 used in ConvNets.
Despite the increased FLOPs for the depthwise convolution layer, this change reduces the whole network FLOPs to 4.6G, due to the significant FLOPs reduction in the downsampling residual blocks’ shortcut 1 ×1 conv layer.
-> 계산량을 줄이고 성능을 또 향상했다.
4. Larger Kernel Sizes
One of the most distinguishing aspects of vision Transformers is their non-local self-attention, which enables each layer to have a global receptive field.
Although Swin Transformers reintroduced the local window to the self-attention block, the window size is at least 7 ×7, significantly larger than the ResNe(X) t kernel size of 3 ×3. Here we revisit the use of large kernel-sized convolutions for ConvNets.
4-1. Moving up depthwise conv layer
4-2. Increasing the kernel size
5. Micro Design
5-1. Replacing ReLU with GELU
5-2. Fewer activation finctions
5-3. Fewer normalization layers
5-4. Substituting BN with LN
5-5. Separate downsampling layers
Closing remarks
댓글