본문 바로가기
On Going/Deep Learning[Paper Review] ConvNeXt - A ConvNet for the 2020s (4/n) ConvNet : Other tasks 1. object detection and segmentation on COCO도 잘하고 2. Semantic segmentation on ADE20K 도 잘하고 3. model efficiency도 낫다 In addition, Hybrid model = CNN + self attention이 연구되고 있다. Prior to ViT, the focus was on augmenting a ConvNet with self-attention/non-local modules [8, 55, 66, 79] to capture long-range dependencies The original ViT [20] first studied a hybrid configuration, and a large body of fol.. 2023. 12. 4.
[Paper Review] ConvNeXt - A ConvNet for the 2020s (3/n) ConvNet : Evaluations on classification ConvNeXtT/S/B/L, to be of similar complexities to Swin-T/S/B/L. ConvNeXt-T/B is the end product of the “modernizing” procedure on ResNet-50/200 regime, respectively. In addition, we build a larger ConvNeXt-XL to further test the scalability of ConvNeXt. The variants only differ in the number of channels C, and the number of blocks B in each stage 1. settings ImageNet-22K(21841개의 class, 14M개의 이미지.. 2023. 11. 28.
[Paper Review] ConvNeXt - A ConvNet for the 2020s (2/n) ConvNet : a Roadmap Starting Point is a ResNet-50 model을 Vision transformer가 학습하는 방식으로 학습시켰더니 performance가 더 좋았다. Deisign Decision (network modernization : CNN의 현대화) 1. Macro design 2. ResNeXt 3. Inverted bottleneck 4. Large kernel size 5. various layer-wise micro design Training Techniques 0. train a baseline model(resnet 50/200) with the vision transformer training procedure ~ DeiT와 Swin Trnasfomer에서 소개된 것들과 유사한 .. 2023. 11. 27.
[Paper Review] ConvNeXt - A ConvNet for the 2020s (1/n) Introduction In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. 대표적인 inductive bias 1. sliding-window manner = is intrinsic to visual processing, particularly when working with h.. 2023. 11. 27.
[FastVit] Vision Transformer from APPLE https://github.com/apple/ml-fastvit?s=09 https://arxiv.org/pdf/2303.14189.pdf In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing s.. 2023. 8. 18.