본문 바로가기
[ViT] Vision Transformer 리뷰 비전 트랜스포머(ViT)는 이미지를 입력으로 받아 Transformer 인코더만으로 특징을 추출하는 모델입니다. ViT는 먼저 입력 이미지를 고정 크기의 패치(patch)들로 분할합니다. 예를 들어 224 ×224 RGB 이미지를 16 ×16 크기의 패치로 나누면 패치 1개는 16 ×16 ×3 픽셀이고, 전체 패치는 14 ×14=196개가 생성됩니다. 각 패치는 2차원 픽셀을 1차원으로 평탄화한 후 선형 변환을 거쳐 패치 임베딩 벡터로 변환됩니다  1. ViT 모델 구조 개요이미지를 일정 크기의 패치들로 분할한 뒤(아래 패치 예시) 각 패치를 벡터로 임베딩하고, 위치 인코딩을 더하여 Transformer 인코더(위 부분)에 입력한다. 인코더는 여러 층의멀티헤드 자가주목과 MLP로 이루어진다. 최종적으로 .. 2025. 3. 24.
On Going/Deep Learning[Paper Review] ConvNeXt - A ConvNet for the 2020s (4/n) ConvNet : Other tasks 1. object detection and segmentation on COCO도 잘하고 2. Semantic segmentation on ADE20K 도 잘하고 3. model efficiency도 낫다 In addition, Hybrid model = CNN + self attention이 연구되고 있다. Prior to ViT, the focus was on augmenting a ConvNet with self-attention/non-local modules [8, 55, 66, 79] to capture long-range dependencies The original ViT [20] first studied a hybrid configuration, and a large body of fol.. 2023. 12. 4.
[Paper Review] ConvNeXt - A ConvNet for the 2020s (3/n) ConvNet : Evaluations on classification ConvNeXtT/S/B/L, to be of similar complexities to Swin-T/S/B/L. ConvNeXt-T/B is the end product of the “modernizing” procedure on ResNet-50/200 regime, respectively. In addition, we build a larger ConvNeXt-XL to further test the scalability of ConvNeXt. The variants only differ in the number of channels C, and the number of blocks B in each stage 1. settings ImageNet-22K(21841개의 class, 14M개의 이미지.. 2023. 11. 28.
[Paper Review] ConvNeXt - A ConvNet for the 2020s (2/n) ConvNet : a Roadmap Starting Point is a ResNet-50 model을 Vision transformer가 학습하는 방식으로 학습시켰더니 performance가 더 좋았다. Deisign Decision (network modernization : CNN의 현대화) 1. Macro design 2. ResNeXt 3. Inverted bottleneck 4. Large kernel size 5. various layer-wise micro design Training Techniques 0. train a baseline model(resnet 50/200) with the vision transformer training procedure ~ DeiT와 Swin Trnasfomer에서 소개된 것들과 유사한 .. 2023. 11. 27.
[Paper Review] ConvNeXt - A ConvNet for the 2020s (1/n) Introduction In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. 대표적인 inductive bias 1. sliding-window manner = is intrinsic to visual processing, particularly when working with h.. 2023. 11. 27.
[FastVit] Vision Transformer from APPLE https://github.com/apple/ml-fastvit?s=09 https://arxiv.org/pdf/2303.14189.pdf In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing s.. 2023. 8. 18.