1. Over of multi-modal learning(다중 감각)
Challenge
(1) - Different representations between modalities
(2) - Unbalance between heterogeneous feature spaces - 1:N matching가능성
(3) - May a model be biased on a specific modality
Desipite the challenges, multi-modal learning is fruitful and important
Matching : 서로 다른 데이터 타입을 공통된 space로 match
Translating : 하나의 데이터타입을 다른 데이터 타입으로 translate
Referencing : 서로 다른 데이터가 서로 참조하는 형식
으로 해결하는 접근을 하는중
2. Multi-modal tasks(1) - Visual data & Text
Text embedding - Example
- Characters are hard to use in machine learning
- Map to dense vectors
- 놀랍게도, Generalization power is obtained by learning dense representation
Text 표현의 기초
word2 vec의 경우 Skip-gram model을 사용함
- Trained to learn \(W and W'\)
- Rows in \(W\) represent word embedding vectors
- Learning to predict neighboring \(N\) words for understanding relationships between words
- Given a model with a window of size 5, the center words depend on 4 words
Joint embedding (Matching 방식의 문제해결) - Image tagging(combining pre-trained unimodal models)
- Can generate tags of a given image, and retrieve images by a tag keyword as well
- 각각의 감각에 대해 사용되는 모델을 동시에 사용하여 joint embedding을 만든다
- text와 image가 pair일 경우 매칭의 거리를 줄이도록 학습하고, pair가 아닐 경우 매칭의 거리를 멀도록 함 (Metric learning)
-> 놀랍게도, learned embeddings hold analogy relationships between visual and text data
Cosine similarity loss(for joint Embedding)와 semantic regularization loss(Loss incorporating high-level semantics)를 동시에 사용함
Application : food Image ~ its recipe
recipe : 재료와 순서 두 개를 각각 encoding 하여 vector embedding
Cross modal translation (Translating 방식의 문제해결)
Application - Image captioning
Show and tell : captioning as image-to-sentence ~ CNN for iamge & RNN for sentence
Encoder : CNN pre-trained model on ImageNet
Decoder : LSTM module
Show, Attend and tell [논문 이름인 듯?]
1. Input Image
2. Convolutional Feature Extraction - 중간 feature map을 가져다씀
3. Rnn with attention over the image
4. Word by word generation
Text-to image by generative model
Cross modal reasoning(Referencing 방식의 문제해결)
Visual question answering - multiple streams, joint embedding, end-to-end training
3. Multi-modal tasks(2) - Visual data & Audio
Sound representation(Spectrogam) - Acoustic feature extraction from waveform to spectrogram
- STFT(short-time Fourier transform) : waveform data를 spectrum data형태[power spectrum]로 변환
- FT decomposes an input signal into constituent frequencies
Joint embedding [SoundNet]
- learn audio representation from synchronized RGB frames in the same videos
- Train by the teacher-student manner
- Transfer visual knowledge from pre-trained visual recognition models into sound modality
- For a target task, the pre-trained internal representation (pool5) is used as features
- Training a classifier with the pool5 feature
- Instead of the output layer, the pool5 feature posses more generalizable semantic info.
Cross modal translation
Speech2 Face - module networks
VGG-Face Model
- Training by feature matching loss(self-supervised manner) for making features compatible
- Natural co-occurrence of speaker's speech and facial images
Application : Image-to-speech synthesis 이미지로만 소리를 만들어내는 것
- Image captioning, but sub-word units not natural language
Associate speech with contextually relevant visual inputs using a triplet loss
Cross modal reasoning
Application - sound source localization - 소리의 위치를 찾아주는 기능
1.
Looking to listen at the cocktail party [논문인 듯?] - Audio-visual fusion
Training data : synthetically generated by combining two clean speech videos
Loss : L2 loss between clean spectrogram and enhanced spectrogram
2. Lip movements generation
Concousion.
Beyond image, text and audio
Autopilot - Tesla self-driving
'On Going > Computer Vision' 카테고리의 다른 글
SLAM의 input과 output에 대해 알아보자 (0) | 2024.06.18 |
---|---|
SLAM (Simultaneous Localization and Mapping) 관련 개념정리 (0) | 2024.06.18 |
Landmark localization (1) | 2023.12.05 |
Segmentations(Instance & Panoptic) (1) | 2023.12.05 |
Object detection (2) | 2023.12.05 |
댓글