본문 바로가기
On Going/Computer Vision

Multi-modal learning

by 에아오요이가야 2023. 12. 7.

1. Over of multi-modal learning(다중 감각)

 

Challenge

(1) - Different representations between modalities

(2) - Unbalance between heterogeneous feature spaces - 1:N matching가능성

(3) - May a model be biased on a specific modality

 

Desipite the challenges, multi-modal learning is fruitful and important

Matching : 서로 다른 데이터 타입을 공통된 space로 match

Translating : 하나의 데이터타입을 다른 데이터 타입으로 translate

Referencing : 서로 다른 데이터가 서로 참조하는 형식

으로 해결하는 접근을 하는중

 

2. Multi-modal tasks(1) - Visual data & Text

Text embedding - Example

- Characters are hard to use in machine learning 

- Map to dense vectors

- 놀랍게도, Generalization power is obtained by learning dense representation

 

Text 표현의 기초

word2 vec의 경우 Skip-gram model을 사용함

- Trained to learn \(W and W'\)

- Rows in \(W\) represent word embedding vectors

- Learning to predict neighboring \(N\) words for understanding relationships between words

  - Given a model with a window of size 5, the center words depend on 4 words

 

Joint embedding (Matching 방식의 문제해결) - Image tagging(combining pre-trained unimodal models)

- Can generate tags of a given image, and retrieve images by a tag keyword as well

- 각각의 감각에 대해 사용되는 모델을 동시에 사용하여 joint embedding을 만든다

- text와 image가 pair일 경우 매칭의 거리를 줄이도록 학습하고, pair가 아닐 경우 매칭의 거리를 멀도록 함 (Metric learning)

 

-> 놀랍게도, learned embeddings hold analogy relationships between visual and text data

Cosine similarity loss(for joint Embedding)와 semantic regularization loss(Loss incorporating high-level semantics)를 동시에 사용함

Application : food Image ~ its recipe

recipe : 재료와 순서 두 개를 각각 encoding 하여 vector embedding

 

Cross modal translation (Translating 방식의 문제해결)

Application - Image captioning

Show and tell : captioning as image-to-sentence ~ CNN for iamge & RNN for sentence

 

Encoder : CNN pre-trained model on ImageNet

Decoder : LSTM module

 

Show, Attend and tell [논문 이름인 듯?]

1. Input Image

2. Convolutional Feature Extraction - 중간 feature map을 가져다씀

3. Rnn with attention over the image

4. Word by word generation

 

Text-to image by generative model

 

Cross modal reasoning(Referencing 방식의 문제해결)

Visual question answering - multiple streams, joint embedding, end-to-end training

 

3. Multi-modal tasks(2) - Visual data & Audio

Sound representation(Spectrogam) - Acoustic feature extraction from waveform to spectrogram

- STFT(short-time Fourier transform) : waveform data를 spectrum data형태[power spectrum]로 변환

- FT decomposes an input signal into constituent frequencies

 

Joint embedding [SoundNet]

- learn audio representation from synchronized RGB frames in the same videos

- Train by the teacher-student manner

  - Transfer visual knowledge from pre-trained visual recognition models into sound modality

- For a target task, the pre-trained internal representation (pool5) is used as features

- Training a classifier with the pool5 feature

  - Instead of the output layer, the pool5 feature posses more generalizable semantic info.

 

Cross modal translation

Speech2 Face - module networks

VGG-Face Model 

- Training by feature matching loss(self-supervised manner) for making features compatible

  - Natural co-occurrence of speaker's speech and facial images

 

Application : Image-to-speech synthesis 이미지로만 소리를 만들어내는 것

- Image captioning, but sub-word units not natural language

 

Associate speech with contextually relevant visual inputs using a triplet loss

 

Cross modal reasoning

Application - sound source localization - 소리의 위치를 찾아주는 기능

 

1.

Looking to listen at the cocktail party [논문인 듯?] - Audio-visual fusion

Training data : synthetically generated by combining two clean speech videos

Loss : L2 loss between clean spectrogram and enhanced spectrogram

 

2. Lip movements generation

 

Concousion.

Beyond image, text and audio

Autopilot - Tesla self-driving

댓글