Multi-modal learning

1. Over of multi-modal learning(다중 감각)

Challenge

(1) - Different representations between modalities

(2) - Unbalance between heterogeneous feature spaces - 1:N matching가능성

(3) - May a model be biased on a specific modality

Desipite the challenges, multi-modal learning is fruitful and important

Matching : 서로 다른 데이터 타입을 공통된 space로 match

Translating : 하나의 데이터타입을 다른 데이터 타입으로 translate

Referencing : 서로 다른 데이터가 서로 참조하는 형식

으로 해결하는 접근을 하는중

2. Multi-modal tasks(1) - Visual data & Text

Text embedding - Example

- Characters are hard to use in machine learning

- Map to dense vectors

- 놀랍게도, Generalization power is obtained by learning dense representation

Text 표현의 기초

word2 vec의 경우 Skip-gram model을 사용함

- Trained to learn \(W and W'\)

- Rows in \(W\) represent word embedding vectors

- Learning to predict neighboring \(N\) words for understanding relationships between words

- Given a model with a window of size 5, the center words depend on 4 words

Joint embedding (Matching 방식의 문제해결) - Image tagging(combining pre-trained unimodal models)

- Can generate tags of a given image, and retrieve images by a tag keyword as well

- 각각의 감각에 대해 사용되는 모델을 동시에 사용하여 joint embedding을 만든다

- text와 image가 pair일 경우 매칭의 거리를 줄이도록 학습하고, pair가 아닐 경우 매칭의 거리를 멀도록 함 (Metric learning)

-> 놀랍게도, learned embeddings hold analogy relationships between visual and text data

Cosine similarity loss(for joint Embedding)와 semantic regularization loss(Loss incorporating high-level semantics)를 동시에 사용함

Application : food Image ~ its recipe

recipe : 재료와 순서 두 개를 각각 encoding 하여 vector embedding

Cross modal translation (Translating 방식의 문제해결)

Application - Image captioning

Show and tell : captioning as image-to-sentence ~ CNN for iamge & RNN for sentence

Encoder : CNN pre-trained model on ImageNet

Decoder : LSTM module

Show, Attend and tell [논문 이름인 듯?]

1. Input Image

2. Convolutional Feature Extraction - 중간 feature map을 가져다씀

3. Rnn with attention over the image

4. Word by word generation

Text-to image by generative model

Cross modal reasoning(Referencing 방식의 문제해결)

Visual question answering - multiple streams, joint embedding, end-to-end training

3. Multi-modal tasks(2) - Visual data & Audio

Sound representation(Spectrogam) - Acoustic feature extraction from waveform to spectrogram

- STFT(short-time Fourier transform) : waveform data를 spectrum data형태[power spectrum]로 변환

- FT decomposes an input signal into constituent frequencies

Joint embedding [SoundNet]

- learn audio representation from synchronized RGB frames in the same videos

- Train by the teacher-student manner

- Transfer visual knowledge from pre-trained visual recognition models into sound modality

- For a target task, the pre-trained internal representation (pool5) is used as features

- Training a classifier with the pool5 feature

- Instead of the output layer, the pool5 feature posses more generalizable semantic info.

Cross modal translation

Speech2 Face - module networks

VGG-Face Model

- Training by feature matching loss(self-supervised manner) for making features compatible

- Natural co-occurrence of speaker's speech and facial images

Application : Image-to-speech synthesis 이미지로만 소리를 만들어내는 것

- Image captioning, but sub-word units not natural language

Associate speech with contextually relevant visual inputs using a triplet loss

Cross modal reasoning

Application - sound source localization - 소리의 위치를 찾아주는 기능

Looking to listen at the cocktail party [논문인 듯?] - Audio-visual fusion

Training data : synthetically generated by combining two clean speech videos

Loss : L2 loss between clean spectrogram and enhanced spectrogram

2. Lip movements generation

Concousion.

Beyond image, text and audio

Autopilot - Tesla self-driving

'On Going > Computer Vision' 카테고리의 다른 글

SLAM의 input과 output에 대해 알아보자 (0)	2024.06.18
SLAM (Simultaneous Localization and Mapping) 관련 개념정리 (0)	2024.06.18
Landmark localization (1)	2023.12.05
Segmentations(Instance & Panoptic) (1)	2023.12.05
Object detection (2)	2023.12.05

Problem Solver

Multi-modal learning

'On Going > Computer Vision' 카테고리의 다른 글

댓글

티스토리툴바

Multi-modal learning

'On Going > Computer Vision' 카테고리의 다른 글

관련글

댓글

티스토리툴바