Attention is all you need

Abstract

dominant한 sequence transduction models는 encoder 와 decoder를 포함한 복잡한 rnn 이나 cnn 구조에 기반을 두고있다. 성능이 좋은 모델들은 encoder와 decoder를 attention mechanism을 통해 연결한다. the Transformer: 독자적인 attention mechanisms에 기반하고 RNN, CNN전체를 dispensing(?) 하는 구조

IntroductionAttention mechanisms은 다양한 task에서 compelling sequence modeling과 transduction 모델들의 통합적인 부분이 됐다. 이것은 input혹은 output sequence의 거리를 고려하지않는 modeling의 의존성을 허락했다. 하지만 몇몇 케이스에서는 attention mechanisms이 RNN과 과 융합되기도 하였다.
the Transformer는 recurrence를 피하는 대신에 input과 output사이의 global 한 dependencies를 이끌어내기위해 전체 attention mechanism에 의존하는 모델구조이다. the Transformer는 병렬처리도 좋고 성능도 지린다~
RNN(LSTM, GRU)은 sequence modeling과 transduction problems[language modeling이나 machine translation]에서 높은 성능을 보여왔다. RNN은 일반적으로 input과 output의 symbol positions을 따라 computation을 factor(?)한다. positions을 computation time의 단계로 할당하면서, previous hidden state h_{t-1}과 input position 인 t를 이용한 함수인 hidden states h_t 를 생성한다. 내재적으로 sequential 한 특성은 학습된 데이터들의 병렬? 을 제외한다. 이것은 길이가 긴 sequence 에 치명적이다. 전체 예제를 batch와 하는것을 메모리가 제한하게 된다. factorization tricks과 conditional computation을 이용하여 계산 효율성이 조금 올라가서 모델들의 성능이 향상됐다. 하지만 근본적인 sequential computation은 여전히 남아있다.
BackgroundTransformer에서는 constant한 연산횟수로 줄어들었다, Averaging attention-weighted positions로 인해 줄어든 효과적인 resolution cost에도 불구하고 3.2에 묘사할 Multi Head Attention문제가 있기도 하다. Self-attention(=intra-attention)은 sequence의 representation을 계산하기위해 서로 다른 position의 single sequence를 연결하는 attention mechanism이다.End to end memory networks는 sequence-aligned recurrence 대신에 recurrent attention mechanism에 기반을 두고있다. 그리고 simple-language question answering과 language modeling tasks에서 성능이 좋다.다음 section에서 self-attention에서 motivate된 Transformer을 설명하고 장점을 늘어놓겠음.
하지만, the Transformer는 sequence-aligned Rnn이나 Convolution을 사용하지 않고 input과 output의 representations를 계산하기위한 self-attention전체의 의존하는 첫번째 transduction 모델이다..
Self-attention은 reading comprehension, abstractive summarization, textual entailment와 learning task-independent sentence representations와 같은 여러가지 tasks에서 성공적으로 사용돼왔다.
Sequential computation 을 줄이는것의 목표는 다음 세개와 같이 CNN을 기본 building block으로 사용하며, 모든 input과 output position hidden representations in parallel에 대한 계산을 하는 Extended Neural GPU, ByteNet 그리고 ConvS2S의 foundation을 형성하기도 한다. 이러한 모델들에서, 두개의 임의의 input혹은 output position에서 신호를 연결짓기 위해 필요한 연산의 횟수는 position간의 거리에 따라 증가한다. ConvS2S에는 선형적으로, ByteNet에는 logarithmically(로그함수적으로) 이는 멀리있는 positions간의 의존성을 학습하는것을 더욱 어렵게 한다.
Model Architecture여기, encoder는 input sequence of symbol representations(x1,...,xn)를 연속적인 representation z=(z1,...,zn)로 mapping한다.모든 step에서 모델은 auto-regressive하고 다음 symbol을 생성할때 이전에 생성된 symbol을 추가적인 input으로 사용한다.3.1 Encoder and Decoder StacksEncoder: encoder 는 6개짜리 identical layers의 stack으로 구성돼있다.Layersub-layer2: position-wise Fully Connected feed-forward network그말인 즉슨, 각각의 sub-layer의 output은 Normalize(x+sublayer(x)), where sublayer(x) is the function implemented by the sub-layer itself.Decoder: decoder도 역시 6개짜리 identical layers의 stack으로 구생됐다.mulit-head ateension over the output of the encoder stack을 수행하는 세번째 sub-layer를 추가했다.또한decoder내의 self-attention sub-layer를 prevent positions from attending to subsequent positions하도록 변형했다..3.2 Attention, [query, keys, values, and output는 모두 vector]weight에 할당된 each value는 query와 이에 상응하는 key를 이용한 compatibility function에 의해 계산된다.우리의 특별한 attention을 " Scaled Dot-Product Attention"라고 부르겠음.query와 모든 keys들의 dot products를 계산한고, 각각을 루트 (dk)로 나누고 softmax 함수를 적용시겨 values의 weights를 얻는다.keys 는 matrix K values는 matrix V로 묶여서 한방에 계산한다
실제로 할땐, attention function는 query set을 matrix Q에 한방에 쌓아서 동시에 계산한다.
input은 queries and keys of dimension dk, 와 values of dimension dv로 이뤄져있음.
3.2.1 Scaled Dot-Product Attention
output은 values의 weighted sum에 의해 계산된다.
attention function은 query를 a set of key-value pairs to an output에 mapping하는것이다.
이 masking이 combined with fact that the output embeddings는 offset by one positions이며 position i의 predictions이 i보다 적은 positions의 알려진 output에만 의존할수있도록 보증한다.
encoder와 유사하게 residual connections을 sub-layers들에 normalization layer따라오게 사용했다.
게다가 encoder layer각각에 안에있는 두개의 sub-layers에다가
이 residual connections을 이용하기 위해 모델의 모든 sub-layers와 embedding layers 모델의 output 차원을 512로 만들어줬다.
두개의 sub-layers 각각에 residual connection을 normalization뒤에 따라오게 이용했다.
sub-layer1 : multi-head self-attention mechanism
각각의 layer는 두개의 sub-layer들을 갖고있다.
The Transformer는 stacked self-attention하고 point-wise하며, encoder와 decoder둘다 연결된 fully connected layer를 이용한 구조를 전체적으로 따른다.[Figure 1에 나와있음]
주어진 z, the decoder는 symbols one element의 output sequence(y1,...,yn)을 생성한다 at a time.
Most competitive neural sequence transduction models는 encoder-decoder구조를 갖고있다.

'Have Done > Attention' 카테고리의 다른 글

[ViTs] Going deep with Image Transformers (1/4) (0)	2022.10.27
[Attention] Evolution of Attention - Version.2 (0)	2022.09.14
[Attention] Evolution of Attention - Version_0,1 (0)	2022.09.13
[Attention] Intro + Transformer Architecture (0)	2022.09.13
[Attention] attention 뭐 들어도 모르겠는 제대로 공부좀 하자 (0)	2022.07.05