Vision Transformer with Deformable Attention

AI/Vision 2023. 9. 14. 15:49

CVPR 2022

Vision Transformer with Deformable Attention

Tsinghua Univ, AWS AI (Amazon), Beijing Academy of Artificial Intelligence

Deformable mechanism을 이용하여 주요 영역을 쉽게 찾고 computational complexity를 줄이기 위해 제안한 attention 방법이다.
ViT를 활용한 실험을 진행하며 학습 시간이 너무 오래 걸리고 메모리를 너무 많이 필요로 하는 문제점에 직면하였다. 이를 개선해 보기 위해 방법을 찾던 도중 발견한 논문이다. deformable mechanism을 통해 target object를 더 효율적으로 포착할 수 있으며 vomputatonal cost를 많이 높이지 않고 swin과 비슷한 정도로 만들 수 있는다는 것이 특이하고 좋다고 느껴져 눈이 갔던 논문이다.
Doi : https://doi.org/10.48550/arXiv.2201.00520

Vision Transformer with Deformable Attention

Transformers have recently shown superior performances on various vision tasks. The large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. Nevertheless, simply enlarging recepti

arxiv.org

1. Introduction

< 기존 vision tranformer (vit) >

vit 는 non-overlapping image patch sequences를 처리하기 위해서 transformer blocks를 쌓아 image classification을 위한 convolution free model로 만들어졌다.
이는 large receptive field 를 가지며 long-range dependencies modeling에 탁월하며 large amount of training data, model parameters 영역에서 CNN 보다 우수한 성능을 보임을 입증하였다.

하지만, vit 는 단점도 명확하다.

Query patch 당 사용되는 과도한 key의 수로 인해 높은 computation cost 발생
느린 수렴과 그로 인한 overfitting 위험성의 증가
Layer 가 깊어져도 거의 유사한 self-attention map 이 학습

이러한 문제점을 개선하기 위해선 과도한 attention 을 줄여야 한다.
그렇기 때문에 이전 연구들에서는 efficient attentino pattern을 활용하여 computation complexity를 줄이기 위해 노력을 하였다고 한다.

< Efficient Attention >

보통 Swin Transformer or Pyramid Vision Transformer 를 채택하여 사용한다.
Swin tansformer : local window 에서 attention을 제안하기 위해 window-based local attention을 채택하였다.
Pyramid Vision Transformer : computation 을 저장하기 위해 key와 value feature map를 down sampling 한다.

이러한 방법들은 좋은 성능을 거두는 것에 성공했지만 아래와 같은 문제점이 발생할 수 있다.

hand-crafted pattern 은 데이터에 구애를 받지 않으며 최적이 아닐 수 있다.
Relevant key/value 가 삭제되고, 덜 중요한 key/value 값이 유지될 수 있다.

< Re-Attention >

DeepViT 에서 제안된 attention 방법으로 기존의 ViT 구조는 layer 가 깊어져도 CNN 같은 성능의 향상이 어렵다는 문제점을 해결하기 위해서 제안된 방법이다.
Layer 가 깊어져도 attention collaps로 인해 거의 유사한 self-attention map 이 학습되게 되는데 이를 해결하기 위해서 self-attention map에 learnable matrix를 곱하여 섞어주는 방법을 제안하였다.

위 그림을 통해 Re-attention 구조에서 query-dependent 한 attention map 이 학습됨을 확인할 수 있다.
또한 re-attention 구조도 layer 가 깊어지면 attention map 이 uniform 한 형태를 보이게 된다는 문제점이 있다.

2. Deformable Attention Transformer (DAT)

본 격적으로 DAT 를 알아보기 전에 deformable과 deformable convolution network (DCN)에 간단히 대해서 알아보자.

2.1. Deformable and DCN

< Deformable ? >

이상적으로 주어진 key/value set 이 flexible 하고 개별 input에 adapt 한 기능을 가지고 있어 hand-crafted sparse attention pattern의 문제를 완화할 수 있다.
CNN에서 convolution filters에 대한 deformable receptive field를 학습하는 것은 data dependent 기준으로 더 많은 정보를 제공하는 영역을 선택적으로 attenting 하는 것에 효과적이다.
위와 같은 이유로 deformabel convolutional network (DCN) 은 많은 vision task에서 강력한 성능을 보인다.

< DCN ? >

DCN 은 위의 그림과 같이 두 가지 방법을 제시한다.

Deformable Convolution
Deformable ROI pooling

기존 CNN 은 고정된 receptive field에서만 특징을 추출, 이는 translation-invariance 가 생겨 object detection과 segmentation에 안 좋은 영향을 준다.
이를 개성하기 위해 DCN 은 고정된 receptive field 가 아닌 좀 더 flexible 한 영역에서 feature를 추출하는 방법을 제안하였다.

< DCN 적용시 문제점 >

DCN을 attention에 적용하여 단순히 구현하면 deformable offsets 가 patch 수에 상관없이 quadratic 하기 때문에 high memory / computation complexity 문제가 발생한다.
이러한 이유로 최근 연구에서 deformable mechanism 은 transformer 에서 DCN 은 강력한 netowrk 임에도 물구하고 backbone network로 취급하지 않았다.

대신 deformable mechanism 은 detection head 혹은 후속 backbone 의 patch를 sampling 하기 위한 preprocessing layer에서만 사용하였다.

2.2. Deformable Attention Transformer

본 눈문에서는 DCN 을 적용할 시에 발생하는 문제점을 개선하고 이를 attention에 적용한 simple and efficient deformable self-attentino module를 제안하였다.
기존 DCN 은 전체 feature map 에서 서로 다른 pixel에 대해 서로 다른 offset을 학습한다.
DAT 이와 달리 key/value 를 important regions로 이동시키기 위해 모든 query 가 shared 하는 few groups of sampling offset을 학습하는 방법을 제안하였다. 해당 방법은 global attention 이 일반적으로 다른 query 에 대해 거의 동일한 attention pattern을 보인다는 점을 기반으로 제안하였다.

DAT 는 각 attention module에 대해 input data 전체에 걸쳐 동일하고 균일한 grid 로 기준점을 생성한다.
후보 key/value 가 중요한 영역으로 이동하여 원래의 self-attention module을 더 높은 유연성과 효율성으로 보강해 준다.
이를 통해 보다 유익한 feature를 capture 할 수 있게 된다.

2.3. Preliminaries

기존 self-attention 은 flattened feature map x∈R^(N×C)를 input으로 사용하여 M heads를 가지는 Multi-Head Self-Attention을 사용한다.

σ: softmax function
d = C/M: dimension of head head
z^((m)): embedding output from m-th attention head
q^((m)), k^((m)), v^((m))∈R^(N×d): query, key, and value embedding respectively
W_q, W_k, W_v, W_o∈R^(C×C): Projection matrices

최종적으로 transformer block의 수식은 아래와 같이 표현할 수 있다.

2.4. Deformable Attention

기존 hierarchical vision transformer, 특히 PVT, Swin의 과도한 attention 문제를 해결하고자 제안된 방법이다.
Down-sampling techique of the former의 문제점은 아래와 같이 두 개로 요약할 수 있다.

심각한 information loss를 초래한다.
shift-window attention 은 receptive field의 성장을 느리게 하여 large object를 modeling 할 수 있는 가능성을 제한한다.

Relevant features를 flexibly modeling 하기 위해서 data-dependent sparse attention 이 필요하며, 이는 DCN에서 제안된 deforamble mechanism으로 이어질 수 있다.

< 단순히 적용 시 문제점 >

DCN을 단순히 attention module에 적용하면 computation complexity 가 높아진다는 문제점이 있음을 앞에서 이야기하였었다. 그럼 그 문제에 대해서 자세히 알아보자.
DCN에서 feature map의 각 요소는 offset을 개별적으로 학습하여, H×W×C feature map에서 3x3 deformable convolution 은 9HWC의 space complexity를 가진다.
이 mechanism을 attention module에 직접 정요 하게 되면 space complexity는 (N_q) ×(N_k) ×(C)로 증가하게 된다.
이때 N_q, N_k는 number of queries and keys이며, 보통 feature map size HW와 동일한 scale을 가진다.
그러므로 대략적으로 biquadratic complexity를 가져오게 되는 것이다.

< Deformable DETR >

해당 문제를 해결하기 위해 deformable DETR 이 제안되었다.
각 scale에서 N_k = 4로 더 적은 수의 key를 설정하여 overhead를 줄이고 detection head로 잘 작동하였다.

그러나 해당 방법은 허용할 수 없는 information loss 가 발생하기 때문에 backbone network에서 아래와 같은 소수의 key fmf 처리하는 것은 어렵다.

< Deformable Attention >

서로 다른 query 가 visual attention models에서 비슷한 attention map을 가진다는 연구결과를 바탕으로 이들은 효율적인 절충을 위해 각 query에 대해 shared shifted key and value를 사용하는 더 간단한 solution을 선택하여 사용하였다.

input x∈R^(H×W×C)가 주어지면 uniform grid of points p ∈R^(H_G × W_G × C) 가 reference로 생성된다.
grid size는 input feature map size에서 factor r 만큼 down-sampled 된다. (H_G = H / r, W_G = W / r)
Reference points의 value는 linearly spaced 2d coordinate {(0,0),..., (H_G - 1, W_G - 1)}이며 grid shape H_G × W_G는 [-1, +1] 범위로 normalize 된다. --> top-left corner (-1, -1), bottom right corner (+1, +1)

각 reference points에 대해 offset을 얻기 위해 feature map을 linear projection 하여 query token q = xW_q를 얻은 다음, light weight sub-network θ_offset (∙)의 input으로 사용하여 ∆p=θ_offset (q)를 생성한다.
이때, training process의 stabilize를 위해 offset 이 너무 커지는 것을 방지해야 한다. predefined factor s 만큼 ∆p의 진폭을 조정한다. (∆p <- s tanh ∆p)

feature는 deformed points에서 key/value 값으로 sampling 되고 projection을 진행한다.
sampled function 은 bilinear interpolation으로 만들어 미분 가능하도록 할 수 있다.
g(a, b) = max(0,1 - |a - b|)
이때 g!= 0 이므로 가장 가까운 4개의 integral point (p_x, p_y)는 수식을 4개의 위치의 weighted average로 단순화할 수 있다.

이를 수식으로 표현하면 아래와 같다.

q: query
k ̃, v ̃: deformed key and value
ϕ: sampling function
r_x, r_y: indexes on z∈R^(H×W×C)

기존 접근 방식과 유사하게 q, k, v에 대해 multi head attention을 수행하고 relative position offsets R을 채택한다 (수식 8)

각 head는 concatenate 되어 W_0를 통해 projection 되어 최종 output z를 얻는다.

< Offset generation >

sub-network는 query features 를 사용하고 reference points 에 대한 offset values 를 각각 출력하는 offset 생성에 채택된다. reference points 가 local s × s region 을 covers 한다는 것을 고려해보면 generation network 를 합리적인 offset 을 학습하기 위해 local feature 에 대한 perception 을 가져야 한다.
그러므로 sub-network 는 아래와 같이 구성된다.

(two convolution module, nonlinear activation)

5x5 conv -> GELU -> 1x1 conv 형태로 구성되어 있다.
이때 모든 location에 대한 무조건적인 이동을 완화하기 위해서 1x1 conv는 bias 가 제거 된다.

< Offset groups >

Deformed points의 다양성을 촉진하기 위해서 MHA와 유사한 paradigm을 따른다.
feature map channel을 G groups로 split 한 뒤, 각 groups의 feature는 shared sub-network를 사용하여 각각 해당 offset을 생성한다.
attention module의 head number M 은 offset groups G size의 여러 배로 설정되어 deformed keys/values의 한 gropu에 여러 번 attention head 가 할당 되록한다.

< Deformable relation position bias >

Relative position bias는 query와 key의 모든 pair 사이의 상대 위치를 encoding 하여 spatial informatino으로 vanilla attention을 강화해 준다.
feature map 이 H × W shape을 가지는 것을 고려하면 상대 좌표 변위는 각각 두 차원에 대해 [-H, H], [-W, W] 범위에 있다.

Swin Transformer에서 relative position bias table

은 두 방향의 상대 변위로 table을 indexing 하여 relative position bias를 얻기 위해 구성되어 있다.
Deformable attention 에는 연속적인 key의 위치가 있으므로 normalized range [-1, +1]에서 상대 변위를 계산한 다음 가능한 모든 offset value를 포함하기 위해 parameterized bias table

에서

를 continuous relative displacements로 interpolation 한다.

< Computational complexity >

Deformable multi-head attention (DMHA)는 PVT or Swin transformer와 유사한 computation cost를 가진다.
Offset을 생성하기 위해 추가된 sub-network 가 유일한 추가 overhead이다.

N_s = (H_G)(W_G) = (HW) / (r^2) : number of sampled points

Sub-network는 channel size에 대해 linear complexity를 가지며 이는 attention computational cost에 비해 상대적으로 미미한 정도이다.
Swin-T를 기준으로 추가 overhead는 5.08M Flops로 전체의 6.0% 에 불과하다.
Large down-sample factor r을 선택하면 complexity는 더욱 줄어들어 object detection과 instance segmentation과 같은 high input resolution을 사용하는 task에 적합하다.

2.5. Model Architectures

PVT, Swin transformer를 통해서 검증된 pyramid structure를 사용
Window-based local attention, shifted-window attention을 사용하였으며 deformable 로 인한 추가적인 연산을 줄이기 위해서 Stage 3, 4 에서만 deformable attention 을 사용하였다.

3. Experiment

'AI > Vision' 카테고리의 다른 글

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator (0)	2024.05.08
Prototypical Contrastive Learning of Unsupervised Representation (1)	2024.02.05
Vision Transformer (0)	2023.09.14
Kernel Aware Resampler (0)	2023.06.30
Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks (0)	2023.06.05

ABOUT ME

JungSoo_AI_Study JungSoo_AI_Study

CVPR 2022

Vision Transformer with Deformable Attention