实战级Stand-Alone Self-Attention in CV，快加入到你的trick包吧 | NeurIPS 2019

晓飞的算法工程笔记 2020-04-07 我要评论

> 论文提出stand-alone self-attention layer，并且构建了full attention model，验证了content-based的相互关系能够作为视觉模型特征提取的主要基底。在图像分类和目标检测实验中，相对于传统的卷积模型，在准确率差不多的情况下，能够大幅减少参数量和计算量，论文的工作有很大的参考意义来源：【晓飞的算法工程笔记】公众号 **论文: Stand-Alone Self-Attention in Vision Models** ![](https://upload-images.jianshu.io/upload_images/20428708-5a6827e98004304b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) * **论文地址：[https://arxiv.org/abs/1906.05909](https://arxiv.org/abs/1906.05909)** # Introduction *** 目前卷积网络的设计是提高图像任务性能的关键，而卷积操作由于平移不变性使其成为了图像分析的主力。受限于感受域的大小设定，卷积很难获取长距离的像素关系，而在序列模型中，已经能很好地用attention来解决这个问题。目前，attention模块已经开始应用于传统卷积网络中，比如channel-based的attention机制 Squeeze-Excite和spatially-aware的attention机制Non-local Network等。这些工作都是将global attention layers作为插件加入到目前的卷积模块中，这种全局形式考虑输入的所有空间位置，当输入很小时，由于网络需要进行大幅下采样，通常特征加强效果不好因此，论文提出简单的local self-attention layer，将内容之间的关系(content-based interactions)作为主要特征提取工具而不是卷积的增强工具，能够同时处理大小输入，另外也使用这个stand-alone attention layer来构建全attention的视觉模型，在图像分类和目标定位上的性能比全卷积的baseline要好 # Background *** ### Convolution ![](https://upload-images.jianshu.io/upload_images/20428708-aec5ec2899461e85.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 卷积神经网络(CNN)通常学习小范围(kernel sizes)的局部特征，对于输入$x\in \mathbb{R}^{h\times w\times d_{in}}$，定义局部像素集$\mathcal{N}_k$为像素$x_{i,j}$周围$k$区域的像素，大小为$k\times k\times d_{in}$，如图1 ![](https://upload-images.jianshu.io/upload_images/20428708-e35252fe71005649.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 对于学习到的权重$W\in \mathbb{R}^{k\times k\times d_{out}\times k_{in}}$，位置$ij$的输出$y_{ij}\in \mathbb{R}^{d_{out}}$通过公式1计算所得，其中$\mathcal{N}_k(i,j)=\{a,b|\ |a-i|\le k/2,|b-j|\le k/2\}$，CNN使用权重共享，$W$用于所有像素位置$ij$的输出，权重共享使得特征具有平移不变性以及降低卷积的参数量。目前有一些卷积的变种用以提高预测的表现，比如深度分离卷积 ### Self-Attention 与传统的attention不同，self-attention应用于单个context而不是多个context间，能够直接建模context内长距离的交互信息，论文提出stand-alone self-attention layer用来替代卷积操作，并且构建full attention模型，这个attention layer主要是对之前的工作的一个简化 ![](https://upload-images.jianshu.io/upload_images/20428708-a0ba59b386188712.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 与卷积类似，对于像素$x_{ij}\in \mathbb{R}^{d_{in}}$，首先会取$x_{ij}$的$k$范围内的局部区域像素$ab\in \mathcal{N}_k(i,j)$，称为*memory block*。与之前的all-to-all attention不同，这个attention只在局部区域进行attention操作，全局attention只有在特征大小大幅减少后才能使用，不然会带来很大的计算开销 ![](https://upload-images.jianshu.io/upload_images/20428708-532dfcc931e0163e.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) single-headed attention计算如公式2，输出像素$y_{ij}\in \mathbb{R}^{d_{out}}$，首先对输入向量进行三种变化得到3个值，查询像素*queries*$q_{ij}=W_Qx_{ij}$，关键词像素*keys*$k_{ab}=W_Kx_{ab}$以及值*values*$v_ab=W_Vx_{ab}$为像素$ij$和其附近像素的线性变化，$softmax_{ab}$应用于所有$q_{ij}^\top k_{ab}$，$W_Q,W_K,W_V\in \mathbb{R}^{d_{out}\times d_{in}}$为学习到的变化。与公式1的卷积类似，local self-attention通过结合混合权重($softmax_{ab}(\cdot)$)与值向量进行输出，每个位置$ij$都重复上述步骤在实际中，使用multiple attention heads来学习输入的多个独立表达，将像素特征$x_{ij}$分为$N$组$x_{ij}^n\in \mathbb{R}^{d_{in}/N}$，每个head用不同的变化$W_Q^n,W_K^n,W_V^n\in R^{d_{out}/N\times d_{in}/N}$进行single-headed attention计算，最后将结果concatenate成最终的输出$y_{ij}\in \mathbb{R}^{d_{out}}$ ![](https://upload-images.jianshu.io/upload_images/20428708-a4e96d7e8fc902b8.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 公式2中没有使用位置信息，而之前的研究指出相对位置编码能提升self-attention带来明显的提升。因此，使用二维相对位置编码(*relative attention*)，将行偏移$a-i$和列偏移$b-j$的编码进行concatenate成$r_{a-i,b-j}$，如图4，最后将公式2变成公式3的spatial-relative attention，同时考虑query和key的内容间的相似性以及相对位置。由于考虑了相对位置，self-attention也拥有了类似卷积的平移不变性。另外，参数量的计算跟空间区域的大小无关，只稍微$d_{in}$和$d_{out}$有关，而且增长很慢 # Fully Attentional Vision Model *** ### Replacing Spatial Convolution 空间卷积为区域$k>1$的卷积，论文将所有的空间卷积替换成attention layer，若需要下采样，则在层后接一个stride为2的$2\times 2$平均池化。整体模型基于ResNet系列，将bottleneck block中的$3\times 3$卷积替换成公式3的self-attention layer，其余不变 ### Replacing the Convolutional Stem 卷积神经网络的初始几层称为*stem*，主要用于学习例如边(edge)的局部特征，后面的层用来分辨整体目标。stem与核心block结构不一样，一般主要为轻量级的下采样操作。在ResNet中，stem由stride为2的$7\times 7$卷积接stride为2的$3\times 3$的max pooling组成。而stem中的内容包含RGB像素，这些像素是高度空间相关的，独立起来则失去了意义，没有丰富的content信息，使用content-based的公式3(会基于内容softmax weight)来替换stem中的卷积层会十分困难 ![](https://upload-images.jianshu.io/upload_images/20428708-b97b7eb35418b488.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 卷积的卷积核不同位置有不同的权重，有利于学习特定的边特征，为了减少这个偏差，在pointwise的$1\times 1$卷积($W_V$)中加入距离相关的信息进行空间的线性变化，得到$\tilde{v}_{ab}=(\sum_m p(a,b,m)W_V^m)x_{ab}$，为多值矩阵$W_V^m$是与邻近像素的$p(a,b,m)$的凸组合，$p(a,b,m)$可以认为是多值矩阵的权重。由于不同相对位置的emb不同，所以同一个像素点在不同的相对距离下就有不同的值，类似与卷积的属性，有利于边特征学习 # Experiments *** ### ImageNet Classification ![](https://upload-images.jianshu.io/upload_images/20428708-e3edd85316e411c9.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) ![](https://upload-images.jianshu.io/upload_images/20428708-fbf2804a6966a73f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) multi-head self-attention的区域范围$k=7$，8个attention head，stem在原图的$4\times 4$区域进行self-attention，后接一个batch normalization和$4\times 4$的max pool。从结果来看，对比ResNet-50，full attention准确率高0.7%，参数量和计算量分别少12%和29% ### COCO Object Detection ![](https://upload-images.jianshu.io/upload_images/20428708-8f2dfe46a4f1a793.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 基于RetinaNet进行主干网络和FPN的替换进行实验，使用attention-based主干准确率差不多，且能够直接减少22%参数，而对主干网络和FPN同时替换成attention layer则能进一步下降34%参数和39%计算量 ### Where is stand-alone attention most useful * ##### Stem 从表1、表2和图5可以看出，对于分类，convolution stem表现较好，对于目标检测，在FPN为卷积时，convolution stem表现较好，而当其它部分都为full attention时，则表现差不多 * ##### Full network ![](https://upload-images.jianshu.io/upload_images/20428708-0b3c318f15159aef.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 论文比较有意思，基于convolution stem，将替换的粒度精确到某一个group，将convolution用在前面的group能够提升性能，相反则会造成下降，论文解释为卷积能更好地提取低维特征，但是这里应该是同一维度的，所以这里值得商榷 ### Which components are important in attention? * ##### Effect of spatial extent of self-attention ![](https://upload-images.jianshu.io/upload_images/20428708-6ec7b304e85bddcd.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) * ##### Importance of positional information ![](https://upload-images.jianshu.io/upload_images/20428708-b7da1dbf1e28f220.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 论文对比公式3中$r_{a-i,b-i}$的位置编码，相对位置编码准确率最高 ![](https://upload-images.jianshu.io/upload_images/20428708-755f493212e8ff36.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 表6结果表明，content-relative的交互信息$(q\cdot r)$是比较重要的 * ##### Importance of spatially-aware attention stem ![](https://upload-images.jianshu.io/upload_images/20428708-3c894490083d642c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) # CONCLUSION *** 论文提出stand-alone self-attention laer，并且构建了full attention model，验证了content-based的相互关系能够作为视觉模型特征提取的主要基底。在图像分类和目标检测实验中，相对于传统的卷积模型，在准确率差不多的情况下，能够大幅减少参数量和计算量，论文的工作有很大的参考意义 >写作不易，未经允许不得转载～更多内容请关注微信公众号【晓飞的算法工程笔记】 ![work-life balance.](https://upload-images.jianshu.io/upload_images/20428708-7156c0e4a2f49bd6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

实战级Stand-Alone Self-Attention in CV，快加入到你的trick包吧 | NeurIPS 2019

相关文章

猜您喜欢

今日热门