ICLR 2020 | 抛开卷积，multi-head self-attention能够表达任何卷积操作

晓飞的算法工程笔记 2020-03-30 我要评论

> 近年来很多研究将nlp中的attention机制融入到视觉的研究中，得到很不错的结果，于是，论文侧重于从理论和实验去验证self-attention可以代替卷积网络独立进行类似卷积的操作，给self-attention在图像领域的应用奠定基础 **论文: On the Relationship between Self-Attention and Convolutional Layers** ![](https://upload-images.jianshu.io/upload_images/20428708-9d587ea0e40c1328.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) * **论文地址：[https://arxiv.org/abs/1911.03584](https://arxiv.org/abs/1911.03584)** * **论文代码：[https://github.com/epfml/attention-cnn](https://github.com/epfml/attention-cnn)** # Introduction *** transformer的提出对NLP领域的研究有很大的促进作用，得益于attention机制，特别是self-attention，会考虑词间的相似性，对当前词进行加权输出。受到词间关系学习的启发，self-attention也开始用于视觉任务中，但大都是attention和convonlution的结合。Ramachandran在19年的研究中，用full attention model达到了resnet baseline的精度，模型参数和计算量相比卷积网络减轻了不少因此，论文主要研究self-attention layer在图片处理上是否能达到convolutional layer的效果，贡献如下： * 在理论层面，论文通过构造性证明self-attention layers能够替代任何卷积层 * 在实际层面，论文通过构造multi-head self-attention layer进行实验，证明attention-only架构的前几层的确学习到了关注query pixel附近的g网格区域特征 # Background on attention mechanisms for vision *** ### The multi-head self-attention layer 定义$X\in \mathbb{R}^{T\times D_{in}}$为输入矩阵，包含$T$个$D_{in}$维的token，在NLP中，token对应着序列化的词，同样地也可以对应序列化的像素 ![](https://upload-images.jianshu.io/upload_images/20428708-4ae132f9d1fa5fd6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) ![](https://upload-images.jianshu.io/upload_images/20428708-b3734c90e2cb9539.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) self-attention layer从$D_{in}$到$D_{out}$的计算如公式1,2所示，$A$为attention scores，softmax将score转换为attention probabilities，该层的参数包含查询矩阵(query matrix)$W_{qry}\in \mathbb{R}^{D_{in}\times D_k}$，关键词矩阵(key matrix)$W_{key}\in \mathbb{R}^{D_{in}\times D_k}$，值矩阵(value matrix)$W_{val}\in \mathbb{R}^{D_{in}\times D_{out}}$，都用于对输入进行变化，基本跟NLP中的self-attention一致 ![](https://upload-images.jianshu.io/upload_images/20428708-1e06e9722163d728.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 因为只考虑相关性，self-attention一个很重要的属性是，不管输入的顺序如何改变，输出都是不变的，这对于希望顺序对结果有影响的case影响很大，因此在self-attention基础上为每个token学习一个positional encoding参数，$P\in \mathbb{R}^{T\times D_{in}}$为包含位置信息的embedding向量，可以有多种形式 ![](https://upload-images.jianshu.io/upload_images/20428708-ef33900a2f9b47fc.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 这里采用multiple heads版本的self-attention，每个head的参数矩阵都不一样，能够提取不同的特征，$N_h$个head输出$D_h$维结果concat后映射成$D_{out}$维的最终输出，两个新参数，映射矩阵(projection matrix)$W_{out}\in \mathbb{R}^{N_hD_h\times D_{out}}$，偏置$b_{out}\in \mathbb{R}^{D_{out}}$ ### Attention for images ![](https://upload-images.jianshu.io/upload_images/20428708-e4a2b63587ebda82.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) ![](https://upload-images.jianshu.io/upload_images/20428708-8b9d96170ecc202d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 卷积是最适合神经网络的图片操作方法，给予图片$X\in \mathbb{R}^{W\times H\times D_{in}}$，卷积在$(i,j)$的操作如公式5，$W\in \mathbb{R}^{K\times K\times D_{in}\times D_{out}}$，$b\in \mathbb{R}^{D_{out}}$，K为卷积核的大小 ![](https://upload-images.jianshu.io/upload_images/20428708-b56cbcb06dc8d9dc.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 在图片上应用self-attention，定义查询像素和关键词像素$q,k\in[W]\times [H]$，输入的向量大小为$X\in \mathbb{R}^{W\times H\times D_{in}}$为了保持一致性，用1D的符号来代表2D坐标，比如$p=(i,j)$，用$X_p$代表$X_{ij}$，用$A_p$代表$A_{ij}$ ### Positional encoding for images 位置编码目前主要有两种，分别是绝对位置(absolute)编码和相对(relative)位置编码 ![](https://upload-images.jianshu.io/upload_images/20428708-6f5e035c02fba956.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 在绝对位置编码中，每个像素拥有一个位置向量$P_p$(学习的或固定的)，于是公式2可以转换为公式7 ![](https://upload-images.jianshu.io/upload_images/20428708-92953465bb6fa134.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 相对位置编码的核心是只考虑查询像素和查询像素之间的位置差异，如公式8，大体是将公式7的每一项的绝对位参数改为相对位置参数。attention scores只跟偏移$\delta:=k-q$，$u$和$v$是learnable参数，每个head都不一样，而每个偏移的相对位置编码$r_\delta\in \mathbb{R}^{D_p}$是head共享的。关键词权重分成了两部分，$W_{key}$属于输入，$\widehat {W}_{key}$属于偏移 ![](https://upload-images.jianshu.io/upload_images/20428708-33abae4f6eceb7c7.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 公式9称为二次编码(quadratic encoding)，参数$\Delta^{(h)}=(\Delta_1^{(h)},\Delta_2^{(h)})$和$\alpha^{(h)}$分别代表中心点以及attention区域的大小，都是通过学习得来的，而$\delta=(\delta_1,\delta_2)$则是固定的，代表查询像素和关键词像素的相对位移 # Self-attention as a convolutional layer *** ![](https://upload-images.jianshu.io/upload_images/20428708-5b4bde3549b09128.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) 定理1，对于multi-head self-attention，$N_h$个head，每个head输出$D_h$维，整体最终输出$D_{out}$，相对位置编码$D_p\ge 3$维，可以表示任何卷积，核大小为$\sqrt{N_h}\times \sqrt{N_h}$，output channel为$min(D_h,D_{out})$ 对于output channel不是固定$D_{out}$，论文认为当$D_h

ICLR 2020 | 抛开卷积，multi-head self-attention能够表达任何卷积操作

相关文章

猜您喜欢

今日热门