The Scaled Dot-Product Attention function - A PhD Journey: Learning to Learn

## A simple attention module $ s_i = \sum_j w_j x_{ij} $ $ w_j = \frac{e^{z_j}}{\sum_k e^{z_k}} $ Effectively, the attention mechanism can be seen as a multiplicative module that switches on or off the components of a variable. The weights vector, $w$ , can be interpreted as probabilities which mask out or attenuate parts of the input. When this module is applied to a sequence of variables as such: ![[attention_module.excalidraw.svg|The attention module, as presented in LeCun's slides on Deep Learning]] where $w$ is computed from $z$ through the softmax function. Using the logistic function allows to *smoothly* blend variables by disabling or enabling specific components of each. In short, attention within a neural network can be used to switch on or off parts of the network, or components of a variable, via multiplicative interaction. ## The attention mechanism The **Attention mechanism**, the central part of many transductive models, consists in modeling the relationships between points in any dimensional space without regards to their distance. This aspect is particularly attractive because it enables long-distance dependencies between points in a sequence or set. > An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.[^1] [^1]: [Attention Is All You Need (Vaswani et al.)](http://arxiv.org/abs/1706.03762) There are mainly two versions of attention: *additive attention* and *multiplicative (or dot-product) attention*. The former computes the compatibility function via a 1-hidden-layer MLP, which is rather costly, while the latter is a dot-product that can be highly optimized. The Transformer architecture uses a *scaled* dot-product attention function. # The scaled dot-product As stated above, the inputs are queries and key-value pairs, where the queries and keys are vectors of dimension $d_k$ and the values are vectors of dimension $d_v$. The *scaled* dot-product attention is defined by Vaswani et al.[^1] as: $ \text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $ With $Q$, $K$, $V$ being the query, key and value vectors packed into matrices for more efficient computation, such that the output weights form a matrix. Those weights represent how each value is relevant to each query-key pair. A vector version of the attention function is defined as: $ y_i^h = Att(q_i^h, K^h)V^h $ $ Att(q_i^h, K^h) = \text{softmax}{(\frac{q_i^h \dot K^h}{\sqrt{d_k}})} $ #### *So what are those queries, keys and values exactly?* The idea is that **there is a difference between what attention should be paid to and where the information should be extracted**. This is the reasoning behind the key-value pairing: we're paying attention to the keys and we're extracting information from the values. In the end, for the case of NLP, the queries, keys and values are all the same words of the input sequence (for the encoder part of the Transformer at least). But this attention function is flexible and you could very well decide to learn the queries or use different key-value pairs. ### The dot-product ![[Dot product diagram.svg|The angle between two vectors.]] The dot-product between two vectors $\mathbf{a}$ and $\mathbf{b}$ is defined as: $ \mathbf{a} \cdot \mathbf{b} = \lVert \mathbf{a} \rVert \lVert\mathbf{b}\rVert \cos \theta $ where $\theta$ is the angle between $\mathbf{a}$ and $\mathbf{b}$. This means that vectors that have a similar direction will have a high dot-product relative to their scale. It is important to scale it down so that the dot-products are normalised, and a factor of $\frac{1}{\sqrt{d_k}}$ is used since larger values of $d_k$ tend to push the Softmax function into regions where it has extremely small gradients: this is because the dot-product between the vectors $q$ and $k$ is: $q \cdot k = \Sigma_{i=1}^{d_k} q_i k_i$, and so it scales linearly by $d_k$. Therefore it is a countermeasure to the vanishing gradient problem. ![[The Softmax function]]