The Transformer - A PhD Journey: Learning to Learn

# Induction vs Transduction In supervised learning, where training samples are labeled with ground-truth, *induction* is the application of knowledge, or statistical inference, gained from the training set to a specific and finite set of unseen test samples. In contrast, *transduction* consists in exploiting the patterns in both the training and the testing set, even though the test samples are unlabeled. The major difference between the two approaches is that *induction* results in a predictive model, one that is trained once on a finite data set and applied to novel test samples, while *transduction* requires seeing the entire collection of data points every time that new test samples are introduced. In the context of sequence-to-sequence or language modeling, the problem is defined as *transductive* by nature. This is because every point in the sequence, or every token in a sentence, is needed to compute the output. The transformer is defined as *transductive* because each input relates to every other inputs and it is not done in a sequential manner that would allow *induction*, although it can abstractly be viewed as an *inductive* model because of the long-term memory induced by the MLP components of Transformer-encoder blocks. # Attention, attention! ![[The Scaled Dot-Product Attention function]] ## Multi-head Attention In practice, the Transformer architecture does not perform single attention on the entire queries, keys and values. Instead, it applies a linear projection to those vectors such that $ Q, K, V = QW^Q, KW^K, VW^V. $ However, this operation is done $h$ times with each different weight matrices, so that it can learn different representations. This becomes useful when the scaled dot-product attention is computed in parallel for each resulting $Q, K, V$ matrices, before concatenating the final attention weights. ![Single Attention vs Multi-Head Attention.](attention_vs_mha.png)[^1] As defined by Vaswani et al.[^1], $ \text{MultiHead}(Q,K,V) = \text{Concat(head}_1, \ldots, \text{head}_h)W^O $ $ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $ > Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.[^1] We can use the analogy of the layered structure of CNNs to motivate Multi-Head attention: as we go deeper in the network, the receptive field increases. [^1]: [Attention Is All You Need (Vaswani et al.)](http://arxiv.org/abs/1706.03762) # The Transformer architecture Now that we understand attention, and more specifically the **Multi-Head Attention mechanism**, we can insert it into its context in the following Transformer encoder-decoder architecture: ![[transformer.png|The Transformer architecture.]][^1] However there are a few things that we missed: - **Positional Encoding:** it is helpful for the transformer to have a notion of position and order for the tokens in the sequence of inputs, therefore an optionally learned position embedding is added to the input/output embeddings. The original Transformer architecture uses fixed sine and cosine functions of different frequencies, but the Vision Transformer[^3], for instance, uses learned embeddings that are appended to the latent embeddings. - **Feed Forward Layer:** those fully-connected ReLU feed-forward networks are applied to each position separately, as a two-layer MLP with different parameters for each layer of the Transformer. *They implement a long term memory version of attention.* In normal attention, the keys and values are a function of the receptive field (current inputs). This long term attention however is independent of the particular inputs, it will store longer term memories across the whole training.[^2] - **Layer Norm:** Important for the cosine similarity, but can be improved with L2 norm.[^2] [^2]: [Attention Approximates Sparse Distributed Memory](https://www.youtube.com/watch?v=THIIk7LR9_8) [^3]: [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al.)](http://arxiv.org/abs/2010.11929) ## Auto-associativity vs hetero-associativity What's the difference between autoregressive transformers and hetero-regressive transformers? Basically, in an autoregressive or auto-associative transformer the keys point to other keys or even themselves (therefore the queries are the keys as well), and in a hetero-associative setting where you want to associate A to B or predict the next token in a sequence, the keys point to values in a different set.[^2]