# Induction vs Transduction
In supervised learning, where training samples are labeled with ground-truth, *induction* is the application of knowledge, or statistical inference, gained from the training set to a specific and finite set of unseen test samples. In contrast, *transduction* consists in exploiting the patterns in both the training and the testing set, even though the test samples are unlabeled.
The major difference between the two approaches is that *induction* results in a predictive model, one that is trained once on a finite data set and applied to novel test samples, while *transduction* requires seeing the entire collection of data points every time that new test samples are introduced.
In the context of sequence-to-sequence or language modeling, the problem is defined as *transductive* by nature. This is because every point in the sequence, or every token in a sentence, is needed to compute the output. The transformer is defined as *transductive* because each input relates to every other inputs and it is not done in a sequential manner that would allow *induction*, although it can abstractly be viewed as an *inductive* model because of the long-term memory induced by the MLP components of Transformer-encoder blocks.
# Attention, attention!
![[The Scaled Dot-Product Attention function]]
## Multi-head Attention
In practice, the Transformer architecture does not perform single attention on the entire queries, keys and values. Instead, it applies a linear projection to those vectors such that
$
Q, K, V = QW^Q, KW^K, VW^V.
$
However, this operation is done $h$ times with each different weight matrices, so that it can learn different representations. This becomes useful when the scaled dot-product attention is computed in parallel for each resulting $Q, K, V$ matrices, before concatenating the final attention weights.
[^1]
As defined by Vaswani et al.[^1],
$
\text{MultiHead}(Q,K,V) = \text{Concat(head}_1, \ldots, \text{head}_h)W^O
$
$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$
> Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.[^1]
We can use the analogy of the layered structure of CNNs to motivate Multi-Head attention: as we go deeper in the network, the receptive field increases.
[^1]: [Attention Is All You Need (Vaswani et al.)](http://arxiv.org/abs/1706.03762)
# The Transformer architecture
Now that we understand attention, and more specifically the **Multi-Head Attention mechanism**, we can insert it into its context in the following Transformer encoder-decoder architecture:
![[transformer.png|The Transformer architecture.]][^1]
However there are a few things that we missed:
- **Positional Encoding:** it is helpful for the transformer to have a notion of position and order for the tokens in the sequence of inputs, therefore an optionally learned position embedding is added to the input/output embeddings. The original Transformer architecture uses fixed sine and cosine functions of different frequencies, but the Vision Transformer[^3], for instance, uses learned embeddings that are appended to the latent embeddings.
- **Feed Forward Layer:** those fully-connected ReLU feed-forward networks are applied to each position separately, as a two-layer MLP with different parameters for each layer of the Transformer. *They implement a long term memory version of attention.* In normal attention, the keys and values are a function of the receptive field (current inputs). This long term attention however is independent of the particular inputs, it will store longer term memories across the whole training.[^2]
- **Layer Norm:** Important for the cosine similarity, but can be improved with L2 norm.[^2]
[^2]: [Attention Approximates Sparse Distributed Memory](https://www.youtube.com/watch?v=THIIk7LR9_8)
[^3]: [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al.)](http://arxiv.org/abs/2010.11929)
## Auto-associativity vs hetero-associativity
What's the difference between autoregressive transformers and hetero-regressive transformers?
Basically, in an autoregressive or auto-associative transformer the keys point to other keys or even themselves (therefore the queries are the keys as well), and in a hetero-associative setting where you want to associate A to B or predict the next token in a sequence, the keys point to values in a different set.[^2]