Softmax is defined as: $ \sigma(\vec{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}} $ where $z_i$ are the elements of the input vector, and $e$ is the Euler number of the exponential function. The denominator is the normalization term which ensures that all the output values of the function will sum up to $1$, thus constituting a valid probability distribution. Softmax normalizes the weights, but it makes large values larger via the exponential function, as the following figure[^3] illustrates: ![The Softmax function in the scaled dot-product attention.](softmax.png) [^3]: [Attention Approximates Sparse Distributed Memory](https://www.youtube.com/watch?v=THIIk7LR9_8)