Shannon's Entropy - A PhD Journey: Learning to Learn

## The right way to measure information content Claude Shannon claimed that the right way to define *the information content of an outcome* $x=a_i$ for an ensemble $X$ is $ \begin{aligned} h(x=a_i) &= \log_2 \frac{1}{P(x=a_i)}\\ &= -\log_2 P(x=a_i). \end{aligned} $ An ensemble is defined as $X = \{x, A_x, P_x\}$ where: - $x$ is a random variable - $A_x = \{a_1, a_2, a_3, \ldots, a_i\}$ is the set of possible outcomes - $P_x = \{p_1, p_2, p_3, \ldots, p_i\}$ is the set of probabilities such that $p(x=a_i) = p_i$ and $\sum_i p_i=1, p_i \ge 0$. >[!note] The information content is additive > If we consider a joint distribution of two indepedent variables $P(X,Y)$, the information content is **additive**. For an esenmble $XY$ such that an event is a pair of independent random variables $x, y$ with probability $P(x,y) = P(x)P(y)$, the entropy of one outcome is defined as: $ \begin{aligned} h(x=a,y=b) &= \log_2 \frac{1}{P(x=a,y=b)}\\ &= \log_2 \frac{1}{P(x=a)} + \log_2 \frac{1}{P(y=b)} \end{aligned} $ This means that the information content of the joint outcome of two independent events is the sum of the information content of each independent event. For example, the toss of a coin and the wheather outside are independent: the information content of both outcomes is the sum of each. ## The information content of an ensemble is its Entropy In *Information Theory*, the notion of *entropy* is used all over the place. This quantity defines **the average information content** (or element of surprise) in the transmission of information. It is defined for a probability distribution as $ H(X) = \sum_{x_i} P(X=x_i) \log_2 \frac{1}{P(X=x_i)} $ where $P(X=x_i)$ is the probability of event $x_i$ to occur, given the PDF/PMF $p(X)$ (see [[Probabilities 101]] for the PDF and PMF). A distribution with relatively spread out probability mass (information) has an exponentially higher entropy than one with all its mass concentrated around a single point. ### How is it the information content? If a system $S$ has $N$ possible states, then its state $s$ can be represented in $\log_2 N$ bits. In fact, if all its states are uniformely distributed, we can compute the system's entropy using $\log_2$ as: $ \begin{aligned} H(S) &= - \sum_{i=1}^{N} p(s_i)\log_2 p(s_i) \\ &= - \sum_{i=1}^{N} \frac{1}{N}\log_2 \frac{1}{N} \\ & = -\frac{1}{N}\log_2(\prod_{i=1}^N\frac{1}{N})\\ & = -\frac{1}{N}\log_2(N^{-N})\\ & = -\frac{1}{N} (-N) \log_2 N\\ & = \log_2 N\\ \end{aligned} $ Therefore, we can use $\log_2$ to measure the entropy of a distribution of $N$ possible states. Then, we can interpret the result as the minimum number of bits needed to encode a state sampled from that distribution. Using $\log_2$ is well suited to computer science and information encoding, but it generalizes to any base. Hence, **Shannon's entropy is a measure of information content**. When considering binary files, $h(x)$ is **the compressed file length to which we should aspire**[^1]. [^1]: Davic McKay, [Course on Information Theory, Pattern Recognition, and Neural Networks](http://videolectures.net/mackay_course_02/) ### Why the logarithm? The information content is the same thing as the measure of uncertainty associated with an outcome. Due to its properties, such that the logarithm of a product is the sum of the logarithms, the logarithm is a perfect fit for a measure of uncertainty. Primarily, it is common that adding parameters to a system makes its number of possible states grow exponentially. Inversely, the number of parameters tend to vary linearly with the logarithm of the number of possible states. This gives a more intuitive measure which works really well with the properties of the logarithm and the information content being additive. ![[entropy plot.svg|Shanon's entropy function.]] In effect, more certainty - or probability - over an event exponentially increases its logarithm (for all $0 \le p_i \le 1$), hence we can imagine two general cases: 1. A distribution with a spike for one event and *very low* probabilities for others will have a *very low* entropy: this is due to the negative summation. There is **almost no surprise** to the outcome! 2. A more uniform distribution with more or less equal probabilities will have a *very high* entropy: there is **a high surprise** as to the outcome! The entropy is an essential measure for Information Theory, the field of Computer Science that studies how information can be compressed and [[The information bottleneck|extracted]]. In Deep Learning, it is the foundation of [[The Variational Autoencoder]] and is at the core of the [[Kullback-Leibler Divergence]].