# The Evolution of Information Theory: From Basics to Applications

Written on

The caves of Borneo, Indonesia, provide a glimpse into humanity's earliest recorded communication methods. Approximately 40,000 years ago, long before written language emerged, the use of *physical illustrations* on cave walls represented the most accurate means of conveying information. Over time, the approach to documentation evolved from cave art to *sophisticated alphabets*, enabling comprehensive expression through the intricate structure of *language*. In English, concepts as simple as "tree" and as complex as "love" are articulated through the 26 letters, which possess no inherent value apart from what society attributes to them.

In the context of the newly formed discipline, *information* is defined as the resolution of uncertainty (or surprise). For instance, when communicating the existence of a tree, a preserved image can convey details about its shape, colors, and relative size compared to other trees. However, this representation is not entirely precise. Written language, with its vast vocabulary, provides a more detailed account, including where the tree is planted, who planted it, and the type of soil it thrives in. Different modes of communication inherently convey varying levels of uncertainty.

Claude Shannon, widely regarded as the Father of Information Theory, introduced a logical and mathematical framework for quantifying the differences in communication methods. He termed this concept *entropy*, which serves as a mathematical tool to quantify the relationship between disorder and uncertainty. A higher entropy value signifies greater uncertainty regarding the information conveyed or, more practically, an increased number of potential outputs from a given function. To grasp the modern formula for measuring information, we must first revisit the 1920s, when Ralph Hartley expanded upon Henry Nyquist's foundational work.

## Early Entropy— A Measure Of Uncertainty

Information is characterized as the resolution of uncertainty; if a value can be determined without posing any questions, it lacks informational content. Entropy directly correlates with the information within a system. The greater the entropy, the more uncertainty is associated with the meaning of a symbol (such as a number or letter). Entropy is maximized when a symbol can have *multiple* interpretations (uniform distribution). The simplest unit of entropy arises when a symbol can take on *two* meanings. A coin flip, for instance, embodies this binary characteristic—resulting in either heads or tails.

In the graph above, the y-axis, H(x), signifies *entropy* for a symbol with two outcomes (binary). Notice that entropy, or *uncertainty*, peaks (equal to 1) at the center of the graph. This aligns intuitively; when flipping a coin, the uncertainty of the outcome is highest while it's airborne. Conversely, entropy reaches its minimum (0) at the extremes of the x-axis when the current value of the binary symbol is known. Thus, coin flips, yes/no questions, Booleans, and binary digits (0 or 1) are all *mathematically* equivalent, represented by the above graph.

*The symbol with a binary property, referred to as a "Bit", serves as the fundamental unit of entropy in Information Theory.*

While the term “bit” was not coined until later, we adopt it here for clarity as we delve into modern information theory. An interesting note: "bit" is a contraction of *bi*nary digi*t*.

Logically, a message requiring a single-step solution (a single bit, coin flip, yes/no question, or Boolean uncertainty) has a *lower* entropy value compared to one needing multiple steps. For example, in a fair coin flip, to ascertain whether the outcome is heads or tails, a single bit suffices: "Was it heads?" The answer can be determined with one response. This scenario carries an entropy value of *one* since the answer can be resolved in just *one* step.

When considering more intricate problems, additional questions will likely be necessary to ascertain the answer with certainty. We now recognize that conveying a single symbol—the most basic one, a bit—yields an entropy value of one.

### What If I Send A Message With Three Symbols (All Bits)?

Continuing from the previous example, if we aim to send a three-symbol message using three coin flips, the total number of possible messages is now eight (2³), resulting in an entropy value of *three*. It’s crucial to remember that entropy does not reflect the number of possible messages but rather the minimum number of questions needed to uncover the message.

### What If I Send A Message With A Single Non-Binary Symbol (Character)?

Now, let's consider sending a single character, which could be any one of twenty-six letters (1/26). What is the entropy value in this case? Essentially, how many yes/no questions must we answer to identify our letter, say *J*?

A logical starting point is to ask, in alphabetical order, one character at a time: “Is X the intended letter?” For instance, is A the letter we aim to encode? How about B? Or C? Using this method, on average, a single symbol would require 13 "yes/no" questions to pinpoint (or possess 13 bits of entropy). However, a significant caveat exists: measuring information necessitates the *most effective way* of assigning value to symbols. Rather than searching character by character, we utilize a binary search algorithm. For example, when searching for the letter *J*, we first ask if *J* is in the first *half* of the alphabet. If yes, we then narrow it down further, asking if our character is among the first six letters (A-F). This process continues until we identify our symbol, *J*. This method of assigning value to a symbol using Booleans is far more efficient; instead of an average of thirteen yes/no questions, we often need no more than five (sometimes even four).

Through this insight, the potential entropy theoretically doubles for every symbol available in the message space. With this understanding, calculating the most efficient method to encode an alphabet character given a uniform distribution becomes clear.

The formula above estimates the entropy (4.7 bits) associated with sending a single, random alphabet character. As Ralph Hartley noted in his influential paper, *Transmission of Information*:

> What We Have Done Then Is To Take As Our Practical Measure Of Information *The Logarithm Of The Number Of Possible Symbol Sequences*

Hartley's early definition of information varied slightly in notation. He expressed it as H (information) = *n log (s)*, where H represents information, *n* denotes the number of symbols (letters, numbers, etc.), and *s* is the number of different symbols available at each selection (essentially the length of the message space).

Thus far, we've gleaned a solid understanding:

- Information Theory quantifies the degree of surprise inherent in a communication event.
- Entropy, or uncertainty, measures the minimum number of yes/no questions required to determine the value of a symbol.
- It has been established that the binary digit, or bit, carries an entropy value of 1, serving as the base unit within this mathematical domain.
- An early function of information has been derived, assuming a set of uniformly distributed symbol values.

Up to this point, we've presumed that each value within a symbol set is randomly discrete, a convenient simplification. In reality, the theory does not hold this neatly, as not all symbol values are created equal. *There are discernible, measurable patterns in our language and other forms of communication*. Consider a quick thought experiment: count how many times the letter “e” appears in the previous sentence—does that character distribution reflect a uniform distribution of 1/26?

## When Symbols Aren’t Random — Markov Chains

Fast forward to 1948, when Claude Shannon, the father of modern information theory, proposed in his groundbreaking paper, *A Mathematical Theory of Communication*, that patterns in communication could be leveraged to convey the same message (value) in fewer steps (bits).

*Written language reveals patterns that render the next value in a sequence more predictable based on prior values.*

In simpler terms, the preceding value reduces the uncertainty (entropy) of the following value. A prime example is the predictability of a ‘U’ following a ‘Q’ in English. If ‘Q’ is typically succeeded by ‘U’ 90% of the time, the subsequent letter's potential outcome is no longer evenly distributed; it skews toward ‘U’ at a rate of 90%.

This creates a system where the next value depends on the previous one. Russian mathematician Andrey Markov demonstrated this in his foundational work on *Markov Chains*, proving that the probability of future values hinges on prior events, remaining *fixed* in their likelihood. He established that, over time, the outcomes would align with their statistical probabilities.

Recalling the dependency of ‘Q’ on ‘U’, with a 90% chance of returning ‘U’ (P(Xi)), the entropy, or uncertainty, of ‘U’ appearing after ‘Q’ is H(X) = 0.13 bits. The entropy for a completely randomized alphabet, one with total independence, measures H(X) = 4.7 bits. Establishing this dependency, the entropy associated with ‘Q’ and ‘U’ drops by an impressive 4.57 bits. Instead of halving the alphabet on the initial inquiry, the most informative question becomes, “Is the value ‘U’?” Given the high likelihood, this will often be true, reducing entropy to just one bit. This process eliminates unnecessary questions, further lowering the total entropy of the system. By analyzing millions of written works, standard probabilities for each letter in the English language have been established. These probabilities can be applied to independent events. When considering dependent value outputs (Markov Chains), relative frequencies for letter occurrences have also been defined.

### The Information/Entropy Formula Revisited

With this insight, Shannon enhanced information theory by refining Hartley’s function. For a set of random, uniform values *X*, we compute the entropy of encoding a single symbol using the log (base 2) of *X*. For a set of related, interdependent values *X*, we determine the entropy for a single symbol by *summing* the individual entropy values for each symbol in the set:

However, the above formula assumes uniform distribution, which we now recognize as incorrect due to Markov Chains. To adjust for this, we must modify the formula to incorporate the probability frequency of each symbol value *x* (*p(x)*):

Finally, we must replace *n* in the logarithmic function. We are calculating the number of yes/no questions needed to arrive at *each individual character*, rather than an aggregate outcome. Returning to the alphabet example, we understand from the previous Markov chain that determining whether a character is *e* versus *z* requires differing amounts of bits. Therefore, for each summation, we need the occurrence rate of that specific outcome—it's not 26, but rather (1 / *p(x)*).

The formula we derived aligns with the one that Claude Shannon formulated, establishing him as a pioneer in information theory. Though he slightly rearranged the variables, this equation forms the foundation of modern information theory:

H(x) represents entropy, the measure of uncertainty tied to variable *X*. P(x) denotes the probability of outcome *x* within variable *X*, and the base-2 Log(1/p(x)) signifies the number of bits needed to decode the outcome *x* from variable *X*. The fundamental unit expressed here, what H(x) equates to, is defined in bits.

Theoretically, Shannon’s refined formula, which employs the principles of Markov chains, should decrease the entropy in a set of symbol values since we are moving away from uniform distribution. To confirm this, let's reference an earlier example where we calculated that *H(x) = 4.7* for a single alphabet character. Comparing this to H(x) calculated using the updated formula:

As indicated in the bottom-right sum, our intuition is validated with a final entropy value of *H(x) = 4.18*. Since the 1950s, this general formula for entropy has fostered growth in information theory and its applications.

## Onward To Applications

The concepts established here, built from the foundational bit upward, are both recognizable and powerful in today's digital landscape. One of the most prevalent and impactful applications of information theory lies in lossless data compression. This method is utilized in database records, word documents, image files, and video files. In this compression format, data can be perfectly reconstructed to its original form from the compressed file. By applying the principles of Shannon’s entropy and Markov Chains, information can be retrieved without any loss. This capability for compression has facilitated the mass production of devices capable of storing immense amounts of data, particularly evident in the realm of music files. In the early days of recorded music, vinyl records represented an uncompressed data format, capable of storing a single album in its entirety. With contemporary compression technology, music files are compressed to retain bits concerning pitch, volume, resonance, etc. Another notable example is the enhancement in hardware storage capabilities, another outcome driven by the mathematical principles discussed here.

Without Shannon’s contributions to communication, the digital age would be merely a distant dream. Just like many other underappreciated mathematical principles, information theory plays a crucial role in our daily lives.