top of page

Shannon:

The beginning of the information theory

1. Context
 
A word about Shannon

​

Claude Shannon (1916-2001) was an American mathematician and electrical engineer. His education started at the university of Michigan then in 1938 he wrote his master’s theses at MIT ”A Symbolic Analysis of Relay and Switching Circuits”. This paper was the foundation of the digital circuit design theory. Indeed, he demonstrated that electrical applications of Boolean algebra could construct any logical numerical relationship.

      In 1941, this brilliant engineer joined the Bells Labs. Then during the world war II, he worked on various problems including communication, cryptography, fire control, together with many renowned scientists. His popularity especially came from a landmark paper, A Mathematical Theory of Communication [106] in 1948.

​

​

Introduction

​

Claude Shannon is well known as the father of the Digital Age. Different pioneers such as Nyquist, Hartley or Markow have explored the communication field. Their influence was significant, but their work is limited and focussed on their own particular applications. In 1948, Shannon came up with an unifying vision that completely changed communication in [106] . He provided different tools that contributed to the development of a new field: information theory. This is one of the few scientific fields that has an identifiable beginning in one specific paper.

 

      Previously, analog signals were used to communicate. Variation of voltage pulses along a wire was the translation of the message to send. This varying signal could be measured at the other end and interpreted back into words. This principle is sufficient for short distances. But an analog electrical signal along a wire is deteriorating due to noise. If the length increases, the message becomes unintelligible.

 

      How has this problem been overcome?

The answer lies in Shannon’s paper. There are actually four major breakthroughs. Understanding these concepts is essential to grasp information theory.

​

math theory
2. A Mathematical Theory of communication [106]

​

These four major breakthroughs to highlight are the general communication system, the probabilistic framework, the entropy and the capacity of a channel.

A general communication system

 

Shannon was the first to model a communication system by a very useful and easy diagram which can be seen in Figure 14.

​

Related to the model, the communication process may be understood as a source communicating to a destination. The message is provided by the source to a transmitter. This transmitter communicates to the receiver through a channel. Finally the receiver transfers the message to the destination. Shannon introduces noise in the channel between the transmitter and the receiver that leads to changes in the original message.

      That noise has an obvious impact which has to be taken into account. Before Shannon, scientists missed or did not know how to integrate this fundamental element in their studies, since the received signal is not necessarily the same as the one sent by the transmitter. So far, all communication systems have been based on this model.

​

Fig. 14 Schematic diagram of a general communication system (in [106] page 2)

A probabilities framework

​

The fundamental mission in communication is the transmission of a message from one point to another. This message has obviously a signification. As discussed by Hartley in [48] and confirmed by Shannon, the significant aspect is that the actual message is selected from a set of possible messages. It is a layout for probabilities (1) and thus for mathematical reasoning.

      Instead of studying a source as a generator of a continuous message such as a text or dash and dot, this source is being studied as generating the message, symbol by symbol. These symbols are chosen according to certain probabilities. Besides, this source is represented by a stochastic process and more precisely as a Markov process (see here).

      At this stage, Shannon is trying to quantify how much information is produced by such a process, at what rate the information is produced or how much information is required to write a particular symbol.

​

​

The entropy

​

In order to answer the important questions above, the definition of information must be fixed. Therefore, Shannon seeks a universal quantity which measures the uncertainty of any event occurring e.g. in communication. He is thus looking for a

measure, say H, which exhibits the following characteristics [106]:

        - H should be continuous in the set of possible events or outcomes (2).

        - If all possibilities are equally likely, then H should be a monotonic increasing function of the number of possibilities.

        - When decomposing the possibilities of an event into its successive parts (treelike structure, see Figure 15), the original must be the weighted sum of the successive values of H.

For example 

proba

Fig. 15 The third requirement of the measure H being defined by Shannon is the weighted sum of the successive values of H. On the left, three possibilities p1 = 1/2 , p2 = 1/3 and p3 = 1/6 . On the left, the first choice is between two possibilities each with the probability 1/2 and the second choice leads to three different possibilities with probability respectively equal to 2/3 and 1/3 [106]

 The only useful representation appears in a logarithmic law that is given by

equation 3

where pi  is the i-th probability from the set of n possible outcomes. Inherited from thermodynamics, H is called the entropy. By the way, the word ”entropy” was not created by Shannon as explained on that page.

​

      In other to understand the sum more intuitively, the word entropy can be replaced by the average amount of information (provided by messages of the same set). Therefore, this average amount of information is equal to the sum for all messages considering that each message has a probability p and has information log(1/p). Figure 16 illustrates the entropy formula.

Fig. 16 Illustration of the entropy formula (5)

      This concept in information theory has a central role in the measures of information, choice and uncertainty. In order to understand it in a communication process, Shannon gives an obvious example.

”If a source can produce only one particular message its entropy is zero and no channel is required. For example, the successive digits of π produces a definite sequence with no chance element. ” [106]

So if there is one message (n =1), its probability is equal to 1. By replacing the previous equation, its entropy is zero and no information is given to the receiver. This is intuitive, since there is only one message, nothing unexpected is transmitted, thus no information is sent.

 

      At this stage, the concept is mathematically stated. The transmission from a source to the receiver is being studied. Considering the noise, the rate of transmission R is obtained by subtracting from the the entropy of the source H(x), the entropy of the source given the output Hy(x):

 And so the capacity of a noisy channel is the maximum possible rate of transmission:

This formula can be understood by ”translating” each expression. H(x) is the average amount of information generated by the source. This value has to be maximized in order to provide as much information as possible at the input. Hy(x), equivalent to H(x|y), is equal to the conditional entropy of x given y which is the average amount of information from the input given the output. When the output is read from a noiseless channel, the input is immediately known and the conditional entropy of this situation is zero. Since this is the ideal case, in reality this value has to be minimized. Finally, the difference between H(x)  and H(x|y) is maximized (3) . It corresponds to the measure of information passing through the channel on average, in other words it is equal to the channel capacity.

      This notion of conditional entropy is fundamental. An easier example studied by Shannon [106] is the sequence of symbols such as the English language. He noticed that the conditional entropy of a letter knowing the previous one is lower than the individual entropy. Indeed, if the word starts with a ”r”, it is more likely that the next letter will be a vowel.

​

​

The capacity of a channel

​

Shannon extends his mathematical formulas for continuous signals and a continuous channels (4). He wants to provide a simple and usable formula for the channel capacity. He considers the case when the noise is a white thermal noise such as N is the average noise power. He demonstrates that the capacity of a channel of band W affected by white thermal noise power N when the average power of the transmitted is limited to P is given by

​

The signal-to-noise ratio could be highlighted in the equivalent expression

 In the case of a binary digits transmission, the channel capacity is

To resume, a communication consists in sending symbols through the channel. This channel can carry a limited amount of information every second which is called the capacity. Therefore, the main implication of these channel capacity formulas is that at any rate below this value C , the received message can be entirely retrieved with a low probability of error. Besides, by adding redundancy (5). But the drawback is the number of bits required for encoding increases and the communication becomes slower. Conversely, it is impossible to obtain a free error communication above the limit given by C. Further explanations of the impacts are presented in the section The Information Age

​

​

3. Consequences
 
Moving away from physics and implementation issues

​

All these concepts seem mathematical while sending a message from one point to another is physical. Shannon’s theory is mathematical, that is, decoupled from the implementation and physical considerations of the problem. The following example helps to understand those concepts intuitively.

      Before Shannon, the message was sent as a wave along the channel. This wave saw its amplitude varying in a large range of values in order to represent the message. Due to the noise and the distance, it suffers from noise and attenuation as illustrated in Figure.17.

​

​

Fig. 17 The transmission of a message before Shannon [52]

Even if an amplifier is added, the amplitude and the noise are amplified like in Figure.18.

Fig. 18 The transmission of a message before Shannon with an amplifier added across the channel [52]

The solution is based on the conversion of the message into bits as shown in Figure 19. Therefore, the sent wave is no more a signal with varying amplitudes but a wave with two distinct levels. Along the channel, regenerative amplifiers are 

added. Since 0 and 1 are represented by one of the two levels, they are easily identifiable. The message is then regenerated and amplified. When it arrives at the other point, the receiver can perform the inverse operation and decodes it from bits into understandable words.

Fig. 19 The transmission of a message thanks to Shannon

with regenerative amplifiers [52]

Information concept in practice
​

The idea of bits comes from Turkey. But, in a sense, this digitization comes from one of one Shannon’s concept which is the quantification of information. According to the theory, the context strongly influences the concept of information.

For instance, finding a man called Paul Dupont based only on his first name in France or in China is different. Indeed, the word Paul in France represents less information than in China. The context, in this example, is related to the probability of 

the name in a given country. Finding Paul is much likely in France than in China.

​

      If p  is the probability of the message, its information is related to 1/p. But the brilliant idea of Shannon is to define the number of bits required to write 1/p  : log2(1/p). It implies that the higher the probability, the lower the number of bits. So, Paul in France required less bits to be encoded than in China. The choice of the logarithm is mathematically explained by Shannon in at the beginning of the point b.

​

      This notion of information in a message has driven Shannon to define the entropy. As explained mathematically at the point c, it is equivalent to the average amount of information provided by messages in the set of possible messages. The entropy seems very theoretical but in fact it has an important impact for example in data compression. It provides the theo-retical limit to the average number of bits to code a message.

      Then Shannon goes even further by providing the number of bits per second that can be transmitted in the channel. This channel capacity is obtained without considering the physical implementation such as the wires or the distance between the transmitter and the receiver.

​

​

​

4. Conclusion
​

Shannon’ 1948 paper addresses the problem of long distance communication by providing an abstract view of a communication system and especially of a unit of information. The idea is simple: the message is converted into a code of 0 and 1. Then it is translated into low and high amplitude voltage. The signal still suffers from noise. However, the signal presents two states that are easy to distinguish and then the message can be easily recovered.

      The mathematical tools developed in his paper are powerful and applied nowadays in various applications. By the way, it recalls the work of Hodgkin and Huxley concerning the nervous system. The response of the neuron is an all-or-none model as explained on that page. The parallelism is obvious; the model has two distinct levels 0 or 1.

​

      To conclude, the ”Mathematical Theory of Communication” corresponds to the rupture between the Analog Age and the Digital Age. After 1948, engineers in the communication field and also in other fields have a new manner of thinking and new tools to solve problems. By the way, the transition from an analog to a digital approach appears in different domains. Turing in fact bases his tests on the exploitation of ”digital computers” (see this page). That led computer sciences to be developed following this new approach.

​

​

footnote
  1. Since a event (the actual message translated into symbols) is chosen from a set of possible events. Probabilities formulas can be applied, for instance adding a weight to each event.

  2. In other words, H should be a continuous function which characterized a set of events with their given probabilities. 

  3. The first term is maximized, the second is minimized and so the difference between them is maximized.  

  4. Continuous channels imply that the probabilities are expressed as continuous functions instead of basic sums. 

  5. Redundancy means copying the symbols bits in order to increase the probability to retrieve the message. For example if the letter D is represented as 101. Applying redundancy transforms the symbol for example as 110011.

bottom of page