• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

View

# Measuring Information: Shannon versus Popper

last edited by 12 years, 10 months ago

# Measuring Information: Shannon versus Popper

Extracted and adapted for the Web from "Value and Belief", a PhD thesis accepted by the University of Bristol, 2003.

Topics: information theory, inductive logic, epistemology, logical probability

### Abstract

Philosophers have a notion of the epistemic "strength" or "boldness" of a proposition, or rather its information content, and perhaps have an idea from Popper or Wittgenstein that it can be measured using probability. This short note explains the advantage of the Shannon information measure used in information science, in terms of logical consistency and with a minimum of formalism.

The issue of how to quantify information has come up frequently in the literature on inductive logic (e.g. Hempel & Oppenheim (1948), Carnap and Bar-Hillel (1952)). What is agreed is that information content is a quantity attaching to propositions. When you receive the message "Supper's ready", we say that strictly the information content attaches not to that utterance but to the proposition that you have received that utterance. As such, information content can be represented as a mathematical function over sentences of a logical language, much like probability or utility functions. The common theme between different proposed measures is the principle, found in Popper and in Wittgenstein, that a proposition is informative according to how many possible situations it excludes. Popper and others have insisted that the information content of H is measured by 1-P(H) where P(H) is the logical probability of H . This means that the information content is just the ratio of possibilities excluded by H to all logical possibilities. This measure meets a basic requirement of a measure of information: namely that if B is a proposition which has a non-negligible probability given A, then A&B is more informative than A, because it is true in fewer situations. AvB, on the other hand, has less content.

However, the question of how to measure information has been decisively solved by Shannon (Shannon and Weaver (1949)) in a paper that is crucial to what is now called information technology. To show what is at stake, I will explain how Shannon derived his measure and then show why Popper's measure is unacceptable.

Shannon based his measure of information on requirements of logical consistency. Indeed his work is very similar to the Cox proof of Bayesian probability. Like Cox, Shannon set out consistency requirements on a type of formal system as mathematical constraints on a function, then showed that the functions satisfying these constraints differ only trivially from each other, and hence that there is really only one consistent measure.

To illustrate what is meant by a consistency constraint in this context, imagine that you receive two successive messages through the same channel each consisting of one letter of the alphabet. Imagine separately that you receive a single message consisting of two letters of the alphabet. It should be clear that these are different descriptions of the same situation, hence any truly propositional measure should give them the same value. Put another way, measures of information content should give the same value to "You receive 'A' followed by 'B'" as to "You receive 'AB'."

At the moment, we are concerned with measuring the information content of the message 'AB', not in the sense of how much is tells us about a particular issue, but in the sense of how much information would be required to unambiguously transmit the message down a hypothetical communication channel. This intrinsic complexity or information content is referred to in the theory as its self-information, whereas the extent to which a message is informative about whether or not H is called its cross-information on H.

With Popper, let us take 1-P(H) to measure information content, where each letter is taken as equally probable. In the first situation, the information content of the first message, whichever letter it turns out to be, is 25/26. Since there are two individual messages, the total information received is 50/26. In the second situation, the total number of possible messages (two-letter sequences) is 676. Whatever message you receive will logically exclude 675 of these messages, so the total information received is 675/676. Thus we have reached two entirely different values depending on how a particular message was described, and this serves to illustrate the problem with using a non-Shannonian measure.

Shannon's measure itself uses logarithms. The information content of a particular message A, called its surprisal, is -logP(A). It does not matter which base we use for the logarithm so long as we are consistent: this is the sense in which there are different mathematically allowable measures, but they differ so trivially that we can consider them to be one measure. When base two is used, the resulting unit of information is called a 'bit' (short for "binary digit"), a bit being the maximum amount of information that can be conveyed by the answer to a yes/no question.

In the above example, each one-letter message has a surprisal of -log21/26 = 4.7 bits, and a two-letter message has a surprisal of -log21/676 = 9.4 bits. Hence we see that the additivity requirement (that the content of two one-letter messages is that same as that of the one two-letter message) is satisfied.

Like probability and utility, information content is a propositional measure which obeys the expectation principle. If we do not know what a particular message is, but that it is the answer to a question whose possible answers are A1, A2, A3,..., An then the information content is the expectation of the information content over all possible messages, in other words the sum of -P(Ai)logP(Ai).

An information source or communications channel can be thought of as a question with one of a (possibly very large) set of possible answers.

This defines a crucial term in information theory: entropy. Calculating the expected information content for the set of possible answers to an inquiry gives us the entropy for that inquiry, which can informally be regarded as a measure of uncertainty attached to it. If a subject is irrevocably certain about an issue, in that one answer is given probability one while all others have probability zero, then the entropy is zero. When we have a finite set of mutually exclusive hypotheses with no information to discriminate between them, then entropy is at its maximum when all are given the same probability./p>

### Information versus Probability

Since information content measures are simply descriptions of probability functions, it may seem that we do not gain anything by talking of information content that can not be expressed in terms of probability. However, information theory gives us a perspective on inferential tasks that we can miss if we talk entirely in terms of probability. To illustrate this I will consider a standard example. In a particular city, it rains three days out of four. The local weatherman makes a forecast for rain half the time, no rain the rest of the time. His predictions are such that he correctly predicts rain half the time, correctly predicts no rain a quarter of the time and incorrectly predicts no rain the remaining quarter of the time. This can be expressed in the following table of probabilities.

Rain No Rain 50% 0% 25% 25%

Here is the problem. Someone who predicts rain for every day will be right 75% of the time. Someone who accepts the weatherman's forecast also has a probability of 75% of being right on any given day. So why is the weatherman any use? The answer is that we do not simply have to accept the weatherman's forecast but use it as an information source. In other words, rather than taking the "No Rain" forecast uncritically, we can conditionalise on it to get a new probability of rain (in this case 50%).

We can evaluate how informative the forecaster is about the weather by measuring the reduction in entropy: a perfectly reliable forecaster would reduce the entropy to zero. The entropy resulting from consulting the weather forecaster is zero if the forecast is for rain and one bit if the forecast is no rain. Since these are equally likely, the overall entropy is half a bit. If we do not consult the weatherman, then given just the 75% chance of rain on any one day, the entropy is .811 . So the benefit of this forecaster is a .311 bit reduction in entropy.

By measuring the information content of the predictions in this way, we have a basis for comparison of weather forecasters (or other predictors) which is more meaningful than merely taking the probability of them being correct.

### References

Carnap, R. and Y. Bar-Hillel., 1953. "An outline of a theory of semantic information." British Journal for the Philosophy of Science, 4, 147-157.

Hempel, C. G. and P, Oppenheim., 1948. "Studies in the logic of explanation." Philosophy of Science, 15: 135-175.

Shannon, C. E. and W. Weaver, 1949. The mathematical theory of communication. Urbana, Illinois: University of Illinois Press.