Python for Bioinformatics: Information

Wednesday, April 13, 2011

Information

I just finished reading James Gleick's new book, The Information (link).

The book features Claude Shannon prominently. It's unusual, including long discussions of (among other topics): African drum language, Charles Babbage, Shannon's information, Turing, quantum computing, and Wikipedia. It's broader and not as technical as I'd hoped, but very enjoyable and highly recommended. Gleick is a talented writer (I loved his book about Richard Feynman and the biography of Newton is also excellent).

Gleick introduces the famous equation from information theory (and entropy)

H = Σp_i log₂p_i

and quotes Shannon:

"The resulting units may be called binary digits, or more briefly, bits."

Gleick:

"As the smallest possible quantity of information, a bit represents the amount of uncertainty that exists in the flipping of a coin. The coin toss makes a choice between two possibilities of equal likelihood: in this case p₁ and p₂ each equal 1/2; the base 2 logarithm of 1/2 is -1; so H = 1 bit. A single character chosen randomly from an alphabet of 32 conveys more information: 5 bits, to be exact, because there are 32 possible messages and the logarithm of 32 is 5."

He goes on to describe a test Shannon performed:

".. in point of fact his test subject was his wife, Betty. He pulled a book from the shelf (it was a Raymond Chandler detective novel, Pickup on Noon Street), put his finger on a short passage at random, and asked Betty to start guessing the letter, then the next letter, then the next. The more text she saw, of course, the better her chances of guessing right. After 'A SMALL OBLONG READING LAMP ON THE' she got the next letter wrong. But once she knew it was D, she had no trouble guessing the next three letters. .. Quantifying predictability and redundancy in this way is a backward way of measuring information content."

Later in the book Gleick describes measuring complexity or information content by asking how far we can compress a message. For example, the message:

DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

might be given as

Dx50

So it has very little information.

But this example leads to something that's counterintuitive. Namely, the least compressible message is a completely random string of the allowed characters, and by this argument such a string has the highest possible information content. In a way that makes sense, I suppose. If we happen to possess just the right one-time pad, such a message could be very informative. But I find it pretty confusing.