Thursday, 10 January 2013

Intro to Information Theory

Information is everywhere, but that's obvious and everybody is saying it.  In a certain sense, at the very heart of experience, there is nothing other than information.  But that's a story for another day.  So what do we mean by information?  Certainly, there is a sense of relationship, of correlation, of geometry.  But can we make it precise?

We can.  Information Theory is extremely well developed and mathematically rigorous.  Here, we will briefly explore the basic intuitions that lie behind it.  In the future, we will explore its many consequences.

Everything starts with probability theory, more particularly, with the main object of probability theory, the random variable.  A random variable is a variable which, lo and behold, takes on a random value from some predefined set of values.  The number of hours a day a person spends in traffic is a random variable, with values taken from positive real numbers between 0 and 24.  A more abstract random variable is the configuration of human civilization in a particular point in space and time.  If we 'measure' this configuration, then we find that the random variable takes on a value of, say, the exact current configuration of our society today, measured by some macroscopic qualities (population density, lifespan, distribution of wealth, velocity of goods, etc.).  But this is only one of the many possible values that the random variable configuration of human civilization could take on.  Indeed, many an author have made their fortune exploring other possible values of this wretched-though-magnificent random variable.

So the neat thing about a random variable is that it comes with a probability distribution.  That is, if we measure the random variable repeatedly, some values may be more likely to appear then others, and so we can talk about the probability of the random variable taking on any one of its many possible values.    We call this set of probabilities a probability distribution over the random variable.

Consider a coin toss.  The result of a toss is a random variable, whose possible values are 'heads' and 'tails'.  If it is a fair coin, then there is an equal probability for both possible outcomes (its probability distribution is uniform).  On any given toss, you literally have no idea what is going to happen.  It's either heads, or tails, but you can't do any better than that.  So naturally, when you flip the coin, and find out that it has landed heads or tails, you will be pleasantly surprised, because you had no inkling as to which it would have been before hand.  We might say, "the probability distribution of a fair coin exhibits maximal surprise."

Consider now a coin which is not fair, but which, by a clever trick of its design, comes up heads 90% of the time when flipped.  If you knew this, then you won't be surprised at all when you find out the outcome of the flip, most of the time.  So this probability distribution has considerably less surprise, because it's almost always heads.


Now this surprise-quantity more commonly goes by the name of entropy.  That is, the entropy of a probability distribution over some random variable is a measure of how little one knows about the outcome of any measurement on the variable.  So its like an uncertainty.  But it is simultaneously a measure of how much knowledge one can gain by measuring the variable, given how little one knew before hand.  And so in an interesting sense, the entropy of a variable, its surprise content, is a measure of its information content.  A probability distribution which is uniform is most uncertain, having maximal entropy, and the most information to gain from measurement - the most surprise.

But that's not the whole story.  When we talk about information, we almost always refer to information between two random variables.  We are not so much interested in what someone says as we are in how much of what is said is heard and understood properly by the receiving party.  Suppose there was a mysterious light on the wall that flashed colours when you flipped your coin.  You soon notice that every time it flashes blue, you flip a heads.  It's flashing blue now.  Will you be surprised when you flip heads?  Not at all.  The flashing light, by being correlated to the outcome of your random variable, reduced the entropy of your random variable.  It used to be a surprise, but now that you have the light, its not.  So there is mutual information between the light and the coin.

Mutual information then is defined in this precise way, as the reduction in entropy (uncertainty) of one variable, given the value of another.  We can theoretically measure the information content between any two variables, or more, by sampling their respective probability distributions, and calculating how much each variable reduces the entropy of the others.  In the ideal case, where there is maximal information, the entropy of a variable is reduced completely when given the other variable.  So the maximum mutual information between any two variables is the entropy of the variables themselves! (Actually, it is the lesser of the entropies of the two variables).  That is to say, something which has a low entropy to begin with (ie. you have some degree of certainty as to its outcome), is necessarily restricted in how much information it can share with another variable, compared to something which has a higher entropy. 

In most cases, this result, that the maximal mutual information between two random variables is the lesser of the entropies of the variables, has a lot to do with the number of possible outcomes of a variable in the first place.  A coin toss, regardless of its probability distribution, has only two possible outcomes, and so is obviously limited in the amount of information it can share.  But similarly in large systems, with many possible outcomes, a distribution which maintains a low entropy necessarily restricts the information sharing capacity of the system.

So there is an interesting trade-off.  Something which is well organized, about which there is little uncertainty, which is accurate and precise and efficient, ie. something which has low entropy, is fundamentally restricted in the amount of information it can share with another variable, by virtue of the fact that it is so organized and predictable!  On the other hand, something which is initially more uncertain, which may be less reliable and effective but certainly more opportunistic and versatile, ie. something which has higher entropy, is able to share far more information with another variable, by virtue of the fact that it can be found in its other states with greater probability.  That sort of trade off permeates everything.

I think about these ideas most often in the context of protein signalling in the cell, where proteins take on particular shapes and are sometimes well ordered and sometimes totally disordered, and one-way-or-another these structural differences relate to their functionality and give rise to the marvels of our existence.  But on the other hand, consider our political organization, and the chains of beaurocracy, clamping down on entropy everywhere they can with enormous energy expenditure.  The reduction in entropy is catastrophic in the face of how much information our governments need to handle if they are to do their jobs.  We need far more disorganization in our society, because only with a higher internal entropy can we properly encode the enormous wealth of information about ourselves and our environment that is necessary to keep track of in the maintenance and evolution of the (supposedly) most complex and intelligent species on the planet.  Libertarians have been saying this for a while but I think I just grounded it in physics. 

Information theory is cool.