Thursday, November 07, 2019

2.1 Playing Defense / 2.1.1 Entropy

2.1.1 Entropy

The single most important concept in cryptography -- indeed, in all of information theory, and inarguably one of the most important in all of computer science -- is Claude Shannon's entropy.  The entropy is, roughly, the amount of disorder in a sequence of data, or the amount of information that must be transmitted to reconstruct the data.  When the entropy of a cleartext message is high, it is hard to predict the next symbol (bit or byte or letter or floating point number); purely
random data has very high entropy.  When it is low, the cleartext has a skewed probability distribution such that, e.g., the letter 'e' or the number 0 is more common than other values.

Of course, because encryption tries to hide information, encrypted data looks as random as possible: it has very high entropy.  Thus, an easy way to automate a test to see if we have successfully decrypted a message is to calculate the entropy of our provisionally decrypted plaintext; low entropy indicates a high probability that we have (at least partially) succeeded.

The entropy is defined as
$H_{2}(X)=-\sum_{i=1}^{N} \frac{\operatorname{count}_{i}}{N}\log_{2}\left(\frac{\operatorname{count}_{i}}{N}\right)$
where $\operatorname{count}_i$ is the number of times that the $i$th value appeared in our data, and $N$ is the number of different values that show up in our data sequence.

From the fantastic website RosettaCode.org, we have a simple example in Python of calculating the entropy in bits per byte:


import math from collections import Counter def entropy(s): p, lns = Counter(s), float(len(s)) return -sum( count/lns * math.log(count/lns, 2) for count in p.values()) entropy("1223334444")
(This function can actually calculate entropy of strings of other types besides characters, as well.)

Note that this definition of entropy does not take into account any long-range structure; it is purely based on the overall probability of the appearance of a given set of symbols.  Thus, to this function, the strings 0101...01 and 000...01...111 have a bit-level entropy just as high as a string composed of exactly 50/50 randomly chosen 0s and 1s, although by applying a little intelligence we can easily reconstruct the former but not the latter (at least, not exactly).  Many lossless data compression programs operate on stateful _sequences_ of data to predict the next symbol.  For example, the common Unix utilities gzip and compress use the Lempel-Ziv algorithm, which builds a dictionary of sequences as it goes, so that repetitive data and long runs are compressed effectively; this is especially good at sequences such as 0101...01 or 000...01...111.

(Taking still further advantage of brainpower, the ultimate in compression is measured using the _Kolmogorov complexity_: the length of the shortest program that will reproduce the sequence.  This has very low complexity for anything highly regular, as well as seemingly irregular sequences such as the output of a pseudo-random number generator or the digits of $\pi$.  I am not aware of any automated form for calculating the Kolmogorov complexity of an arbitrary sequence of data.)

This, obviously, is one of the motivations for the substitution phase in a substitution-permutation network; with permutation only the entropy of the original cleartext is preserved in the ciphertext, and in some cases reveals a great deal about the original data.



No comments: