What is the difference between joint entropy, conditional entropy, and mutual information?

Joint entropy H(X,Y) = -sum_{x,y} p(x,y) log p(x,y) measures total uncertainty in the pair (X,Y). Conditional entropy H(Y|X) = H(X,Y) - H(X) measures the remaining uncertainty in Y after observing X. Mutual information I(X;Y) = H(X) + H(Y) - H(X,Y) measures how much knowing X reduces uncertainty about Y. The chain rule H(X,Y) = H(X) + H(Y|X) links all three. Mutual information is always non-negative and equals zero if and only if X and Y are independent.

What does the Kullback-Leibler divergence measure and why is it asymmetric?

The KL divergence D_KL(P||Q) = sum_x p(x) log(p(x)/q(x)) measures how much distribution P differs from reference distribution Q in expected log-likelihood ratio. It is always non-negative by Gibbs inequality and equals zero only when P = Q. It is asymmetric because D_KL(P||Q) computes expectations under P whereas D_KL(Q||P) computes expectations under Q. In machine learning, cross-entropy loss equals H(P) + D_KL(P||Q), so minimizing cross-entropy is equivalent to minimizing KL divergence from true labels.

What does Shannon's noisy channel coding theorem guarantee?

Shannon's noisy channel coding theorem states that for any discrete memoryless channel with capacity C = max_{p(x)} I(X;Y) bits per channel use, and any rate R less than C, there exist block codes of length n such that the error probability approaches zero as n grows. Conversely, if R exceeds C, error probability is bounded away from zero for any code. For the binary symmetric channel with crossover probability p, capacity is C = 1 - H_b(p) = 1 + p*log2(p) + (1-p)*log2(1-p) bits.

How does Huffman coding achieve optimal prefix-free compression?

Huffman coding builds an optimal prefix-free code by repeatedly merging the two lowest-probability symbols into a composite node, assigning shorter codewords to more probable symbols. The expected codeword length L satisfies H(X) less than or equal to L less than H(X) + 1 bit per symbol. For symbols with probabilities 0.5, 0.25, 0.125, 0.125: codes are A=0, B=10, C=110, D=111 with expected length 1.75 bits, matching H(X) exactly. Arithmetic coding approaches entropy more closely by coding blocks of symbols rather than individual symbols.

What is differential entropy and how does it differ from discrete entropy?

Differential entropy h(X) = -integral f(x) log f(x) dx for a continuous random variable X with density f. Unlike discrete entropy, differential entropy can be negative, is not invariant under reparameterization, and lacks direct interpretation as a bit count. The Gaussian distribution maximizes differential entropy among all distributions with a fixed variance: h(X) = (1/2) log(2*pi*e*sigma^2) nats. For the AWGN channel, the Gaussian input distribution achieves capacity precisely because it maximizes the differential entropy of the output subject to power constraints.

What is Fisher information and how does it relate to the Cramer-Rao bound?

Fisher information I(theta) = E[(d/dtheta log p(X;theta))^2] measures how much information a random variable X carries about an unknown parameter theta. The Cramer-Rao bound states that the variance of any unbiased estimator theta-hat satisfies Var(theta-hat) >= 1/I(theta). This connects information theory to statistics: more Fisher information means tighter estimation. For the Gaussian model, I(theta) = 1/sigma^2, giving the bound Var(theta-hat) >= sigma^2/n for n samples. The MLE achieves this bound asymptotically.

Advanced Mathematics

Information Theory: Shannon Entropy to Channel Capacity

Information theory, founded by Claude Shannon in 1948, provides the mathematical framework for quantifying information, establishing fundamental limits on data compression, and characterizing the capacity of communication channels. This guide covers entropy, mutual information, source and channel coding theorems, error-correcting codes, differential entropy, rate-distortion theory, and Fisher information.

1. Shannon Entropy

Shannon entropy quantifies the average unpredictability of a random variable. It answers the question: how many binary questions must we ask on average to identify the outcome of an experiment? Shannon showed in 1948 that entropy is the unique measure satisfying four natural axioms — continuity, maximality for uniform distributions, symmetry, and a grouping rule — up to a positive multiplicative constant.

Definition and Units

For a discrete random variable X taking values in a finite alphabet with probability mass function p(x), the Shannon entropy is:

Shannon Entropy

H(X) = -sum_x p(x) log_b p(x)
H(X) = E[ -log_b p(X) ]
Convention: 0 * log(0) = 0 (limit as p approaches 0)

Base b = 2 gives bits; base b = e gives nats; base b = 10 gives hartleys (or bans).

Self-Information

The self-information (surprisal) of a single event x with probability p(x) is I(x) = -log_2 p(x) bits. Rare events carry high information; a certain event carries zero information. Shannon entropy is the expected value of self-information: H(X) = E[I(X)].

Examples: Coin Flip and Dice

Fair Coin (Bernoulli 1/2)

H = -(1/2)*log2(1/2) - (1/2)*log2(1/2)

H = 1 bit

One binary question suffices: heads or tails?

Fair Six-Sided Die

H = -6*(1/6)*log2(1/6)

H = log2(6) approx 2.585 bits

About 2.585 binary questions needed on average.

Properties of Entropy

→Non-negativity: H(X) is greater than or equal to 0 for all distributions
→Maximum: H(X) is maximized by the uniform distribution, giving H = log_2 n for n outcomes
→Minimum: H(X) = 0 if and only if one outcome has probability 1 (no uncertainty)
→Concavity: H is a concave function of the probability vector
→Symmetry: H is invariant under permutation of the alphabet

Binary Entropy Function

For a Bernoulli random variable with P(X=1) = p, the binary entropy function is:

H_b(p) = -p*log2(p) - (1-p)*log2(1-p)

H_b(0) = H_b(1) = 0; H_b(1/2) = 1 bit (maximum). The function is symmetric around p = 1/2 and concave on [0,1]. It appears throughout information theory, for example in the capacity of the binary symmetric channel: C = 1 - H_b(p).

Nats vs. Bits

The choice of logarithm base is a matter of convention. Using natural logs gives entropy in nats, natural for theoretical work and connections to thermodynamics. Using log base 2 gives bits, natural for data compression and communication theory. The conversion is: 1 nat = 1/ln(2) approx 1.4427 bits. Shannon himself used bits in most applied work.

2. Joint and Conditional Entropy

When multiple random variables are involved, joint and conditional entropy describe how uncertainty is distributed across them. The chain rule for entropy is the fundamental decomposition theorem that relates these quantities.

Joint Entropy

For a pair (X, Y) with joint distribution p(x, y), the joint entropy measures total uncertainty in the pair:

H(X,Y) = -sum_(x,y) p(x,y) * log2 p(x,y)
H(X,Y) is less than or equal to H(X) + H(Y)
Equality holds iff X and Y are independent

Conditional Entropy

The conditional entropy H(Y|X) measures the expected remaining uncertainty in Y after observing X:

H(Y|X) = sum_x p(x) * H(Y|X=x)
H(Y|X) = -sum_(x,y) p(x,y) * log2 p(y|x)
H(Y|X) = H(X,Y) - H(X)

Conditioning never increases entropy: H(Y|X) is less than or equal to H(Y), with equality if and only if X and Y are independent. Knowing X can only reduce or maintain uncertainty about Y.

Chain Rule for Entropy

The chain rule decomposes joint entropy into a sum of conditional entropies. For two variables:

H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

The general chain rule for n random variables:

H(X_1, X_2, ..., X_n) = sum_(i=1 to n) H(X_i | X_1, ..., X_(i-1))

Each term conditions on all previously revealed variables. For i.i.d. variables, each conditional entropy equals H(X), giving H = n * H(X).

Entropy Diagram (Venn-Diagram View)

Conceptual Summary

H(X) and H(Y) are represented as overlapping regions. Their intersection is I(X;Y) (mutual information). The non-overlapping portions are H(X|Y) and H(Y|X) respectively. The union is H(X,Y). This diagram makes the identity H(X,Y) = H(X) + H(Y) - I(X;Y) visually transparent.

3. Mutual Information

Mutual information is the central quantity in information theory. It measures how much one random variable tells us about another, appearing in channel capacity, feature selection, and the information bottleneck principle for deep learning.

Definition

I(X;Y) = H(X) + H(Y) - H(X,Y)
I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
I(X;Y) = sum_(x,y) p(x,y) * log2[ p(x,y) / (p(x)*p(y)) ]
I(X;Y) = D_KL( p(x,y) || p(x)*p(y) )

The last equality shows mutual information as the KL divergence from the joint distribution to the product of marginals — a natural measure of statistical dependence.

Properties

→Non-negativity: I(X;Y) is greater than or equal to 0
→Equality with zero iff X and Y are independent
→Symmetry: I(X;Y) = I(Y;X)
→Self-information: I(X;X) = H(X)
→Bound: I(X;Y) is less than or equal to min(H(X), H(Y))

Data Processing Inequality

If X, Y, Z form a Markov chain (Z depends on Y only through X), written X to Y to Z, then:

I(X;Z) is less than or equal to I(X;Y)

Processing data can only destroy information, never create it. This is the information-theoretic justification for why feature engineering and data augmentation must be done carefully: any deterministic processing Z = f(Y) satisfies I(X;Z) is less than or equal to I(X;Y). Equality holds when f is invertible.

Fano's Inequality

Fano's inequality lower-bounds the error probability when estimating X from Y. Let X-hat be any estimate of X based on Y, with error probability P_e = P(X-hat is not equal to X):

H(X|Y) is less than or equal to H_b(P_e) + P_e * log2(|X| - 1)

Equivalently: P_e is greater than or equal to (H(X|Y) - 1) / log2 |X|. This is the key step in the converse of Shannon's channel coding theorem, bounding how low error probability can be when rate exceeds capacity.

4. Kullback-Leibler Divergence

The Kullback-Leibler divergence (relative entropy) is a fundamental measure of how one probability distribution differs from another. It appears throughout machine learning, Bayesian inference, variational methods, and the theory of hypothesis testing.

Definition and Properties

KL Divergence (Relative Entropy)

D_KL(P||Q) = sum_x p(x) * log( p(x) / q(x) )
D_KL(P||Q) is greater than or equal to 0 (Gibbs inequality)
D_KL(P||Q) = 0 iff P = Q almost everywhere
D_KL(P||Q) is NOT equal to D_KL(Q||P) in general

Asymmetry and Its Consequences

KL divergence is not symmetric: D_KL(P||Q) measures expected log-likelihood ratio under P, while D_KL(Q||P) measures it under Q. When used in machine learning, the choice matters significantly:

Forward KL: D_KL(P||Q)

Also called exclusive divergence. Minimizing this forces Q to cover all modes of P. If P has mass where Q does not, the divergence is infinite. Used in maximum likelihood estimation.

Reverse KL: D_KL(Q||P)

Also called inclusive divergence. Minimizing this allows Q to ignore modes of P. Produces mode-seeking behavior. Used in variational inference (ELBO optimization).

Cross-Entropy

The cross-entropy H(P, Q) between distributions P and Q is:

H(P, Q) = -sum_x p(x) * log q(x)
H(P, Q) = H(P) + D_KL(P||Q)

In classification tasks, minimizing the cross-entropy loss H(P, Q) where P is the true label distribution and Q is the model's predicted distribution is equivalent to minimizing D_KL(P||Q), since H(P) is fixed with respect to the model parameters.

Jensen-Shannon Divergence

The Jensen-Shannon divergence is a symmetrized, smoothed version of KL divergence that is always finite and bounded:

JSD(P||Q) = (1/2) * D_KL(P||M) + (1/2) * D_KL(Q||M)

where M = (P + Q)/2 is the mixture distribution

JSD is bounded between 0 and log 2 bits. Its square root is a true metric. It appears in generative adversarial networks (GANs), where the discriminator implicitly minimizes JSD between real and generated distributions.

5. Source Coding: Compression Theory

Source coding theory establishes the fundamental limits of lossless data compression. Shannon's source coding theorem shows that entropy is both a lower bound on compression and an achievable rate.

Shannon's Source Coding Theorem

Theorem (First Shannon Theorem, 1948)

For a discrete memoryless source X with entropy H(X) bits per symbol, it is impossible to losslessly compress n independent samples to fewer than n * H(X) bits on average. Conversely, for any rate R greater than H(X), there exist codes using at most n * R bits per sample with vanishing error probability as n grows.

The proof relies on the asymptotic equipartition property (AEP): for large n, nearly all probability is concentrated on approximately 2^(n*H(X)) typical sequences, each with probability approximately 2^(-n*H(X)). Assigning equal-length codes to the typical set suffices for near-optimal compression.

Asymptotic Equipartition Property

The AEP is the information-theoretic law of large numbers. For i.i.d. samples X_1, ..., X_n from distribution p:

-(1/n) * log p(X_1, ..., X_n) converges in probability to H(X)

The epsilon-typical set A_eps^n contains sequences whose sample entropy is within epsilon of H(X). Its probability approaches 1 as n grows, and it contains between (1-eps) * 2^(n*(H-eps)) and 2^(n*(H+eps)) elements.

Huffman Coding

Huffman coding constructs the optimal prefix-free code for a known symbol distribution. Algorithm:

1.Create a leaf node for each symbol weighted by its probability
2.Merge the two nodes with lowest probability into a new internal node with probability equal to their sum
3.Repeat until one node remains; assign 0/1 to left/right branches

Example: Symbols A (p=0.5), B (p=0.25), C (p=0.125), D (p=0.125)

Step 1: merge C and D into CD (p=0.25)
Step 2: merge B and CD into BCD (p=0.5)
Step 3: merge A and BCD into root (p=1.0)
Codes: A=0, B=10, C=110, D=111
Expected length = 0.5(1) + 0.25(2) + 0.125(3) + 0.125(3) = 1.75 bits
H(X) = 1.75 bits exactly — Huffman is optimal here

Huffman codes satisfy H(X) is less than or equal to L is less than H(X) + 1 bit per symbol. Codes on blocks of k symbols achieve H(X) is less than or equal to L/k is less than H(X) + 1/k, approaching entropy as k grows.

Arithmetic Coding

Arithmetic coding encodes an entire message as a single fractional number in [0, 1), partitioning the interval proportionally to symbol probabilities. It achieves expected length between H(X) and H(X) + 2/n bits for n symbols, far better than Huffman for small alphabets or highly skewed distributions. Arithmetic coding is the basis for modern compressors including LZMA, PAQ, and parts of JPEG 2000.

Lempel-Ziv Algorithms

The Lempel-Ziv family (LZ77, LZ78, LZW) achieves universal lossless compression without knowing the source distribution. LZ77 (basis of DEFLATE, used in gzip and PNG) replaces repeated substrings with back-references (distance, length) pairs. LZ78 and LZW build an explicit dictionary of encountered phrases.

Universality Theorem

For any stationary ergodic source with entropy rate H, the LZ78 compression ratio converges to H bits per symbol as the sequence length grows. The LZ complexity c(x^n) counts distinct phrases; (c(x^n) * log n) / n converges to the entropy rate. ZIP uses DEFLATE (LZ77 + Huffman); JPEG uses DCT followed by Huffman or arithmetic coding of quantized coefficients; MP3 uses psychoacoustic masking followed by Huffman coding.

6. Channel Capacity

Channel capacity is the maximum rate at which information can be transmitted reliably over a noisy communication channel. Shannon's noisy channel coding theorem establishes capacity as both an upper bound and an achievable rate.

Discrete Memoryless Channels

A discrete memoryless channel (DMC) is defined by input alphabet X, output alphabet Y, and transition probabilities p(y|x). Each channel use is independent. The capacity is:

Channel Capacity

C = max over p(x) of I(X;Y) bits per channel use

The maximum is over all input distributions p(x). Since I(X;Y) is concave in p(x) for fixed p(y|x), this is a convex optimization problem with a unique global maximum. The capacity-achieving input distribution can be found via the Blahut-Arimoto algorithm.

Binary Symmetric Channel

The binary symmetric channel (BSC) with crossover probability p flips each transmitted bit independently with probability p. The uniform input distribution achieves capacity:

C_BSC = 1 - H_b(p) = 1 + p*log2(p) + (1-p)*log2(1-p)
At p=0 (no noise): C = 1 bit
At p=0.5 (pure noise): C = 0 bits
At p=1 (inverted): C = 1 bit (channel is invertible)

Binary Erasure Channel

The binary erasure channel (BEC) with erasure probability alpha delivers each bit correctly with probability 1 - alpha or erases it (outputs ?) with probability alpha. The capacity is:

C_BEC = 1 - alpha bits per channel use

The BEC is simpler to analyze than the BSC and is the canonical channel for studying capacity-achieving codes. LDPC codes achieve BEC capacity with message-passing decoding.

AWGN Channel and the Shannon-Hartley Theorem

The additive white Gaussian noise (AWGN) channel model is Y = X + Z where Z is Gaussian with mean 0 and variance N, and input X satisfies power constraint E[X^2] is less than or equal to P. The Gaussian input distribution is capacity-achieving:

Shannon-Hartley Theorem

C = (1/2) * log2(1 + P/N) bits per channel use
C = W * log2(1 + P / (N_0 * W)) bits per second

W is bandwidth in Hz, N_0 is noise power spectral density. SNR = P / (N_0 * W). Capacity grows logarithmically in SNR but linearly in bandwidth.

The Shannon-Hartley theorem explains modern wireless design: doubling bandwidth doubles capacity, while doubling SNR adds only about 1 bit per channel use. OFDM divides bandwidth into subchannels; MIMO adds spatial dimensions. Both increase effective bandwidth rather than relying solely on power.

7. Channel Coding and Error-Correcting Codes

Shannon's channel coding theorem guarantees that reliable communication is possible at any rate below capacity, but the proof is existential — it shows good codes exist without constructing them. Practical error-correcting codes achieve rates close to capacity with efficient encoding and decoding.

Shannon's Noisy Channel Coding Theorem

Theorem (Second Shannon Theorem, 1948)

For a DMC with capacity C: if R is less than C, there exist block codes with rate R and block error probability approaching zero as block length n grows. If R exceeds C, the error probability of any code is bounded below by a positive constant, regardless of block length.

The achievability proof uses random coding: generate M = 2^(nR) codewords i.i.d. from the capacity-achieving input distribution. Joint typicality decoding declares the transmitted codeword to be the unique codeword jointly typical with the received sequence. The converse uses Fano's inequality to show that high reliability forces R to be less than or equal to C.

Hamming Codes

Hamming codes are perfect linear codes with parameters (2^r - 1, 2^r - 1 - r, 3) for any integer r is greater than or equal to 2. The (7, 4, 3) Hamming code encodes 4 data bits into 7 bits by appending 3 parity-check bits. The parity-check matrix H has columns equal to all nonzero 3-bit binary vectors.

Decoding via Syndrome

Receive r = c + e (codeword c plus error e)
Compute syndrome s = H * r (mod 2)
s = 0: no detectable error
s nonzero: s equals the binary position of the flipped bit
Correct by flipping that bit

Minimum Hamming distance d = 3 allows detection of up to 2 errors and correction of 1 error. The code is perfect: the sphere-packing bound is achieved with equality.

Turbo Codes

Turbo codes, invented by Berrou, Glavieux, and Thitimajshima in 1993, combine two recursive systematic convolutional codes with an interleaver. They approach Shannon capacity within about 0.5 dB at practical block lengths through iterative belief-propagation decoding. Turbo codes are used in 3G/4G cellular standards (UMTS, LTE).

LDPC Codes

Low-density parity-check (LDPC) codes, invented by Gallager in 1962 and rediscovered in the 1990s, use sparse parity-check matrices and message-passing decoding on a bipartite Tanner graph. They provably achieve capacity on the binary erasure channel and perform within 0.01 dB of capacity in practice. LDPC codes are used in 5G NR, DVB-S2 (satellite TV), and Wi-Fi (802.11n/ac/ax).

Linear Codes and Key Parameters

Code	Parameters (n, k, d)	Application	Corrections
Hamming (7,4)	(7, 4, 3)	ECC RAM, simple storage	1-bit correction
Reed-Solomon	(255, k, 255-k+1)	CDs, DVDs, QR codes	Burst errors
Turbo	Rate 1/3 typical	3G/4G cellular	Near-capacity
LDPC	Variable, sparse H	5G, Wi-Fi, satellite	Near-capacity
Polar	(N, k, d)	5G NR control	Provably optimal

8. Differential Entropy for Continuous Distributions

Differential entropy extends the entropy concept to continuous random variables. It behaves differently from discrete entropy in several important ways: it can be negative, is not invariant under reparameterization, and lacks direct interpretation as a bit count.

Definition

Differential Entropy

h(X) = -integral f(x) * log f(x) dx
h(X) = E[ -log f(X) ]

The integral is over the support of f. Unlike discrete entropy, h(X) can take any real value including negative values.

Differential Entropy of Common Distributions

Distribution	Parameters	Differential Entropy h(X)
Gaussian N(mu, sigma^2)	mean mu, variance sigma^2	(1/2) log(2pie*sigma^2) nats
Uniform U(a, b)	interval [a, b]	log(b - a) nats
Exponential Exp(lambda)	rate lambda	1 - log(lambda) nats
Laplace Lap(mu, b)	location mu, scale b	1 + log(2b) nats
Cauchy Cauchy(x_0, gamma)	location x_0, scale gamma	log(4pigamma) nats

Gaussian Maximizes Differential Entropy

Maximum Entropy Theorem

Among all continuous distributions with fixed variance sigma^2, the Gaussian N(mu, sigma^2) uniquely maximizes differential entropy: h(X) is less than or equal to (1/2) log(2*pi*e*sigma^2), with equality if and only if X is Gaussian.

This is why Gaussian noise is the worst-case noise for the AWGN channel: it maximizes entropy for a given power. It also explains why the Gaussian input achieves capacity for the AWGN channel — it maximizes output entropy subject to the power constraint.

Properties of Differential Entropy

→Translation invariance: h(X + c) = h(X) for any constant c
→Scaling: h(aX) = h(X) + log|a| for any nonzero constant a
→Chain rule: h(X, Y) = h(X) + h(Y|X)
→Conditioning reduces entropy: h(Y|X) is less than or equal to h(Y)
→Not invariant under reparameterization (unlike discrete entropy)

Relative Entropy for Continuous Distributions

The KL divergence for continuous distributions replaces the sum with an integral:

D_KL(f||g) = integral f(x) * log(f(x)/g(x)) dx

D_KL(f||g) is still non-negative and equals zero iff f = g almost everywhere. Mutual information for continuous (X, Y) is I(X;Y) = D_KL(f(x,y) || f(x)*f(y)) = h(X) + h(Y) - h(X,Y).

9. Rate-Distortion Theory

Rate-distortion theory addresses lossy compression: what is the minimum bit rate needed to represent a source within a given distortion level? Shannon's rate-distortion theorem characterizes the optimal tradeoff between compression rate and reconstruction quality.

Distortion Measures

A distortion measure d(x, x-hat) quantifies the cost of representing source symbol x by reconstruction symbol x-hat. Common distortion measures include:

Squared Error

d(x, x-hat) = (x - x-hat)^2

Used for speech and image compression. MSE averages this over many samples. Minimized by the conditional mean estimator E[X|Y].

Hamming Distortion

d(x, x-hat) = 0 if x = x-hat, 1 otherwise

Used for discrete sources. Distortion equals error probability. Rate-distortion with Hamming distortion recovers the channel coding converse.

The Rate-Distortion Function

Definition: R(D)

R(D) = min over p(x-hat|x) satisfying E[d(X,X-hat)] less than or equal to D

R(D) = I(X; X-hat)

R(D) is the minimum mutual information between source X and reconstruction X-hat subject to the distortion constraint. It is non-increasing and convex in D.

Rate-Distortion for Gaussian Sources

For a Gaussian source X with variance sigma^2 under squared error distortion, the rate-distortion function has the elegant closed form:

R(D) = (1/2) * log2(sigma^2 / D) bits, for 0 less than D less than or equal to sigma^2
R(D) = 0, for D greater than or equal to sigma^2

This means to achieve distortion D, the source must be coded at rate at least (1/2) log2(sigma^2/D). Halving the distortion requires 1 extra bit. This is the information-theoretic foundation for transform coding in JPEG and MP3.

Shannon's Rate-Distortion Theorem

Theorem

For a memoryless source X with rate-distortion function R(D): if rate R is greater than R(D), there exists a sequence of source codes with rate at most R achieving expected distortion at most D as block length grows. If R is less than R(D), no sequence of codes can achieve expected distortion D.

Blahut-Arimoto Algorithm

The Blahut-Arimoto algorithm computes R(D) numerically for discrete sources using an alternating optimization procedure. Starting from an initial test channel p(x-hat|x), it alternates between:

1.Update the output distribution q(x-hat) = sum_x p(x) p(x-hat|x)
2.Update the test channel using the Lagrangian condition with parameter lambda controlling the rate-distortion tradeoff

The algorithm converges geometrically and computes the entire rate-distortion curve by varying lambda from 0 to infinity. The same algorithm (with source and channel roles swapped) computes channel capacity, making it a unified computational tool for information-theoretic optimization.

10. Fisher Information and Statistical Estimation

Fisher information connects information theory to classical statistics. It measures how much information an observation carries about an unknown parameter, establishing fundamental limits on estimation accuracy through the Cramer-Rao bound.

Definition of Fisher Information

Fisher Information

I(theta) = E[ (d/d_theta log p(X; theta))^2 ]
I(theta) = -E[ d^2/d_theta^2 log p(X; theta) ]
I(theta) = Var[ d/d_theta log p(X; theta) ]

The score function s(X; theta) = d/d_theta log p(X; theta) has mean zero under mild regularity conditions. Fisher information is the variance of the score.

The Cramer-Rao Bound

Cramer-Rao Inequality

Var(theta-hat) is greater than or equal to 1 / I(theta)

For any unbiased estimator theta-hat of theta based on a single observation from p(x; theta). For n i.i.d. observations, the bound becomes Var(theta-hat) is greater than or equal to 1 / (n * I(theta)), since Fisher information adds for independent samples.

The maximum likelihood estimator (MLE) achieves the Cramer-Rao bound asymptotically as n grows — it is asymptotically efficient. For the Gaussian model X is distributed as N(mu, sigma^2) with known sigma^2, I(mu) = 1/sigma^2 and the MLE mu-hat = sample mean achieves variance sigma^2/n, exactly matching the bound.

Sufficient Statistics

A statistic T(X) is sufficient for parameter theta if the conditional distribution of X given T(X) does not depend on theta. Sufficiency means T captures all the information about theta contained in the data.

Fisher-Neyman Factorization Theorem

T(X) is sufficient for theta if and only if the likelihood factors as p(x; theta) = g(T(x), theta) * h(x), where g depends on x only through T(x) and h does not depend on theta. Example: for the Gaussian family, the sample mean and sample variance are jointly sufficient for (mu, sigma^2).

Fisher Information and KL Divergence

Fisher information is the second-order local approximation of KL divergence. For nearby distributions parametrized by theta and theta + d_theta:

D_KL( p(x; theta) || p(x; theta + d_theta) ) approx (1/2) * I(theta) * (d_theta)^2

This connection motivates information geometry: the Fisher information matrix defines a Riemannian metric on the manifold of probability distributions. Natural gradient methods in machine learning use this geometry to perform gradient descent in the space of distributions rather than parameter space.

11. Applications Across Science and Engineering

Information theory provides foundational tools across data compression, communications, machine learning, and biology. Its concepts quantify uncertainty, channel efficiency, and learning capacity in a unified mathematical language.

Data Compression

ZIP / gzip

DEFLATE = LZ77 (dictionary compression) followed by Huffman coding. LZ77 removes repeated substrings; Huffman encodes the resulting symbols near-optimally. Achieves entropy rate for stationary sources.

JPEG

Discrete cosine transform (DCT) decorrelates image blocks; quantization reduces precision (lossy step); Huffman or arithmetic coding compresses quantized coefficients. Exploits rate-distortion tradeoff.

MP3

Perceptual audio coding uses psychoacoustic masking to allocate bits: inaudible frequencies get fewer bits. Modified DCT followed by Huffman coding. Operates near the perceptual rate-distortion limit.

Machine Learning and Neural Networks

Information theory is central to modern deep learning:

→Cross-entropy loss: the standard classification loss H(P, Q) = -sum_y P(y) log Q(y) equals H(P) + D_KL(P||Q), so training minimizes KL divergence from predictions to labels
→Decision trees (ID3, C4.5): split on the feature with maximum information gain = I(feature; label), a conditional mutual information criterion
→Variational autoencoders (VAEs): maximize the evidence lower bound (ELBO) = E[log p(x|z)] - D_KL(q(z|x) || p(z)), trading reconstruction quality against posterior-prior divergence
→Information bottleneck: deep learning can be interpreted as compressing X into a representation Z that maximizes I(Z; Y) while minimizing I(Z; X), a rate-distortion tradeoff in representation space
→Natural gradient descent: replaces Euclidean gradient with Fisher-information-weighted gradient, performing optimization on the manifold of distributions rather than flat parameter space

Communications Engineering

Shannon's theorems set the engineering targets for telecommunications. The Shannon-Hartley theorem guides spectrum allocation and power budget design. Modern systems approach capacity limits:

→5G NR uses LDPC codes for data and polar codes for control, both approaching AWGN capacity within 0.1 dB
→OFDM divides the channel into parallel narrowband subchannels, each modeled as a flat AWGN channel for simpler capacity analysis
→MIMO with m transmit and n receive antennas can achieve capacity up to min(m, n) times the single-antenna AWGN capacity

Biology and DNA Information

DNA stores genetic information using a 4-letter alphabet (A, C, G, T). Each base position carries log2(4) = 2 bits of information in the maximum entropy case. The human genome contains approximately 3 * 10^9 base pairs, storing about 750 MB of information at maximum entropy (though actual information content is lower due to repetition and regulatory structure).

Information-Theoretic Analysis of Genetic Sequences

Mutual information between positions i and j in a protein sequence, I(X_i; X_j), detects correlated mutations indicating structural or functional coupling — a statistical signal of co-evolution. Sequence logos represent information content H_max - H(position) in bits, where taller letters indicate more conserved (lower entropy) positions in a multiple sequence alignment.

Cryptography

Shannon's 1949 paper "Communication Theory of Secrecy Systems" founded information-theoretic cryptography. A cipher with key K encrypts plaintext M to ciphertext C. Shannon defined:

→Perfect secrecy: I(M; C) = 0, meaning the ciphertext reveals no information about the plaintext. The one-time pad achieves perfect secrecy.
→Unicity distance: the minimum ciphertext length at which the key becomes uniquely determinable, approximately H(K) / D, where D is the redundancy per symbol of the plaintext language. English has D approximately 3.5 bits/letter.

12. Practice Problems with Solutions

Work through these problems to solidify your understanding of information theory. Click each problem to reveal the solution.

Problem 1: Entropy of a Biased Coin

A biased coin lands heads with probability p = 0.8 and tails with probability 0.2. Compute the entropy H(X) in bits. Compare to the entropy of a fair coin and explain the difference intuitively.

Solution

H(X) = -0.8 * log2(0.8) - 0.2 * log2(0.2)

= -0.8 * (-0.3219) - 0.2 * (-2.3219)

= 0.2575 + 0.4644 = 0.7219 bits

The fair coin has H = 1 bit (maximum). The biased coin has H = 0.7219 bits, about 28% less. Intuitively, if we know the coin lands heads 80% of the time, we can often predict the outcome correctly without any information — so the actual information per flip is less. The more biased the coin, the lower the entropy.

Problem 2: Chain Rule for Entropy

Two random variables X and Y have joint distribution: P(X=0, Y=0) = 1/4, P(X=0, Y=1) = 1/4, P(X=1, Y=0) = 1/4, P(X=1, Y=1) = 1/4. Compute H(X), H(Y), H(X,Y), H(Y|X), and I(X;Y). Interpret the result.

Solution

Marginals: P(X=0) = P(X=1) = 1/2, P(Y=0) = P(Y=1) = 1/2

H(X) = H(Y) = 1 bit each

H(X,Y) = -4 * (1/4) * log2(1/4) = 2 bits

H(Y|X) = H(X,Y) - H(X) = 2 - 1 = 1 bit

I(X;Y) = H(X) + H(Y) - H(X,Y) = 1 + 1 - 2 = 0 bits

The joint distribution is uniform, so X and Y are independent. Knowing X tells us nothing about Y: I(X;Y) = 0. The chain rule gives H(X,Y) = H(X) + H(Y|X) = 1 + 1 = 2 bits, matching H(X,Y) = 2 bits directly.

Problem 3: Binary Symmetric Channel Capacity

A binary symmetric channel has crossover probability p = 0.1 (10% chance of a bit flip). Compute the channel capacity C in bits per channel use. What is the maximum reliable transmission rate if the channel is used 1,000,000 times per second?

Solution

C = 1 - H_b(0.1) = 1 - (-0.1*log2(0.1) - 0.9*log2(0.9))

H_b(0.1) = 0.1*3.3219 + 0.9*0.1520 = 0.3322 + 0.1368 = 0.469 bits

C = 1 - 0.469 = 0.531 bits per channel use

Max rate = 0.531 * 1,000,000 = 531,000 bits/second = 531 kbps

Shannon guarantees that reliable transmission at any rate below 531 kbps is achievable using appropriate error-correcting codes, while rates above 531 kbps cannot be reliably transmitted regardless of how sophisticated the code is.

Problem 4: KL Divergence Calculation

Let P = (1/2, 1/4, 1/4) and Q = (1/4, 1/4, 1/2) be distributions over three outcomes. Compute D_KL(P||Q), D_KL(Q||P), and verify that they are unequal. Also compute the Jensen-Shannon divergence JSD(P, Q).

Solution

D_KL(P||Q) = (1/2)log2(2) + (1/4)log2(1) + (1/4)log2(1/2)

= 0.5(1) + 0.25(0) + 0.25(-1) = 0.5 - 0.25 = 0.25 bits

D_KL(Q||P) = (1/4)log2(1/2) + (1/4)log2(1) + (1/2)log2(2)

= 0.25(-1) + 0 + 0.5(1) = -0.25 + 0.5 = 0.25 bits

Here D_KL(P||Q) = D_KL(Q||P) = 0.25 bits by symmetry of these specific distributions. In general they differ.

M = (3/8, 1/4, 3/8); JSD = (1/2)*D_KL(P||M) + (1/2)*D_KL(Q||M)

= (1/2)[(1/2)log2(4/3) + (1/4)log2(1) + (1/4)log2(2/3)] + ... approx 0.189 bits

Problem 5: Gaussian Differential Entropy

A random variable X is Gaussian with mean 0 and variance 4. Compute the differential entropy h(X) in nats and in bits. By the maximum entropy theorem, no distribution with variance 4 can have higher differential entropy. What differential entropy does the Laplace distribution with the same variance achieve?

Solution

Gaussian: h(X) = (1/2) ln(2*pi*e*4) = (1/2) ln(8*pi*e) nats

= (1/2)(ln 8 + ln pi + 1) = (1/2)(2.079 + 1.145 + 1) = 2.112 nats

= 2.112 / ln(2) = 3.047 bits

Laplace with variance sigma^2 = 4 has scale b = sqrt(2) (since Var = 2b^2 = 4, b = sqrt(2)). Laplace entropy: 1 + ln(2b) = 1 + ln(2*sqrt(2)) = 1 + ln(2.828) = 1 + 1.040 = 2.040 nats = 2.944 bits. The Laplace distribution has lower differential entropy (2.044 nats) than the Gaussian (2.112 nats) with the same variance, confirming the maximum entropy theorem.

Problem 6: Cramer-Rao Bound for Poisson Parameter

Let X_1, ..., X_n be i.i.d. Poisson random variables with unknown mean lambda. Compute the Fisher information I(lambda) for a single observation, state the Cramer-Rao bound for any unbiased estimator of lambda, and verify that the sample mean X-bar achieves this bound.

Solution

Poisson PMF: p(x; lambda) = lambda^x * e^(-lambda) / x!

log p = x*ln(lambda) - lambda - ln(x!)

Score: d/d_lambda log p = x/lambda - 1

I(lambda) = E[(X/lambda - 1)^2] = Var(X)/lambda^2 = lambda/lambda^2 = 1/lambda

Cramer-Rao bound: Var(lambda-hat) is greater than or equal to 1/(n * I(lambda)) = lambda/n

Sample mean: E[X-bar] = lambda (unbiased), Var(X-bar) = lambda/n

The sample mean achieves the Cramer-Rao bound exactly — it is the uniformly minimum variance unbiased estimator (UMVUE) for the Poisson mean, confirmed by the fact that the sum of observations is a complete sufficient statistic.

Problem 7: Rate-Distortion for Gaussian Source

A Gaussian source has variance sigma^2 = 16. You are allowed R = 2 bits per sample for compression. What is the minimum achievable mean squared error distortion D? If you reduce the rate to R = 1 bit per sample, how does D change? Explain the tradeoff.

Solution

Rate-distortion: R(D) = (1/2)*log2(sigma^2/D)

Solving for D: D = sigma^2 * 2^(-2R)

At R = 2 bits: D = 16 * 2^(-4) = 16/16 = 1

At R = 1 bit: D = 16 * 2^(-2) = 16/4 = 4

Halving the bit rate from 2 to 1 bit/sample quadruples the minimum distortion from 1 to 4. Each additional bit per sample reduces distortion by a factor of 4 (6 dB improvement in SNR). This matches the standard "6 dB per bit" rule in audio and image compression. No lossy compression scheme can achieve D less than 1 at R = 2 bits for this Gaussian source.

Quick Reference: Key Formulas

Entropy Identities

H(X) = -sum_x p(x) log p(x)
H(X,Y) = H(X) + H(Y|X)
H(Y|X) = H(X,Y) - H(X)
I(X;Y) = H(X) + H(Y) - H(X,Y)
I(X;Y) = D_KL(p(x,y) || p(x)p(y))
H(Y|X) is less than or equal to H(Y)

Channel Capacity

C = max_p(x) I(X;Y)
C_BSC = 1 - H_b(p)
C_BEC = 1 - alpha
C_AWGN = (1/2) log2(1 + P/N)
C_SH = W*log2(1 + SNR)

Divergences and Cross-Entropy

D_KL(P||Q) = sum p(x) log(p(x)/q(x))
H(P,Q) = H(P) + D_KL(P||Q)
D_KL(P||Q) is greater than or equal to 0
JSD = (1/2)[D_KL(P||M) + D_KL(Q||M)]

Fisher and Rate-Distortion

I(theta) = E[(d/dtheta log p)^2]
Var(theta-hat) is greater than or equal to 1/I(theta)
h_Gauss = (1/2) log(2*pi*e*sigma^2)
R(D) = (1/2) log2(sigma^2/D) [Gaussian]
D = sigma^2 * 2^(-2R) [Gaussian]

Frequently Asked Questions

What is Shannon entropy and how is it measured?

Shannon entropy H(X) = -sum_x p(x) log2 p(x) quantifies the average unpredictability of a random variable in bits. A fair coin has H = 1 bit; a fair six-sided die has H = log2(6) approximately 2.585 bits. Entropy is maximized for the uniform distribution and is zero when one outcome has probability 1.

Why is KL divergence not a true distance?

KL divergence is asymmetric: D_KL(P||Q) computes expectations under P while D_KL(Q||P) computes them under Q, giving different values in general. It also fails the triangle inequality. The Jensen-Shannon divergence is a symmetrized version whose square root is a true metric.

What makes a code optimal for lossless compression?

A prefix-free code is optimal if its expected length equals H(X) bits per symbol (achievable when probabilities are powers of 2). Huffman coding is optimal among all prefix-free codes; arithmetic coding approaches entropy arbitrarily closely for blocks of symbols. No lossless code can compress below H(X) bits per symbol on average (Shannon's source coding theorem).

Why does the Gaussian distribution maximize differential entropy?

Among all continuous distributions with a fixed variance sigma^2, the Gaussian N(mu, sigma^2) has the highest differential entropy h = (1/2) log(2*pi*e*sigma^2). This follows from the non-negativity of KL divergence: D_KL(f || g_Gaussian) is greater than or equal to 0, which rearranges to h(f) is less than or equal to h(g_Gaussian) for any f with the same variance.

How does information theory connect to neural network training?

The cross-entropy loss H(P, Q) = H(P) + D_KL(P||Q) used to train classifiers is equivalent to minimizing KL divergence from predictions to true labels. VAEs maximize the ELBO = E[log p(x|z)] - D_KL(q(z|x)||p(z)), a rate-distortion tradeoff between reconstruction and latent code compression. The information bottleneck principle interprets hidden layers as compressing input while preserving label information.

Information Theory: Shannon Entropy to Channel Capacity

1. Shannon Entropy

Definition and Units

Self-Information

Examples: Coin Flip and Dice

Properties of Entropy

Binary Entropy Function

Nats vs. Bits

2. Joint and Conditional Entropy

Joint Entropy

Conditional Entropy

Chain Rule for Entropy

Entropy Diagram (Venn-Diagram View)

3. Mutual Information

Definition

Properties

Data Processing Inequality

Fano's Inequality

4. Kullback-Leibler Divergence

Definition and Properties

Asymmetry and Its Consequences

Cross-Entropy

Jensen-Shannon Divergence

5. Source Coding: Compression Theory

Shannon's Source Coding Theorem

Asymptotic Equipartition Property

Huffman Coding

Arithmetic Coding

Lempel-Ziv Algorithms

6. Channel Capacity

Discrete Memoryless Channels

Binary Symmetric Channel

Binary Erasure Channel

AWGN Channel and the Shannon-Hartley Theorem

7. Channel Coding and Error-Correcting Codes

Shannon's Noisy Channel Coding Theorem

Hamming Codes

Turbo Codes

LDPC Codes

Linear Codes and Key Parameters

8. Differential Entropy for Continuous Distributions

Definition

Differential Entropy of Common Distributions

Gaussian Maximizes Differential Entropy

Properties of Differential Entropy

Relative Entropy for Continuous Distributions

9. Rate-Distortion Theory

Distortion Measures

The Rate-Distortion Function

Rate-Distortion for Gaussian Sources

Shannon's Rate-Distortion Theorem

Blahut-Arimoto Algorithm

10. Fisher Information and Statistical Estimation

Definition of Fisher Information

The Cramer-Rao Bound

Sufficient Statistics

Fisher Information and KL Divergence

11. Applications Across Science and Engineering

Data Compression

Machine Learning and Neural Networks

Communications Engineering

Biology and DNA Information

Cryptography

12. Practice Problems with Solutions

Quick Reference: Key Formulas

Frequently Asked Questions

Related Topics