Graduate Mathematics

Advanced Probability Theory

A rigorous, graduate-level treatment of probability theory built on measure-theoretic foundations. This guide covers sigma-algebras, probability spaces, random variables as measurable functions, all four convergence modes, the strong law of large numbers, the central limit theorem, characteristic functions, conditional expectation, martingales, Brownian motion, Poisson processes, Markov chains, large deviations, and concentration inequalities.

Table of Contents

  1. 1. Measure-Theoretic Foundations
  2. 2. Random Variables as Measurable Functions
  3. 3. Expectation as Lebesgue Integral
  4. 4. Convergence of Random Variables
  5. 5. Laws of Large Numbers
  6. 6. Central Limit Theorem
  7. 7. Characteristic Functions and MGFs
  8. 8. Conditional Expectation and Tower Property
  9. 9. Martingales and Optional Stopping
  10. 10. Brownian Motion
  11. 11. Poisson Processes
  12. 12. Markov Chains: Detailed Balance and Mixing Times
  13. 13. Large Deviations and Cramer's Theorem
  14. 14. Concentration Inequalities
  15. 15. Frequently Asked Questions

1. Measure-Theoretic Foundations

Sigma-Algebras

The starting point of rigorous probability is the concept of a sigma-algebra. Given a set Omega (the sample space), a sigma-algebra F on Omega is a collection of subsets satisfying three rules: the empty set belongs to F; if a set A belongs to F then its complement also belongs to F; and if A_1, A_2, A_3, ... is a countable collection of sets each belonging to F, then their union also belongs to F. The elements of F are called measurable sets or events.

The smallest sigma-algebra on Omega is the trivial one containing only the empty set and Omega itself. The largest is the power set of all subsets. The most important example in probability is the Borel sigma-algebra on the real line, generated by all open intervals — it contains all open sets, closed sets, compact sets, and countable intersections and unions thereof.

Why restrict to a sigma-algebra instead of taking all subsets? Because not every subset of the real line can be assigned a consistent notion of length. The Vitali set, constructed using the axiom of choice, is a classic non-measurable set: if it had a well-defined Lebesgue measure, that measure would have to simultaneously be 0 and positive. By working only within a sigma-algebra, we avoid such paradoxes entirely.

Probability Spaces

A probability space is a triple (Omega, F, P) where Omega is the sample space, F is a sigma-algebra on Omega, and P is a probability measure. A probability measure P is a function from F to the closed interval [0, 1] satisfying: non-negativity P(A) is at least 0 for all A in F; normalization P(Omega) = 1; and countable additivity — for any countable sequence of pairwise disjoint events A_1, A_2, ..., P of their union equals the sum of P(A_i).

From these three axioms (Kolmogorov's axioms) all familiar probability rules follow as theorems. The complement rule: P(A-complement) = 1 minus P(A). The monotone rule: if A is a subset of B then P(A) is at most P(B). The inclusion-exclusion principle for two events: P(A union B) = P(A) + P(B) minus P(A intersection B). Countable additivity is what distinguishes a probability measure from a finitely additive set function and is essential for the interchange of limits and probabilities.

Filtrations and Information

In stochastic processes, information accumulates over time. A filtration is an increasing family of sigma-algebras F_0 subset F_1 subset F_2 subset ... subset F. The sigma-algebra F_n represents the information available at time n. A process X_0, X_1, X_2, ... is adapted to the filtration if each X_n is F_n-measurable, meaning the value of X_n is determined by the information available at time n. The natural filtration of a process is the filtration generated by that process itself: F_n is generated by X_0, X_1, ..., X_n.

2. Random Variables as Measurable Functions

In the measure-theoretic framework, a random variable X is not a vague notion of a quantity that "takes random values." It is a precise mathematical object: a measurable function from the probability space (Omega, F, P) to the real numbers R equipped with the Borel sigma-algebra B(R). Measurability means that for every Borel set B in R, the preimage X-inverse(B) = the set of all omega in Omega with X(omega) in B belongs to F.

This condition ensures that events like "X is at most 3" or "X is in the interval (2, 5]" are genuine events with well-defined probabilities. In practice, a sufficient condition for measurability is continuity: every continuous function from R to R is Borel measurable. Monotone functions, indicator functions of Borel sets, and pointwise limits of sequences of measurable functions are all measurable.

Distribution of a Random Variable

The distribution (or law) of a random variable X is the pushforward measure mu_X defined on B(R) by mu_X(B) = P(X-inverse(B)) = P(X is in B). It is a probability measure on R that captures all statistical properties of X. Two random variables with the same distribution are called identically distributed, even if they live on different probability spaces. The cumulative distribution function (CDF) F_X(x) = P(X is at most x) determines the distribution completely.

A random variable is called discrete if its distribution is concentrated on a countable set, in which case it is described by a probability mass function. It is called absolutely continuous (or simply continuous) if its distribution is absolutely continuous with respect to Lebesgue measure, in which case the Radon-Nikodym theorem guarantees the existence of a probability density function f_X satisfying P(X is in B) = the integral of f_X over B.

Independence

Events A_1, ..., A_n are mutually independent if for every subset S of their indices, the probability of the intersection of A_i for i in S equals the product of P(A_i) for i in S. Random variables X_1, ..., X_n are independent if for all Borel sets B_1, ..., B_n, P(X_1 in B_1 and ... and X_n in B_n) = P(X_1 in B_1) times ... times P(X_n in B_n). Equivalently, the joint distribution equals the product of the marginal distributions. Independence is a condition far stronger than pairwise independence.

3. Expectation as Lebesgue Integral

The expectation E[X] is defined as the Lebesgue integral of X with respect to the probability measure P. This definition is built in stages. First, for a simple random variable S = sum over i of c_i times the indicator of A_i (a finite linear combination of indicator functions of disjoint events), the integral is defined as the sum over i of c_i times P(A_i). Then for a non-negative random variable X, the integral is defined as the supremum of integrals of simple random variables that are at most X. Finally, for a general X, write X = X-plus minus X-minus and define E[X] = E[X-plus] minus E[X-minus] when at least one of these is finite.

The Lebesgue integral enjoys powerful convergence theorems not available for Riemann integrals. The Monotone Convergence Theorem: if 0 is at most X_1 is at most X_2 is at most ... and X_n converges to X pointwise, then E[X_n] converges to E[X]. The Dominated Convergence Theorem: if X_n converges to X almost surely and |X_n| is at most Y for all n with E[Y] finite, then E[X_n] converges to E[X]. These theorems justify many interchange of limit and integral operations that appear throughout probability and analysis.

Variance and Higher Moments

The k-th moment of X is E[X^k] when it exists. The variance is Var(X) = E[(X minus E[X])^2] = E[X^2] minus (E[X])^2. The standard deviation is the square root of the variance. For independent random variables, Var(X + Y) = Var(X) + Var(Y). The covariance of X and Y is Cov(X,Y) = E[(X minus E[X])(Y minus E[Y])] = E[XY] minus E[X] times E[Y], which vanishes when X and Y are independent (but not conversely). Jensen's inequality states that for a convex function phi, E[phi(X)] is at least phi(E[X]).

4. Convergence of Random Variables

Almost-Sure Convergence

X_n converges to X almost surely (a.s.) if there exists a set N of probability 0 such that for every omega not in N, the sequence X_n(omega) converges to X(omega) in the usual sense. Equivalently, P(lim X_n = X) = 1. This is the strongest of the four main convergence modes.

Convergence in Probability

X_n converges in probability to X if for every epsilon greater than 0, P(|X_n minus X| greater than epsilon) approaches 0 as n grows. A.s. convergence implies convergence in probability, but not vice versa. The classical counterexample is the "typewriter sequence" of indicator functions on [0,1].

L^p Convergence

X_n converges to X in L^p (p at least 1) if E[|X_n minus X|^p] approaches 0. L^2 convergence (mean-square convergence) is most common in applications. L^p convergence implies convergence in probability by Markov's inequality, and implies L^q convergence for all q at most p.

Convergence in Distribution

X_n converges in distribution to X if the CDF of X_n evaluated at x approaches the CDF of X at every continuity point x of X. This is the weakest mode: it says only that the distributions (not the random variables themselves) converge. It is implied by convergence in probability and is equivalent to pointwise convergence of characteristic functions (Levy's continuity theorem).

Borel-Cantelli Lemmas

The Borel-Cantelli lemmas are fundamental tools for almost-sure statements. The first Borel-Cantelli lemma says: if the sum of P(A_n) over n is finite, then with probability 1 only finitely many of the events A_n occur. The second Borel-Cantelli lemma (requiring independence) says: if the events A_n are independent and the sum of P(A_n) is infinite, then with probability 1 infinitely many of the events A_n occur. Together these lemmas give a precise criterion for "infinitely often" events and are used in the proof of the strong law of large numbers.

5. Laws of Large Numbers

The laws of large numbers formalize the intuition that averaging many independent observations produces a reliable estimate of the true mean. Let X_1, X_2, ... be i.i.d. random variables with common mean mu. Define S_n = X_1 + ... + X_n and the sample mean X-bar_n = S_n divided by n.

Weak Law of Large Numbers

The WLLN states that X-bar_n converges in probability to mu. For every epsilon greater than 0, P(|X-bar_n minus mu| greater than epsilon) approaches 0 as n grows to infinity. A clean proof using Chebyshev's inequality requires only that Var(X_1) is finite: P(|X-bar_n minus mu| greater than epsilon) is at most Var(X_1) divided by (n times epsilon-squared), which approaches 0. With additional truncation arguments, the WLLN holds under the weaker assumption that E[|X_1|] is finite.

Strong Law of Large Numbers

The SLLN states almost-sure convergence: P(lim X-bar_n = mu) = 1. This is a dramatically stronger statement — the sample mean is exactly equal to mu with probability 1 in the limit, not just close to mu with high probability. The SLLN holds under the assumption that E[|X_1|] is finite; no assumption on variance is needed. The classic proof by Etemadi uses the Borel-Cantelli lemma applied to the event that |X-bar_n minus mu| exceeds epsilon, combined with a truncation argument to reduce to bounded variables.

The SLLN has profound consequences: it justifies Monte Carlo simulation (the empirical average of many samples converges a.s. to the true expectation), validates relative frequency as a model for probability in the long run, and provides the foundation for consistent statistical estimators.

6. Central Limit Theorem

While the law of large numbers tells us that X-bar_n converges to mu, the central limit theorem (CLT) describes the fluctuations around that limit. Suppose X_1, X_2, ... are i.i.d. with mean mu and variance sigma-squared (both finite, sigma greater than 0). Define the standardized sum Z_n = (S_n minus n times mu) divided by (sigma times the square root of n). Then Z_n converges in distribution to a standard normal N(0,1).

This means: for any real numbers a less than b, P(a is at most Z_n is at most b) converges to the integral from a to b of the standard normal density. The CLT is remarkable because it does not depend on the shape of the distribution of X_1 — only on its mean and variance. A binomial, exponential, Poisson, or uniform distribution all produce the same limiting normal distribution after standardization.

Berry-Esseen Theorem

The CLT gives a qualitative statement (convergence in distribution) but does not say how fast. The Berry-Esseen theorem gives a quantitative bound: if X_1 has finite third moment rho = E[|X_1 minus mu|^3], then the sup over x of |P(Z_n is at most x) minus Phi(x)| (where Phi is the standard normal CDF) is at most C times rho divided by (sigma-cubed times the square root of n), where C is a universal constant (approximately 0.5). This provides the error rate of the normal approximation and shows it decays like 1 over the square root of n.

Generalizations of the CLT

The Lindeberg CLT generalizes to triangular arrays of independent (but not necessarily identically distributed) random variables, requiring only the Lindeberg condition: the contribution of large terms is negligible. The multivariate CLT states that the properly standardized vector of sample means of a multivariate distribution converges in distribution to a multivariate normal. The functional CLT (Donsker's theorem) states that the linearly interpolated partial-sum process (scaled by 1 over the square root of n) converges in distribution to Brownian motion in the Skorokhod space of right-continuous functions with left limits.

7. Characteristic Functions and Moment Generating Functions

Characteristic Functions

The characteristic function of a random variable X is phi_X(t) = E[exp(i times t times X)] for t in R, where i is the imaginary unit. This is the Fourier transform of the distribution of X. The characteristic function always exists (since |exp(itX)| equals 1), uniquely determines the distribution, and is uniformly continuous. For a sum of independent random variables X + Y, the characteristic function is phi_X(t) times phi_Y(t) — products correspond to independence.

Levy's continuity theorem: X_n converges in distribution to X if and only if the characteristic function of X_n evaluated at t converges to that of X for every t. This equivalence makes characteristic functions the main tool for proving the CLT: expand the characteristic function of (X_1 minus mu) divided by sigma in a Taylor series around t = 0 to get 1 minus t^2/2 plus o(t^2); raise to the n-th power and substitute t divided by the square root of n to get (1 minus t^2/(2n) + o(1/n))^n, which converges to exp(minus t^2/2) — the characteristic function of N(0,1).

Moment Generating Functions

The moment generating function (MGF) of X is M_X(t) = E[exp(t X)] for t in some neighborhood of 0 (when this expectation is finite). The MGF does not always exist — the Cauchy distribution has no MGF — but when it does, it determines all moments: the k-th derivative of M_X at t = 0 equals E[X^k]. The cumulant generating function is the log of the MGF: K_X(t) = log M_X(t). The first cumulant is the mean, the second cumulant is the variance, and higher cumulants measure non-Gaussianity. For independent X and Y, the MGF of (X+Y) equals the MGF of X times the MGF of Y, which translates to additivity of cumulants.

8. Conditional Expectation and Tower Property

Conditional expectation in the measure-theoretic sense is far more general than the elementary notion of conditioning on a single event. Given a probability space (Omega, F, P), an integrable random variable X, and a sub-sigma-algebra G of F, the conditional expectation E[X | G] is defined as the unique G-measurable random variable Y satisfying: for every G-measurable event A, the integral of Y over A equals the integral of X over A (with respect to P).

Existence follows from the Radon-Nikodym theorem: the function A maps to the integral of X over A is a finite signed measure on G, and it is absolutely continuous with respect to P restricted to G, so its Radon-Nikodym derivative exists. Uniqueness is almost-sure: any two versions of E[X | G] agree with probability 1.

Key Properties

The tower property (iterated conditioning): if H is a sub-sigma-algebra of G, then E[E[X | G] | H] = E[X | H]. In particular, E[E[X | G]] = E[X]. Taking out what is known: if Z is G-measurable, then E[Z X | G] = Z E[X | G]. Independence: if X is independent of G (meaning X is independent of every event in G), then E[X | G] = E[X] — knowing G gives no information about X.

In practice, E[X | G] is the best prediction of X given the information in G, in the sense that it minimizes E[(X minus Z)^2] over all G-measurable random variables Z (the projection interpretation in L^2). This connects conditional expectation to regression in statistics.

9. Martingales and Optional Stopping

A stochastic process M_0, M_1, M_2, ... adapted to a filtration (F_n) is called a martingale if E[|M_n|] is finite for all n and E[M sub n+1 given the filtration at n] equals M sub n for all n. The defining condition says that given current information, the best prediction of tomorrow's value is today's value — there is no drift. A supermartingale satisfies E[M sub n+1 given filtration at n] is at most M sub n (a favorable game for the house); a submartingale satisfies E[M sub n+1 given filtration at n] is at least M_n.

Classic examples of martingales: a symmetric random walk S_n = X_1 + ... + X_n where each X_i is plus 1 or minus 1 with equal probability; the process S_n^2 minus n (the square of the walk minus time); the exponential martingale exp(lambda S_n minus n times (M_X(lambda) minus 1)) for suitable lambda; and conditional expectations E[Z | F_n] for any integrable random variable Z.

Martingale Convergence Theorem

Doob's martingale convergence theorem: a non-negative supermartingale (or more generally, an L^1-bounded martingale) converges almost surely to an integrable limit M_infinity. This is a deep result that implies, for example, that a player with finite fortune who plays a fair game (martingale) and stops when their fortune reaches 0 must eventually reach 0 or converge — they cannot oscillate forever.

Optional Stopping Theorem

A stopping time T with respect to a filtration (F_n) is a random variable taking values in the non-negative integers (or infinity) such that the event T equals n belongs to F_n for every n — the decision to stop at time n depends only on information available at time n, not on the future.

Doob's optional stopping theorem (OST): let M be a martingale and T a stopping time. Under sufficient conditions (for example, T is bounded, or M is bounded, or the stopped process M sub n-min-T is uniformly integrable), we have E[M_T] = E[M_0]. Intuitively: no clever stopping strategy can beat a fair game. Applications:

  • In a symmetric random walk starting at position x, the expected time to first exit the interval [0, N] is x times (N minus x). Proof: apply OST to S_n^2 minus n.
  • The gambler's ruin probability (probability of reaching N before 0 starting from x) equals x divided by N for a symmetric walk. Proof: apply OST to S_n.
  • Wald's identity: for a stopping time T with finite expected value, E[S_T] = E[T] times E[X_1] when X_i are i.i.d. with finite mean.

10. Brownian Motion

Standard Brownian motion (or the Wiener process) is the continuous-time analogue of the symmetric random walk. It is a stochastic process W_t indexed by t at least 0 characterized by five properties: W_0 = 0 almost surely; for any 0 at most s less than t, the increment W_t minus W_s is independent of all information up to time s (independent increments); W_t minus W_s has a normal distribution with mean 0 and variance t minus s (stationary Gaussian increments); and the paths t maps to W_t are continuous almost surely.

The existence of Brownian motion (that there is a probability space on which such a process exists) requires proof. One construction uses the Kolmogorov consistency theorem combined with Kolmogorov's continuity criterion. Another approach uses the Haar wavelet expansion of the paths.

Key Properties of Brownian Motion

  • Martingale: W_t is a martingale with respect to its natural filtration (the sigma-algebra generated by W_s for s at most t).
  • Quadratic variation: over any partition of [0, t] with mesh going to 0, the sum of (W sub t(k+1) minus W sub t(k)) squared converges in probability to t. This is written as the quadratic variation [W, W]_t = t.
  • Self-similarity: the process c times W evaluated at t divided by c-squared has the same distribution as W_t for any c greater than 0 (Brownian scaling).
  • Nowhere differentiable: with probability 1, the path W is not differentiable at any point, even though it is continuous everywhere.
  • The process W_t^2 minus t is a martingale; more generally, exp(theta W_t minus theta^2 t / 2) is a martingale for any real theta.

Brownian motion is the foundational process for stochastic calculus (Ito calculus), which provides the mathematical framework for stochastic differential equations arising in finance (Black-Scholes model), physics (Langevin equation), and biology (diffusion models).

11. Poisson Processes

The Poisson process is the canonical model for random arrivals over time. A Poisson process with rate lambda greater than 0 is a counting process N_t (the number of arrivals in the interval [0, t]) satisfying: N_0 = 0; the process has independent increments; and for any interval of length h, the probability of exactly one arrival is approximately lambda times h and the probability of two or more arrivals is negligible (of order h-squared). From these conditions it follows that N_t minus N_s has a Poisson distribution with mean lambda times (t minus s) for any s less than t.

An equivalent characterization: define the inter-arrival times T_1, T_2, ... as the times between successive arrivals. The Poisson process is characterized by the property that T_1, T_2, ... are i.i.d. exponential random variables with mean 1 divided by lambda. The exponential distribution is the unique continuous distribution with the memoryless property: P(T greater than s + t | T greater than s) = P(T greater than t).

Superposition and Thinning

Superposition: the sum of two independent Poisson processes with rates lambda_1 and lambda_2 is a Poisson process with rate lambda_1 + lambda_2. Thinning (splitting): if each arrival of a Poisson(lambda) process is independently kept with probability p and discarded with probability 1 minus p, the kept arrivals form a Poisson(lambda times p) process and the discarded arrivals form an independent Poisson(lambda times (1 minus p)) process.

The Poisson process also arises as the limit of binomial point processes: if n points are placed independently in [0, t] each with probability lambda times t divided by n of being an arrival, and n grows to infinity, the counting process converges to a Poisson(lambda) process. This Poisson limit theorem underlies the use of Poisson distributions for modeling rare events (insurance claims, radioactive decay, server request arrivals).

12. Markov Chains: Detailed Balance and Mixing Times

A discrete-time Markov chain is a sequence of random variables X_0, X_1, X_2, ... taking values in a countable state space S, satisfying the Markov property: P(X at step n+1 equals j given X at step n equals i, X at step n-1, ..., X at step 0) equals P(X at step n+1 equals j given X at step n equals i). The conditional distribution of the future given the present is independent of the past. A time-homogeneous chain is described by a transition matrix P where P sub ij equals P(X at step n+1 equals j given X at step n equals i).

Stationary Distributions

A probability distribution pi on S is stationary (or invariant) for the chain if pi times P = pi, meaning the sum over i of pi(i) times P sub ij equals pi(j) for all j. If the chain starts in the stationary distribution, it stays in it forever.

Detailed Balance

A distribution pi satisfies detailed balance with respect to P if pi(i) times P sub ij equals pi(j) times P sub ji for all states i and j. Detailed balance is also called the reversibility condition because it is equivalent to the chain being reversible: the process run forward in time and the process run backward in time have the same distribution. Any distribution satisfying detailed balance is stationary (summing over i gives stationarity), but the converse is false — a stationary distribution need not satisfy detailed balance. Detailed balance is the key condition verified in the Metropolis-Hastings MCMC algorithm: the algorithm constructs a chain whose stationary distribution is the target, by design.

Convergence and Mixing Times

An irreducible, aperiodic Markov chain on a finite state space has a unique stationary distribution pi, and P to the n-th power at entry (i,j) converges to pi(j) as n grows for every starting state i. The mixing time is a quantitative measure of how quickly this convergence occurs. Define the total variation distance between two distributions mu and nu on S as the maximum over all events A of |mu(A) minus nu(A)|, equivalently half the sum over states i of |mu(i) minus nu(i)|. The mixing time t_mix(epsilon) is the smallest n such that the total variation distance from P^n(x, dot) to pi is at most epsilon for all starting states x.

The spectral gap (1 minus the second-largest eigenvalue of P) controls the mixing time: t_mix(epsilon) is of order (1 over spectral gap) times log(1 over epsilon). A large spectral gap means fast mixing (the chain converges quickly to its stationary distribution); a small spectral gap indicates slow mixing and poor performance for MCMC methods.

13. Large Deviations and Cramer's Theorem

The law of large numbers says that X-bar_n converges to mu; the CLT describes fluctuations of order 1 over the square root of n. Large deviation theory addresses a fundamentally different regime: the probability that X-bar_n deviates from mu by a fixed amount x greater than mu. This probability goes to 0 exponentially fast in n, and large deviation theory identifies the exponent precisely.

Cramer's Theorem

Assume X_1, X_2, ... are i.i.d. with a cumulant generating function that is finite in a neighborhood of 0. The rate function is I(x) = sup over all real t of (t times x minus log E[exp(t X_1)]). This is the Legendre-Fenchel transform (convex conjugate) of the cumulant generating function K(t) = log E[exp(t X_1)].

Cramer's theorem states: for any x greater than mu, lim as n grows of (1/n) times log P(X-bar_n is at least x) = negative I(x). For x less than mu, lim of (1/n) times log P(X-bar_n is at most x) = negative I(x). The rate function I is non-negative, convex, and equals 0 only at x = mu. Thus P(X-bar_n is at least x) is approximately exp(minus n times I(x)) for large n — the probability decays exponentially with rate I(x).

The rate function has a clear interpretation: I(x) is the minimum amount of "work" the sample mean must do to reach x. The tilted distribution that makes x the typical value has log-likelihood ratio exactly I(x) per observation — this is the change-of-measure (importance sampling) interpretation. Cramer's theorem extends to the sample path level via the Sanov theorem (large deviations for empirical measures) and Schilder's theorem (large deviations for Brownian motion paths).

14. Concentration Inequalities

Concentration inequalities provide explicit upper bounds on the probability that a random variable deviates from its mean or median. They are essential in probability, statistics, and theoretical computer science for analyzing algorithms, estimators, and stochastic systems.

Markov's Inequality

For any non-negative random variable X and any a greater than 0, P(X is at least a) is at most E[X] divided by a. This is the simplest and weakest bound, requiring only finite mean. It is proved in one line: a times P(X is at least a) is at most E[X times the indicator that X is at least a] is at most E[X]. Markov's inequality is the starting point from which all other concentration inequalities are derived.

Chebyshev's Inequality

For any random variable X with finite variance and any k greater than 0, P(|X minus E[X]| is at least k times sigma) is at most 1 divided by k-squared, where sigma is the standard deviation. More generally, P(|X minus mu| is at least t) is at most Var(X) divided by t-squared. Chebyshev's inequality follows immediately from Markov's inequality applied to (X minus mu)-squared. It gives polynomial decay in t — much weaker than the exponential decay achievable under stronger assumptions.

Chernoff Bounds

The Chernoff method applies Markov's inequality to exp(t X) for a free parameter t greater than 0 to get P(X is at least a) is at most E[exp(t X)] divided by exp(t a) = exp(log M_X(t) minus t a), then optimize over t by taking t that minimizes this bound. The resulting bound is exp(minus I(a)) where I(a) = sup over t of (t a minus log M_X(t)) is the Legendre transform of the cumulant generating function — exactly the Cramer rate function. For a sum S_n of independent Bernoulli(p) variables and delta greater than 0: P(S_n is at least (1 + delta) n p) is at most exp(minus n p times h(1 + delta)) where h is a function of delta, giving exponentially sharp bounds for binomial tails.

Hoeffding's Inequality

Let X_1, ..., X_n be independent random variables with X_i in [a_i, b_i] almost surely. Let S_n = X_1 + ... + X_n. Then for any t greater than 0, P(S_n minus E[S_n] is at least t) is at most exp(minus 2 t-squared divided by the sum of (b_i minus a_i)-squared), and similarly for the lower tail. Hoeffding's inequality requires only that variables are bounded — no moment assumptions beyond this. It is particularly clean because the bound depends on the variables only through their ranges. For identically distributed variables bounded in [0, 1]: P(X-bar_n minus mu is at least t) is at most exp(minus 2 n t-squared).

McDiarmid's Inequality (Bounded Differences)

Let X_1, ..., X_n be independent random variables and let f(X_1, ..., X_n) be a real-valued function satisfying the bounded differences condition: changing any single coordinate X_i while fixing all others changes f by at most c_i. Then P(f minus E[f] is at least t) is at most exp(minus 2 t-squared divided by the sum of c_i-squared). McDiarmid's inequality generalizes Hoeffding's inequality from linear functions (sums) to arbitrary functions with bounded influence per variable. It is widely used in learning theory (generalization bounds) and combinatorics (concentration of the chromatic number).

Frequently Asked Questions

What is a probability space in the measure-theoretic sense?

A probability space is a triple (Omega, F, P) where Omega is the sample space (all possible outcomes), F is a sigma-algebra of events (the measurable subsets), and P is a probability measure satisfying countable additivity and P(Omega) = 1. The sigma-algebra specifies which subsets can be assigned a probability, resolving paradoxes from naive set theory. This framework is due to Kolmogorov (1933) and provides the rigorous foundation for all of modern probability.

How is a random variable defined in measure-theoretic probability?

A random variable X on (Omega, F, P) is a measurable function from Omega to R: for every Borel set B in R, the preimage X-inverse(B) belongs to F. Measurability ensures that events like "X is at most 5" are genuine events with well-defined probabilities. The distribution of X is the pushforward measure P(X-inverse(B)). This definition unifies discrete and continuous random variables: both are special cases of measurable functions, differing only in whether the distribution is concentrated on a countable set or is absolutely continuous with respect to Lebesgue measure.

What is expectation as a Lebesgue integral and when does it exist?

For a random variable X on (Omega, F, P), E[X] is the Lebesgue integral of X with respect to P. Built in stages: simple functions (finite linear combinations of indicator functions) have integral equal to the sum of coefficients times probabilities; non-negative X has integral equal to the supremum over simple functions at most X; general X is split into X-plus minus X-minus. E[X] exists and is finite exactly when E[|X|] is finite (X in L^1). For discrete X this gives the usual sum; for continuous X with density f it gives the usual integral of x times f(x) dx — both are special cases. The Dominated Convergence Theorem and Monotone Convergence Theorem enable interchange of limits and expectations in ways the Riemann integral cannot.

What are the four main modes of convergence and how are they related?

(1) Almost-sure (a.s.) convergence: the sequence converges pointwise on a set of probability 1. (2) Convergence in probability: for every epsilon greater than 0, P(|X_n minus X| greater than epsilon) goes to 0. (3) L^p convergence: E[|X_n minus X|^p] goes to 0. (4) Convergence in distribution: CDFs converge at all continuity points of the limit CDF. The hierarchy: a.s. implies probability; L^p implies probability; probability implies distribution. None of the converses hold in general. A.s. convergence and L^p convergence are incomparable in general — neither implies the other without additional conditions (such as uniform integrability for L^1 convergence from a.s. convergence).

What is the difference between the weak and strong law of large numbers?

Both concern the sample mean X-bar_n = (X_1 + ... + X_n) / n for i.i.d. variables with mean mu. The weak law (WLLN) says X-bar_n converges in probability to mu: for every epsilon, P(|X-bar_n minus mu| greater than epsilon) goes to 0. The strong law (SLLN) says X-bar_n converges almost surely to mu: P(X-bar_n converges to mu) = 1, meaning on a probability-1 set of outcomes the sequence of sample means literally equals mu in the limit. The SLLN implies the WLLN. The WLLN needs finite mean; the SLLN also needs finite mean (no variance assumption required in Etemadi's proof). The distinction matters for simulations and algorithms where one needs guarantees that hold simultaneously over all large n, not just for each fixed n.

How does the CLT work and what is the role of characteristic functions?

The CLT says the standardized sum Z_n = (S_n minus n mu) divided by (sigma times sqrt(n)) converges in distribution to N(0,1) when X_i are i.i.d. with finite mean mu and variance sigma-squared. The characteristic function proof: the char. function of Z_n at t equals the n-th power of the char. function of (X minus mu) divided by sigma evaluated at t over sqrt(n). Expand phi of a single centered, unit-variance variable in a Taylor series: phi(s) = 1 minus s-squared over 2 plus o(s-squared). With s = t / sqrt(n): phi(t/sqrt(n)) = 1 minus t-squared/(2n) + o(1/n). Raise to the n-th power: the limit as n grows is exp(minus t-squared / 2), which is the characteristic function of N(0,1). By the Levy continuity theorem, convergence of characteristic functions implies convergence in distribution.

What is conditional expectation and what is the tower property?

Given a probability space (Omega, F, P), an integrable X, and a sub-sigma-algebra G, the conditional expectation E[X | G] is the unique G-measurable random variable Y such that for every A in G, the integral of Y over A equals the integral of X over A. Existence follows from the Radon-Nikodym theorem. The tower property (iterated conditioning): if H is a sub-sigma-algebra of G, then E[E[X | G] | H] = E[X | H]. In particular, E[E[X | G]] = E[X]. This says that coarser conditioning "wins": averaging first over fine information G and then over coarse information H gives the same result as conditioning directly on H. The tower property underlies martingale theory, sequential statistical decisions, and the computation of expectations by conditioning.

What is a martingale and what does the optional stopping theorem say?

A martingale M_n adapted to a filtration satisfies E[M sub n+1 given the filtration at n] equals M sub n: given current information, the best prediction of the next value is the current value. It is a fair game. The optional stopping theorem: if T is a stopping time and one of three conditions holds — (i) T is bounded, (ii) M is bounded, or (iii) the stopped process is uniformly integrable — then E[M_T] = E[M_0]. This says no clever stopping rule (based on the history of the game) can change the expected value from the starting value in a fair game. Classic applications: gambler's ruin probabilities, expected hitting times for random walks, and sequential hypothesis testing (sequential probability ratio test).

What is Cramer's theorem and what does it tell us about rare events?

Cramer's theorem characterizes the exponential rate of decay of P(X-bar_n is at least x) for x greater than mu. The rate function is I(x) = sup over t of (t x minus log E[exp(t X_1)]), the Legendre-Fenchel transform of the cumulant generating function. The theorem says lim (1/n) log P(X-bar_n is at least x) = negative I(x). So P(X-bar_n is at least x) is approximately exp(negative n I(x)) — exponential decay with rate I(x). The rate function is non-negative and equals 0 only at mu. For Bernoulli(p) variables, I(x) is the Kullback-Leibler divergence from Bernoulli(x) to Bernoulli(p). Cramer's theorem is the model result for all of large deviation theory, which also covers continuous-time processes, Markov chains, and empirical measures.

What are the main concentration inequalities and when is each appropriate?

Markov's inequality: P(X is at least a) is at most E[X] / a for non-negative X. Use when you have only mean information. Gives weak (polynomial) bounds. Chebyshev's inequality: P(|X minus mu| is at least t) is at most Var(X) / t-squared. Use when you have mean and variance. Still polynomial. Chernoff bounds: P(X is at least a) is at most exp(minus I(a)) where I is the rate function from the MGF. Use when you have the full MGF. Gives exponentially sharp bounds — the best possible. Hoeffding's inequality: for independent bounded variables, P(S_n minus E[S_n] is at least t) is at most exp(minus 2 t-squared / sum of (b_i minus a_i)-squared). Use when variables are bounded but distributions may differ. McDiarmid's inequality: for arbitrary functions with bounded differences in each coordinate, gives the same exp(minus 2 t-squared / sum of c_i-squared) form. Use for non-linear functions of independent variables — graph properties, empirical risk, etc.

Related Topics

NailTheTest

Ready to Master Advanced Probability?

NailTheTest offers guided courses, practice problem sets, and exam simulators for graduate mathematics — from measure-theoretic probability through stochastic calculus and statistical learning theory. Work through problems at your own pace with step-by-step solutions.

Start Learning on NailTheTest