Home/Math/Probability Theory
Advanced Mathematics

Probability Theory

A rigorous treatment of modern probability: from Kolmogorov axioms and sigma-algebras through distributions, the law of large numbers, the central limit theorem, Markov chains, and convergence theory.

1. Foundations: Sample Spaces and Sigma-Algebras

Sample Space

The sample space (denoted Omega) is the set of all possible outcomes of a random experiment. Every probability model begins here.

  • Discrete sample space: Omega is finite or countably infinite. Example: rolling a die, Omega = (1, 2, 3, 4, 5, 6).
  • Continuous sample space: Omega is uncountably infinite. Example: measuring a voltage, Omega = the real line R.

Events

An event is a subset A of Omega to which we wish to assign a probability. Not every subset is necessarily an event — this is where sigma-algebras become essential.

Sigma-Algebras (Sigma-Fields)

A sigma-algebra (or sigma-field) F on Omega is a collection of subsets satisfying three properties:

  1. Non-empty: Omega is in F.
  2. Closed under complementation: If A is in F, then the complement of A (denoted A^c) is also in F.
  3. Closed under countable unions: If A_1, A_2, ... are in F, then the union of all A_i is also in F.

The trivial sigma-algebra is just (empty set, Omega). The power set of Omega — all subsets — is always a sigma-algebra. For continuous sample spaces the most important sigma-algebra is the Borel sigma-algebraon R, generated by all open intervals.

Probability Space

Definition

A probability space is a triple (Omega, F, P) where Omega is the sample space, F is a sigma-algebra on Omega, and P is a probability measure on F satisfying Kolmogorov's axioms.

2. Probability Axioms and Measure Theory

Kolmogorov's Axioms (1933)

Andrey Kolmogorov placed probability on a rigorous axiomatic foundation in 1933. Given a probability space (Omega, F, P):

Axiom 1 — Non-negativity

P(A) >= 0 for all A in F

Axiom 2 — Normalization

P(Omega) = 1

Axiom 3 — Countable Additivity (sigma-additivity)

If A_1, A_2, ... are pairwise disjoint events in F, then P(union of A_i) = sum of P(A_i)

Derived Properties

All other standard probability rules follow from these three axioms:

RuleFormula
Complement RuleP(A^c) = 1 - P(A)
Empty SetP(empty) = 0
MonotonicityIf A subset B then P(A) <= P(B)
Addition RuleP(A union B) = P(A) + P(B) - P(A intersect B)
Bonferroni InequalityP(A union B) <= P(A) + P(B)
Total ProbabilityP(B) = sum over i of P(B|A_i) * P(A_i)

Probability as a Measure

A probability measure is a special case of a general measure (from measure theory) with the added constraint that the total measure is 1. This connection to Lebesgue measure theory allows powerful analytic tools — Lebesgue integration, dominated convergence, monotone convergence — to be applied directly to probability problems.

3. Conditional Probability, Bayes, and Independence

Conditional Probability

The conditional probability of A given B is defined whenever P(B) greater than 0:

P(A | B) = P(A intersect B) / P(B)

Conditional probability satisfies all three Kolmogorov axioms — it is itself a valid probability measure on F, restricted to the "universe" B.

Chain Rule (Multiplication Rule)

P(A intersect B) = P(A | B) * P(B)

P(A_1 intersect ... intersect A_n) = P(A_1) * P(A_2|A_1) * P(A_3|A_1,A_2) * ...

Law of Total Probability

If (A_1, ..., A_n) is a partition of Omega (pairwise disjoint, union = Omega, all with positive probability), then for any event B:

P(B) = sum from i=1 to n of P(B | A_i) * P(A_i)

Bayes' Theorem

Bayes' Theorem

P(A_i | B) = P(B | A_i) * P(A_i) / sum_j P(B | A_j) * P(A_j)

Where P(A_i) is the prior, P(B | A_i) is the likelihood, and P(A_i | B) is the posterior. Bayes' theorem is the mathematical basis for Bayesian inference.

Example — Medical Testing

A disease affects 1% of the population. A test is 99% sensitive (true positive rate) and 95% specific (true negative rate, so 5% false positive). If a patient tests positive, what is the probability they actually have the disease?

P(Disease | Positive) = (0.99 * 0.01) / (0.99 * 0.01 + 0.05 * 0.99) = 0.0099 / (0.0099 + 0.0495) = 0.0099 / 0.0594 approx 0.167

Only about 16.7% — despite a seemingly accurate test, low prevalence dominates.

Independence

Events A and B are independent if and only if:

P(A intersect B) = P(A) * P(B)

Equivalently (when P(B) greater than 0): P(A | B) = P(A) — knowing B gives no information about A.

Caution: Mutual vs Pairwise Independence

A collection of events (A_1, ..., A_n) is mutually independent if every subset satisfies the product rule. Pairwise independence (every pair is independent) does NOT imply mutual independence.

For sigma-algebras: sub-sigma-algebras F_1, ..., F_n are independent if for every choice A_i in F_i the events A_1, ..., A_n are mutually independent.

4. Random Variables, Distributions, and Expectations

Random Variables

A random variable X is a measurable function X: Omega to R — it maps outcomes to real numbers, and the preimage of every Borel set belongs to F. Random variables transform the abstract probability space into a numeric one we can compute with.

Probability Mass Function (PMF)

For a discrete random variable X (taking countably many values), the PMF is:

p(x) = P(X = x), with sum over all x of p(x) = 1

Probability Density Function (PDF)

For a continuous random variable X, the PDF f(x) satisfies:

P(a <= X <= b) = integral from a to b of f(x) dx

f(x) >= 0 for all x, and integral over all reals of f(x) dx = 1

Note: P(X = x) = 0 for any single point when X is continuous. Probability lives in intervals, not points.

Cumulative Distribution Function (CDF)

F(x) = P(X <= x)

  • F is non-decreasing: if a < b then F(a) <= F(b)
  • Right-continuous: lim from right of F(x) = F(x)
  • Limits: F(-infinity) = 0 and F(+infinity) = 1
  • For continuous X: f(x) = F prime (x) wherever F is differentiable

Expected Value

The expected value (mean) E[X] is the probability-weighted average:

Discrete: E[X] = sum over x of x * p(x)

Continuous: E[X] = integral over all reals of x * f(x) dx

Variance and Standard Deviation

Var(X) = E[(X - mu)^2] = E[X^2] - (E[X])^2

SD(X) = sqrt(Var(X))

Moment Generating Functions

The moment generating function (MGF) of X is:

M_X(t) = E[e^(tX)], defined for t in a neighborhood of 0

The k-th moment is E[X^k] = M_X^(k)(0) — the k-th derivative of M_X evaluated at 0. If two distributions share the same MGF on an open interval around 0, they are identical. MGFs are powerful tools for proving the CLT and working with sums of independent random variables.

Key Inequalities

InequalityStatement
MarkovP(X >= a) <= E[X] / a for X >= 0, a > 0
ChebyshevP(|X - mu| >= k*sigma) <= 1 / k^2
Jensenphi(E[X]) <= E[phi(X)] for convex phi
Cauchy-Schwarz|E[XY]|^2 <= E[X^2] * E[Y^2]

5. Common Probability Distributions

Discrete Distributions

DistributionPMF p(k)MeanVarianceUse Case
Bernoulli(p)p^k * (1-p)^(1-k), k in (0,1)pp(1-p)Single yes/no trial
Binomial(n,p)C(n,k) * p^k * (1-p)^(n-k)npnp(1-p)Successes in n trials
Poisson(lambda)e^(-lambda) * lambda^k / k!lambdalambdaRare events per interval
Geometric(p)(1-p)^(k-1) * p, k = 1,2,...1/p(1-p)/p^2Trials until first success

Poisson as Limit of Binomial

When n is large and p is small with lambda = np held constant, Binomial(n, p) converges to Poisson(lambda). This is the Poisson limit theorem — useful for modeling rare events like radioactive decay, server arrivals, and insurance claims.

Continuous Distributions

DistributionPDF f(x)MeanVariance
Uniform(a,b)1/(b-a) on [a,b](a+b)/2(b-a)^2 / 12
Normal(mu, sigma^2)(1 / sigma*sqrt(2*pi)) * exp(-(x-mu)^2 / 2*sigma^2)musigma^2
Exponential(lambda)lambda * e^(-lambda*x), x >= 01/lambda1/lambda^2
Gamma(alpha, beta)x^(alpha-1) * e^(-x/beta) / (Gamma(alpha) * beta^alpha)alpha * betaalpha * beta^2
Beta(alpha, beta)x^(alpha-1) * (1-x)^(beta-1) / B(alpha, beta), on [0,1]alpha / (alpha+beta)alpha*beta / ((alpha+beta)^2 * (alpha+beta+1))

The Normal Distribution in Depth

The normal (Gaussian) distribution N(mu, sigma^2) is central to probability and statistics. Key properties:

  • Symmetric about the mean mu
  • The 68-95-99.7 rule: 68% of data within 1 sigma, 95% within 2 sigma, 99.7% within 3 sigma
  • Linear combinations of independent normals are normal: if X ~ N(mu_1, s_1^2) and Y ~ N(mu_2, s_2^2) independently, then X + Y ~ N(mu_1 + mu_2, s_1^2 + s_2^2)
  • MGF: M_X(t) = exp(mu*t + sigma^2*t^2/2)
  • Standard normal Z = (X - mu) / sigma has distribution N(0,1)

Memoryless Property of the Exponential

The exponential distribution is the unique continuous distribution with the memoryless property: P(X > s + t | X > s) = P(X > t). This makes it the natural model for waiting times in Poisson processes and for component lifetimes in reliability theory.

6. Limit Theorems: Law of Large Numbers and CLT

Weak Law of Large Numbers (WLLN)

Let X_1, X_2, ... be i.i.d. random variables with E[X_i] = mu and Var(X_i) = sigma^2 (finite). Define the sample mean X-bar_n = (X_1 + ... + X_n) / n. Then:

For every epsilon > 0: P(|X-bar_n - mu| > epsilon) approaches 0 as n to infinity

This is convergence in probability. The WLLN follows immediately from Chebyshev's inequality: P(|X-bar_n - mu| > epsilon) <= sigma^2 / (n * epsilon^2), which goes to 0 as n grows.

Strong Law of Large Numbers (SLLN)

Under the same conditions (in fact only requiring E[|X|] finite):

P(lim as n to infinity of X-bar_n = mu) = 1

This is almost-sure convergence — a strictly stronger statement. The event "the sample mean converges to mu" has probability exactly 1. The SLLN is what justifies the frequency interpretation of probability over the long run.

Central Limit Theorem (CLT)

Central Limit Theorem

Let X_1, X_2, ... be i.i.d. with mean mu and variance sigma^2, both finite and sigma^2 greater than 0. Then:

sqrt(n) * (X-bar_n - mu) / sigma converges in distribution to N(0,1)

Equivalently, for large n: X-bar_n is approximately N(mu, sigma^2/n), and S_n = X_1 + ... + X_n is approximately N(n*mu, n*sigma^2).

The CLT is one of the most remarkable results in mathematics — regardless of the underlying distribution (uniform, exponential, Poisson, etc.), the normalized sum converges to a Gaussian. It explains why normal distributions appear so frequently in nature and is the foundation for much of classical statistics.

CLT in Practice

A rule of thumb: the normal approximation is reliable for n >= 30 when the distribution is not severely skewed. For symmetric distributions, even n = 10 may suffice. For the Poisson, the normal approximation is reasonable when lambda >= 5.

Berry-Esseen Theorem (Rate of Convergence)

The Berry-Esseen theorem quantifies how fast the CLT converges. If E[|X|^3] is finite with rho = E[|X - mu|^3], then the error in the normal approximation to the CDF is bounded by C * rho / (sigma^3 * sqrt(n)), where C is an absolute constant (best known value approximately 0.4748). Faster convergence occurs for distributions closer to normal.

7. Markov Chains

The Markov Property

A stochastic process (X_n) indexed by n = 0, 1, 2, ... and taking values in a (discrete) state space S is a Markov chain if it satisfies the Markov property — the future is independent of the past given the present:

P(X_(n+1) = j | X_n = i, X_(n-1), ..., X_0) = P(X_(n+1) = j | X_n = i)

Transition Matrix

A (time-homogeneous) Markov chain on a finite state space S = (1, ..., N) is fully described by its transition probability matrix P, where:

P_ij = P(X_(n+1) = j | X_n = i)

Each row sums to 1: sum over j of P_ij = 1

The n-step transition probabilities P^(n)_ij = P(X_n = j | X_0 = i) are given by the (i,j) entry of the matrix P^n (P raised to the n-th power).

Classification of States

ConceptDefinition
AccessibleState j is accessible from i if P^(n)_ij > 0 for some n >= 0
Communicatingi and j communicate if each is accessible from the other; written i <-> j
IrreducibleAll states communicate — only one communicating class
RecurrentP(return to i eventually | start at i) = 1; sum of P^(n)_ii = infinity
TransientP(return to i eventually | start at i) < 1; chain leaves i forever after some point
Periodd(i) = gcd of n such that P^(n)_ii > 0; aperiodic if d(i) = 1

Stationary Distribution

A distribution pi on S is stationary (invariant) for the chain if:

pi = pi * P (pi is a left eigenvector with eigenvalue 1)

For an irreducible, positive recurrent Markov chain, a unique stationary distribution exists and pi_i = 1 / E[T_i | X_0 = i] (the reciprocal of the expected return time).

Ergodic Theorem for Markov Chains

For an irreducible, aperiodic, positive recurrent (ergodic) Markov chain, the long-run distribution converges to pi regardless of the starting state: P^(n)_ij approaches pi_j as n approaches infinity for all i and j. Time averages equal space averages (ensemble averages) — a probability analogue of ergodicity in dynamical systems.

8. Convergence of Random Variables

Four Modes of Convergence

Let (X_n) be a sequence of random variables on (Omega, F, P) and let X be a random variable. There are four standard notions of convergence, from strongest to weakest:

1. Almost-Sure Convergence (a.s.) — Strongest

P(lim as n to infinity of X_n = X) = 1

The set of outcomes where X_n does NOT converge to X has probability zero. This is the mode used in the SLLN.

2. Convergence in Probability

For every epsilon > 0: P(|X_n - X| > epsilon) approaches 0 as n to infinity

Weaker than a.s. convergence. The WLLN gives this mode. Almost-sure implies convergence in probability but not vice versa.

3. Convergence in L^p (Mean-Square for p=2)

E[|X_n - X|^p] approaches 0 as n to infinity

L^p convergence implies convergence in probability (by Markov's inequality). L^2 convergence (mean-square) is especially important in signal processing and statistics.

4. Convergence in Distribution (Weak Convergence) — Weakest

F_n(x) approaches F(x) for every continuity point x of F

Only the distribution functions converge, not the random variables themselves. This is the mode in the CLT. Implies nothing about the joint behavior of X_n and X.

Implications Between Modes

a.s. ==> in probability ==> in distribution

L^p ==> in probability ==> in distribution

None of the reverse implications hold in general. Convergence in distribution to a constant implies convergence in probability.

Skorokhod's Representation Theorem

If X_n converges in distribution to X, then there exist random variables Y_n and Y (defined on a common probability space) with Y_n having the same distribution as X_n, Y having the same distribution as X, and Y_n converging to Y almost surely. This theorem is a powerful tool for proving distributional convergence results.

Continuous Mapping Theorem

If X_n converges in distribution (or probability, or a.s.) to X, and g is a continuous function, then g(X_n) converges in the same mode to g(X). This is used extensively when applying transformations to converging sequences — for example, if X_n converges in distribution to N(0,1) then X_n^2 converges to chi-squared(1).

Frequently Asked Questions

What is a sigma-algebra and why does it matter in probability theory?+
A sigma-algebra on a sample space Omega is a collection of subsets closed under complementation and countable unions. Together with a probability measure P, it defines the probability space (Omega, F, P). Sigma-algebras specify exactly which events are measurable and can be assigned probabilities, preventing paradoxes that arise from assigning probabilities to all possible subsets of an uncountable space.
What are Kolmogorov's three axioms of probability?+
The three axioms are: (1) Non-negativity: P(A) >= 0 for all events A. (2) Normalization: P(Omega) = 1, the total probability is 1. (3) Countable additivity: for pairwise disjoint events, the probability of their union equals the sum of their individual probabilities. Every standard probability rule (complement rule, addition rule, etc.) is a consequence of these three axioms.
How does Bayes' theorem work and when do you use it?+
Bayes' theorem: P(A|B) = P(B|A) * P(A) / P(B). It lets you reverse conditional probabilities — computing the probability of a hypothesis given evidence, from the probability of that evidence given the hypothesis. Use it in Bayesian inference, medical diagnosis, spam detection, and any situation where you update beliefs as new information arrives.
What is the difference between the weak and strong law of large numbers?+
The weak law (WLLN) gives convergence in probability: P(|X-bar_n - mu| > epsilon) approaches 0 for every epsilon > 0. The strong law (SLLN) gives almost-sure convergence: P(lim X-bar_n = mu) = 1. The strong law is a strictly stronger statement — every sequence of coin flips, for example, will have its running average converge to the true probability with probability 1.
What does the Central Limit Theorem say?+
The CLT states that for i.i.d. random variables with finite mean and variance, the standardized sample mean sqrt(n)(X-bar_n - mu)/sigma converges in distribution to the standard normal N(0,1). This means: regardless of the underlying distribution, sums of many independent random variables are approximately normally distributed for large n. It is the theoretical basis for Z-tests, confidence intervals, and much of classical statistics.
What is a Markov chain and what is the Markov property?+
A Markov chain is a stochastic process where the probability of moving to the next state depends only on the current state, not on the history. Formally: P(X_(n+1) = j | X_n = i, X_(n-1), ..., X_0) = P(X_(n+1) = j | X_n = i). Markov chains are described by a transition matrix and are used to model queues, genetics, PageRank, economic systems, and much more.
What are the four modes of convergence for random variables?+
From strongest to weakest: (1) Almost-sure (a.s.): P(lim Xn = X) = 1. (2) In probability: P(|Xn - X| > epsilon) to 0 for every epsilon > 0. (3) In L^p: E[|Xn - X|^p] to 0. (4) In distribution: F_n(x) to F(x) at continuity points of F. Almost-sure and L^p both imply convergence in probability, which implies convergence in distribution. The CLT gives convergence in distribution; the SLLN gives almost-sure convergence.

Key Terms Glossary

Sample Space (Omega)

The set of all possible outcomes of a random experiment

Sigma-Algebra (F)

A collection of subsets of Omega closed under complementation and countable unions

Probability Measure (P)

A function P: F to [0,1] satisfying Kolmogorov's three axioms

Event

A measurable subset A of Omega belonging to F

Random Variable

A measurable function X: Omega to R

PMF

Probability mass function — gives P(X=x) for discrete X

PDF

Probability density function — integrates to give probabilities for continuous X

CDF

Cumulative distribution function F(x) = P(X <= x)

Expected Value

Probability-weighted average of X; the mean mu = E[X]

Variance

E[(X - mu)^2] — measures spread around the mean

MGF

Moment generating function M(t) = E[e^(tX)]

Independence

P(A intersect B) = P(A)*P(B); knowing one event gives no info about the other

WLLN

Weak law of large numbers — sample mean converges in probability to mu

SLLN

Strong law of large numbers — sample mean converges almost surely to mu

CLT

Central limit theorem — normalized sum converges in distribution to N(0,1)

Markov Chain

Stochastic process satisfying the Markov (memoryless) property

Stationary Distribution

Distribution pi satisfying pi = pi*P; long-run distribution of ergodic chain

a.s. Convergence

X_n to X with probability 1; the strongest mode of convergence

Ready to Practice Probability Theory?

Test your understanding of sigma-algebras, distributions, convergence theorems, and Markov chains with our adaptive math practice tools.

Start Practicing