Numerical Analysis

Numerical Stability & Loading Calculations

Numerical stability is the study of how errors propagate through computational algorithms. A stable method keeps small errors small; an unstable one amplifies them until they swamp the true answer. This guide covers the mathematical theory behind stable ODE solvers, matrix factorizations, iterative methods, and error analysis for scientific and engineering computation.

1. Numerical Stability Fundamentals

Every numerical computation introduces errors. The central question of numerical stability is whether those errors grow or stay bounded as the computation proceeds. Two distinct sources of error must be understood: round-off error from finite-precision arithmetic and truncation error from replacing exact mathematical operations with finite approximations.

Round-Off Error and Floating-Point Arithmetic

IEEE 754 double-precision floating-point represents numbers as m * 2^e where m has 52 mantissa bits and e is an 11-bit exponent. Any real number that is not exactly representable is rounded to the nearest floating-point value. The machine epsilon eps_mach is approximately 2.22 × 10^-16 for double precision — the smallest number such that 1 + eps_mach is distinguishable from 1.

Floating-Point Rounding Model

  • fl(x) = x(1 + delta) where |delta| ≤ eps_mach
  • fl(a op b) = (a op b)(1 + delta), |delta| ≤ eps_mach
  • eps_mach ≈ 2.22e-16 (double precision)
  • eps_mach ≈ 1.19e-7 (single precision)

Truncation Error

Truncation error arises from approximating continuous mathematics with discrete or finite operations. The forward-difference approximation to a derivative is a canonical example: using f'(x) ≈ [f(x+h) - f(x)] / h truncates the Taylor series f(x+h) = f(x) + hf'(x) + (h²/2)f''(x) + ..., leaving a local truncation error of order h. Halving h halves the truncation error but doubles the number of operations, increasing accumulated round-off. The optimal h balances these two competing effects.

Optimal Step Size for Finite Differences

Forward difference error ≈ (h/2)|f''(x)| + (eps_mach/h)|f(x)|
Minimizing over h: h_opt ≈ sqrt(2 * eps_mach)
h_opt ≈ 1.5 × 10^(-8) for double precision

Condition Numbers and Ill-Conditioning

The condition number of a problem measures its inherent sensitivity to input perturbations, independent of the algorithm used. For a linear system Ax = b, the condition number kappa(A) = ||A|| ||A^-1|| bounds the relative error amplification:

Perturbation Bound for Linear Systems

  • ||delta_x|| / ||x|| ≤ kappa(A) * ||delta_b|| / ||b||
  • kappa(A) = sigma_max / sigma_min (ratio of singular values)
  • kappa(A) = 1 for orthogonal matrices (perfectly conditioned)
  • kappa(A) → infinity as A approaches singularity

An ill-conditioned problem loses approximately log10(kappa(A)) digits of accuracy in the computed solution. If kappa(A) = 10^10 and we work in double precision with ~16 significant digits, we may only get 6 correct digits in the answer — even with a perfectly stable algorithm.

Stability vs. Conditioning

Conditioning is a property of the problem; stability is a property of the algorithm. A stable algorithm applied to an ill-conditioned problem still gives a poor answer — but it gives the best answer the problem permits. An unstable algorithm applied to a well-conditioned problem gives a poor answer unnecessarily.

2. Stability of ODE Solvers

To analyze the stability of numerical ODE methods, we use Dahlquist's test equation: dy/dt = lambda*y, where lambda is a complex number with Re(lambda) less than 0 so the true solution decays to zero. The region of absolute stability of a method is the set of step sizes h such that the numerical solution also decays to zero.

A-Stability and L-Stability

A method is A-stable if its region of absolute stability contains the entire left half-plane {Re(z) < 0} where z = h*lambda. This means the method is unconditionally stable for any decaying test equation regardless of step size. A method is L-stable if it is A-stable and additionally the amplification factor R(z) → 0 as z → -infinity. L-stability ensures that very stiff components (large negative eigenvalues) are immediately damped rather than being slowly reduced.

Explicit Euler

R(z) = 1 + z
Stable: |1 + z| ≤ 1
Disk of radius 1
centered at z = -1

Not A-stable

Implicit Euler

R(z) = 1/(1 - z)
Stable: |z - 1| ≥ 1
Entire left half-plane

A-stable, L-stable

Trapezoidal (CN)

R(z) = (1 + z/2) /
(1 - z/2)
Entire left half-plane

A-stable, not L-stable

Stiff Equations and the Stiffness Ratio

A system y' = f(y) is stiff if the Jacobian J = df/dy has eigenvalues lambda_i with widely varying magnitudes. The stiffness ratio is max|Re(lambda_i)| / min|Re(lambda_i)|. For a stiff system with stiffness ratio 10^6, an explicit method with stability region |h*lambda| ≤ C must use h ≤ C / max|lambda| — a step size 10^6 times smaller than the accuracy requirement would dictate. This makes explicit methods prohibitively expensive for stiff problems.

Classic Stiff Example: Robertson Chemistry

dy1/dt = -0.04 y1 + 1e4 y2 y3
dy2/dt = 0.04 y1 - 1e4 y2 y3 - 3e7 y2^2
dy3/dt = 3e7 y2^2

Eigenvalues: ~0, ~-0.04, ~-3×10^7
Stiffness ratio ≈ 10^9

Consistency and Convergence (Lax Equivalence Theorem)

For a well-posed linear initial value problem and a consistent numerical method, stability is equivalent to convergence. This is the Lax-Richtmyer equivalence theorem. Consistency means the local truncation error goes to zero as h → 0. Stability means bounded error amplification. Together they guarantee that the global error vanishes as h → 0 — the method converges to the true solution.

Key Relationships

  • Consistency: local truncation error = O(h^p) for some p ≥ 1
  • Stability: numerical solution bounded for bounded right-hand side
  • Convergence: global error → 0 as h → 0
  • Lax: Consistency + Stability ⟺ Convergence

3. Runge-Kutta Methods

Runge-Kutta (RK) methods are one-step methods that achieve high-order accuracy by evaluating the derivative function at multiple intermediate points within each step. They are defined by a Butcher tableau that specifies the stage values and quadrature weights.

The Classical RK4 Method

The classical fourth-order Runge-Kutta method is the most widely used ODE solver in practice. It requires four function evaluations per step and achieves local truncation error of O(h^5) and global error of O(h^4).

RK4 Algorithm for y' = f(t, y)

k1 = h * f(t_n,         y_n)
k2 = h * f(t_n + h/2,   y_n + k1/2)
k3 = h * f(t_n + h/2,   y_n + k2/2)
k4 = h * f(t_n + h,     y_n + k3)

y_{n+1} = y_n + (k1 + 2k2 + 2k3 + k4) / 6

Weights (1, 2, 2, 1)/6 come from Simpson's rule applied to the integral formulation of the ODE.

Embedded Pairs and Adaptive Step Control

Production ODE solvers use embedded Runge-Kutta pairs: two methods of different orders sharing the same function evaluations. The difference between the two solutions estimates the local error, which drives adaptive step size control. The Dormand-Prince pair (DOPRI5, used in MATLAB's ode45) embeds a 4th-order method within a 5th-order method using 6 function evaluations per step.

Adaptive Step Size Formula

  • err = ||y_5th - y_4th|| / (atol + rtol * ||y||)
  • h_new = h * min(fac_max, max(fac_min, fac * (1/err)^(1/5)))
  • Typical: fac = 0.9, fac_min = 0.1, fac_max = 5.0
  • Accept step if err ≤ 1, reject and retry if err > 1

Implicit Runge-Kutta Methods

Explicit RK methods have limited stability regions. Implicit RK methods (like Gauss-Legendre collocation and SDIRK — Singly Diagonally Implicit Runge-Kutta) solve a system of nonlinear equations at each step but achieve A-stability or even L-stability. SDIRK methods have a diagonal Butcher tableau, so each stage can be solved sequentially with a single Newton iteration per stage.

2-Stage Gauss-Legendre (order 4)

c1 = 1/2 - sqrt(3)/6
c2 = 1/2 + sqrt(3)/6
Symplectic, A-stable
Optimal order for 2 stages

2-Stage SDIRK (order 3)

gamma = (3 + sqrt(3)) / 6
Diagonal entries = gamma
L-stable
Each stage: solve (I - h*gamma*J)

Order Conditions and Rooted Trees

Butcher's theory of rooted trees provides a systematic way to derive the algebraic conditions that a Runge-Kutta method must satisfy to achieve a given order. For order p, all rooted trees with up to p nodes must satisfy corresponding order conditions. The number of conditions grows rapidly: 1 condition for order 1, 2 for order 2, 4 for order 3, 8 for order 4, and 17 for order 5. This is why constructing high-order explicit RK methods requires solving large systems of polynomial equations.

4. Linear Multistep Methods

Linear multistep methods (LMMs) use information from multiple previous steps to advance the solution. They can achieve high order with fewer function evaluations per step than Runge-Kutta, but require special starting procedures and are subject to the Dahlquist barrier theorem.

Adams-Bashforth Methods (Explicit)

Adams-Bashforth (AB) methods interpolate the derivative history with a polynomial and integrate it exactly to advance the solution. They are explicit (no implicit solve required) and widely used for non-stiff problems.

Adams-Bashforth Formulas

  • AB1 (=Explicit Euler): y_{n+1} = y_n + h*f_n
  • AB2 (order 2): y_{n+1} = y_n + h*(3f_n - f_{n-1})/2
  • AB3 (order 3): y_{n+1} = y_n + h*(23f_n - 16f_{n-1} + 5f_{n-2})/12
  • AB4 (order 4): coefficients (55, -59, 37, -9)/24

Adams-Moulton Methods (Implicit)

Adams-Moulton (AM) methods include the current step's derivative in the interpolation, making them implicit. They achieve one higher order than the corresponding Adams-Bashforth method with the same number of prior steps and have larger stability regions. In practice, AB and AM methods are paired as Predictor-Corrector (PECE) schemes.

Adams-Moulton Formulas

  • AM1 (=Implicit Euler): y_{n+1} = y_n + h*f_{n+1}
  • AM2 (=Trapezoidal): y_{n+1} = y_n + h*(f_{n+1} + f_n)/2
  • AM3 (order 3): h*(5f_{n+1} + 8f_n - f_{n-1})/12
  • AM4 (order 4): coefficients (9, 19, -5, 1)/24

BDF Methods for Stiff ODEs

Backward Differentiation Formulas (BDF) are the method of choice for stiff ODEs. Rather than interpolating the derivative, BDF methods differentiate the interpolating polynomial of the solution history and set it equal to the derivative. BDF1 through BDF6 are zero-stable and A(alpha)-stable for orders 1–6.

BDF Formulas

  • BDF1: y_{n+1} - y_n = h*f_{n+1} (Implicit Euler)
  • BDF2: (3y_{n+1} - 4y_n + y_{n-1})/2 = h*f_{n+1}
  • BDF3: (11y_{n+1} - 18y_n + 9y_{n-1} - 2y_{n-2})/6 = h*f_{n+1}
  • BDF4: coefficients (25,-48,36,-16,3)/12 = h*f_{n+1}

Dahlquist Barrier Theorem

No explicit linear multistep method is A-stable. Among implicit LMMs, A-stable methods have order at most 2 (the second Dahlquist barrier). The trapezoidal rule is the unique A-stable LMM of order 2 with the smallest error constant. BDF methods sacrifice full A-stability for higher order but remain A(alpha)-stable (stable in a wedge of the left half-plane).

5. Matrix Stability: Eigenvalue Analysis

The stability of iterative algorithms, linear recurrences, and dynamical systems depends critically on the eigenvalue structure of the associated matrices. The spectral radius and Gershgorin's theorem provide algebraic tools for bounding eigenvalues without full diagonalization.

Spectral Radius and Matrix Norms

The spectral radius of a matrix A is rho(A) = max|lambda_i|, the largest absolute value of any eigenvalue. For a linear recurrence x_{k+1} = Ax_k, the sequence converges to zero if and only if rho(A) less than 1. The spectral radius bounds induced matrix norms: rho(A) ≤ ||A|| for any induced norm. For symmetric matrices, rho(A) = ||A||_2 (the spectral norm equals the largest singular value).

Spectral Radius and Convergence

  • rho(A) < 1: x_k → 0 exponentially (convergent recurrence)
  • rho(A) = 1: bounded but not convergent (marginal stability)
  • rho(A) > 1: x_k grows without bound (instability)
  • ||A^k|| → 0 if and only if rho(A) < 1 (Gelfand formula)

The Gershgorin Circle Theorem

The Gershgorin circle theorem localizes eigenvalues using only the matrix entries. For a matrix A, define the ith Gershgorin disk as: D_i = { z in C : |z - A_{ii}| ≤ R_i } where R_i is the sum of absolute values of off-diagonal entries in row i. Every eigenvalue of A lies in the union of these disks.

Gershgorin Example

A = [ 4   1  -1 ]
    [ 0   3   0.5]
    [-0.5  0   2 ]

D1: center 4,  radius 1 + 1 = 2  → eigenvalue in [2, 6]
D2: center 3,  radius 0 + 0.5 = 0.5 → eigenvalue in [2.5, 3.5]
D3: center 2,  radius 0.5 + 0 = 0.5 → eigenvalue in [1.5, 2.5]

All disks in right half-plane → matrix is stable (negative of A stable)

Lyapunov Stability for Continuous Systems

For the continuous system dx/dt = Ax, the origin is asymptotically stable if and only if all eigenvalues of A have strictly negative real parts (A is Hurwitz stable). This can be tested without computing eigenvalues using the Routh-Hurwitz criterion for the characteristic polynomial, or by solving the Lyapunov equation A^T P + PA = -Q for a positive definite P, given any positive definite Q.

Stability Criteria Summary

System TypeStability Condition
Discrete: x_{k+1} = Ax_krho(A) < 1
Continuous: x' = AxAll Re(lambda_i) < 0
Iterative method: x_{k+1} = Mx_k + crho(M) < 1
ODE method (test eq)h*lambda in stability region

6. LU Decomposition with Pivoting

Gaussian elimination factorizes a matrix A as A = LU where L is unit lower triangular and U is upper triangular. This factorization allows efficient solution of multiple right-hand sides by forward and backward substitution. Pivoting strategies are essential for numerical stability.

Partial Pivoting

Partial pivoting interchanges rows before each elimination step so that the pivot element has the largest absolute value in its column. This bounds all multipliers: |L_{ij}| ≤ 1 for all i > j. The factorization produces PA = LU where P is a permutation matrix.

Partial Pivoting Algorithm

for k = 1 to n-1:
  p = argmax_{i>=k} |A[i,k]|   // find pivot row
  swap rows k and p in A        // row permutation
  P[k] = p                      // record permutation

  for i = k+1 to n:
    L[i,k] = A[i,k] / A[k,k]   // compute multiplier
    A[i,k:n] -= L[i,k]*A[k,k:n] // eliminate

Growth Factor and Stability Guarantees

The growth factor g_n = max|U_{ij}| / max|A_{ij}| measures how much the matrix entries grow during elimination. Large growth factors indicate potential instability. With partial pivoting, the theoretical bound is g_n ≤ 2^(n-1), but in practice the growth factor is rarely larger than n. With complete pivoting (permuting both rows and columns), the bound is g_n ≤ (n * 2 * 3^(1/2) * ... * n^(1/(n-1)))^(1/2), which grows much more slowly.

Partial Pivoting

  • Growth bound: 2^(n-1)
  • Cost: O(n^2) comparisons
  • Practical growth: usually O(n)
  • Standard for most applications

Complete Pivoting

  • Growth bound: much smaller
  • Cost: O(n^3) comparisons
  • Proven safe for all matrices
  • Rarely needed in practice

Cholesky Decomposition for Symmetric Positive Definite Matrices

When A is symmetric positive definite (SPD), the Cholesky decomposition A = LL^T is always stable without pivoting, and costs half as much as LU (roughly n^3/6 flops vs n^3/3 flops). The growth factor is exactly 1: max|L_{ij}|^2 ≤ max|A_{ii}|, so entries never grow during factorization. Cholesky is the preferred method for SPD systems arising in finite element analysis, statistics (normal equations), and optimization.

Cholesky Algorithm

for j = 1 to n:
  L[j,j] = sqrt(A[j,j] - sum(L[j,k]^2, k=1..j-1))
  for i = j+1 to n:
    L[i,j] = (A[i,j] - sum(L[i,k]*L[j,k], k=1..j-1)) / L[j,j]

// Solve Ax = b via forward sub (Ly = b) then back sub (L^T x = y)

7. Classical Iterative Solvers

For large sparse systems where direct factorization is too expensive due to fill-in, iterative methods build a sequence of approximations that converge to the solution. The simplest are the splitting methods: Jacobi, Gauss-Seidel, and SOR.

Matrix Splitting Framework

A matrix splitting writes A = M - N where M is easy to invert. The iterative scheme is Mx_{k+1} = Nx_k + b, or equivalently x_{k+1} = M^-1Nx_k + M^-1b. The iteration converges if and only if rho(M^-1N) less than 1. The asymptotic convergence rate is -log(rho(M^-1N)) per iteration.

Common Splittings

  • Jacobi: M = D (diagonal of A), N = -(L + U)
  • Gauss-Seidel: M = D + L (lower triangle), N = -U
  • SOR: M = (D + omega*L)/omega, uses relaxation parameter omega

Jacobi Method

In the Jacobi method, each component of x is updated using only values from the previous iteration. The update for component i is: x_i^{(k+1)} = (b_i - sum_{j ≠ i} A_{ij} x_j^{(k)}) / A_{ii}. Because all components are updated simultaneously, Jacobi is easily parallelized. It converges if A is strictly diagonally dominant: |A_{ii}| > sum_{j≠i} |A_{ij}|.

Gauss-Seidel Method

Gauss-Seidel uses the most recently computed component values immediately: x_i^{(k+1)} = (b_i - sum_{j<i} A_{ij} x_j^{(k+1)} - sum_{j>i} A_{ij} x_j^{(k)}) / A_{ii}. This typically converges roughly twice as fast as Jacobi for the same problem class. For symmetric positive definite systems, Gauss-Seidel always converges.

SOR and Optimal Relaxation

Successive Over-Relaxation (SOR) accelerates Gauss-Seidel by combining the current Gauss-Seidel update with the previous iterate: x_i^{new} = (1-omega)*x_i^{old} + omega*x_i^{GS}. For 0 < omega < 1 (under-relaxation) convergence is guaranteed for more problems; for 1 < omega < 2 (over-relaxation) convergence is accelerated. For the model problem (Poisson equation on a grid), the optimal omega is known analytically.

Optimal SOR Parameter (Poisson on n×n grid)

  • rho_J = cos(pi/(n+1)) (Jacobi spectral radius)
  • omega_opt = 2 / (1 + sqrt(1 - rho_J^2))
  • rho_SOR = omega_opt - 1
  • Iterations: O(n) for SOR vs O(n^2) for GS

8. Krylov Subspace Methods

Krylov subspace methods are the state of the art for large sparse linear systems. They build an approximation to the solution from the Krylov subspace K_k(A, r_0) = span{r_0, Ar_0, A^2r_0, ..., A^{k-1}r_0}where r_0 = b - Ax_0 is the initial residual. Each iteration requires only one matrix-vector product, making them feasible for matrices with millions of unknowns.

Conjugate Gradient (CG) for Symmetric Positive Definite Systems

The Conjugate Gradient method is optimal for SPD systems: it minimizes the A-norm of the error ||e_k||_A = sqrt(e_k^T A e_k) over all vectors in x_0 + K_k. CG converges in at most n iterations (in exact arithmetic), but in practice converges much faster due to eigenvalue clustering.

CG Algorithm

r_0 = b - A*x_0,  p_0 = r_0
for k = 0, 1, 2, ...:
  alpha_k = (r_k^T r_k) / (p_k^T A p_k)
  x_{k+1} = x_k + alpha_k * p_k
  r_{k+1} = r_k - alpha_k * A*p_k
  beta_k  = (r_{k+1}^T r_{k+1}) / (r_k^T r_k)
  p_{k+1} = r_{k+1} + beta_k * p_k

CG Convergence Rate

  • ||e_k||_A / ||e_0||_A ≤ 2 * ((sqrt(kappa)-1)/(sqrt(kappa)+1))^k
  • kappa = kappa(A) = lambda_max / lambda_min
  • Iterations for eps accuracy: O(sqrt(kappa) * log(2/eps))
  • vs. O(kappa * log(1/eps)) for Gauss-Seidel

GMRES for Non-Symmetric Systems

GMRES (Generalized Minimal Residual) handles non-symmetric systems by minimizing ||b - Ax_k||_2 over x_0 + K_k. It builds an orthonormal basis for K_k using the Arnoldi process and solves a small least-squares problem at each step. GMRES is guaranteed to converge (the residual is monotonically non-increasing) but may require restarts to bound memory use.

GMRES vs. BiCGSTAB Comparison

PropertyGMRESBiCGSTAB
MemoryO(k*n) (grows)O(n) (fixed)
ResidualMonotone decreaseNon-monotone
Matrix-vec products1 per iteration2 per iteration
Best forWell-conditioned, FEMMildly non-symmetric

Preconditioning Strategies

A preconditioner M approximates A^-1 cheaply. Left preconditioning solves M^-1Ax = M^-1b; right preconditioning solves AM^-1(Mx) = b. The goal is to cluster the eigenvalues of M^-1A (or AM^-1) near 1, dramatically reducing the number of Krylov iterations required.

Common Preconditioners

  • Diagonal (Jacobi): M = diag(A). Cheap but weak. Effective when diagonal dominates.
  • Incomplete LU (ILU): Partial LU factorization with sparsity pattern of A. ILU(0) keeps only the original sparsity; ILU(k) allows k levels of fill-in.
  • Algebraic Multigrid (AMG): Builds a hierarchy of coarser problems. Near-optimal for elliptic PDEs; O(n) solve cost.
  • Domain Decomposition: Splits domain into subdomains, solves local problems independently. Naturally parallel (Schwarz methods, FETI).

9. Stiff Systems in Detail

Stiff ODEs appear throughout science and engineering wherever multiple processes operate on widely different time scales. Solving them efficiently requires specialized implicit methods and careful linearization strategies.

Stiffness Detection and Diagnosis

A system y' = f(y) is stiff near a solution if the Jacobian J = df/dy has eigenvalues lambda_i satisfying: some |Re(lambda_i)| is large while the solution of interest varies slowly. The stiffness ratio S = max|Re(lambda_i)| / min|Re(lambda_i)| quantifies severity. Practically, stiffness is detected when an explicit solver takes many more steps than accuracy would require.

Stiff Examples

  • Chemical kinetics (fast/slow reactions)
  • Circuit simulation (RC time constants vary 10^9×)
  • Atmospheric chemistry models
  • Combustion (ms ignition, ns radical reactions)
  • Biological networks with fast/slow dynamics

Recommended Solvers by Stiffness

  • Non-stiff: DOPRI5 (RK45), Adams-PC
  • Mildly stiff: SDIRK, RK23s
  • Stiff: VODE/BDF, RADAU (implicit RK)
  • Very stiff/DAE: IDA, DASSL

Linearization and Newton Iteration for Implicit Methods

Implicit methods require solving a nonlinear system G(Y) = 0 at each step. Newton's method is used: Y^(k+1) = Y^(k) - [G'(Y^(k))]^(-1) G(Y^(k)). For BDF methods, G'(Y) = I - h*beta_s*J where J is the Jacobian of f. The Jacobian is typically computed once at the start of the step and reused for several steps (modified Newton or Jacobian lagging).

BDF2 Implicit Solve at Each Step

// BDF2: (3y_{n+1} - 4y_n + y_{n-1})/2h = f(y_{n+1})
// Rewrite as G(y_{n+1}) = 0:
G(y) = y - (2h/3)*f(y) - (4y_n - y_{n-1})/3

// Newton iteration:
J_G = I - (2h/3) * J_f
for k = 0, 1, ...:
  J_G * delta = -G(y^(k))   // solve linear system
  y^(k+1) = y^(k) + delta
  if ||delta|| < tol: break

Differential-Algebraic Equations (DAEs)

DAEs combine differential equations with algebraic constraints: F(t, y, y') = 0. The index of a DAE measures how many differentiations are needed to convert it to a pure ODE. Index-1 DAEs (like those from circuit simulation or incompressible Navier-Stokes) can be handled by BDF methods. Higher-index DAEs require reformulation or specialized solvers to avoid index reduction instabilities.

Pendulum DAE Example

x'' = -lambda*x, y'' = -lambda*y - g
x^2 + y^2 = L^2 (algebraic constraint)
Index 3 DAE — must be reduced to index 1 before solving

10. Error Analysis: Forward, Backward, and Mixed

Numerical analysts distinguish three fundamental types of error analysis. Each gives different insight into algorithm behavior and is appropriate in different contexts.

Forward Error Analysis

Forward error analysis tracks errors as they propagate through each arithmetic operation. Starting from input errors (round-off in representing data), it bounds the final output error by accumulating error bounds at each step. The forward error bound is ||computed_x - true_x|| / ||true_x||. Forward analysis tends to produce pessimistic (overly large) bounds because it assumes worst-case error accumulation at every step.

Backward Error Analysis (Wilkinson's Approach)

Backward error analysis asks: what perturbation to the original problem would produce the computed answer exactly? For Gaussian elimination with partial pivoting, Wilkinson showed that the computed solution x_computed exactly solves (A + delta_A) x_computed = b where ||delta_A||_inf / ||A||_inf ≤ g_n * eps_mach * n. This is a small backward error when the growth factor g_n is small — meaning the algorithm is backward stable even if the forward error could be large.

Backward Stability Definition

An algorithm is backward stable if the computed solution is the exact solution to a nearby problem. Precisely: fl(alg(x)) = alg(x + delta_x) where ||delta_x|| / ||x|| = O(eps_mach). A backward stable algorithm applied to a well-conditioned problem gives a small forward error. A backward stable algorithm applied to an ill-conditioned problem gives an answer as good as the problem permits.

Mixed Stability and the Error Relationship

The fundamental relationship connecting backward error, forward error, and conditioning is:

forward error ≤ condition_number × backward_error

||delta_x|| / ||x|| ≤ kappa(A) × ||delta_A|| / ||A||

For backward stable algorithm: backward error = O(eps_mach)
Therefore: forward error = O(kappa(A) × eps_mach)

Global vs. Local Truncation Error for ODEs

For ODE methods, the local truncation error (LTE) measures the defect introduced in a single step assuming exact initial data. An s-stage Runge-Kutta method of order p has LTE = O(h^{p+1}). The global truncation error (GTE) accumulates LTE over n = T/h steps. Since each step introduces O(h^{p+1}) error and there are O(1/h) steps, the GTE is O(h^p). This is why order-p methods achieve O(h^p) global accuracy.

Error Summary for ODE Methods

MethodLTEGTEStiff?
Explicit EulerO(h^2)O(h)No
Implicit EulerO(h^2)O(h)Yes
TrapezoidalO(h^3)O(h^2)Yes
Classic RK4O(h^5)O(h^4)No
BDF2O(h^3)O(h^2)Yes
DOPRI5 (RK45)O(h^6)O(h^5)No

11. Applications in Science and Engineering

Numerical stability analysis is not purely theoretical — it determines whether large-scale simulations give trustworthy results. The following applications illustrate the real-world stakes of choosing stable algorithms.

Climate and Weather Models

Global climate models integrate atmospheric and oceanic PDEs for decades of simulated time. The equations include both fast acoustic waves (requiring small time steps for stability) and slow climate dynamics. Semi-implicit methods treat the fast terms implicitly (allowing large time steps) and the slow nonlinear terms explicitly. Spectral methods in the horizontal direction require careful aliasing control to prevent numerical instability from nonlinear interactions. Energy-conserving discretizations are essential: small energy errors per step accumulate over millions of steps to produce unphysical climate drift.

Key stability concern: Courant-Friedrichs-Lewy (CFL) condition for explicit advection; symplectic integrators for long-time energy conservation

Structural Analysis (Finite Element Method)

Structural finite element analysis assembles large sparse SPD linear systems Ku = f where K is the stiffness matrix. For 3D problems with millions of elements, K may have dimension 10^7 × 10^7. Direct solvers (sparse LU) are feasible for 2D but scale poorly for 3D due to fill-in. Condition numbers can be enormous for near-singular structures (thin shells, nearly incompressible materials). Preconditioning is critical: AMG preconditioners reduce CG iteration counts from O(n) to O(1) for many structural problems.

Key stability concern: ill-conditioning from mesh distortion, locking phenomena in incompressible elements

Computational Fluid Dynamics (CFD)

Navier-Stokes simulations involve both stiffness (from viscous terms at high Reynolds number) and saddle-point structure (from the incompressibility constraint). Operator-splitting methods (fractional step, pressure-correction) decouple velocity and pressure updates but require careful treatment to maintain stability and accuracy. High-order Runge-Kutta methods in time are combined with spectral or high-order finite difference/element methods in space. Numerical diffusion from upwind schemes damps instabilities but can corrupt accuracy at coarse resolutions.

Key stability concern: CFL condition, pressure-velocity coupling, spurious pressure modes in incompressible formulations

Circuit Simulation (SPICE-class solvers)

Electronic circuit simulation involves DAEs from Kirchhoff's laws combined with device model ODEs (transistors, diodes). The time scales span 12 orders of magnitude: GHz digital switching alongside nHz thermal drift. SPICE uses BDF methods (Gear's method) with variable order and step size. The Jacobian (modified nodal admittance matrix) is sparse but can be ill-conditioned when circuit elements have very different impedances. Adaptive step size and order control are essential for efficiency.

Key stability concern: stiffness ratio up to 10^12, DAE index issues with ideal switches, ill-conditioned sparse Jacobians

12. Practice Problems with Solutions

Work through these problems to consolidate your understanding of numerical stability. Expand each solution only after attempting the problem.

Problem 1: Stability Region of Explicit RK2

The explicit midpoint method (RK2) applied to y' = lambda*y gives y_{n+1} = (1 + z + z^2/2) * y_n where z = h*lambda. Find all z on the real axis such that |1 + z + z^2/2| ≤ 1. Is this method stable for the heat equation time-stepping with h*lambda = -3?

Show Solution

Solution:

Let R(z) = 1 + z + z^2/2. For real z, we need |R(z)| ≤ 1.


R(z) = 1 is trivially satisfied at z = 0. For z < 0 (stability-relevant region), R(z) decreases from 1, reaches a minimum, then increases. Setting R(z) = -1: 1 + z + z^2/2 = -1, so z^2/2 + z + 2 = 0, giving z = -1 ± sqrt(1-4) — no real solution. Setting R(z) = +1 again at z = -2: R(-2) = 1 - 2 + 2 = 1. ✓


More carefully: on the real axis, the stability interval is -2 ≤ z ≤ 0. For z = -3: R(-3) = 1 - 3 + 4.5 = 2.5 > 1. The method is UNSTABLE for h*lambda = -3. To stabilize the heat equation with this method, we need h ≤ 2/|lambda_max|.

Problem 2: Condition Number Calculation

For the Hilbert matrix H_n where H_{ij} = 1/(i+j-1), the condition number grows roughly as kappa(H_n) ≈ e^{3.5n}. For H_5 (a 5×5 Hilbert matrix), estimate how many digits of accuracy you expect when solving H_5 x = b in double precision. If the true solution is x = (1,1,1,1,1)^T and you compute x_computed with ||delta_x|| / ||x|| = 5 × 10^{-5}, is this consistent with the condition number estimate?

Show Solution

Solution:

kappa(H_5) ≈ e^{3.5 × 5} = e^17.5 ≈ 4 × 10^7.


Double precision has ~16 significant digits (eps_mach ≈ 10^-16). Expected digits lost: log10(kappa) ≈ log10(4 × 10^7) ≈ 7.6 digits. Expected accuracy: ~16 - 8 = 8 significant digits.


Forward error bound: ||delta_x|| / ||x|| ≤ kappa(H_5) × eps_mach ≈ 4 × 10^7 × 10^-16 = 4 × 10^-9.


An observed error of 5 × 10^(-5) is larger than the theoretical bound of ~10^(-9). This suggests the actual condition number of H_5 is larger (the true kappa(H_5) ≈ 4.77 × 10^(17), so the approximation e^(3.5n) underestimates it). An error of 5 × 10^(-5) is actually consistent with kappa ≈ 5 × 10^11: reasonable for the 5×5 Hilbert matrix with accumulation of round-off.

Problem 3: Gershgorin Disk Analysis

Apply the Gershgorin circle theorem to the matrix A = [[5, -1, 0.5], [-1, 4, -1], [0.5, -1, 3]]. Find all three Gershgorin disks and determine whether the matrix is guaranteed to be positive definite.

Show Solution

Solution:

Row 1: center = 5, R1 = |-1| + |0.5| = 1.5 → D1 = [3.5, 6.5]
Row 2: center = 4, R2 = |-1| + |-1| = 2.0 → D2 = [2.0, 6.0]
Row 3: center = 3, R3 = |0.5| + |-1| = 1.5 → D3 = [1.5, 4.5]


All three disks lie entirely in the positive real axis (leftmost point: min(3.5, 2.0, 1.5) = 1.5 > 0). Since A is symmetric (A = A^T, which can be verified) and all eigenvalues lie in [1.5, 6.5] (all positive), the matrix is guaranteed positive definite. The minimum eigenvalue is at least 1.5 and the maximum is at most 6.5, giving kappa(A) ≤ 6.5/1.5 ≈ 4.3. This is a well-conditioned SPD matrix suitable for Cholesky factorization.

Problem 4: Jacobi vs. Gauss-Seidel Convergence

For the 2×2 system Ax = b with A = [[2, 1], [1, 3]], find the iteration matrices for the Jacobi and Gauss-Seidel methods. Compute the spectral radius of each and determine which converges faster.

Show Solution

Solution:

D = [[2,0],[0,3]], L = [[0,0],[1,0]], U = [[0,1],[0,0]]

Jacobi: M_J = -D^(-1)(L+U) = [[0,-1/2],[-1/3,0]]
Eigenvalues of M_J: lambda^2 = (1/2)(1/3) = 1/6
lambda = ±1/sqrt(6) ≈ ±0.408
rho(M_J) = 1/sqrt(6) ≈ 0.408

Gauss-Seidel: M_GS = -(D+L)^(-1)U
(D+L) = [[2,0],[1,3]], (D+L)^(-1) = [[1/2,0],[-1/6,1/3]]
M_GS = -[[1/2,0],[-1/6,1/3]] * [[0,1],[0,0]] = [[0,-1/2],[0,1/6]]
Eigenvalues: 0 and 1/6
rho(M_GS) = 1/6 ≈ 0.167


Gauss-Seidel converges faster: rho = 1/6 ≈ 0.167 vs. Jacobi rho = 0.408. Note that rho(M_GS) = rho(M_J)^2, which is the classical result for consistently ordered matrices.

Problem 5: Adaptive Step Size Selection

An adaptive RK45 solver attempts a step of h = 0.1 and estimates an error of err_est = 3.2 × 10^{-6} against a tolerance of tol = 10^{-7}. The safety factor is 0.9 and the order is 5. Should the step be accepted? What step size should be used next?

Show Solution

Solution:

err_ratio = err_est / tol = 3.2e-6 / 1e-7 = 32
err_ratio = 32 > 1 → REJECT the step

New step: h_new = h × fac × (1/err_ratio)^(1/(p+1))
= 0.1 × 0.9 × (1/32)^(1/6)
= 0.1 × 0.9 × (0.03125)^(0.1667)
= 0.1 × 0.9 × 0.5616
= 0.0506


The step is rejected. Retry with h_new ≈ 0.050. This halved step should reduce the error by approximately 2^5 = 32 (for a 5th-order method), bringing err_est down to ~10^-7, consistent with the tolerance.

Problem 6: Partial Pivoting Growth Factor

Perform LU factorization with partial pivoting on the matrix A = [[1, 2, 3], [4, 5, 6], [7, 8, 10]]. Identify the permutation matrix P and factorization LU. Compute the growth factor g = max|U_{ij}| / max|A_{ij}|.

Show Solution

Solution:

Step 1: Pivot column 1. Max |entry| in col 1 is 7 (row 3). Swap rows 1 and 3:
A → [[7, 8, 10], [4, 5, 6], [1, 2, 3]], P records swap (1↔3)
Multipliers: l_21 = 4/7, l_31 = 1/7
After elimination:
[[7, 8, 10], [0, 5-32/7, 6-40/7], [0, 2-8/7, 3-10/7]]
= [[7, 8, 10], [0, 3/7, 2/7], [0, 6/7, 11/7]]

Step 2: Pivot column 2 (rows 2-3). Max is |6/7| (row 3). Swap rows 2 and 3:
[[7,8,10], [0,6/7,11/7], [0,3/7,2/7]]
l_32 = (3/7)/(6/7) = 1/2
U[3,3] = 2/7 - (1/2)(11/7) = 2/7 - 11/14 = 4/14 - 11/14 = -7/14 = -1/2

U = [[7, 8, 10], [0, 6/7, 11/7], [0, 0, -1/2]]
max|U| = 10, max|A| = 10
Growth factor g = 10/10 = 1.0


The growth factor is 1.0 — the elements did not grow at all. This is the typical favorable behavior of partial pivoting: the theoretical bound is 2^2 = 4 for a 3×3 matrix, but the actual growth is much smaller.

Problem 7: CG Convergence Estimate

A symmetric positive definite matrix A has smallest eigenvalue lambda_min = 0.5 and largest eigenvalue lambda_max = 200. Using the conjugate gradient method, how many iterations are needed to reduce the A-norm error by a factor of 10^{-8}? If we use a preconditioner that reduces the effective condition number to 10, how many iterations are needed?

Show Solution

Solution:

kappa(A) = lambda_max / lambda_min = 200 / 0.5 = 400
sqrt(kappa) = 20
rho = (sqrt(kappa)-1)/(sqrt(kappa)+1) = 19/21 ≈ 0.905

Need: 2 * rho^k ≤ 10^-8
rho^k ≤ 5 × 10^-9
k ≥ log(5e-9) / log(0.905) ≈ -19.1 / (-0.0998) ≈ 191 iterations

With preconditioner (kappa_eff = 10):
sqrt(10) ≈ 3.162
rho_pc = (3.162-1)/(3.162+1) = 2.162/4.162 ≈ 0.519
k ≥ log(5e-9) / log(0.519) ≈ -19.1 / (-0.655) ≈ 30 iterations


Preconditioning reduces iterations from ~191 to ~30 — a 6× speedup. For large-scale problems, the cost of each iteration (one linear solve with M plus one matrix-vector product with A) is typically much less than kappa reduction ratio suggests, making preconditioning extraordinarily valuable.

Quick Reference: Methods and Stability Properties

MethodOrderStabilityCost/StepBest For
Explicit Euler1Disk |z+1| ≤ 11 f-evalDemos only
Classic RK44Bounded, explicit4 f-evalsNon-stiff ODEs
DOPRI5 (RK45)5(4)Adaptive, explicit6 f-evalsNon-stiff, adaptive
Implicit Euler1A-stable, L-stable1 Newton solveVery stiff
Trapezoidal (CN)2A-stable, not L1 Newton solveParabolic PDEs
BDF22A(alpha)-stable1 Newton solveStiff ODEs
RADAU IIA5A-stable, L-stable3-stage NewtonVery stiff, DAEs
Adams-Bashforth 44Explicit, bounded1 f-evalNon-stiff, cheap f
LU + Partial PivotBackward stableO(n^3)Dense systems
CholeskyUnconditionally stableO(n^3/6)SPD systems
CGConvergent (SPD)O(sqrt(kappa)*n)Large sparse SPD
GMRESConvergent (general)O(k*n) memoryLarge sparse general

Ready to Test Your Numerical Methods Skills?

Practice stability analysis, error estimation, and solver selection with our adaptive math tool. Work through problems on Runge-Kutta methods, condition numbers, and iterative solver convergence with instant feedback.

Start Practicing Now

Covers stability regions, error analysis, ODE solvers, and linear systems — all topics from this guide.