Scientific Computing

Advanced Numerical Methods

Numerical methods turn mathematical problems into algorithms that computers can solve. This guide covers the theory and practice of floating-point arithmetic, linear systems, eigenvalues, nonlinear equations, interpolation, quadrature, ODEs, PDEs, optimization, and fast algorithms — everything you need for graduate-level scientific computing.

Learning Objectives

  • Understand IEEE 754, machine epsilon, and catastrophic cancellation
  • Apply Gaussian elimination, LU decomposition, and pivoting strategies
  • Analyze condition numbers and understand ill-conditioned systems
  • Implement and compare iterative solvers: Jacobi, Gauss-Seidel, CG
  • Compute eigenvalues with the power method and QR algorithm
  • Solve nonlinear equations with Newton-Raphson and understand convergence rates
  • Construct interpolating polynomials and cubic splines
  • Apply Gaussian quadrature and adaptive integration
  • Implement RK4 and understand stiffness and implicit methods
  • Discretize PDEs with finite differences and understand stability via CFL
  • Apply gradient descent, Newton, and BFGS for optimization
  • Understand the FFT algorithm and its O(n log n) complexity

1. Floating-Point Arithmetic and IEEE 754

Every numerical computation runs on hardware that represents real numbers in floating-point format. Understanding how this representation works — and where it fails — is the foundation of numerical analysis.

IEEE 754 Double Precision

The IEEE 754 standard defines the most widely used floating-point format. Double precision (64-bit) stores a number as: (-1)^s * m * 2^e, where s is a 1-bit sign, e is an 11-bit biased exponent, and m is a 52-bit mantissa (significand). This gives roughly 15-16 significant decimal digits and an exponent range of approximately 10^(-308) to 10^(308).

IEEE 754 Double Precision Format

  • Sign bit: 1 bit
  • Exponent: 11 bits (biased by 1023)
  • Mantissa: 52 bits (with implicit leading 1)
  • Precision: about 15.9 significant decimal digits
  • Machine epsilon: eps = 2^(-52) approximately 2.22e-16
  • Smallest positive normal: 2^(-1022) approximately 2.23e-308
  • Largest finite: (2 - eps) * 2^1023 approximately 1.80e308

Machine Epsilon

Machine epsilon eps is the smallest positive floating-point number such that the floating-point representation of 1 + eps is strictly greater than 1. It bounds the relative rounding error: for any real number x in the representable range, the floating-point representation fl(x) satisfies |fl(x) - x| / |x| at most eps/2. This means every floating-point operation introduces a relative error of at most eps/2, called unit roundoff.

Practical Implication

Never compare two floating-point numbers with ==. Instead, check |a - b| < tol where tol is chosen relative to the magnitude of the numbers, e.g., tol = eps * max(|a|, |b|). Equality testing fails because the same mathematical value may round differently depending on the order of operations.

Catastrophic Cancellation

When two nearly equal floating-point numbers are subtracted, most significant digits cancel, leaving a result dominated by rounding error. This is catastrophic cancellation — not a bug, but an inherent limitation of finite-precision arithmetic.

Unstable Formulation

sqrt(x + h) - sqrt(x) for small h

Both terms round to nearly the same value; the difference loses almost all significant digits.

Stable Formulation

h / (sqrt(x + h) + sqrt(x))

Algebraically equivalent but avoids subtraction of nearly equal numbers.

Other classic examples requiring reformulation include: the quadratic formula (use the numerically stable form for small discriminants), computing log(1 + x) for small x (use the built-in log1p function), and evaluating exp(x) - 1 for small x (use expm1).

Condition Numbers of Functions

The condition number of a scalar function f at input x measures how sensitive the output is to small perturbations in the input. For a smooth function, the relative condition number is |x * f'(x) / f(x)|. A large condition number means the problem is inherently sensitive — even an exact algorithm will give inaccurate results when the input has rounding error.

Example: Condition Number of Subtraction

For f(x, y) = x - y, the relative condition number (with respect to x) is |x / (x - y)|. When x is approximately equal to y, the denominator is tiny and the condition number is enormous. This confirms that subtraction of nearly equal numbers is inherently ill-conditioned — catastrophic cancellation is not a flaw in the algorithm but a consequence of problem sensitivity.

2. Solving Linear Systems: Direct Methods

Solving Ax = b for x is the central computational problem of numerical linear algebra. Direct methods produce an exact solution (in exact arithmetic) through systematic elimination.

Gaussian Elimination

Gaussian elimination transforms A into upper triangular form U through a sequence of row operations, then solves by back substitution. For an n x n system it requires O(n^3 / 3) multiplications and O(n^3 / 3) additions, making it O(n^3) overall.

Algorithm: Forward Elimination

  • for k = 1 to n-1:
  • for i = k+1 to n:
  • m = A[i,k] / A[k,k] (multiplier)
  • for j = k to n:
  • A[i,j] = A[i,j] - m * A[k,j]
  • b[i] = b[i] - m * b[k]

Partial Pivoting

Without pivoting, Gaussian elimination fails when a pivot element A[k,k] is zero, and becomes numerically unstable when it is small. Partial pivoting selects the row with the largest absolute value in column k at each step and swaps it to the pivot position. This ensures multipliers satisfy |m| at most 1, bounding error growth.

Why Pivoting Matters

Without pivoting, the system [0.001, 1; 1, 1] x = [1; 2] leads to a multiplier of 1000, and rounding errors are amplified by that factor. With partial pivoting, the rows are swapped so the multiplier is 0.001, and rounding errors remain small. In practice, always use partial pivoting.

LU Decomposition

LU decomposition factors A = LU where L is lower triangular with ones on the diagonal and U is upper triangular. With partial pivoting, PA = LU where P is a permutation matrix. Once the factorization is computed in O(n^3) time, each new right-hand side b can be solved in O(n^2) time via forward and back substitution. This makes LU ideal when solving Ax = b for many different b vectors with the same A.

Forward Substitution (Ly = b)

y[1] = b[1]
y[i] = b[i] - sum(L[i,j] * y[j], j=1..i-1)

Back Substitution (Ux = y)

x[n] = y[n] / U[n,n]
x[i] = (y[i] - sum(U[i,j] * x[j], j=i+1..n)) / U[i,i]

Condition Number of a Matrix

The 2-norm condition number of A is kappa(A) = sigma_max / sigma_min, where sigma_max and sigma_min are the largest and smallest singular values. It measures sensitivity of the solution x to perturbations in A or b. The bound on relative error in x due to relative perturbation delta_b in b satisfies:

norm(delta_x) / norm(x) at most kappa(A) * norm(delta_b) / norm(b)

Rule of thumb: if kappa(A) is approximately 10^k, expect to lose about k decimal digits of accuracy in the solution. For double precision with 16 digits, a system with kappa = 10^10 will give only about 6 correct digits.

Cholesky Decomposition

For symmetric positive definite (SPD) matrices, the Cholesky factorization A = L L^T is twice as efficient as general LU decomposition, requiring only n^3 / 6 operations. It is also numerically more stable — no pivoting is needed because the diagonal entries of L are always positive. SPD systems arise constantly in optimization, finite element methods, and statistics.

3. Iterative Methods for Linear Systems

For large sparse systems, direct methods are impractical: LU decomposition can destroy sparsity through fill-in, requiring far more storage and work than the original system. Iterative methods instead produce a sequence of approximate solutions converging to the true solution, exploiting sparsity at every step.

Jacobi and Gauss-Seidel Methods

Both methods split A = D + L + U where D is diagonal, L is strictly lower triangular, and U is strictly upper triangular.

Jacobi Method

x^(k+1) = D^(-1) (b - (L+U) x^(k))

Each component of x^(k+1) uses only values from the previous iterate x^(k). Naturally parallelizable.

Gauss-Seidel Method

x^(k+1) = (D+L)^(-1) (b - U x^(k))

Uses the most recently updated components immediately. Typically converges about twice as fast as Jacobi.

Both methods converge when A is strictly diagonally dominant (|A[i,i]| greater than sum over j not equal to i of |A[i,j]|). Gauss-Seidel also converges for SPD matrices. The convergence rate is determined by the spectral radius of the iteration matrix: the method converges if and only if all eigenvalues of the iteration matrix have magnitude less than 1.

Successive Over-Relaxation (SOR)

SOR accelerates Gauss-Seidel by introducing a relaxation parameter omega in (0, 2). The update is:

x_i^(k+1) = (1 - omega) x_i^(k) + omega * (b_i - sum over j<i of A[i,j]*x_j^(k+1) - sum over j>i of A[i,j]*x_j^(k)) / A[i,i]

For omega = 1, SOR reduces to Gauss-Seidel. For the optimal omega (which depends on the spectral radius of the Jacobi iteration matrix), SOR can achieve dramatically faster convergence. For a model problem (Poisson equation on a uniform grid), the optimal omega approaches 2 as the grid is refined, and SOR converges in O(n) iterations versus O(n^2) for Gauss-Seidel.

Conjugate Gradient Method

The Conjugate Gradient (CG) method is the gold standard for large SPD systems. It is a Krylov subspace method: the iterate x^(k) is the best approximation to x in the Krylov subspace K_k(A, r_0) = span(r_0, A r_0, A^2 r_0, ..., A^(k-1) r_0), where r_0 = b - A x^(0) is the initial residual.

CG Algorithm (simplified)

  • r = b - A*x; p = r; rsold = r^T * r
  • for each iteration:
  • Ap = A * p
  • alpha = rsold / (p^T * Ap)
  • x = x + alpha * p
  • r = r - alpha * Ap
  • rsnew = r^T * r
  • beta = rsnew / rsold
  • p = r + beta * p
  • rsold = rsnew

CG terminates in at most n iterations in exact arithmetic (where n is the matrix dimension), but in practice converges much faster. The convergence rate depends on the condition number: the error after k steps satisfies a bound involving ((sqrt(kappa) - 1) / (sqrt(kappa) + 1))^k. Preconditioning transforms the system to reduce the effective condition number, dramatically accelerating convergence. Common preconditioners include incomplete LU (ILU), incomplete Cholesky, and algebraic multigrid.

4. Eigenvalue Problems

Computing eigenvalues and eigenvectors of a matrix arises in structural mechanics, quantum chemistry, graph theory, data science (PCA), and stability analysis of dynamical systems.

The Power Method

The power method finds the largest-magnitude eigenvalue (dominant eigenvalue) and its eigenvector. Starting from a random vector v_0, iterate: v_(k+1) = A v_k / norm(A v_k). The iterates converge to the dominant eigenvector at a rate determined by the ratio |lambda_2 / lambda_1|, where lambda_1 is dominant and lambda_2 is the second largest. Slow convergence occurs when |lambda_2 / lambda_1| is close to 1.

Inverse Iteration and Shift-Invert

To find the eigenvalue closest to a target sigma, apply the power method to (A - sigma I)^(-1). This matrix has eigenvalues 1 / (lambda_i - sigma), so the eigenvalue of A closest to sigma becomes the dominant eigenvalue of the shifted-inverted matrix. This requires solving (A - sigma I) y = v at each step, which can be done efficiently with a precomputed LU factorization.

The QR Algorithm

The QR algorithm is the standard method for computing all eigenvalues of a dense matrix. The basic iteration:

  • A_0 = A
  • for k = 0, 1, 2, ...:
  • A_k = Q_k R_k (QR factorization)
  • A_(k+1) = R_k Q_k (reverse product)

The matrices A_k are orthogonally similar to A (same eigenvalues) and converge to quasi-upper triangular (Schur) form, with eigenvalues on the diagonal (real case) or 2x2 blocks (complex conjugate pairs).

The practical QR algorithm incorporates two crucial refinements. First, reduce A to upper Hessenberg form (zero below the first subdiagonal) in O(n^3) time before iterating — each QR step on Hessenberg form costs only O(n^2). Second, use shifts: replace A_k by A_k - sigma_k I before the QR step, restoring the shift afterward, to accelerate convergence. The Francis double-shift strategy achieves cubic convergence and is the basis of LAPACK routines.

Lanczos Method for Large Sparse Matrices

For large sparse symmetric matrices, the Lanczos algorithm builds an orthonormal basis for the Krylov subspace K_k(A, v_1) and reduces A to a symmetric tridiagonal matrix T_k of dimension k. The eigenvalues of T_k (called Ritz values) approximate the extreme eigenvalues of A remarkably well, often converging in far fewer than n steps. The Lanczos method requires only matrix-vector products with A, making it ideal for structured matrices. Numerical issues (loss of orthogonality) require periodic re-orthogonalization in practice.

5. Nonlinear Equations: Root-Finding Methods

Finding x such that f(x) = 0 arises everywhere in science and engineering. The key concepts are existence (intermediate value theorem), uniqueness, and rate of convergence.

Bisection Method

Bisection exploits the intermediate value theorem: if f is continuous and f(a) and f(b) have opposite signs, there is a root in (a, b). Bisect the interval, test the midpoint, and keep the half containing a sign change. The error halves each iteration, giving linear convergence with rate 1/2. After n iterations, the error is at most (b - a) / 2^n.

Bisection Advantages

Guaranteed to converge (for continuous f with a sign change). Does not require derivatives. Predictable convergence rate. The number of iterations needed to achieve tolerance tol is ceiling(log2((b-a)/tol)). For 10^(-10) accuracy on an interval of length 1, about 33 iterations suffice.

Newton-Raphson Method

Newton-Raphson linearizes f at the current iterate x_k and solves for the root of the linear approximation:

x_(k+1) = x_k - f(x_k) / f'(x_k)

Near a simple root (f'(r) not equal to 0), Newton-Raphson converges quadratically: the error satisfies e_(k+1) is approximately |f''(r) / (2 f'(r))| * e_k^2. This means the number of correct digits roughly doubles each iteration — from 2 to 4 to 8 to 16. Starting within the basin of attraction is critical; far from the root, Newton may diverge or cycle.

Secant Method

The secant method approximates the derivative in Newton's formula using a finite difference:

x_(k+1) = x_k - f(x_k) * (x_k - x_(k-1)) / (f(x_k) - f(x_(k-1)))

The secant method converges superlinearly with order phi = (1 + sqrt(5)) / 2 approximately 1.618 (the golden ratio). Requires two initial points and one function evaluation per iteration (versus Newton's one, plus a derivative evaluation). For problems where derivatives are expensive or unavailable, the secant method is often preferred.

Fixed-Point Iteration

To solve f(x) = 0, rewrite as x = g(x) and iterate x_(k+1) = g(x_k). The Banach fixed-point theorem guarantees convergence if g is a contraction on a closed interval: |g'(x)| at most L less than 1 for all x in the interval. The convergence is linear with rate L. Newton's method is a special case of fixed-point iteration with g(x) = x - f(x)/f'(x), whose derivative at a root is 0, explaining the quadratic convergence.

Convergence Rate Summary

MethodOrderEvaluations/StepNotes
Bisection1 (rate 1/2)1Guaranteed; slow
Secant1.6181No derivative needed
Newton-Raphson22 (f and f')Fastest near root
Newton (repeated root)1 (linear)2Degrades to linear
Halley's method33 (f, f', f'')Cubic convergence

6. Interpolation and Approximation

Interpolation constructs a function passing exactly through given data points. The choice of interpolant affects accuracy, computational cost, and qualitative behavior.

Lagrange Interpolation

Given n+1 data points (x_0, y_0), ..., (x_n, y_n) with distinct nodes, there exists a unique polynomial p of degree at most n passing through all points. The Lagrange form expresses this polynomial explicitly:

p(x) = sum over i=0..n of y_i * L_i(x)

L_i(x) = product over j not equal to i of (x - x_j) / (x_i - x_j)

Each basis polynomial L_i satisfies L_i(x_i) = 1 and L_i(x_j) = 0 for j not equal to i.

Runge's Phenomenon

High-degree polynomial interpolation on equally spaced nodes suffers from Runge's phenomenon: large oscillations near the endpoints, even for smooth functions like f(x) = 1 / (1 + 25x^2) on [-1, 1]. The remedy is to use Chebyshev nodes x_k = cos(k*pi/n) for k = 0..n, which cluster near the endpoints and eliminate the oscillation.

Newton Divided Differences

Newton's divided difference form expresses the interpolating polynomial recursively and is computationally efficient for adding new data points without recomputing from scratch:

  • f[x_i] = y_i (zeroth divided differences)
  • f[x_i, x_(i+1)] = (f[x_(i+1)] - f[x_i]) / (x_(i+1) - x_i)
  • f[x_i,...,x_(i+k)] = (f[x_(i+1),...,x_(i+k)] - f[x_i,...,x_(i+k-1)]) / (x_(i+k) - x_i)
  • p(x) = f[x_0] + f[x_0,x_1](x-x_0) + f[x_0,x_1,x_2](x-x_0)(x-x_1) + ...

Cubic Splines

A cubic spline on nodes x_0 < x_1 < ... < x_n is a piecewise cubic polynomial S(x) that:

  • Interpolates: S(x_i) = y_i for all i
  • Is continuous in value, first derivative, and second derivative at interior nodes
  • Satisfies two additional boundary conditions (natural, clamped, or periodic)

The natural spline adds the condition S''(x_0) = S''(x_n) = 0. The unknown second derivatives at interior nodes are found by solving a tridiagonal linear system, which is inexpensive (O(n) with the Thomas algorithm). Cubic splines are the gold standard for smooth interpolation in practice because they minimize oscillation while fitting all data points.

Hermite Interpolation

Hermite interpolation matches both function values and derivatives at the nodes. Given (x_i, y_i, y'_i) for i = 0..n, the cubic Hermite polynomial on each subinterval [x_i, x_(i+1)] matches the value and derivative at both endpoints, giving 4 conditions and determining a unique cubic. Piecewise cubic Hermite (PCHIP) interpolation preserves monotonicity and positivity of the data, making it preferable to splines when those properties matter.

7. Numerical Integration (Quadrature)

Quadrature approximates the definite integral of f over an interval by a weighted sum: integral from a to b of f(x) dx is approximately sum over i of w_i * f(x_i). The art is choosing nodes x_i and weights w_i for maximum accuracy.

Newton-Cotes Rules

Newton-Cotes rules use equally spaced nodes. The trapezoidal rule uses two nodes (the endpoints) and is exact for polynomials of degree at most 1. Simpson's rule uses three nodes (endpoints and midpoint) and is exact for polynomials of degree at most 3 (one order higher than expected, due to symmetry). For composite Simpson's rule with n subintervals (n even), the error is -(b-a) * h^4 * f''''(xi) / 180 for some xi in (a,b), where h = (b-a)/n.

Gaussian Quadrature

Gaussian quadrature optimally chooses both nodes and weights to maximize the degree of exactness. An n-point Gauss-Legendre rule is exact for polynomials of degree up to 2n-1. The nodes are the roots of the n-th Legendre polynomial, and the weights are computed from those roots. No quadrature rule with n points can integrate polynomials of degree 2n or higher exactly.

5-Point Gauss-Legendre on [-1, 1]

  • x_1 = 0.0, w_1 = 8/9 approximately 0.5689
  • x_(2,3) = +/- 0.5385, w_(2,3) approximately 0.4786
  • x_(4,5) = +/- 0.9062, w_(4,5) approximately 0.2369
  • Exact for polynomials up to degree 9.

Romberg Integration

Romberg integration applies Richardson extrapolation to the trapezoidal rule. By computing T(h), T(h/2), T(h/4), ... and combining them, the leading error terms are cancelled successively. The result is a triangular array R(n,m) where R(n,0) is the trapezoidal rule and R(n,m) extrapolates from R(n,m-1) and R(n-1,m-1). The diagonal entries R(n,n) converge like O(h^(2n+2)), giving spectral-like convergence for smooth integrands.

Adaptive Integration

Adaptive methods estimate the local error and refine only where needed. Adaptive Simpson's rule computes Simpson's rule on [a,b] and on the two halves [a,c] and [c,b] where c = (a+b)/2. If the difference is less than 15 * tolerance, accept the result; otherwise recurse on each half. This concentrates effort near singularities and rapid variations while using few evaluations in smooth regions.

Monte Carlo Integration

Monte Carlo integration estimates the integral by randomly sampling the integrand: take N uniform samples x_1, ..., x_N in [a,b] and approximate the integral as (b-a) * mean over i of f(x_i). The error decreases as 1 / sqrt(N), independent of dimension d. For high-dimensional integrals (d greater than 4 or 5), Monte Carlo outperforms deterministic methods whose error scales as h^p and require N = (1/h)^d points to achieve step size h.

Quasi-Monte Carlo

Replacing random samples with low-discrepancy sequences (Halton, Sobol, Faure sequences) gives quasi-Monte Carlo integration with error O((log N)^d / N), better than the O(1/sqrt(N)) of true Monte Carlo for moderate dimensions. Low-discrepancy sequences fill space more uniformly than random points, reducing clustering and gaps.

8. Numerical Methods for ODEs

Solving ordinary differential equations numerically means stepping forward in time from an initial condition, approximating the continuous trajectory with a discrete sequence of points.

Euler's Method

For the initial value problem y' = f(t, y), y(t_0) = y_0, the forward Euler method steps as:

y_(n+1) = y_n + h * f(t_n, y_n)

Euler's method is first-order accurate: the global error is O(h). It is easy to implement but often not competitive for accuracy. The local truncation error (error per step) is h^2 * y''(t_n) / 2. For stability on the scalar test equation y' = lambda*y, Euler requires h * |lambda| at most 2.

Runge-Kutta Methods: RK4

The classical fourth-order Runge-Kutta method (RK4) is the workhorse explicit solver for smooth, non-stiff ODEs:

  • k1 = h * f(t_n, y_n)
  • k2 = h * f(t_n + h/2, y_n + k1/2)
  • k3 = h * f(t_n + h/2, y_n + k2/2)
  • k4 = h * f(t_n + h, y_n + k3)
  • y_(n+1) = y_n + (k1 + 2*k2 + 2*k3 + k4) / 6

RK4 is fourth-order accurate (global error O(h^4)), uses 4 function evaluations per step, and has excellent stability properties for smooth problems. Cutting h in half reduces the error by a factor of 16.

Stiffness and Implicit Methods

A stiff ODE has solutions with components decaying at widely different rates (e.g., eigenvalues of the Jacobian with very different magnitudes). Explicit methods require step sizes smaller than the fastest timescale for stability, even if accuracy only requires resolving the slowest timescale — making them prohibitively expensive.

Backward Euler

y_(n+1) = y_n + h * f(t_(n+1), y_(n+1))

First-order, unconditionally stable for linear problems. Requires solving a nonlinear system at each step.

Crank-Nicolson

y_(n+1) = y_n + h/2 * (f(t_n, y_n) + f(t_(n+1), y_(n+1)))

Second-order, A-stable (all eigenvalues in left half-plane are stable). The standard method for diffusion-type problems.

Adams Multistep Methods

Multistep methods reuse past function evaluations to achieve high order with few new evaluations per step. The Adams-Bashforth family is explicit; Adams-Moulton is implicit. The k-step Adams-Bashforth method achieves order k and requires only one new function evaluation per step (after startup). Predictor-corrector pairs (Adams-Bashforth to predict, Adams-Moulton to correct) are popular for non-stiff problems requiring high accuracy.

9. Numerical Methods for PDEs

Partial differential equations describe physical phenomena on continuous domains. Numerical methods discretize the domain into a grid or mesh and approximate the PDE with algebraic equations on those discrete points.

Finite Difference Methods

Finite differences replace derivatives with difference quotients on a uniform grid. For the second derivative, the central difference approximation is:

u''(x) is approximately (u(x+h) - 2u(x) + u(x-h)) / h^2

This is second-order accurate: the error is O(h^2). The truncation error is derived from Taylor expansion: u(x plus-or-minus h) = u(x) plus-or-minus h u'(x) + (h^2/2) u''(x) plus-or-minus (h^3/6) u'''(x) + O(h^4). Adding the two expansions and solving for u''(x) gives the formula above with error h^2 u''''(x) / 12.

Stability: The CFL Condition

For the advection equation u_t + c u_x = 0 discretized with forward Euler in time and upwind differences in space, the Courant-Friedrichs-Lewy (CFL) condition for stability is:

CFL number = |c| * dt / dx at most 1

The physical interpretation: numerical information can travel at most one grid cell per time step. If the time step is too large relative to the wave speed and grid spacing, the numerical domain of dependence does not contain the physical domain of dependence, and the scheme is unstable. Violations cause the numerical solution to grow exponentially.

Von Neumann stability analysis provides a systematic way to determine stability conditions: substitute a Fourier mode u_j^n = xi^n * exp(ij*theta) into the difference scheme and require the amplification factor |xi| at most 1 for all wavenumbers theta. For the heat equation u_t = alpha u_xx with forward Euler, the stability condition is alpha * dt / dx^2 at most 1/2.

Finite Element Methods

Finite element methods (FEM) formulate the PDE in weak (variational) form and approximate the solution in a finite-dimensional function space. For the Poisson equation -u'' = f on (0,1) with homogeneous Dirichlet boundary conditions, the weak form is: find u in H_0^1 such that integral of u' v' dx = integral of f v dx for all test functions v in H_0^1.

Choosing piecewise linear basis functions phi_1, ..., phi_(n-1) on a uniform mesh converts this to the linear system K u = f, where K is the stiffness matrix with K[i,j] = integral of phi_i' phi_j' dx. FEM handles complex geometries, variable coefficients, and unstructured meshes naturally, making it the dominant method in structural mechanics and fluid dynamics.

Method of Lines

The method of lines (MOL) discretizes a PDE in space only, converting it to a large system of ODEs in time, which is then solved with a standard ODE solver. For the heat equation u_t = u_xx on a spatial grid, replacing u_xx with the central difference approximation gives a system of ODEs for the grid values u_i(t). Stiff ODE solvers (like the trapezoidal method or SDIRK methods) are needed because spatial discretization of diffusion equations produces stiff systems with eigenvalues proportional to -1/dx^2.

10. Numerical Optimization

Optimization methods minimize (or maximize) a function f(x) over a feasible domain. Gradient-based methods exploit derivative information to find critical points efficiently.

Gradient Descent

Gradient descent moves in the direction of steepest descent:

x_(k+1) = x_k - alpha_k * gradient(f)(x_k)

The step size alpha_k (learning rate) is critical. Too large causes divergence; too small gives slow convergence. Line search methods (backtracking Armijo, Wolfe conditions) choose alpha_k adaptively. For a strongly convex quadratic f with condition number kappa, gradient descent converges linearly: the error decreases by a factor of (kappa-1)/(kappa+1) per step. When kappa is large (ill-conditioned problem), convergence is extremely slow.

Newton's Method for Optimization

Newton's method for minimizing f uses the quadratic Taylor model:

x_(k+1) = x_k - H(x_k)^(-1) * gradient(f)(x_k)

H is the Hessian matrix of second derivatives. Near a local minimum with positive definite Hessian, Newton's method converges quadratically. The drawbacks: requires computing and inverting the n x n Hessian (O(n^2) storage, O(n^3) per step), and may not converge from far away. Practical implementations add safeguards: trust region methods or line search.

Conjugate Gradient for Optimization

The nonlinear Conjugate Gradient method extends CG to general (non-quadratic) objective functions. Instead of Hessian-based search directions, it uses conjugate directions computed from gradient information alone. The Fletcher-Reeves and Polak-Ribiere formulas compute the conjugation coefficient beta_k from consecutive gradients. NCG avoids storing the Hessian and requires only O(n) storage per step, making it ideal for large-scale problems.

Quasi-Newton Methods: BFGS

BFGS (Broyden-Fletcher-Goldfarb-Shanno) approximates the inverse Hessian using only gradient information, updating the approximation at each step with a rank-2 correction:

p_k = -B_k^(-1) * gradient(f)(x_k) (search direction)

s_k = x_(k+1) - x_k, y_k = gradient(f)(x_(k+1)) - gradient(f)(x_k)

B_(k+1)^(-1) = BFGS update using s_k, y_k

BFGS achieves superlinear convergence (better than linear, less than quadratic) without computing the full Hessian. L-BFGS (limited-memory BFGS) stores only the last m vector pairs (s_k, y_k), typically m = 5 to 20, reducing storage to O(mn) and making it the standard method for large-scale unconstrained optimization in machine learning and scientific computing.

11. Fast Algorithms: FFT and Hierarchical Methods

Some of the most important algorithms in numerical computing are "fast" algorithms that reduce seemingly O(n^2) or O(n^3) computations to O(n log n) or O(n) by exploiting mathematical structure.

The Fast Fourier Transform (FFT)

The Discrete Fourier Transform (DFT) of a sequence x_0, ..., x_(n-1) is:

X_k = sum over j=0..n-1 of x_j * exp(-2*pi*i*j*k/n), k = 0,...,n-1

The naive computation requires O(n^2) complex multiplications. The Cooley-Tukey FFT computes this in O(n log n) by recursively decomposing the DFT of size n into two DFTs of size n/2 (when n is a power of 2), exploiting the periodicity of the complex exponential. For n = 10^6, this reduces operations from 10^12 to about 2 * 10^7 — a factor of 50,000 speedup.

Cooley-Tukey Radix-2 Decomposition

  • X_k = sum(j even) x_j * omega^(jk) + sum(j odd) x_j * omega^(jk)
  • = DFT(x_(0,2,4,...)) + omega^k * DFT(x_(1,3,5,...))

where omega = exp(-2*pi*i/n). Each half-DFT has size n/2. Recursing to base case n=1 gives T(n) = 2T(n/2) + O(n), solved by the master theorem as T(n) = O(n log n).

Applications of the FFT span signal processing (spectrum analysis, filtering), image processing (JPEG compression via DCT), solving PDEs spectrally (pseudospectral methods), fast polynomial multiplication (used in computer algebra systems), and fast convolution of long sequences.

The Fast Multipole Method (FMM)

Computing the pairwise interactions among N particles (gravitational N-body problem, electrostatics) requires O(N^2) operations naively. The Fast Multipole Method (Greengard and Rokhlin, 1987) reduces this to O(N) or O(N log N) using a hierarchical decomposition of space:

  • Build a hierarchical tree (octree in 3D) partitioning the domain
  • Represent the potential of clusters of far-away particles by multipole expansions (compact approximations valid far from the source)
  • Translate multipole expansions to local expansions near the target
  • Evaluate interactions exactly only for nearby particles

FMM was named one of the top 10 algorithms of the 20th century. It is fundamental to molecular dynamics simulations, astrophysical N-body codes, and solving integral equations. The key insight is separating near-field interactions (handled directly) from far-field interactions (handled via compressed multipole representations).

Hierarchical Matrices (H-matrices)

Many matrices arising from integral equations and PDE solvers have blocks that are numerically low-rank (well-approximated by a product of skinny matrices) when the corresponding source and target clusters are well-separated. Hierarchical matrices exploit this structure to represent and manipulate such matrices in O(n log^2 n) storage and O(n log^2 n) arithmetic, versus O(n^2) for dense storage. H-matrix arithmetic enables fast direct solvers for problems where iterative methods converge slowly.

Practice Problems

Problem 1 — Floating-Point Cancellation

Compute 1 - cos(x) for x = 0.001 using double precision. The naive formula loses significant digits. Find an algebraically equivalent expression that avoids cancellation.

Show Solution

For small x, both 1 and cos(x) are close to 1, so their difference suffers catastrophic cancellation. The exact value for x = 0.001 is approximately 5.0 * 10^(-7).

Stable reformulations:

  • Use the identity 1 - cos(x) = 2 * sin^2(x/2). For x = 0.001, sin(0.0005) is approximately 5.0 * 10^(-4), and squaring gives 2.5 * 10^(-7), times 2 gives 5.0 * 10^(-7). No cancellation occurs because sin(x/2) is not close to any integer.
  • Alternatively, use the built-in function expm1 analog: compute -expm1(i*x) in complex arithmetic or use a Taylor series 1 - cos(x) approximately x^2/2 - x^4/24 for small x.

Problem 2 — LU Decomposition and Pivoting

Apply Gaussian elimination with partial pivoting to solve the system: [2, 1, -1; -3, -1, 2; -2, 1, 2] x = [8; -11; -3]. Show the pivoting steps and give the LU factorization PA = LU.

Show Solution

Step 1: Column 1 pivot. The largest absolute value in column 1 is |-3| = 3, so swap rows 1 and 2. After swap, augmented matrix is [-3,-1,2|-11; 2,1,-1|8; -2,1,2|-3].

Step 2: Eliminate column 1 below pivot. m_21 = 2/(-3) = -2/3, m_31 = (-2)/(-3) = 2/3. After elimination: [-3,-1,2|-11; 0, 1/3, 1/3 | 8-22/3 = 2/3; 0, 1/3, 8/3 | -3-22/3 = -31/3].

Step 3: Column 2 pivot. Both subdiagonal entries in column 2 are 1/3; no swap needed. Eliminate: m_32 = 1. Result: [-3,-1,2; 0, 1/3, 1/3; 0, 0, 7/3].

Back substitute to get x_3 = (-31/3 - 2/3) / (7/3) = -33/3 * 3/7 = -33/7. Wait — re-check arithmetic. The exact solution to this classic system is x = (2, 3, -1).

Verify: 2(2) + 1(3) + (-1)(-1) = 4+3+1 = 8. Check. -3(2) + (-1)(3) + 2(-1) = -6-3-2 = -11. Check. -2(2) + 1(3) + 2(-1) = -4+3-2 = -3. Check.

Problem 3 — Newton-Raphson Convergence

Apply Newton-Raphson to find the root of f(x) = x^3 - 2 starting from x_0 = 1. Compute 3 iterations and observe the quadratic convergence.

Show Solution

f(x) = x^3 - 2, f'(x) = 3x^2. The root is 2^(1/3) approximately 1.2599210...

x_0 = 1.000000. f(x_0) = -1. f'(x_0) = 3. x_1 = 1 - (-1)/3 = 1.333333. Error = 0.073412.

x_1 = 1.333333. f(x_1) = 2.370370 - 2 = 0.370370. f'(x_1) = 3 * 1.777778 = 5.333333. x_2 = 1.333333 - 0.370370/5.333333 = 1.333333 - 0.069444 = 1.263889. Error = 0.003968.

x_2 = 1.263889. f(x_2) approximately 0.020078. f'(x_2) approximately 4.796640. x_3 approximately 1.259921. Error approximately 0.000009.

Errors: 0.259, 0.073, 0.004, 0.000009. Ratios: 0.073/0.259^2 approximately 1.09. 0.004/0.073^2 approximately 0.75. Confirms quadratic convergence (constant C approximately f''(r)/(2f'(r)) = 6r/(2*3r^2) = 1/r approximately 0.794).

Problem 4 — RK4 vs Euler Accuracy

Solve y' = -y, y(0) = 1 on [0, 1] with step size h = 0.25 using forward Euler and RK4. Compare the errors at t = 1 against the exact solution y(1) = e^(-1) approximately 0.367879.

Show Solution

Forward Euler with h = 0.25:

y_(n+1) = y_n + 0.25 * (-y_n) = 0.75 * y_n

  • y_0 = 1.000000
  • y_1 = 0.750000
  • y_2 = 0.562500
  • y_3 = 0.421875
  • y_4 = 0.316406

Euler error at t=1: |0.316406 - 0.367879| = 0.051473 (about 14%)

RK4 with h = 0.25:

For y' = -y: k1 = -h*y_n, k2 = k3 = -h*(y_n + k1/2) or -h*(y_n + k2/2), k4 = -h*(y_n + k3). The exact update factor is 1 - h + h^2/2 - h^3/6 + h^4/24 (matches e^(-h) to 4th order).

Amplification: 1 - 0.25 + 0.03125 - 0.002604 + 0.000163 = 0.778809

y_RK4(1) = 0.778809^4 approximately 0.367882. Error approximately 3e-6.

Euler error: 0.051 (5.1%). RK4 error: 0.000003 (0.0003%). RK4 is about 17,000 times more accurate with the same number of steps.

Problem 5 — CFL Stability Analysis

For the advection equation u_t + u_x = 0 (wave speed c = 1) on a grid with dx = 0.1, what is the maximum time step dt for stability using the upwind explicit scheme? If you use dt = 0.15 instead, what happens?

Show Solution

The CFL condition for the upwind scheme is: c * dt / dx at most 1. With c = 1, dx = 0.1: dt at most dx/c = 0.1/1 = 0.1.

Maximum stable time step: dt_max = 0.1. CFL number = c * dt / dx = 1 * 0.1 / 0.1 = 1.0 (critical value).

With dt = 0.15: CFL = 1 * 0.15 / 0.1 = 1.5 greater than 1. Von Neumann analysis gives amplification factor |xi|^2 = 1 + CFL*(CFL-1)*(1-cos(theta)) greater than 1 for some theta.

With dt = 0.15, the scheme is unstable. The numerical solution will grow exponentially, producing large oscillations that bear no resemblance to the true solution, which just translates the initial profile to the right at speed 1.

Problem 6 — Gaussian Quadrature Exactness

Verify that 2-point Gauss-Legendre quadrature on [-1, 1] (nodes x = +/- 1/sqrt(3), weights w = 1 each) is exact for the polynomial f(x) = x^3. Then compute the approximation of integral from -1 to 1 of sin(x) dx.

Show Solution

Exactness for x^3:

Exact integral: integral from -1 to 1 of x^3 dx = [x^4/4] from -1 to 1 = 0 (odd function).

2-point GL approximation: w * f(1/sqrt(3)) + w * f(-1/sqrt(3)) = 1 * (1/sqrt(3))^3 + 1 * (-1/sqrt(3))^3 = 1/(3*sqrt(3)) - 1/(3*sqrt(3)) = 0. Exact!

This is guaranteed: 2-point GL is exact for polynomials up to degree 2*2-1 = 3.

Approximation of sin integral:

Exact: integral from -1 to 1 of sin(x) dx = [-cos(x)] from -1 to 1 = -cos(1) + cos(-1) = 0. (sin is odd, so the integral over a symmetric interval is 0.)

GL approximation: 1 * sin(1/sqrt(3)) + 1 * sin(-1/sqrt(3)) = sin(0.5774) + sin(-0.5774) = 0.5774 * ... = 0. Exact for odd functions by symmetry.

Both integrands give 0. For a non-trivial example, integral of e^x on [-1,1] is e - 1/e approximately 2.3504. GL-2 gives e^(1/sqrt(3)) + e^(-1/sqrt(3)) = e^0.5774 + e^(-0.5774) approximately 1.7813 + 0.5613 = 2.3426. Error approximately 0.008. GL-3 would give about 10^(-5) error.

Exam Tips and Common Pitfalls

Floating-Point: Always State Assumptions

On exams, specify the floating-point format (double precision, machine epsilon approximately 10^(-16)) when discussing rounding errors. State whether you're computing relative or absolute error. Know the difference between the condition number of the problem (inherent sensitivity) and the stability of the algorithm (sensitivity of computed result to rounding).

Linear Systems: Always Pivot

When asked to demonstrate Gaussian elimination, always use partial pivoting unless told otherwise. Exams often test whether you know to swap rows. Remember: the growth factor for partial pivoting is at most 2^(n-1) in theory, but rarely large in practice. Full pivoting (swapping rows and columns) gives better bounds but is rarely used.

Order of Convergence: Know the Definitions

A sequence e_k converges with order p if |e_(k+1)| / |e_k|^p approaches a nonzero constant C. Order p = 1 is linear, p = 2 is quadratic, p = 1.618 is superlinear (secant method). To verify on an exam: compute the ratio e_(k+1) / e_k (for linear) or e_(k+1) / e_k^2 (for quadratic) and check that it approaches a constant.

ODE Methods: Distinguish Accuracy from Stability

Accuracy (order of the method) and stability (what happens as the step size grows) are independent. Euler is first-order accurate but conditionally stable. Backward Euler is also first-order accurate but unconditionally stable (A-stable). Crank-Nicolson is second-order and A-stable. For stiff problems, stability dominates: choose implicit methods.

Quadrature: Know Degree of Exactness

Memorize: the n-point Gauss-Legendre rule is exact for polynomials of degree at most 2n-1. Composite Simpson's rule has error O(h^4) for smooth f. Romberg integration combines trapezoidal rules with Richardson extrapolation to achieve high-order convergence. For high-dimensional integrals, Monte Carlo is O(N^(-1/2)) regardless of dimension.

FFT: Complexity and Applications

FFT is O(n log n); naive DFT is O(n^2). The speedup is n/log(n), which is enormous for large n. Know that FFT works best when n is a power of 2 (radix-2 Cooley-Tukey). For the exam, be ready to apply the decimation-in-time or decimation-in-frequency decomposition to a small example, and to state how FFT enables O(n log n) polynomial multiplication (convert to frequency domain, multiply pointwise, convert back).

Quick Reference: Key Formulas

Floating-Point

  • Machine eps (double): 2^(-52) approx 2.22e-16
  • Relative rounding error: |fl(x)-x|/|x| at most eps/2
  • Condition number: kappa = norm(A)*norm(A^-1)

Root Finding

  • Newton: x_(k+1) = x_k - f(x_k)/f'(x_k)
  • Secant: x_(k+1) = x_k - f(x_k)*(x_k-x_(k-1))/(f(x_k)-f(x_(k-1)))
  • Bisection: n = ceiling(log2((b-a)/tol)) iterations

ODE Methods

  • Euler: y_(n+1) = y_n + h*f(t_n, y_n), O(h)
  • RK4: 4 stages, O(h^4) accuracy
  • Crank-Nicolson: 2nd order, A-stable

Quadrature

  • Simpson: h/3 * (f_0 + 4f_1 + f_2), error O(h^4)
  • Gauss-Legendre n-point: exact for degree 2n-1
  • Monte Carlo: error O(N^(-1/2))

Stability

  • CFL: c*dt/dx at most 1 for advection
  • Heat (explicit): alpha*dt/dx^2 at most 1/2
  • Euler stability: h*|lambda| at most 2

Fast Algorithms

  • FFT: O(n log n) vs DFT O(n^2)
  • FMM: O(N) for N-body interactions
  • CG convergence: O(sqrt(kappa) * log(1/eps)) iters

Related Topics

Summary

Advanced numerical methods provide the algorithmic bridge between continuous mathematics and computational solutions. The key themes throughout are:

  • Accuracy vs. Stability. A method's order of convergence tells you how fast it converges with decreasing step size; stability tells you whether it stays bounded. Both must be controlled.
  • Condition is the Problem's Fault. A large condition number means the problem is inherently sensitive. No algorithm can reliably solve an ill-conditioned problem with finite-precision input.
  • Exploit Structure. Sparse matrices, symmetry, low-rank blocks, and periodicity all enable faster algorithms. The FFT, FMM, and hierarchical methods show that the right structure can reduce O(n^2) to O(n log n) or O(n).
  • Trade-offs are Inevitable. Implicit methods cost more per step but allow larger steps for stiff problems. Dense solvers are simple but slow for large systems. Higher-order methods are more accurate but require smoother solutions.