What is the bias-variance tradeoff and how does it relate to model complexity?

For any estimator, the expected mean squared error decomposes as MSE = Bias^2 + Variance + irreducible noise. Bias measures how far off the model's average prediction is from the truth — high bias means underfitting (model too simple). Variance measures how much predictions vary across different training sets — high variance means overfitting (model too complex). As model complexity increases, bias typically decreases and variance increases. The optimal model minimizes the sum. Regularization (L1/L2) reduces variance at the cost of some bias. Cross-validation estimates the true generalization error to find the sweet spot. VC dimension formalizes this: a hypothesis class with VC dimension d needs roughly O(d/epsilon) samples to generalize to within epsilon error.

Mathematics of Machine Learning

Q: What is the chain rule and why is it central to backpropagation?

The chain rule states that if y = f(g(x)), then dy/dx = (df/dg)(dg/dx). In a neural network the loss L is a composition of many functions — layer activations, weight matrices, bias additions. Backpropagation applies the chain rule recursively from the output layer back to the input layer to compute the gradient of L with respect to every weight. For a network with layers z^(l) = W^(l) a^(l-1) + b^(l) and a^(l) = sigma(z^(l)), the gradient with respect to W^(l) is dL/dW^(l) = delta^(l) (a^(l-1))^T where delta^(l) = (W^(l+1))^T delta^(l+1) * sigma'(z^(l)) is the error signal propagated backward. This avoids recomputing repeated sub-expressions and runs in O(parameters) time.

Q: What is singular value decomposition (SVD) and how is it used in machine learning?

SVD decomposes any real matrix A (m x n) as A = U Sigma V^T, where U is m x m orthogonal, Sigma is m x n diagonal with non-negative singular values sigma_1 >= sigma_2 >= ... >= sigma_r > 0, and V is n x n orthogonal. In machine learning SVD powers principal component analysis (keep the top-k singular vectors as principal components), latent semantic analysis for text, collaborative filtering for recommendation systems, and low-rank matrix approximation. The best rank-k approximation to A in Frobenius norm is A_k = U_k Sigma_k V_k^T using only the top k singular triplets — the Eckart-Young theorem guarantees no rank-k matrix is closer.

Q: What is the difference between MLE and MAP estimation?

Maximum Likelihood Estimation (MLE) finds the parameter theta that maximizes P(data | theta) — treating parameters as fixed unknowns. Maximum A Posteriori (MAP) finds the theta that maximizes P(theta | data) = P(data | theta) P(theta) / P(data), i.e., the posterior. MAP includes a prior P(theta) over parameters: with a Gaussian prior N(0, tau^2 I), MAP is equivalent to L2 (ridge) regularized MLE. With a Laplace prior, MAP is equivalent to L1 (lasso) regularization. As the dataset grows, both converge to the same answer because the likelihood dominates the prior. MLE can overfit on small data; MAP shrinks estimates toward the prior and provides regularization automatically.

Q: How does gradient descent work and what makes Adam different?

Gradient descent iteratively updates parameters by moving opposite the gradient: theta_{t+1} = theta_t - alpha * grad L(theta_t). Full-batch GD computes the gradient over all data, which is exact but slow. SGD uses one random sample per step — fast but noisy. Mini-batch SGD averages over a small batch. Adam (Adaptive Moment Estimation) maintains exponential moving averages of gradients (m_t = beta_1 m_{t-1} + (1-beta_1) g_t) and squared gradients (v_t = beta_2 v_{t-1} + (1-beta_2) g_t^2), bias-corrects both (m_hat = m_t/(1-beta_1^t)), then updates theta_{t+1} = theta_t - alpha * m_hat_t / (sqrt(v_hat_t) + epsilon). Adam adapts the learning rate per parameter and is robust to sparse gradients, making it the default optimizer for deep learning.

Q: What is the kernel trick in support vector machines?

The kernel trick allows SVMs to learn non-linear decision boundaries without explicitly computing high-dimensional feature maps. A kernel function K(x, z) = phi(x) . phi(z) computes the dot product in a (possibly infinite-dimensional) feature space without ever computing phi explicitly. The SVM dual problem only requires inner products between data points, so replacing x_i . x_j with K(x_i, x_j) implicitly works in the feature space. Common kernels: linear K(x,z) = x^T z; polynomial K(x,z) = (x^T z + c)^d; RBF/Gaussian K(x,z) = exp(-||x-z||^2 / (2 sigma^2)), which corresponds to an infinite-dimensional feature space. Mercer's theorem guarantees K is a valid kernel if and only if it is symmetric and positive semi-definite.

Q: What is PCA and how is it derived mathematically?

Principal Component Analysis finds the directions of maximum variance in data. Given centered data matrix X (n x d), the covariance matrix is C = X^T X / (n-1). PCA finds the unit vector w that maximizes w^T C w subject to ||w|| = 1. By Lagrange multipliers: C w = lambda w, so w must be an eigenvector of C with lambda the corresponding eigenvalue. The first principal component is the eigenvector with the largest eigenvalue. The k-th PC is the k-th eigenvector. Equivalently via SVD: X = U Sigma V^T, the principal components are the columns of V (right singular vectors) and the principal component scores are U Sigma. PCA retains top-k components, projecting d-dimensional data to k dimensions while preserving maximum variance.

Q: What is the vanishing gradient problem and how is it addressed?

In deep networks, backpropagation multiplies many Jacobians together. If activation function derivatives are small (e.g., sigmoid saturates near 0 or 1 with derivative close to 0), the gradient magnitude decreases exponentially with depth, making early layers learn very slowly or not at all. Solutions: (1) ReLU activations (derivative is 1 for positive inputs, no saturation); (2) Batch normalization, which normalizes layer inputs to keep activations in a useful range; (3) Residual connections (ResNets), which add skip connections so gradient flows directly through identity paths; (4) Careful weight initialization (He init for ReLU: W ~ N(0, 2/n_in)); (5) Gradient clipping for RNNs. The exploding gradient problem (gradients grow exponentially) is addressed by gradient clipping.

A comprehensive derivation-first guide to every mathematical tool powering modern ML: linear algebra, multivariable calculus, probability theory, optimization theory, and statistical learning theory. Equations are shown in full — nothing hand-waved.

Linear Algebra for ML
SVD and Low-Rank Approximation
PCA and Dimensionality Reduction
Calculus and Backpropagation
Probability Theory
MLE, MAP, and Conjugate Priors
Optimization Theory
Linear Regression
Logistic Regression
Neural Networks
Support Vector Machines
Learning Theory
FAQ

1. Linear Algebra for Machine Learning

Machine learning is fundamentally a field of linear algebra. Datasets are matrices, model parameters are vectors, predictions are matrix-vector products, and almost every learning algorithm is phrased as an optimization problem over a vector space.

1.1 Matrix Operations and Notation

An m x n matrix A has m rows and n columns. The (i, j) entry is written A_ij or [A]_ij. Transposing swaps rows and columns: (A^T)_ij = A_ji. Key identities used constantly in ML:

(AB)^T = B^T A^T

(ABC)^T = C^T B^T A^T

(A^-1)^T = (A^T)^-1 [for invertible square A]

Tr(AB) = Tr(BA) [cyclic trace property]

d/dA Tr(BA) = B^T

1.2 Inner Products, Norms, and Distance

The standard inner product on R^n is x . y = x^T y = sum_i x_i y_i. This gives the Euclidean norm ||x||_2 = sqrt(x^T x) and the cosine similarity cos(theta) = x^T y / (||x|| ||y||).

ML uses several norms regularly:

L1 norm: ||x||_1 = sum_i |x_i| (sparsity-inducing)

L2 norm: ||x||_2 = sqrt(sum_i x_i^2) (ridge regularization)

Linf norm: ||x||_inf = max_i |x_i|

Frobenius: ||A||_F = sqrt(sum_ij A_ij^2) = sqrt(Tr(A^T A))

Nuclear: ||A||_* = sum_i sigma_i(A) (sum of singular values)

1.3 Eigendecomposition

For a square matrix A, a scalar lambda and nonzero vector v satisfying Av = lambda v are called an eigenvalue and eigenvector. The characteristic polynomial det(A - lambda I) = 0 gives the eigenvalues. A symmetric matrix A = A^T has all real eigenvalues and an orthonormal eigenbasis:

A = Q Lambda Q^T

where Q = [q_1 | q_2 | ... | q_n] (orthonormal columns)

Lambda = diag(lambda_1, ..., lambda_n)

Q^T Q = I

A positive semi-definite (PSD) matrix has all eigenvalues lambda_i >= 0. Covariance matrices and kernel matrices are always PSD. This fact is exploited throughout ML — PSD guarantees convexity of quadratic forms x^T A x >= 0.

Worked Example: 2x2 Eigendecomposition

A = [[3, 1], [1, 3]]

det(A - lambda I) = (3 - lambda)^2 - 1 = 0

=> lambda^2 - 6 lambda + 8 = 0 => lambda = 4, 2

lambda = 4: (A - 4I)v = 0 => [[-1,1],[1,-1]]v = 0 => v_1 = [1,1]^T / sqrt(2)

lambda = 2: (A - 2I)v = 0 => [[1,1],[1,1]]v = 0 => v_2 = [1,-1]^T / sqrt(2)

A = Q diag(4,2) Q^T, Q = (1/sqrt(2))[[1,1],[1,-1]]

1.4 Matrix Calculus

Gradients of scalar functions with respect to vectors and matrices are the engine of optimization. The gradient of a scalar f with respect to vector x is a vector with the same shape as x. Key identities (using denominator-layout convention):

d/dx (a^T x) = a

d/dx (x^T A x) = (A + A^T) x = 2Ax [if A symmetric]

d/dA (x^T A y) = x y^T

d/dA Tr(A^T B) = B

d/dA log det(A) = A^(-T) [A invertible]

2. Singular Value Decomposition (SVD)

SVD is the most important matrix factorization in applied mathematics. Every real m x n matrix A can be written as:

A = U Sigma V^T

where U is m x m orthogonal (columns are left singular vectors), Sigma is m x n with non-negative diagonal entries sigma_1 >= sigma_2 >= ... >= sigma_r > 0 (singular values, r = rank(A)), and V is n x n orthogonal (columns are right singular vectors). Existence is guaranteed for any real matrix.

2.1 Geometric Interpretation

Any linear map can be decomposed into three steps: V^T rotates/reflects the input, Sigma stretches along coordinate axes by the singular values, and U rotates/reflects into the output space. The singular values measure how much A stretches in each direction. ||A||_2 (spectral norm) = sigma_1 (largest singular value).

2.2 Eckart-Young Theorem (Best Low-Rank Approximation)

A_k = sum_(i=1 to k) sigma_i u_i v_i^T

||A - A_k||_F^2 = sigma_(k+1)^2 + ... + sigma_r^2

||A - A_k||_2 = sigma_(k+1)

No rank-k matrix is closer to A in either the Frobenius or spectral norm. This theorem justifies PCA, latent semantic analysis, matrix completion, and any ML approach that approximates a full matrix with a low-rank model.

2.3 SVD and Least Squares

When A is not invertible or not square, the pseudoinverse A^+ = V Sigma^+ U^T gives the minimum-norm least-squares solution to Ax = b: x* = A^+ b. Sigma^+ replaces each nonzero sigma_i with 1/sigma_i and leaves zeros as zeros.

3. PCA and Dimensionality Reduction

3.1 PCA Derivation

Given n data points in R^d stored as rows of X (n x d), center the data: X_c = X - 1 mu^T where mu = (1/n) X^T 1. The sample covariance matrix is:

C = (1/(n-1)) X_c^T X_c [d x d, symmetric PSD]

PCA seeks the unit vector w_1 maximizing variance of projected data:

maximize w^T C w

subject to ||w||_2 = 1

Lagrangian: L = w^T C w - lambda (w^T w - 1)

dL/dw = 2Cw - 2 lambda w = 0 => Cw = lambda w

The optimal w_1 is the eigenvector of C with the largest eigenvalue lambda_1. The objective value equals lambda_1. Each subsequent principal component is the eigenvector with the next largest eigenvalue, constrained to be orthogonal to all previous components. Retaining k components captures fraction (lambda_1 + ... + lambda_k) / (lambda_1 + ... + lambda_d) of total variance.

3.2 PCA via SVD

X_c = U Sigma V^T [thin SVD, U: n x d, Sigma: d x d, V: d x d]

C = X_c^T X_c / (n-1) = V (Sigma^2 / (n-1)) V^T

Principal components: columns of V (right singular vectors)

PC scores (projections): X_c V = U Sigma

Top-k projection: X_c V_k = U_k Sigma_k [n x k matrix]

3.3 Linear Discriminant Analysis (LDA)

LDA is a supervised dimensionality reduction method. For C classes with means mu_c, it finds projections maximizing between-class variance relative to within-class variance. Define scatter matrices:

S_W = sum_c sum_(x in class c) (x - mu_c)(x - mu_c)^T [within-class]

S_B = sum_c n_c (mu_c - mu)(mu_c - mu)^T [between-class]

Objective: maximize w^T S_B w / w^T S_W w (Rayleigh quotient)

Solution: eigenvectors of S_W^(-1) S_B with largest eigenvalues

3.4 t-SNE

t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear method for 2D/3D visualization. For high-dimensional points x_i, it defines similarity as a conditional Gaussian probability p_j|i = exp(-||x_i - x_j||^2 / 2 sigma_i^2) / normalization. In the low-dimensional embedding y_i, it uses a Student-t distribution (heavier tails): q_ij = (1 + ||y_i - y_j||^2)^(-1) / sum_(k != l) (1 + ||y_k - y_l||^2)^(-1). The cost function is the KL divergence KL(P || Q) = sum_ij p_ij log(p_ij / q_ij), minimized by gradient descent. The t-distribution prevents crowding that would occur with a Gaussian in the low-dimensional space.

4. Calculus for Machine Learning

4.1 Gradients and Directional Derivatives

For f: R^n -> R, the gradient grad f(x) is the vector of partial derivatives. The directional derivative in direction v (||v|| = 1) is D_v f(x) = grad f(x) . v, maximized when v = grad f(x) / ||grad f(x)||. Hence the gradient points in the direction of steepest ascent.

The Hessian H(x) = d^2 f / dx^2 is the matrix of second partial derivatives: H_ij = d^2 f / (dx_i dx_j). A critical point where grad f = 0 is a local minimum if H is PSD (all eigenvalues >= 0), a local maximum if H is NSD, and a saddle point if H has both positive and negative eigenvalues.

4.2 Chain Rule in Multiple Dimensions

If z = f(y), y = g(x), then:

dz/dx_i = sum_j (dz/dy_j)(dy_j/dx_i)

In matrix form: Jacobian(z wrt x) = Jacobian(z wrt y) * Jacobian(y wrt x)

4.3 Backpropagation Derivation

Consider a fully-connected network with L layers. Forward pass for layer l:

z^(l) = W^(l) a^(l-1) + b^(l) [pre-activation]

a^(l) = sigma(z^(l)) [activation, applied elementwise]

a^(0) = x [input]

a^(L) = y_hat [output]

Define the error signal at layer l as delta^(l) = dL / dz^(l). For the output layer L:

delta^(L) = dL/da^(L) * sigma'(z^(L)) [elementwise product]

Backprop recurrence (chain rule applied layer by layer, moving backward):

delta^(l) = (W^(l+1))^T delta^(l+1) * sigma'(z^(l)) [elementwise *]

Gradients for parameters at layer l:

dL/dW^(l) = delta^(l) (a^(l-1))^T

dL/db^(l) = delta^(l)

This achieves O(parameters) computation by storing intermediate activations during the forward pass and reusing them during the backward pass — a classic dynamic programming argument. Memory cost is O(depth x width) for storing all activations.

4.4 Computational Graphs and Automatic Differentiation

Modern frameworks (PyTorch, JAX) build a computational graph of operations. Forward mode AD computes Jv (Jacobian-vector product) in one pass useful when inputs are few. Reverse mode AD (backprop) computes v^T J (vector-Jacobian product) in one backward pass — efficient when outputs are few (loss is a scalar). For ML with many parameters and scalar loss, reverse mode is always preferred.

5. Probability Theory for ML

5.1 Bayes' Theorem

P(A | B) = P(B | A) P(A) / P(B)

In the ML context: P(theta | data) = P(data | theta) P(theta) / P(data). The posterior P(theta | data) combines the likelihood P(data | theta) and the prior P(theta). P(data) = integral P(data | theta) P(theta) d theta is the marginal likelihood (evidence), which normalizes the posterior but is often intractable.

5.2 Common Distributions

Distribution	PDF / PMF	Mean	Variance
Bernoulli(p)	p^x (1-p)^(1-x)	p	p(1-p)
Gaussian N(mu, sigma^2)	(1/sqrt(2pi sigma^2)) exp(-(x-mu)^2 / 2sigma^2)	mu	sigma^2
Categorical(p)	prod_k p_k^(x_k)	p	diag(p)-pp^T
Beta(a, b)	x^(a-1)(1-x)^(b-1)/B(a,b)	a/(a+b)	ab / (a+b)^2(a+b+1)
Dirichlet(alpha)	prod_k x_k^(alpha_k - 1) / B(alpha)	alpha_k / alpha_0	alpha_k(alpha_0 - alpha_k) / (alpha_0^2 (alpha_0+1))

5.3 Gaussian Distribution in ML

The multivariate Gaussian N(mu, Sigma) has density:

p(x) = (2pi)^(-d/2) |Sigma|^(-1/2) exp( -1/2 (x-mu)^T Sigma^(-1) (x-mu) )

The term (x-mu)^T Sigma^(-1) (x-mu) is the Mahalanobis distance squared. If Sigma = I, it reduces to squared Euclidean distance. The Mahalanobis distance accounts for correlations and different scales — crucial for Gaussian discriminant analysis, Gaussian processes, and Kalman filters.

5.4 Information Theory

Entropy, cross-entropy, and KL divergence appear throughout ML as loss functions and theoretical measures:

Entropy: H(p) = -sum_x p(x) log p(x) [bits of uncertainty]

Cross-entropy: H(p,q) = -sum_x p(x) log q(x) [loss for classification]

KL divergence: KL(p||q)= sum_x p(x) log(p(x)/q(x)) >= 0 [always non-negative]

Relation: H(p,q) = H(p) + KL(p||q)

6. MLE, MAP Estimation, and Conjugate Priors

6.1 Maximum Likelihood Estimation

Given i.i.d. data D = (x_1, ..., x_n) drawn from p(x | theta), the likelihood is:

L(theta) = p(D | theta) = prod_(i=1 to n) p(x_i | theta)

Log-likelihood (maximize this — products become sums):

ell(theta) = sum_(i=1 to n) log p(x_i | theta)

MLE is asymptotically efficient: it achieves the Cramer-Rao lower bound as n grows. For Gaussian data with unknown mean, MLE gives sample mean. For Bernoulli data, MLE gives the empirical proportion. MLE can overfit severely with small n.

Worked Example: MLE for Gaussian Mean

Data: x_1, ..., x_n ~ N(mu, sigma^2), sigma^2 known

ell(mu) = sum_i [ -1/2 log(2 pi sigma^2) - (x_i - mu)^2 / (2 sigma^2) ]

d ell / d mu = sum_i (x_i - mu) / sigma^2 = 0

=> mu_MLE = (1/n) sum_i x_i = x_bar

6.2 Maximum A Posteriori Estimation

MAP adds a prior over parameters:

theta_MAP = argmax_theta log p(theta | D)

= argmax_theta [ ell(theta) + log p(theta) ]

Gaussian prior p(theta) = N(0, tau^2 I):

log p(theta) = -||theta||^2 / (2 tau^2) + const

=> MAP adds L2 penalty (lambda/2)||theta||^2 with lambda = 1/tau^2 (ridge regression)

Laplace prior p(theta) proportional to exp(-|theta|/b):

=> MAP adds L1 penalty lambda ||theta||_1 (lasso)

6.3 Conjugate Priors

A prior is conjugate to a likelihood if the posterior is in the same family as the prior. Conjugate pairs allow closed-form Bayesian updates — no numerical integration required:

Likelihood	Conjugate Prior	Posterior Update
Bernoulli(p)	Beta(a, b)	Beta(a + successes, b + failures)
Categorical(p)	Dirichlet(alpha)	Dirichlet(alpha + counts)
Gaussian(mu, sigma^2 known)	N(mu_0, tau_0^2)	N with precision-weighted mean
Poisson(lambda)	Gamma(a, b)	Gamma(a + sum(x_i), b + n)

7. Optimization Theory

7.1 Convexity

A function f is convex if for all x, y and t in [0,1]: f(tx + (1-t)y) <= t f(x) + (1-t) f(y). Equivalently (for twice-differentiable f): the Hessian H(x) is PSD everywhere. For convex f, every local minimum is a global minimum. Logistic regression loss, SVM hinge loss, and L2-regularized objectives are all convex.

7.2 Gradient Descent and Convergence

theta_(t+1) = theta_t - alpha * grad L(theta_t)

For L-smooth convex f (||grad f(x) - grad f(y)|| <= L ||x-y||), gradient descent with step size alpha = 1/L converges as:

f(theta_T) - f* <= ||theta_0 - theta*||^2 / (2 alpha T) [O(1/T) rate]

For strongly convex f (mu-strongly convex: f(y) >= f(x) + grad f(x)^T (y-x) + (mu/2)||y-x||^2), gradient descent converges geometrically: f(theta_T) - f* <= (1 - mu/L)^T (f(theta_0) - f*). The condition number kappa = L/mu determines convergence speed.

7.3 Stochastic Gradient Descent (SGD)

SGD estimates the gradient using a random mini-batch B of size b:

g_t = (1/b) sum_(i in B_t) grad L_i(theta_t) [unbiased: E[g_t] = grad L]

theta_(t+1) = theta_t - alpha_t g_t

SGD requires a decaying learning rate schedule (e.g., alpha_t = alpha_0 / sqrt(t)) for convergence guarantees. In practice, SGD with momentum and learning rate warmup-then-decay often reaches the same or better solutions than full-batch GD due to implicit regularization from gradient noise.

7.4 Momentum and Adam

SGD with Momentum (beta = 0.9 typical):

v_(t+1) = beta v_t + (1 - beta) grad L(theta_t)

theta_(t+1) = theta_t - alpha v_(t+1)

Adam (beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-8):

m_t = beta_1 m_(t-1) + (1 - beta_1) g_t [1st moment]

v_t = beta_2 v_(t-1) + (1 - beta_2) g_t^2 [2nd moment]

m_hat_t = m_t / (1 - beta_1^t) [bias correction]

v_hat_t = v_t / (1 - beta_2^t) [bias correction]

theta_(t+1) = theta_t - alpha * m_hat_t / (sqrt(v_hat_t) + epsilon)

7.5 Saddle Points in Non-Convex Optimization

Deep network loss surfaces are non-convex and have many saddle points where grad L = 0 but the Hessian has both positive and negative eigenvalues. Empirically, most local minima in deep networks have similar loss to the global minimum — the "proliferation of local minima" problem is less severe than feared. Saddle points are the real obstacle: gradient descent can slow near them. Perturbation methods and stochastic noise in SGD help escape saddle points.

7.6 Second-Order Methods

Newton's method uses the Hessian for quadratic convergence: theta_(t+1) = theta_t - H^(-1) grad L(theta_t). Computing H^(-1) costs O(d^3) — infeasible for large models. Quasi-Newton methods (L-BFGS) approximate H^(-1) using a limited history of gradient differences: O(md) cost per step for m stored vectors. Natural gradient descent uses the Fisher information matrix (the Riemannian metric in parameter space) instead of the Euclidean Hessian — invariant to reparametrization.

8. Linear Regression

8.1 Ordinary Least Squares (OLS)

Given design matrix X (n x d) and targets y (n x 1), minimize:

L(w) = ||y - Xw||^2 = (y - Xw)^T (y - Xw)

grad_w L = -2 X^T (y - Xw) = 0

=> X^T X w = X^T y [normal equations]

=> w_OLS = (X^T X)^(-1) X^T y [if X^T X invertible]

The hat matrix H = X(X^T X)^(-1) X^T projects y onto the column space of X: y_hat = Hy. The residuals e = y - y_hat lie in the null space of X^T and are orthogonal to all columns of X: X^T e = 0. OLS is BLUE (Best Linear Unbiased Estimator) by the Gauss-Markov theorem when errors are homoscedastic and uncorrelated.

8.2 Ridge Regression (L2 Regularization)

L_ridge(w) = ||y - Xw||^2 + lambda ||w||^2

grad L = -2 X^T (y - Xw) + 2 lambda w = 0

w_ridge = (X^T X + lambda I)^(-1) X^T y

Ridge adds lambda to every diagonal of X^T X, making it always invertible (resolving multicollinearity). Ridge shrinks all coefficients toward zero proportionally. Via Bayesian interpretation: w_ridge = MAP estimate with prior w ~ N(0, (1/lambda) I).

8.3 Lasso (L1 Regularization)

L_lasso(w) = ||y - Xw||^2 + lambda ||w||_1

Lasso has no closed-form solution because ||w||_1 is not differentiable at zero. Coordinate descent solves it: for each coordinate j, the solution has a "soft-thresholding" form: w_j = sign(rho_j) max(|rho_j| - lambda/2, 0), where rho_j = X_j^T (y - X_(-j) w_(-j)). The L1 penalty drives many coefficients to exactly zero, producing sparse models — automatic feature selection. Elastic Net combines L1 and L2: lambda_1 ||w||_1 + lambda_2 ||w||^2.

8.4 Probabilistic Interpretation

Linear regression with Gaussian noise y = Xw + epsilon, epsilon ~ N(0, sigma^2 I) gives likelihood p(y | X, w) = N(Xw, sigma^2 I). Maximizing the log-likelihood is equivalent to minimizing ||y - Xw||^2 — OLS is the MLE under Gaussian noise. Adding a Gaussian prior on w gives ridge regression MAP. The posterior over w is also Gaussian, enabling full Bayesian linear regression with uncertainty quantification.

9. Logistic Regression

9.1 Sigmoid and Log-Odds

For binary classification y in (0, 1), logistic regression models:

P(y=1 | x, w) = sigma(w^T x) where sigma(z) = 1 / (1 + e^(-z))

Properties of sigma:

sigma(-z) = 1 - sigma(z) [symmetry]

sigma'(z) = sigma(z)(1 - sigma(z)) [derivative in terms of itself]

log(sigma/(1-sigma)) = z [log-odds = linear predictor]

9.2 Cross-Entropy Loss Derivation

The negative log-likelihood for n i.i.d. binary outcomes:

-ell(w) = -sum_i [ y_i log sigma(w^T x_i) + (1-y_i) log(1 - sigma(w^T x_i)) ]

Gradient:

grad_w (-ell) = sum_i (sigma(w^T x_i) - y_i) x_i = X^T (sigma_hat - y)

The gradient is elegantly the sum of (prediction - label) times features — same structure as linear regression. The Hessian is H = X^T S X where S = diag(sigma_i (1-sigma_i)), which is PSD, confirming the cross-entropy loss is convex in w. Newton's method (iteratively reweighted least squares) converges quadratically.

9.3 Multiclass: Softmax Regression

P(y=k | x, W) = softmax(Wx)_k = exp(w_k^T x) / sum_j exp(w_j^T x)

Cross-entropy loss (K classes):

L = -sum_i sum_k y_ik log p(y=k | x_i, W)

The softmax is the gradient of the log-sum-exp: d/dz log sum_k exp(z_k) = softmax(z). A key numerical trick: subtract max(z) before exponentiating to prevent overflow — the result is mathematically identical but numerically stable.

10. Neural Networks

10.1 Universal Approximation Theorem

The Universal Approximation Theorem (Cybenko 1989, Hornik 1991) states: a feedforward network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of R^n to arbitrary precision, for any non-constant, bounded, monotonically-increasing continuous activation function. The theorem guarantees existence but says nothing about learnability or required width. Modern variants show depth can exponentially reduce required width — some functions need width exponential in depth to approximate with shallow networks, but polynomial width with deep networks.

10.2 Activation Functions

Activation	Formula	Derivative	Notes
Sigmoid	1/(1+e^(-z))	sigma(1-sigma)	Vanishing gradients for large \|z\|
Tanh	(e^z - e^(-z))/(e^z + e^(-z))	1 - tanh^2(z)	Zero-centered; still saturates
ReLU	max(0, z)	1(z>0)	Dead neurons if z always < 0
Leaky ReLU	max(alpha z, z)	1(z>0) + alpha*1(z<=0)	Prevents dead neurons (alpha ~ 0.01)
GELU	z * Phi(z)	Phi(z) + z phi(z)	Smooth, used in Transformers
Swish	z * sigma(z)	sigma(z) + z sigma(z)(1-sigma(z))	Non-monotone; often outperforms ReLU

10.3 Vanishing and Exploding Gradients

In a network of depth L with weight matrices W^(1), ..., W^(L), the gradient at layer 1 involves the product of L Jacobians:

dL/dz^(1) = (prod_(l=2 to L) W^(l) diag(sigma'(z^(l)))) * dL/dz^(L)

If ||W^(l) diag(sigma')|| < 1 for all l: gradient vanishes exponentially

If ||W^(l) diag(sigma')|| > 1 for all l: gradient explodes exponentially

Solutions: (1) ReLU activation (derivative 1 on positive half, no saturation); (2) Batch normalization normalizes z^(l) before activation to N(0,1) then applies learnable scale and shift; (3) Residual connections y = F(x) + x allow gradients to flow through identity path: dL/dx = dL/dy (1 + dF/dx) — the 1 prevents vanishing; (4) He initialization: W ~ N(0, 2/n_in) keeps activation variance stable through layers.

10.4 Batch Normalization

mu_B = (1/m) sum_i z_i [batch mean]

sigma_B^2 = (1/m) sum_i (z_i - mu_B)^2 [batch variance]

z_hat_i = (z_i - mu_B) / sqrt(sigma_B^2 + eps) [normalize]

y_i = gamma z_hat_i + beta [scale and shift: learnable gamma, beta]

11. Support Vector Machines

11.1 Hard-Margin SVM: Margin Maximization

For linearly separable data (x_i, y_i) with y_i in (-1, +1), the decision boundary is w^T x + b = 0. The signed distance from x_i to the boundary is y_i(w^T x_i + b) / ||w||. The margin is 2/||w||. Maximizing the margin is equivalent to:

minimize (1/2) ||w||^2

subject to y_i (w^T x_i + b) >= 1 for all i

The support vectors are the training points where the constraint is tight: y_i (w^T x_i + b) = 1. The decision boundary depends only on the support vectors.

11.2 Soft-Margin SVM (C-SVM)

minimize (1/2) ||w||^2 + C sum_i xi_i

subject to y_i (w^T x_i + b) >= 1 - xi_i, xi_i >= 0

Slack variables xi_i allow misclassification with penalty C. Large C gives a narrow margin (low bias, high variance); small C gives a wide margin (high bias, low variance). The hinge loss formulation: min_w (1/2)||w||^2 + C sum_i max(0, 1 - y_i(w^T x_i + b)).

11.3 Dual Formulation

Forming the Lagrangian and applying KKT conditions converts the primal SVM to the dual:

maximize sum_i alpha_i - (1/2) sum_ij alpha_i alpha_j y_i y_j x_i^T x_j

subject to 0 <= alpha_i <= C, sum_i alpha_i y_i = 0

Primal solution: w = sum_i alpha_i y_i x_i [only SVs have alpha_i > 0]

Prediction: f(x) = sum_i alpha_i y_i (x_i . x) + b

11.4 Kernel Trick

The dual only uses x_i . x_j inner products. Replace with K(x_i, x_j) = phi(x_i) . phi(x_j) to implicitly map to feature space phi without computing phi:

Linear: K(x,z) = x^T z

Polynomial: K(x,z) = (x^T z + c)^d

RBF: K(x,z) = exp(-||x-z||^2 / (2 sigma^2))

Sigmoid: K(x,z) = tanh(kappa x^T z + c) [not always PSD]

Mercer's theorem: K is a valid kernel iff for any finite set (x_1, ..., x_n), the Gram matrix K_ij = K(x_i, x_j) is symmetric and positive semi-definite. The RBF kernel corresponds to an infinite-dimensional feature space — yet SVM with RBF kernel computes in O(n^2) kernel evaluations rather than infinite-dimensional dot products.

12. Statistical Learning Theory

12.1 Bias-Variance Decomposition

For regression with squared loss, taking expectation over training sets:

E[ (y - f_hat(x))^2 ] = Bias^2(f_hat(x)) + Var(f_hat(x)) + sigma^2

Bias(f_hat(x)) = E[f_hat(x)] - f*(x) [systematic error]

Var(f_hat(x)) = E[(f_hat(x) - E[f_hat(x)])^2] [sensitivity to training data]

sigma^2 = irreducible noise variance in y

Regularization reduces variance at the cost of increased bias. Cross-validation selects model complexity to minimize the sum. Ensemble methods (bagging) reduce variance by averaging many high-variance models; boosting reduces bias by sequentially correcting residuals.

12.2 VC Dimension

The Vapnik-Chervonenkis (VC) dimension of a hypothesis class H is the largest set of points that H can shatter (classify in all 2^n possible ways). For linear classifiers in R^d, VC dim = d + 1.

Sauer's Lemma: growth function m_H(n) <= sum_(k=0 to d) C(n,k) <= (en/d)^d

VC generalization bound (with probability 1 - delta):

L_test(h) <= L_train(h) + sqrt( (d log(2en/d) + log(2/delta)) / (2n) )

The bound grows with VC dimension d and shrinks with n. To generalize to within epsilon error with probability 1 - delta, you need roughly n = O(d log(d/epsilon) + log(1/delta)/epsilon) samples — the fundamental sample complexity result.

12.3 PAC Learning

Probably Approximately Correct (PAC) learning: a hypothesis class H is PAC-learnable if there exists an algorithm A such that for any target concept c in H, any distribution D, and any epsilon, delta > 0, if A receives n = poly(1/epsilon, 1/delta, size) samples, it outputs h with L(h) <= epsilon with probability at least 1 - delta.

Finite hypothesis class: n = O(log(|H|/delta) / epsilon) samples suffice (union bound argument). Agnostic PAC (realizable assumption dropped): generalization error increases by the approximation error (best achievable in H). The fundamental theorem of statistical learning: H is PAC-learnable iff H has finite VC dimension.

12.4 Rademacher Complexity

Rademacher complexity measures the ability of a hypothesis class to fit random noise:

R_n(H) = E_S E_sigma [ sup_(h in H) (1/n) sum_i sigma_i h(x_i) ]

where sigma_i are iid Rademacher RVs: P(sigma_i=+1)=P(sigma_i=-1)=1/2

Generalization bound: L(h) <= L_train(h) + 2 R_n(H) + sqrt(log(1/delta)/(2n))

Rademacher complexity provides tighter bounds than VC theory for specific function classes (e.g., norm-bounded linear classifiers, neural networks). For linear classifiers with ||w|| <= B and ||x|| <= C: R_n <= BC/sqrt(n) — independent of dimension, explaining why high-dimensional linear classifiers can generalize well.

12.5 Double Descent and Modern Overparameterized Models

Classical bias-variance theory predicts a U-shaped test error curve as model size grows. Modern deep learning reveals double descent: test error decreases, then peaks at the interpolation threshold (enough parameters to fit training data exactly), then decreases again into the overparameterized regime. Implicit regularization from gradient descent — which finds minimum-norm solutions — explains why overparameterized models generalize. The minimum-norm interpolant for linear regression is w_MN = X^T (X X^T)^(-1) y (right pseudoinverse).

Algorithm Pseudocode Reference

PCA Algorithm

Input: Data matrix X (n x d), number of components k
Output: k-dimensional projections Z (n x k), principal components V_k (d x k)

1. Compute mean: mu = mean(X, axis=0)
2. Center data: X_c = X - mu  (broadcast)
3. Compute SVD: U, s, Vt = SVD(X_c, full_matrices=False)
   -- OR covariance eigendecomposition:
   -- C = X_c.T @ X_c / (n-1)
   -- eigenvalues, V = eigh(C)  [sorted descending]
4. Top-k components: V_k = Vt.T[:, :k]     (d x k matrix)
5. Project: Z = X_c @ V_k                  (n x k matrix)
6. Explained variance ratio: s[:k]^2 / sum(s^2)
Return Z, V_k

Backpropagation Algorithm

Input: Network params (W^(l), b^(l) for l=1..L), input x, target y
Output: Gradients dL/dW^(l), dL/db^(l) for all l

--- FORWARD PASS ---
a^(0) = x
for l = 1 to L:
    z^(l) = W^(l) @ a^(l-1) + b^(l)
    a^(l) = activation(z^(l))     [elementwise]
y_hat = a^(L)
Compute loss L = loss_fn(y_hat, y)

--- BACKWARD PASS ---
delta^(L) = dL/da^(L) * activation'(z^(L))   [elementwise multiply]
for l = L-1 down to 1:
    dL/dW^(l+1) = delta^(l+1) @ a^(l).T
    dL/db^(l+1) = delta^(l+1)
    delta^(l) = W^(l+1).T @ delta^(l+1) * activation'(z^(l))
dL/dW^(1) = delta^(1) @ a^(0).T
dL/db^(1) = delta^(1)
Return all gradients

Adam Optimizer

Input: Initial params theta_0, learning rate alpha=0.001,
       beta_1=0.9, beta_2=0.999, epsilon=1e-8, num_steps T
Output: Optimized params theta_T

m_0 = 0  (initialize 1st moment vector)
v_0 = 0  (initialize 2nd moment vector)

for t = 1 to T:
    g_t = grad L(theta_{t-1})           [compute gradient]
    m_t = beta_1 * m_{t-1} + (1 - beta_1) * g_t
    v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t^2   [elementwise square]
    m_hat = m_t / (1 - beta_1^t)        [bias correction]
    v_hat = v_t / (1 - beta_2^t)        [bias correction]
    theta_t = theta_{t-1} - alpha * m_hat / (sqrt(v_hat) + epsilon)

Return theta_T

Frequently Asked Questions

What is the chain rule and why is it central to backpropagation?

The chain rule states that if y = f(g(x)), then dy/dx = (df/dg)(dg/dx). In a neural network the loss L is a composition of many functions — layer activations, weight matrices, bias additions. Backpropagation applies the chain rule recursively from the output layer back to the input layer to compute the gradient of L with respect to every weight. The error signal delta^(l) = dL/dz^(l) propagates backward as delta^(l) = (W^(l+1))^T delta^(l+1) * sigma'(z^(l)). This avoids recomputing repeated sub-expressions and runs in O(parameters) time — the same as the forward pass.

What is singular value decomposition (SVD) and how is it used in machine learning?

SVD decomposes any real matrix A (m x n) as A = U Sigma V^T, where U is m x m orthogonal (left singular vectors), Sigma is diagonal with non-negative singular values, and V is n x n orthogonal (right singular vectors). In ML, SVD powers PCA (top-k right singular vectors are principal components), latent semantic analysis, collaborative filtering, and low-rank approximation. The Eckart-Young theorem guarantees the rank-k truncated SVD is the best possible rank-k approximation in both Frobenius and spectral norm — no other rank-k matrix is closer to A.

What is the difference between MLE and MAP estimation?

MLE finds theta maximizing the likelihood P(data | theta) — treating parameters as fixed unknowns with no prior belief. MAP maximizes the posterior P(theta | data), which combines the likelihood with a prior P(theta). With a Gaussian prior N(0, tau^2 I) on parameters, MAP is exactly equivalent to L2 (ridge) regularized MLE. With a Laplace prior, MAP gives L1 (lasso) regularization. As the dataset grows, both converge because the likelihood dominates. MAP shrinks estimates toward the prior on small data, reducing overfitting.

How does gradient descent work and what makes Adam different?

Gradient descent updates parameters by moving opposite the gradient: theta_(t+1) = theta_t - alpha * grad L. Adam maintains exponential moving averages of gradients (m_t) and squared gradients (v_t), bias-corrects both, then divides the gradient update by sqrt(v_hat) — this adapts the effective learning rate per parameter. Parameters with historically large gradients get smaller updates; parameters with small gradients get relatively larger updates. Adam is robust to sparse gradients and works well without careful learning rate tuning, making it the default for deep learning.

What is the kernel trick in support vector machines?

The SVM dual problem only requires inner products x_i^T x_j between training points. Replacing x_i^T x_j with K(x_i, x_j) = phi(x_i)^T phi(x_j) implicitly computes dot products in a high- (or infinite-) dimensional feature space without ever computing phi explicitly. This allows SVMs to learn non-linear boundaries with the same computational cost as linear SVMs (in terms of kernel evaluations). The RBF kernel K(x,z) = exp(-||x-z||^2 / 2sigma^2) corresponds to infinite-dimensional feature space but can be evaluated in O(d) time.

What is the bias-variance tradeoff?

The expected mean squared error of any estimator decomposes as MSE = Bias^2 + Variance + irreducible noise. High bias (underfitting) means the model is systematically wrong regardless of training data — the model is too simple. High variance (overfitting) means predictions vary wildly across different training sets — the model is too sensitive to noise. As model complexity increases, bias decreases and variance increases. Regularization reduces variance at the cost of some bias. Ensemble methods like bagging reduce variance by averaging many models.

What is PCA and how is it derived mathematically?

PCA finds orthogonal directions of maximum variance in centered data. The first principal component maximizes w^T C w subject to ||w|| = 1, where C = X_c^T X_c / (n-1) is the sample covariance. By Lagrange multipliers, the solution satisfies Cw = lambda w, so w must be an eigenvector of C with the largest eigenvalue. Each subsequent PC is the eigenvector with the next largest eigenvalue. Equivalently via SVD: X_c = U Sigma V^T, and the principal components are the columns of V. PCA is unsupervised; for supervised dimensionality reduction, use LDA which maximizes between-class variance relative to within-class variance.

What is the vanishing gradient problem and how is it addressed?

During backpropagation, gradients involve a product of L Jacobians for a depth-L network. If activation derivatives are consistently less than 1 (e.g., sigmoid saturates near 0 or 1 with derivative approaching 0), the product decays exponentially, making early layers receive near-zero gradient and learn very slowly. Solutions include: ReLU activations (derivative is 1 for positive inputs); batch normalization (keeps pre-activation distributions stable); residual skip connections (gradient can bypass layers via identity path); and He initialization W ~ N(0, 2/n_in) which keeps forward activation variance constant.

Test Your ML Math Knowledge

Practice backpropagation derivations, SVD computations, gradient descent analysis, and more with adaptive problem sets.

Start Practicing