Advanced Mathematics

Statistical Inference

Q: What is the difference between frequentist and Bayesian statistical inference?

Frequentist inference treats parameters as fixed unknown constants and interprets probability as the long-run frequency of events in repeated experiments. A 95% confidence interval means that if the experiment were repeated many times, 95% of the constructed intervals would contain the true parameter. Bayesian inference treats parameters as random variables with prior distributions and uses Bayes' theorem to update beliefs given data, producing a posterior distribution. A 95% credible interval directly means there is a 95% probability the parameter lies in that interval given the observed data.

Q: What is maximum likelihood estimation (MLE) and what are its properties?

MLE finds the parameter value that maximizes the likelihood function L(theta; x) = f(x | theta), the probability of observing the data given the parameter. Under regularity conditions, the MLE is consistent (converges to the true value), asymptotically normal (sqrt(n)(theta-hat - theta) converges to N(0, I(theta)^{-1})), and asymptotically efficient (achieves the Cramer-Rao lower bound). MLEs are also equivariant: if theta-hat is the MLE of theta, then g(theta-hat) is the MLE of g(theta) for any function g.

Q: What does the Cramer-Rao lower bound say?

The Cramer-Rao lower bound (CRLB) states that for any unbiased estimator T of theta, the variance satisfies Var(T) >= 1 / I(theta), where I(theta) is the Fisher information I(theta) = E[(d/dtheta log f(X|theta))^2] = -E[d^2/dtheta^2 log f(X|theta)]. An unbiased estimator that achieves this bound is called efficient. For n i.i.d. observations, the bound becomes 1/(n*I(theta)). The MLE is asymptotically efficient, meaning it asymptotically achieves this bound.

Q: What is the Neyman-Pearson lemma and why is it important?

The Neyman-Pearson lemma states that for testing a simple null hypothesis H0: theta = theta_0 against a simple alternative H1: theta = theta_1, the most powerful test at significance level alpha rejects H0 when the likelihood ratio L(theta_1; x) / L(theta_0; x) exceeds a threshold k determined by the size constraint. This lemma is foundational because it shows that among all tests with Type I error at most alpha, the likelihood ratio test has the smallest Type II error (largest power). It motivates the construction of uniformly most powerful (UMP) tests for composite hypotheses.

Q: What is the difference between a confidence interval and a credible interval?

A 95% confidence interval is a random interval (since it depends on random data) such that in 95% of repeated experiments the constructed interval contains the true fixed parameter. It is incorrect to say there is a 95% probability the parameter is in a specific observed interval. A 95% Bayesian credible interval is constructed from the posterior distribution of the parameter given the data, and it correctly means there is a 95% posterior probability that the parameter lies in that interval. Credible intervals can be highest posterior density (HPD) intervals or equal-tailed quantile intervals.

Q: What are Type I and Type II errors in hypothesis testing?

A Type I error (false positive) occurs when you reject a true null hypothesis. Its probability is the significance level alpha, also called the size of the test. A Type II error (false negative) occurs when you fail to reject a false null hypothesis. Its probability is beta. The power of a test is 1 - beta, the probability of correctly rejecting a false null. There is a fundamental tradeoff: for fixed sample size, decreasing alpha increases beta. Increasing sample size can reduce both. Researchers typically set alpha = 0.05 or 0.01 and aim for power of at least 0.80.

Q: What is the Benjamini-Hochberg procedure for controlling the false discovery rate?

The Benjamini-Hochberg (BH) procedure controls the expected proportion of false discoveries among all rejected hypotheses (the false discovery rate, FDR) at a pre-specified level q. Sort the m p-values as p_(1) <= p_(2) <= ... <= p_(m). Find the largest k such that p_(k) <= k*q/m. Reject all hypotheses H_(1) through H_(k). The BH procedure guarantees FDR <= q under independence (and also under positive dependence). It is less conservative than Bonferroni correction (which controls familywise error rate) and thus has greater power in multiple-testing settings like genomics or neuroimaging.

Q: What are conjugate priors in Bayesian inference and why are they useful?

A conjugate prior is a prior distribution that, when combined with a given likelihood, produces a posterior in the same distributional family. For example, a Beta prior on a binomial probability p combined with binomial data gives a Beta posterior. A Normal prior on a Normal mean (known variance) gives a Normal posterior. A Gamma prior on a Poisson rate gives a Gamma posterior. Conjugate priors are useful because they yield closed-form posteriors, making computation easy without MCMC. They also allow intuitive interpretation: the prior hyperparameters can be thought of as pseudo-observations from previous data.

A rigorous treatment of statistical inference: frequentist and Bayesian frameworks, point and interval estimation, hypothesis testing, ANOVA, regression, nonparametric methods, bootstrap, multiple testing corrections, and Bayesian computation.

1. Frequentist vs. Bayesian Frameworks

Statistical inference is the process of drawing conclusions about a population or data-generating process from a sample. Two dominant philosophical frameworks govern how we interpret probability and what "inference" means.

Frequentist Framework

In the frequentist view, parameters are fixed (unknown) constants. Probability is defined as the limiting relative frequency of an event in an infinite sequence of independent repetitions of an experiment. A procedure is evaluated by its long-run behavior across all possible datasets that could have been observed.

Key concepts: estimators, sampling distributions, confidence intervals (which trap the true parameter in 95% of repeated experiments), significance levels, and p-values. Inference statements are about the procedure, not about the probability of specific parameter values.

Bayesian Framework

In the Bayesian view, parameters are treated as random variables with a prior distribution pi(theta) encoding beliefs before observing data. After observing data x, the posterior is computed via Bayes' theorem:

pi(theta | x) = L(theta; x) * pi(theta) / m(x)

where m(x) = integral L(theta; x) pi(theta) dtheta (marginal likelihood)

The posterior pi(theta | x) summarizes all inference about theta given x. Point estimates use the posterior mean, median, or mode. Interval estimates are credible intervals with direct probabilistic interpretations.

Comparison at a Glance

Aspect	Frequentist	Bayesian
Parameter status	Fixed unknown constant	Random variable with prior
Probability	Long-run frequency	Degree of belief
Interval	Confidence interval (procedure guarantee)	Credible interval (posterior probability)
Prior information	Not incorporated formally	Incorporated via prior pi(theta)
Output	Estimate + sampling-based interval	Full posterior distribution

2. Point Estimation

A point estimator is a statistic T(X1, ..., Xn) used to estimate a scalar or vector parameter theta. Key properties of estimators include unbiasedness, consistency, efficiency, and sufficiency.

Bias, Variance, and MSE

The bias of an estimator T is Bias(T) = E[T] - theta. An estimator is unbiased if its bias is zero for all theta. The mean squared error decomposes as:

MSE(T) = Var(T) + [Bias(T)]^2

A biased estimator can have lower MSE than an unbiased one if its variance is sufficiently smaller — this is the bias-variance tradeoff. The sample variance S^2 = (1/(n-1)) * sum(Xi - X-bar)^2 is unbiased for sigma^2, while (1/n) times the same sum is the MLE but biased.

Maximum Likelihood Estimation (MLE)

For i.i.d. data x1, ..., xn with density f(x | theta), the likelihood function is:

L(theta) = product from i=1 to n of f(xi | theta)

log L(theta) = sum from i=1 to n of log f(xi | theta) [log-likelihood]

Score: s(theta) = d/dtheta log L(theta) = 0 [MLE condition]

The MLE theta-hat maximizes log L(theta). Under regularity conditions the MLE is: (1) consistent, (2) asymptotically normal with variance 1/[n*I(theta)], and (3) asymptotically efficient.

Worked Example: MLE for Exponential Distribution

Let X1, ..., Xn be i.i.d. Exponential(lambda). The density is f(x|lambda) = lambda * e^(-lambda*x) for x > 0.

log L(lambda) = n log(lambda) - lambda * sum(xi)

d/dlambda log L = n/lambda - sum(xi) = 0

lambda-hat = n / sum(xi) = 1/X-bar

With n = 10 observations summing to 25, the MLE is lambda-hat = 10/25 = 0.4. The mean lifetime estimate is 1/lambda-hat = 2.5 units.

Method of Moments (MOM)

The method of moments sets theoretical moments equal to sample moments. The k-th population moment is mu_k = E[X^k]. The k-th sample moment is m_k = (1/n)*sum(Xi^k). Solve the system mu_k(theta) = m_k for theta. MOM estimators are consistent but generally less efficient than MLEs.

Worked Example: MOM for Gamma Distribution

Let Xi ~ Gamma(alpha, beta) so E[X] = alpha/beta and Var(X) = alpha/beta^2. The sample mean is X-bar = 5 and sample variance is S^2 = 2.

mu_1 = alpha/beta = X-bar = 5

mu_2 - mu_1^2 = alpha/beta^2 = S^2 = 2

Dividing: 1/beta = S^2/X-bar = 2/5 = 0.4 so beta-hat = 2.5

alpha-hat = X-bar * beta-hat = 5 * 2.5 = 12.5

3. Sufficiency, Completeness, and the Rao-Blackwell Theorem

Sufficient Statistics

A statistic T(X) is sufficient for theta if the conditional distribution of X given T(X) does not depend on theta. Intuitively, T captures all information about theta contained in the data. The Factorization Theorem (Neyman-Fisher) states T is sufficient if and only if the joint density factors as:

f(x | theta) = g(T(x), theta) * h(x)

where g depends on x only through T(x), and h(x) does not depend on theta.

For Exponential(lambda) data: the joint density factors as lambda^n * exp(-lambda * sum(xi)), so T = sum(Xi) is sufficient for lambda. For Normal(mu, sigma^2) data with both unknown, T = (sum Xi, sum Xi^2) or equivalently (X-bar, S^2) is a sufficient statistic.

Minimal Sufficient Statistics

A sufficient statistic T is minimal sufficient if it is a function of every other sufficient statistic — it achieves the maximum data reduction consistent with sufficiency. The ratio f(x|theta)/f(y|theta) being free of theta for some pair (x, y) is the Lehmann-Scheffé criterion for minimal sufficiency.

Completeness

A statistic T is complete for a family of distributions if: whenever E[g(T)] = 0 for all theta implies g(T) = 0 almost surely for all theta. Completeness eliminates the possibility of unbiased estimators of zero (other than zero itself). In exponential families, the natural sufficient statistic is complete.

Rao-Blackwell Theorem

If W is an unbiased estimator of theta and T is sufficient for theta, then the conditional expectation phi(T) = E[W | T] satisfies:

E[phi(T)] = theta (unbiased)

Var(phi(T)) <= Var(W) for all theta

This means conditioning on a sufficient statistic never increases variance. Combined with completeness (Lehmann-Scheffé theorem), if T is a complete sufficient statistic and phi(T) is unbiased, then phi(T) is the unique minimum variance unbiased estimator (UMVUE).

Worked Example: UMVUE for Bernoulli p

For X1, ..., Xn i.i.d. Bernoulli(p), T = sum(Xi) is complete sufficient. The sample proportion p-hat = T/n is already a function of T and is unbiased for p, so it is the UMVUE for p.

For estimating p^2, consider W = X1*X2 (unbiased since E[X1*X2] = p^2 for i >= 2 independent). The UMVUE is E[X1*X2 | T] = T(T-1) / (n(n-1)), which is the sample analogue of p^2.

4. Fisher Information and the Cramer-Rao Lower Bound

Fisher Information

The Fisher information I(theta) measures the amount of information that an observable random variable X carries about the unknown parameter theta. For a single observation:

I(theta) = E[(d/dtheta log f(X|theta))^2]

= -E[d^2/dtheta^2 log f(X|theta)] (under regularity)

For n i.i.d. observations: I_n(theta) = n * I(theta)

The score function s(theta; x) = d/dtheta log f(x|theta) has E[s(theta;X)] = 0 under regularity conditions. The Fisher information is the variance of the score.

Cramer-Rao Lower Bound (CRLB)

For any unbiased estimator T(X) of theta based on n i.i.d. observations:

Var(T) >= 1 / [n * I(theta)]

For estimating g(theta) (not just theta itself):

Var(T) >= [g'(theta)]^2 / [n * I(theta)]

An unbiased estimator achieving equality is called efficient. Under regularity conditions, the MLE is asymptotically efficient: its asymptotic variance equals 1/[n*I(theta)].

Worked Example: CRLB for Normal Mean

Let Xi ~ N(mu, sigma^2) with sigma^2 known. The log-likelihood is:

log f(x|mu) = -1/2 * log(2*pi*sigma^2) - (x-mu)^2 / (2*sigma^2)

d/dmu log f = (x-mu)/sigma^2

I(mu) = E[(X-mu)^2/sigma^4] = 1/sigma^2

CRLB = sigma^2 / n = Var(X-bar)

The sample mean X-bar achieves the CRLB exactly, confirming it is the efficient unbiased estimator for mu.

5. Interval Estimation

Confidence Intervals via Pivots

A pivot is a function Q(X, theta) whose distribution does not depend on theta (or any other unknown parameter). We find a pivot, compute tail probabilities, and invert the inequalities to get bounds for theta.

Example: Xi ~ N(mu, sigma^2) with sigma^2 known

Pivot: Q = sqrt(n)(X-bar - mu)/sigma ~ N(0,1)

P(-z_(alpha/2) <= Q <= z_(alpha/2)) = 1 - alpha

Invert: mu in [X-bar - z_(alpha/2)*sigma/sqrt(n), X-bar + z_(alpha/2)*sigma/sqrt(n)]

When sigma^2 is unknown, replace sigma with S and N(0,1) with t_(n-1), giving the t-interval: X-bar +/- t_(n-1, alpha/2) * S/sqrt(n).

Interpreting Confidence Intervals Correctly

Correct: "If we repeated this study many times, 95% of the confidence intervals we construct would contain the true parameter."

Incorrect: "There is a 95% probability that the true parameter lies in this specific interval." (Once the interval is computed from the data, it either does or does not contain theta — no probability is involved for that specific interval.)

Likelihood Ratio Confidence Intervals

The likelihood ratio confidence region inverts the likelihood ratio test. Define:

LR(theta) = L(theta; x) / L(theta-hat; x)

By Wilks' theorem: -2 log LR(theta) ~ chi^2_1 asymptotically

95% LR interval: (theta : -2 log LR(theta) <= chi^2_1,0.95 = 3.841)

Likelihood ratio intervals have several advantages over Wald intervals (based on normal approximation): they are invariant to reparameterization, tend to be more accurate in small samples, and can naturally handle bounded parameters.

Common Confidence Intervals

Parameter	Assumptions	CI Formula
Mean mu	Normal, sigma known	X-bar +/- z_(a/2) * sigma/sqrt(n)
Mean mu	Normal, sigma unknown	X-bar +/- t_(n-1,a/2) * S/sqrt(n)
Proportion p	Large n (Wald)	p-hat +/- z_(a/2)*sqrt(p-hat(1-p-hat)/n)
Variance sigma^2	Normal data	[(n-1)S^2/chi^2_(n-1,a/2), (n-1)S^2/chi^2_(n-1,1-a/2)]

6. Hypothesis Testing Foundations

Hypothesis testing provides a formal decision framework for evaluating evidence against a null hypothesis H0. A test partitions the sample space into a rejection region C (reject H0 when the test statistic falls in C) and its complement.

Types of Error

Decision	H0 True	H0 False
Reject H0	Type I error (prob = alpha)	Correct (power = 1 - beta)
Fail to reject H0	Correct (prob = 1 - alpha)	Type II error (prob = beta)

Power Function

The power function beta(theta) = P_(theta)(reject H0) gives the probability of rejection as a function of the true parameter value. For theta in H0, we want beta(theta) <= alpha (size constraint). For theta in H1, we want beta(theta) to be large (high power).

Size of a test = sup_(theta in H0) beta(theta)

Power at theta_1 = beta(theta_1) = 1 - P(Type II error at theta_1)

For z-test: beta(mu) = 1 - Phi(z_(a/2) - delta) + Phi(-z_(a/2) - delta)

where delta = sqrt(n)(mu - mu_0)/sigma is the noncentrality parameter

P-values

The p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming H0 is true:

p-value = P_(theta_0)(T(X) >= T(x_obs)) [right-tailed]

= 2 * P_(theta_0)(T(X) >= |T(x_obs)|) [two-tailed for symmetric T]

A p-value is NOT the probability that H0 is true, and it is NOT the probability of making a wrong decision. It measures only how surprising the data are under H0. The p-value is a uniform(0,1) random variable under the null, so thresholding at alpha gives a valid test of size alpha.

Worked Example: One-Sample t-Test

A nutritionist claims average daily calcium intake mu_0 = 1000 mg. A sample of n = 25 adults gives X-bar = 940 mg, S = 120 mg. Test H0: mu = 1000 vs H1: mu < 1000 at alpha = 0.05.

t = (X-bar - mu_0) / (S/sqrt(n))

= (940 - 1000) / (120/sqrt(25))

= -60 / 24 = -2.5

df = n - 1 = 24

p-value = P(t_24 <= -2.5) = 0.0099

Since p-value = 0.0099 < 0.05, reject H0.

7. Neyman-Pearson Lemma and UMP Tests

Neyman-Pearson Lemma

For testing H0: theta = theta_0 against H1: theta = theta_1 (both simple), the most powerful level-alpha test rejects H0 when:

Lambda(x) = L(theta_1; x) / L(theta_0; x) > k

where k is chosen so P_(theta_0)(Lambda(X) > k) = alpha

For continuous distributions there is no randomization needed: the critical value k corresponds directly to a quantile of the null distribution of the test statistic.

Uniformly Most Powerful (UMP) Tests

A test is uniformly most powerful (UMP) at level alpha if it is most powerful against every specific alternative theta_1 in H1. UMP tests exist for one-sided hypotheses in one-parameter exponential families.

For exponential families with natural parameter eta(theta) monotone in theta and sufficient statistic T(x), the UMP test for H0: theta <= theta_0 vs H1: theta > theta_0 rejects when T(x) > c, where c is chosen to achieve size alpha. The key property is that the rejection region does not depend on the specific theta_1 chosen.

UMP tests generally do not exist for two-sided alternatives H1: theta ≠ theta_0. The best achievable in that case is a UMP unbiased (UMPU) test — one that is unbiased (power >= alpha for all theta in H1) and most powerful within the class of unbiased tests.

8. Likelihood Ratio Tests

For composite hypotheses, the generalized likelihood ratio test (GLRT) statistic is:

Lambda(x) = sup_(theta in Theta_0) L(theta; x) / sup_(theta in Theta) L(theta; x)

= L(theta-hat_0; x) / L(theta-hat; x)

Note: 0 <= Lambda(x) <= 1. Reject H0 when Lambda(x) <= c.

Wilks' Theorem

Under H0 and regularity conditions, as n approaches infinity:

-2 log Lambda(X) --> chi^2_r in distribution

where r = dim(Theta) - dim(Theta_0) = number of restrictions under H0

This allows us to find critical values from chi-squared tables without knowing the exact distribution of the test statistic. The LRT is asymptotically equivalent to the Wald test and the score test (Rao test) — all three converge to chi^2_r under H0.

Worked Example: LRT for Exponential Mean

Testing H0: lambda = 1 vs H1: lambda ≠ 1 with n = 20 Exponential observations giving sum(xi) = 25 (so lambda-hat = 20/25 = 0.8).

L(lambda; x) = lambda^n * exp(-lambda * sum(xi))

L(1; x) = 1^20 * exp(-25) = e^(-25)

L(0.8; x) = 0.8^20 * exp(-0.8*25) = 0.8^20 * e^(-20)

-2 log Lambda = 2[log L(0.8) - log L(1)]

= 2[20 log(0.8) + (-20) - (-25)] = 2[-4.46 + 5] = 1.08

chi^2_1, 0.95 = 3.841. Since 1.08 < 3.841, fail to reject H0.

9. Chi-Squared, t, and F Distributions

Chi-Squared Distribution

If Z1, ..., Zk are i.i.d. N(0,1), then V = Z1^2 + ... + Zk^2 follows a chi-squared distribution with k degrees of freedom, written V ~ chi^2_k.

PDF: f(v) = v^(k/2-1) * e^(-v/2) / (2^(k/2) * Gamma(k/2)), v > 0

E[V] = k, Var(V) = 2k

Key result: (n-1)S^2/sigma^2 ~ chi^2_(n-1) for normal data

Student's t Distribution

If Z ~ N(0,1) and V ~ chi^2_k independently, then T = Z / sqrt(V/k) follows a t-distribution with k degrees of freedom, t_k.

PDF: f(t) = Gamma((k+1)/2) / (sqrt(k*pi) * Gamma(k/2)) * (1 + t^2/k)^(-(k+1)/2)

E[T] = 0 (k > 1), Var(T) = k/(k-2) (k > 2)

Key result: sqrt(n)(X-bar - mu)/S ~ t_(n-1) for normal data

As k -> infinity, t_k -> N(0,1)

F Distribution

If U ~ chi^2_m and V ~ chi^2_n independently, then F = (U/m) / (V/n) follows an F-distribution with (m, n) degrees of freedom, F_(m,n).

E[F] = n/(n-2) (n > 2)

F_(m,n) = t_n^2 / (1) when m = 1 (F with 1 numerator df)

Key result: S1^2/S2^2 ~ F_(n1-1, n2-1) for two independent normal samples

ANOVA uses F = MS_between / MS_within to test equality of group means

10. Analysis of Variance (ANOVA)

ANOVA tests whether the means of three or more groups are equal by decomposing total variation into between-group and within-group components.

One-Way ANOVA

The one-way model has k groups with n_i observations each (total N = sum n_i):

Model: Y_ij = mu + alpha_i + epsilon_ij, epsilon_ij ~ N(0, sigma^2) i.i.d.

SST = SS_between + SS_within

SS_between = sum_i n_i * (Y-bar_i - Y-bar)^2

SS_within = sum_i sum_j (Y_ij - Y-bar_i)^2

F = (SS_between / (k-1)) / (SS_within / (N-k)) = MS_between / MS_within

Under H0 (all group means equal): F ~ F_(k-1, N-k)

Worked Example: One-Way ANOVA

Three teaching methods, n = 5 students each, test scores: Group 1: Y-bar_1 = 78, Group 2: Y-bar_2 = 82, Group 3: Y-bar_3 = 85. Grand mean Y-bar = 81.67. SS_within = 120.

SS_between = 5*(78-81.67)^2 + 5*(82-81.67)^2 + 5*(85-81.67)^2

= 5*13.45 + 5*0.11 + 5*11.09 = 123.25

MS_between = 123.25 / (3-1) = 61.63

MS_within = 120 / (15-3) = 10

F = 61.63 / 10 = 6.16

F_(2,12), 0.95 = 3.89. Since 6.16 > 3.89, reject H0.

Two-Way ANOVA

Two-way ANOVA includes two factors A (a levels) and B (b levels) plus their interaction. The model (balanced, one observation per cell) is:

Y_ij = mu + alpha_i + beta_j + (alpha*beta)_ij + epsilon_ij

SS_total = SS_A + SS_B + SS_AB + SS_error

F_A = MS_A / MS_error ~ F_(a-1, ab(n-1))

F_B = MS_B / MS_error ~ F_(b-1, ab(n-1))

F_AB = MS_AB / MS_error ~ F_((a-1)(b-1), ab(n-1))

A significant interaction means the effect of factor A depends on the level of factor B and vice versa. If the interaction is significant, main effects must be interpreted cautiously — examining the cell means (profile plots) is essential.

Post-Hoc Testing

A significant ANOVA F-test only indicates that some means differ. Post-hoc tests identify which pairs differ while controlling error rates. Common methods include Tukey's HSD (controls familywise error for all pairwise comparisons), Bonferroni correction (conservative, adjusts alpha for the number of comparisons), and Scheffe's method (valid for all linear contrasts, most conservative).

11. Regression Inference

In linear regression Y = X*beta + epsilon with epsilon ~ N(0, sigma^2 * I_n), the OLS estimator is beta-hat = (X'X)^(-1) X'Y. Under normality, exact inference is available.

Sampling Distribution of beta-hat

beta-hat ~ N(beta, sigma^2 * (X'X)^(-1))

(n - p - 1) * S^2 / sigma^2 ~ chi^2_(n-p-1)

where S^2 = RSS/(n-p-1) is the residual mean square, p = number of predictors

t-stat for beta_j: t_j = (beta-hat_j - beta_j^0) / (S * sqrt((X'X)^(-1)_(jj))) ~ t_(n-p-1)

Confidence Bands for the Regression Line

At a specific predictor value x_0, the confidence interval for the mean response E[Y | X = x_0] = x_0' * beta is:

x_0' * beta-hat +/- t_(n-p-1, a/2) * S * sqrt(x_0' * (X'X)^(-1) * x_0)

This is the confidence interval for the mean of Y at x_0 — it is narrowest at the mean of X and widens as x_0 moves away. Pointwise confidence bands join these intervals for all x_0; simultaneous Working-Hotelling bands use an F-critical value to cover all x_0 simultaneously.

Prediction Intervals

A prediction interval for a new observation Y_new at x_0 is wider than the confidence interval for the mean because it must also account for the individual error epsilon_new:

x_0' * beta-hat +/- t_(n-p-1, a/2) * S * sqrt(1 + x_0' * (X'X)^(-1) * x_0)

The extra "1" under the square root accounts for individual variation in Y_new.

Overall F-Test for Regression

H0: beta_1 = beta_2 = ... = beta_p = 0 (no predictors useful)

F = (SS_reg / p) / (SS_res / (n-p-1)) = MS_reg / MS_res ~ F_(p, n-p-1) under H0

R^2 = SS_reg / SS_total and F = (R^2/p) / ((1-R^2)/(n-p-1))

12. Nonparametric Tests

Nonparametric tests make minimal distributional assumptions. They are based on ranks or order statistics rather than the actual values, making them robust to outliers and valid when normality cannot be assumed.

Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is the nonparametric alternative to the one-sample or paired t-test. It tests whether the median of a symmetric distribution equals a hypothesized value mu_0.

1. Compute Di = Xi - mu_0

2. Rank |Di| as R_i (ignoring ties and zeros)

3. W+ = sum of ranks with Di > 0

4. W- = sum of ranks with Di < 0

Under H0: E[W+] = n(n+1)/4, Var(W+) = n(n+1)(2n+1)/24

For large n: Z = (W+ - n(n+1)/4) / sqrt(n(n+1)(2n+1)/24) ~ N(0,1)

Mann-Whitney U Test (Wilcoxon Rank-Sum)

The Mann-Whitney U test is the nonparametric alternative to the two-sample t-test. It tests whether two populations have the same distribution (location shift).

Combine n1 + n2 observations and rank all of them.

W = rank sum for group 1 observations

U = W - n1(n1+1)/2

Under H0: E[U] = n1*n2/2, Var(U) = n1*n2*(n1+n2+1)/12

For large samples: Z = (U - n1*n2/2) / sqrt(n1*n2*(n1+n2+1)/12) ~ N(0,1)

Kolmogorov-Smirnov Test

The KS test compares a sample empirical CDF Fn(x) to either a theoretical CDF F0(x) (one-sample) or another empirical CDF (two-sample). The test statistic is the maximum absolute deviation:

One-sample: D_n = sup_x |Fn(x) - F0(x)|

Two-sample: D_(m,n) = sup_x |Fm(x) - Gn(x)|

By the Glivenko-Cantelli theorem, D_n -> 0 a.s. as n -> infinity.

Under H0: sqrt(n)*D_n -> Kolmogorov distribution (tabulated).

The KS test is distribution-free under H0 (when F0 is continuous and fully specified). Its power is often lower than parametric tests when the parametric assumptions hold, but it is more broadly applicable.

Kruskal-Wallis Test

The Kruskal-Wallis test is the nonparametric analogue of one-way ANOVA, testing whether k groups have the same distribution. Rank all N observations jointly, compute mean rank per group, and compute:

H = (12 / (N(N+1))) * sum_i [n_i * (R-bar_i - (N+1)/2)^2]

Under H0: H ~ chi^2_(k-1) approximately for large n_i

13. Bootstrap Methods

The bootstrap approximates the sampling distribution of a statistic by resampling from the observed data. It is invaluable when analytical formulas for standard errors or confidence intervals are unavailable or unreliable.

Nonparametric Bootstrap

1. Observe x = (x1, ..., xn)

2. For b = 1, ..., B:

a. Draw x*^(b) = (x*1, ..., x*n) by sampling WITH replacement from x

b. Compute theta-hat*^(b) = T(x*^(b))

3. Bootstrap SE estimate: sqrt((1/(B-1)) sum_b (theta-hat*^(b) - theta-hat*-bar)^2)

Bootstrap Confidence Intervals

Several bootstrap CI methods exist with different accuracy levels:

Normal Bootstrap CI:

theta-hat +/- z_(a/2) * SE_boot

Simple but assumes symmetry and normality of sampling distribution.

Percentile Bootstrap CI:

[Q_(a/2)(theta-hat*), Q_(1-a/2)(theta-hat*)]

Uses empirical quantiles of bootstrap replicates. Simple and transformation-invariant.

Pivotal (Basic) Bootstrap CI:

[2*theta-hat - Q_(1-a/2)(theta-hat*), 2*theta-hat - Q_(a/2)(theta-hat*)]

Often has better coverage than percentile method.

BCa (Bias-Corrected and Accelerated) CI:

Adjusts for bias and skewness; has second-order accuracy and is transformation-invariant. Most recommended when computational budget allows.

Parametric Bootstrap

In the parametric bootstrap, instead of resampling from the data, we fit a parametric model with estimated parameters theta-hat and then draw bootstrap samples from this fitted distribution. This is more efficient when the model is correct and is used to calibrate test statistics or assess goodness of fit.

14. Multiple Testing Corrections

When conducting m simultaneous hypothesis tests each at level alpha, the probability of at least one false rejection (familywise error rate, FWER) inflates dramatically. For m independent tests: FWER = 1 - (1-alpha)^m. For m = 100, alpha = 0.05: FWER = 1 - 0.95^100 = 0.994. Multiple testing corrections address this.

Bonferroni Correction

The Bonferroni correction controls FWER at level alpha by testing each of m hypotheses at level alpha/m. It is valid for any dependence structure:

Reject H_i if p_i <= alpha/m

FWER = P(any false rejection) <= sum_i P(false rejection of H_i) = m * (alpha/m) = alpha

Conservative when tests are positively correlated. Power decreases as m increases.

Holm-Bonferroni (Step-Down) Procedure

Holm's procedure also controls FWER but is uniformly more powerful than Bonferroni:

1. Sort p-values: p_(1) <= p_(2) <= ... <= p_(m)

2. Starting from smallest, reject H_(i) while p_(i) <= alpha/(m-i+1)

3. Stop at the first non-rejection; retain all remaining

False Discovery Rate: Benjamini-Hochberg

The FDR is the expected proportion of rejected hypotheses that are false discoveries. Controlling FDR is less stringent than controlling FWER but typically offers much higher power — it is the standard approach in genomics, neuroimaging, and other high-dimensional settings.

Benjamini-Hochberg at FDR level q:

1. Sort p-values: p_(1) <= p_(2) <= ... <= p_(m)

2. Find k = max(i : p_(i) <= i*q/m)

3. Reject H_(1), ..., H_(k)

Guarantees FDR <= q * m_0/m <= q, where m_0 = number of true nulls.

Valid under independence and positive dependence (PRDS condition).

Worked Example: BH Procedure

m = 6 tests, FDR level q = 0.05. Sorted p-values: 0.001, 0.008, 0.039, 0.041, 0.210, 0.450.

BH thresholds i*q/m = i*0.05/6:

i=1: 0.0083. p_(1)=0.001 <= 0.0083 [reject]

i=2: 0.0167. p_(2)=0.008 <= 0.0167 [reject]

i=3: 0.025. p_(3)=0.039 > 0.025 [not rejected as max k]

i=4: 0.0333. p_(4)=0.041 > 0.0333

i=5: 0.0417. p_(5)=0.210 > 0.0417

i=6: 0.05. p_(6)=0.450 > 0.05

k = 2 (last i where p_(i) <= i*q/m). Reject H_(1) and H_(2).

15. Bayesian Inference

Posterior and Point Estimates

The posterior distribution combines the likelihood and prior:

pi(theta | x) proportional to L(theta; x) * pi(theta)

Common point estimates from the posterior:

Posterior mean (MMSE): E[theta | x] = integral theta * pi(theta|x) dtheta

Posterior median: minimizes expected absolute loss

MAP estimate: argmax_theta pi(theta|x) (maximizes posterior, minimizes 0-1 loss)

Credible Intervals

A (1-alpha)*100% credible interval [a,b] satisfies P(a <= theta <= b | x) = 1-alpha. Two standard forms:

Equal-tailed: P(theta < a | x) = P(theta > b | x) = alpha/2

HPD (Highest Posterior Density): shortest interval with posterior probability 1-alpha.

HPD interval satisfies: pi(theta_1 | x) = pi(theta_2 | x) at endpoints.

For symmetric unimodal posteriors, equal-tailed and HPD coincide.

Conjugate Priors

A prior is conjugate for a likelihood if the posterior is in the same family as the prior. This enables closed-form Bayesian updating.

Likelihood	Prior	Posterior
Binomial(n, p)	Beta(a, b)	Beta(a + x, b + n - x)
Poisson(lambda)	Gamma(a, b)	Gamma(a + sum(xi), b + n)
Normal(mu, sigma^2) sigma^2 known	Normal(mu_0, tau^2)	Normal(posterior mean, posterior var)
Exponential(lambda)	Gamma(a, b)	Gamma(a + n, b + sum(xi))
Multinomial	Dirichlet(alpha)	Dirichlet(alpha + x)

Worked Example: Beta-Binomial

We observe x = 7 successes in n = 10 Bernoulli trials. Prior: Beta(2, 2) (mild prior centered at 0.5). Compute the posterior and a 95% credible interval.

Posterior: Beta(2 + 7, 2 + 10 - 7) = Beta(9, 5)

Posterior mean = 9/(9+5) = 9/14 = 0.643

MLE (no prior) = 7/10 = 0.70

Prior pulls toward 0.5; posterior is a weighted compromise.

95% equal-tailed credible interval: Beta(9,5) quantiles

= approximately [0.385, 0.874]

Normal-Normal Conjugate Analysis

For X1, ..., Xn ~ N(mu, sigma^2) with sigma^2 known and prior mu ~ N(mu_0, tau^2):

Posterior precision: 1/tau_n^2 = 1/tau^2 + n/sigma^2

Posterior mean: mu_n = tau_n^2 * (mu_0/tau^2 + n*X-bar/sigma^2)

= weighted average of prior mean and data mean, weights = precisions

As n -> infinity, posterior mean -> X-bar (data dominates prior)

As tau^2 -> infinity (diffuse prior), posterior mean -> X-bar

Markov Chain Monte Carlo (MCMC)

When conjugate priors are unavailable or the model is complex, analytical posteriors are intractable. MCMC methods generate samples from the posterior by constructing a Markov chain whose stationary distribution equals the posterior.

Metropolis-Hastings Algorithm

1. Initialize theta^(0)

2. For t = 1, 2, ..., T:

a. Propose theta* from q(theta* | theta^(t-1))

b. Compute acceptance ratio:

r = [pi(theta*|x) * q(theta^(t-1)|theta*)] /

[pi(theta^(t-1)|x) * q(theta*|theta^(t-1))]

c. Accept: theta^(t) = theta* with prob min(1, r)

Reject: theta^(t) = theta^(t-1) otherwise

After burn-in, (theta^(t)) are approximately draws from the posterior.

Gibbs Sampling

Gibbs sampling is a special case of MH for multivariate parameters where we cycle through each component, sampling from its full conditional distribution given all other components. It requires no acceptance-rejection step:

For theta = (theta_1, ..., theta_k):

Sample theta_1^(t) ~ pi(theta_1 | theta_2^(t-1), ..., theta_k^(t-1), x)

Sample theta_2^(t) ~ pi(theta_2 | theta_1^(t), theta_3^(t-1), ..., theta_k^(t-1), x)

...

Sample theta_k^(t) ~ pi(theta_k | theta_1^(t), ..., theta_(k-1)^(t), x)

Converges when the chain is irreducible and aperiodic.

MCMC diagnostics include trace plots (to assess mixing and convergence), autocorrelation plots, Gelman-Rubin R-hat statistic (compares between-chain and within-chain variance; values near 1.0 indicate convergence), and effective sample size (accounts for autocorrelation among draws).

16. Frequently Asked Questions

What is the difference between frequentist and Bayesian statistical inference?

Frequentist inference treats parameters as fixed unknowns and interprets probability as long-run frequency. A 95% confidence interval means 95% of such intervals (across repeated experiments) contain the true parameter. Bayesian inference treats parameters as random with a prior distribution; the posterior pi(theta|x) is updated via Bayes' theorem. A 95% credible interval directly means P(theta in [a,b] | x) = 0.95 given the observed data.

What is maximum likelihood estimation (MLE) and what are its properties?

MLE maximizes the likelihood L(theta; x) = f(x | theta) over theta. Under regularity conditions it is consistent, asymptotically normal (sqrt(n)(theta-hat - theta) converges to N(0, 1/I(theta))), asymptotically efficient (achieves the Cramer-Rao bound), and equivariant (g(theta-hat) is the MLE of g(theta)).

What does the Cramer-Rao lower bound say?

For any unbiased estimator T of theta: Var(T) >= 1/[n*I(theta)], where I(theta) is the Fisher information. An estimator achieving this bound is efficient. The MLE is asymptotically efficient. For estimating g(theta), the bound becomes [g'(theta)]^2 / [n*I(theta)].

What is the Neyman-Pearson lemma and why is it important?

The NP lemma states that for simple H0 vs simple H1, the most powerful level-alpha test rejects when the likelihood ratio L(theta_1;x)/L(theta_0;x) exceeds a threshold k. It is foundational because it proves the LR test is optimal among all tests at the given size, motivating UMP tests for one-sided composite hypotheses via monotone likelihood ratios.

What is the difference between a confidence interval and a credible interval?

A 95% confidence interval is a random interval such that 95% of such intervals (across repeated experiments) contain the fixed true parameter. Once observed, the specific interval either does or does not contain theta — no probability applies. A 95% Bayesian credible interval satisfies P(theta in [a,b] | data) = 0.95 — a direct probability statement about the parameter given the observed data.

What are Type I and Type II errors in hypothesis testing?

Type I error (false positive): rejecting a true H0, with probability alpha (the significance level). Type II error (false negative): failing to reject a false H0, with probability beta. Power = 1 - beta. Increasing sample size reduces both alpha and beta simultaneously; for fixed n, reducing alpha increases beta. Common targets are alpha = 0.05 and power >= 0.80.

What is the Benjamini-Hochberg procedure for controlling the false discovery rate?

Sort m p-values as p_(1) <= ... <= p_(m). Find the largest k with p_(k) <= k*q/m. Reject H_(1) through H_(k). BH controls FDR (expected proportion of false discoveries among rejections) at level q under independence and positive dependence. It is less conservative than Bonferroni and has much higher power in large-scale testing (genomics, neuroimaging).

What are conjugate priors in Bayesian inference and why are they useful?

A conjugate prior yields a posterior in the same family: Beta prior + Binomial likelihood = Beta posterior. Gamma prior + Poisson = Gamma posterior. Normal prior + Normal (known variance) = Normal posterior. Conjugate priors give closed-form posteriors with interpretable hyperparameters (equivalent to prior pseudo-observations), avoiding the need for MCMC in simple models.