Overview
A statistical hypothesis test is a structured method for assessing whether observed data are compatible with a specified claim about a population or process. Analysts state a null hypothesis representing no effect or no difference and an alternative hypothesis that represents the effect of interest. The test computes a test statistic from sample data and compares it with an expected sampling distribution to quantify how surprising the result would be if the null hypothesis were true. That quantification is commonly reported as a p-value, the probability of observing data at least as extreme as the actual data under the null model. See also general topics in statistics.
Key components
Most hypothesis tests follow a standard sequence: specify hypotheses, choose a test statistic, determine its sampling distribution under the null, compute the statistic from the sample, and apply a decision rule based on a preselected significance level (often denoted alpha). Important concepts include:
- Test statistic: a numeric summary (e.g., t, z, chi-square) used to assess evidence.
- Sampling distribution: the distribution of the test statistic when the null is true.
- P-value: probability of data as extreme under the null; smaller values indicate stronger evidence against the null.
- Significance level: threshold for rejecting the null (commonly 0.05).
Errors, power, and interpretation
Decisions from hypothesis tests are probabilistic and can be wrong. A Type I error occurs when the null is rejected though it is true; its long-run rate is controlled by the significance level. A Type II error is failing to reject a false null. The complement of the Type II error rate is the power—the probability a test detects a true effect of a given size. Interpreting p-values and significant results requires care: statistical significance does not always imply practical importance, and results depend on study design, sample size, and assumptions.
Common tests and categories
Tests are often classified as parametric or nonparametric. Parametric tests assume specific distributional forms (e.g., normality) and include the t-test, ANOVA, and linear regression inference. Nonparametric tests, such as the Mann–Whitney or Kruskal–Wallis tests, make fewer assumptions and operate on ranks or other robust summaries. Categorical data are frequently analyzed with chi-square tests or logistic regression. Choice of one-tailed versus two-tailed testing reflects whether departures in a specific direction are of interest.
History, usage, and best practices
Modern hypothesis testing developed through contributions by figures such as Sir Ronald Fisher and later J. Neyman and E.S. Pearson, who formalized complementary approaches emphasizing p-values and long-run error control respectively. Today hypothesis tests are ubiquitous across sciences, business, and policy. Best practices include pre-specifying hypotheses and analysis plans, reporting effect sizes and confidence intervals alongside p-values, checking assumptions, and considering reproducibility. When communicating results, clarify that a small p-value indicates inconsistency with the null under the assumed model, not a definitive proof—random variation and study limitations must be considered; see discussion of experimental design in experiments and foundational ideas about hypotheses and chance.