Statistical significance is a formal criterion used in statistics to judge whether an observed result is unlikely to have arisen by random variation alone. It does not by itself measure practical importance or truth, but instead quantifies how surprising an outcome is under a specific baseline assumption. The concept centers on comparing observed data to what would be expected if a designated baseline — the null hypothesis — were true.
Definition and interpretation
At the core of statistical significance is the p-value, which is the probability, computed under the null hypothesis, of obtaining the observed result or one more extreme. This expresses how compatible the data are with the null model: a small p-value indicates low compatibility. Investigators compare the p-value to a predetermined threshold called the significance level, denoted α. If p ≤ α, the result is called "statistically significant," and researchers may reject the null hypothesis in favor of an alternative explanation.
Tests of significance require a test statistic, which summarizes sample information into a single number; its distribution under the null is used to obtain the p-value. Decisions can be one-sided or two-sided depending on whether the alternative places direction on the effect. The threshold α controls the long-run frequency of false positive conclusions (Type I errors): setting α=0.05 is common but arbitrary.
History and development
The phrase and formal practice emerged in the early 20th century. Ronald Fisher popularized the idea of significance testing and introduced the p-value as a tool for interpreting data. In a later refinement, Jerzy Neyman and Egon Pearson framed hypothesis testing as a decision procedure that emphasized pre-specifying an α level and balancing Type I and Type II errors. Fisher, Neyman, and Pearson contributed complementary perspectives that are still reflected in modern practice: Fisher on evidential p-values and Neyman–Pearson on long-run error rates. For background on foundational texts see Ronald Fisher and general resources about hypothesis tests.
Uses, examples and common misunderstandings
Statistical significance is widely used across experimental science, medicine, social sciences, and industry to screen findings, guide decisions, and support regulatory thresholds. For example, clinical trials often require statistically significant differences before labeling a treatment effective. However, significance is frequently misinterpreted: a small p-value is not the probability that the null hypothesis is true, nor does it measure effect size or practical relevance. Multiple testing, selective reporting, and data dredging ("p-hacking") can inflate false positive rates unless addressed.
- Typical misconceptions: treating p as the chance a result is true, or treating nonsignificance as evidence of no effect.
- Typical problems: multiple comparisons, optional stopping, and publication bias.
Best practices and alternatives
Responsible use of statistical significance includes pre-specifying α, reporting exact p-values, and complementing significance tests with effect sizes and confidence intervals so readers can assess magnitude and uncertainty. Corrections for multiple comparisons (for example, adjusting thresholds) reduce false positives. Increasingly, researchers combine frequentist testing with other approaches — transparent pre-registration, replication, and Bayesian methods — to provide a fuller inferential picture. For discussions of probability and variables related to these topics see probability and variables.
In summary, statistical significance is a useful statistical tool when applied with care and interpreted alongside measures of effect and study design. It signals that observed data are unlikely under a specific null model, but it is only one component of responsible scientific inference and decision making. Further methodological reading and guidelines can be found in summaries of statistical practice and teaching materials at statistics portals or introductory texts on hypothesis tests.