Overview

Information entropy is a numerical measure of unpredictability or uncertainty associated with a set of possible outcomes. It is a central concept in information theory and gives a way to quantify how much information is produced by an event or random variable. Roughly speaking, rare or surprising events carry more information than highly predictable ones. Entropy translates this intuition into a single number that guides optimal coding, inference, and decision-making.

Definition and key properties

The standard formula for the entropy H of a discrete random variable X with outcomes x and probabilities p(x) is H(X) = - Σ p(x) log p(x). When the logarithm is base 2 the units are bits; other bases give nats or bans. Entropy is maximized when outcomes are equally likely and minimized (zero) when the outcome is certain. Important related quantities include conditional entropy (uncertainty remaining about one variable given knowledge of another), mutual information (shared information between variables) and cross-entropy or Kullback–Leibler divergence (measures used to compare distributions).

  • Nonnegativity: H(X) ≥ 0, with equality for deterministic variables.
  • Additivity and chain rules: joint and conditional entropies combine in systematic ways.
  • Concavity: entropy is a concave function of the probability distribution.

Simple examples

The classic illustration is a coin flip. A fair coin, with 50-50 probability for heads or tails, has H = 1 bit because each outcome is equally likely and conveys one bit of information. If a coin is biased so one side is more likely, entropy drops below 1 bit; in the extreme where outcome is certain, entropy is 0. For a six-sided fair die the entropy is higher than for a coin because more outcomes increase unpredictability. These examples show how entropy reflects average surprise, not the value of individual outcomes.

History and development

The modern notion of information entropy was introduced by Claude Shannon in the 1940s as part of a mathematical theory of communication. Shannon chose the entropy formula because it satisfied several reasonable axioms for measuring uncertainty and because it connected directly to the minimum average length of a lossless code. Over ensuing decades the idea was extended, interpreted, and applied across fields, with formal links to statistical mechanics and probability theory.

Applications and notable distinctions

Entropy has many practical and conceptual uses. In data compression it sets a limit on average code length; in cryptography it helps quantify unpredictability of keys; in machine learning loss functions like cross-entropy guide model fitting. Researchers also apply entropy-based methods in biology for sequence analysis and diversity metrics, and in physics to connect information with thermodynamic disorder. Distinguishing information entropy from thermodynamic entropy is important: they are related mathematically and conceptually but arise in different contexts and units.

Readers seeking deeper technical treatments can follow introductory textbooks and surveys that develop proofs, coding theorems, and advanced measures such as conditional entropy, mutual information, cross-entropy and divergences. For formal definitions and derivations see standard references in information theory and probability.

Entropy remains a foundational tool for reasoning about uncertainty, optimal representation and inference in many disciplines.

information theoryinformationeventClaude Shannoncryptographybiologyphysicsmachine learning50-50 probabilityentropy