Overview

Reinforcement learning (RL) is a branch of machine learning in which an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. Rather than being given explicit correct answers, the agent must discover which actions lead to desirable outcomes by trial and error. A central goal is learning a policy — a mapping from states to actions — that maximizes cumulative reward over time.

Core concepts and components

RL systems are commonly described using a few basic elements. The agent observes the environment, selects actions, and receives scalar rewards. The environment responds to actions and produces new observations and rewards. Key theoretical notions include:

  • Policy: how the agent chooses actions (deterministic or stochastic).
  • Reward signal: the scalar feedback guiding learning.
  • Value function: an estimate of expected future reward from states or state–action pairs.
  • Model: an internal prediction of environment dynamics used by model-based methods.

Algorithms and distinctions

Methods are often classified as model-free (learn a policy or value function directly) or model-based (learn a model of dynamics). Popular approaches include temporal-difference learning and Q-learning, policy gradient methods, and actor-critic architectures. Practical systems must balance exploration (trying new actions to discover better outcomes) against exploitation (using known rewarding actions).

History and intellectual roots

RL draws on behaviorist psychology concepts of rewards and punishments, control theory, and dynamic programming from operations research. Over decades it evolved from theoretical foundations to scalable algorithms through advances in function approximation, especially deep neural networks, which enabled RL to tackle high-dimensional sensory inputs.

Applications and examples

Reinforcement learning has been applied to robotics, autonomous control, game playing, resource allocation, recommendation systems, and simulated scientific discovery. Practical deployments often combine RL with supervised learning components or human-derived priors to improve safety and efficiency. For accessible introductions to agents and learning paradigms see agent overview and broader machine learning resources.

Challenges and notable considerations

Important challenges include sample efficiency (how much experience is needed), reward design (specifying objectives without unwanted side effects), stability of learning with function approximators, and ensuring safety and interpretability. For contrasts with other paradigms, consult summaries comparing RL to supervised learning and to behavioral foundations at behaviorist psychology. Further reading and tutorials often link to surveys and educational material via the referenced resources.