Overview
A regular expression (often shortened to regex or regexp) is a compact notation for describing patterns in text. At its simplest a regex specifies sequences of characters to match, but the notation supports character classes, repetition, alternation and grouping so it can express complex text patterns. Implementations are widely available within text-processing tools and programming environments for searching, validation and transformation of string data. For a formal perspective see formal language resources that explain the theoretical foundation.
Core elements and examples
Typical building blocks found in most regex dialects include literal characters, character classes (e.g. [a-z] ), quantifiers that control repetition (*, +, ?, {n,m}), anchors that mark positions (^ for start, $ for end), alternation (|), and grouping with parentheses. Many engines also provide escape sequences (\d for digits, \w for word characters, \s for whitespace) and special constructs such as word boundaries (\b).
Simple examples: the literal pattern "car" matches that substring anywhere. The pattern \bcar\b matches "car" as a separate word. The pattern \$\d+(?:\.\d{2})? matches a dollar amount like "$10" or "$245.99" (a backslash escapes the dollar sign and the decimal part is optional). More advanced examples include lookahead/lookbehind assertions and backreferences; these extend capability but may go beyond what is describable by classical regular languages.
History and theoretical background
The idea of regular expressions originates in formal language theory and automata theory, where mathematicians developed a way to describe regular languages. Practical implementations were popularized in early text-processing software and scripting languages. For concise introductory material on pattern syntax and grammar you can consult resources that describe syntactic rules and how a regex engine parses patterns. Many modern programming environments incorporate regex support; see common programming languages documentation for engine-specific differences.
Common uses and limitations
Regexes are used for searching and replacing text, validating formats (emails, phone numbers, simple dates), tokenizing input, and extracting data from logs or documents. They are particularly effective for pattern-based tasks that do not require full grammatical parsing. However, they are not a substitute for a proper parser when a language has nested or recursive structure—tasks like compiling source code or interpreting complex nested grammars are better handled by parser generators and tools that build a concrete syntax tree; see tools that generate a parser if you need full grammar support.
Practical considerations and notable distinctions
There are many dialects of regular expressions: POSIX, Perl-compatible regex (PCRE), Java, .NET, JavaScript and others each use slightly different features and escaping rules. Some engine features (for example backreferences and certain lookaround constructs) make matching more powerful but also may break theoretical guarantees and cause performance issues like catastrophic backtracking. When performance matters, prefer well-defined, anchored patterns and consider finite-state or streaming approaches for very large inputs.
Further reading
- Introductory tutorials and quick reference tables for common tokens and constructs are useful starting points; search a trusted tutorial or language-specific reference.
- For theoretical depth about pattern expressiveness and limitations, consult resources on sets of characters and automata.
- Practical engine comparisons and examples across tools can be found in many language ecosystems and online guides; see vendor or community documentation referenced under programming languages.
Regex remains a compact and powerful tool for many text-processing tasks. When used with awareness of its dialect-specific features and limits, it can dramatically simplify searches, validations and transformations in code and command-line workflows.