AWK
This article is about the programming language. For other meanings, see AWK.
awk is a programming language for processing and evaluating arbitrary text data, including CSV files. The associated interpreter is more like a compiler, because the program text is first compiled completely and then executed. awk was primarily designed as a report generator and was one of the first tools to appear in version 3 of Unix. You can think of awk as a further development or addition to the stream editor sed; they share certain syntactic elements such as regular expressions. Unlike sed, however, awk provides C-like structures (if .. then .. else, various loops, C formats ...) that allow a much easier program construction. In the minimal application, awk is used in shell scripts to compose filenames as filters, for example. With more detailed programs it is possible to edit, transform or evaluate text files. In addition to the usual string functions, basic mathematical functions are also available. The name "awk" is composed of the first letters of the last names of its three authors Alfred V. Aho, Peter J. Weinberger and Brian W. Kernighan.
A version of awk can be found in almost every unix-like system today and is often pre-installed. However, a comparable program is also available for almost all other operating systems.
The language works almost exclusively with the data type string. In addition, associative arrays (i.e. arrays indexed with strings, also called hashes) and regular expressions are fundamental components of the language.
The power, compactness, but also the limitations of the awk and sed scripts inspired Larry Wall to develop the Perl language.
Structure of a program
The typical execution of an awk program consists of performing operations - such as substitutions - on an input text. To do this, the text is read in line by line and split into fields using a chosen separator - usually a series of spaces and/or tab characters. Afterwards, the awk statements are applied to the respective line.
awk statements have the following structure:
For the line read in, it is determined whether it satisfies the condition (often a regular expression). If the condition is met, the code is executed within the statement block enclosed by curly braces. Deviating from this, a statement can also consist of only one action
or only from a condition
exist. If the condition is missing, the action is executed for each line. If the action is missing, the writing of the entire line is executed as the default action, provided that the condition is fulfilled.
Variables and functions
The user can define variables within statement blocks by referencing, an explicit declaration is not necessary. The scope of the variables is global. An exception here are function arguments, whose validity is restricted to the function defining them.
Functions can be defined at any position, the declaration does not have to be made before the first use. If scalars are involved, function arguments are passed as value parameters, otherwise as reference parameters. The arguments when calling a function do not have to match the function definition, excess arguments are treated as local variables, omitted arguments are given the special value uninitialized - numerically zero and as a string the value of the empty string.
Functions and variables of all kinds use the same namespace, so identical naming leads to undefined behavior.
In addition to user-defined variables and functions, standard variables and standard functions are also available, for example the variables $0 for the
entire line, $1
, $2
... for the i-th field of the line and FS (
from field separator) for the field separator, as well as the functions gsub(), split() and match().