Text file

In information technology, a text file is a file that contains displayable characters. These can be subdivided by control characters such as line and page breaks. The counterpart to the text file is a binary file. Basically, text files are also stored in binary form, but the terms are used complementarily because the interpretation of the binary content is what matters: In a text file, the content is interpreted as a sequential series of characters from a character set; in a binary file, any other interpretation of the content is possible. Consequently, in contrast to a binary file, a text file is readable without the use of special programs and can be viewed and edited with a simple text editor - such as Notepad under Windows or vi or Nano under Unix.

In contrast to this technical definition of the term text file, where the file format is decisive, the colloquial use of the term is often primarily oriented towards the content of the file visible to the end user: In this context, the term "text file" is used somewhat vaguely to refer to all files created with the goal of presenting readable text, regardless of the form in which they are stored. However, the files generated by common word processing or publishing software when saved are often complex file formats that contain, in addition to the text, meta-information describing the text layout, structure, and fonts used; in addition, images or graphics may be embedded. Therefore, they are not text files in the technical sense, as the file formats are often binary and require special software to display them.

In a text file in the technical sense, the number of available characters is determined by the underlying encoding. The most common encodings are ASCII or UTF-8, an encoding of Unicode. Such a text file does not necessarily have to contain text - it can, for example, also be ASCII, i.e. pictograms based on the available characters. However, if it is text and neither special processing steps nor knowledge of a special notation is required to understand the meaning, the content is referred to as plain text. However, the set of characters is also often constrained by a natural or formal language. Text files that require a specific notation - such as HTML files - can be edited with a simple text editor, but there are often special programs for this purpose that make editing easier - for example, through special highlighting or automatic formatting.

The 95 printable characters of the original ASCIIZoom
The 95 printable characters of the original ASCII

History

In the early days of electronic data processing, the distinction between text and binary files was simpler than it is today. In a text file, a character was always converted directly into a special bit pattern. The file could be transmitted to a terminal, printer or teleprinter without any detours - that is, character by character, without any conversion by a special program. The Baudot code used for transmission between teleprinters is also the origin of the control characters "line feed" or "carriage return" found in text files.

Character coding is used to convert the physically stored bit sequences in a text. In the past, a character was almost exclusively always converted into exactly one byte, i.e. as a rule a group of 8 bits, which thus made 256 (corresponding to 28) different characters possible. In the coding using ASCII in the original definition, only 7 bits were actually used.

With the 7- or 8-bit character sets, only one font can be used in a file at a time; the use of different languages is only possible to a limited extent. The East Asian writing systems, such as Japanese, Chinese and Korean, can practically not be mapped at all. In 1986, ISO 2022 was the first standard to allow the use of different fonts in a text file, and it also provided for fonts using more than 256 different characters. However, this standard only achieved significant distribution in the East Asian region and was superseded by Unicode, which was first published in 1991 and is intended to represent all existing writing systems in the long term.

At the latest since the introduction of Unicode, the conversion of a character into its binary representation has become more complicated, since there are several variants for this and a character is not always converted with the same number of bytes.

Since the exchange of files between different computer systems has become more important, not least due to the Internet, and text files make it easier to process files independently of the system compared to binary files, the text format has gained in importance. However, in particular also due to the diverse use of text files, the term itself has become more inappropriate and blurred.

Distinction between binary and text files

Many operating systems have conventions regarding the extension of file names to identify the file type. Under Windows and macOS, the extension .txt is usually appended to the name of a text file, and other operating systems such as Linux also sometimes use this file extension.

The Multipurpose Internet Mail Extensions (MIME), which were designed to standardize the technical format of e-mails, define so-called media types that are now also used in many other areas besides e-mail traffic to identify the file type. The media type text identifies text. The complete type specification is supplemented by a subtype that specifies the intended use of the text. For text files that directly contain the "actual" text that is not intended for specific machine processing, the complete type specification is text/plain.

No special formatting, such as bold emphasis, can be specified for the text contained in a text file. Some encodings allow the stacking of diacritical characters or the display of bidirectional text.

A file created with a word processor (such as Microsoft Word or LibreOffice Writer) is normally not a text file, even if only text was captured, because the text can only be viewed and edited again using a suitable word processor. Text created in PostScript (.ps), Portable Document Format (PDF, .pdf), or TeX-DVI (.dvi) is also not a text file because it contains encoded format information, which may be binary. Similarly, text that is read in by a scanner is not a text file. These are rather image files, unless they are converted into a text file after the scanning process by means of a text recognition software (OCR, optical character recognition).

When compressing data, a significantly larger saving in memory size can usually be achieved for text files than for binary files. This is because text files have a lower information density than most binary files, and common compression algorithms take advantage of this - for example, by using Huffman coding.


AlegsaOnline.com - 2020 / 2023 - License CC3