Corpus (plural corpora) is a Latin word for "body" that has been adopted in several fields to denote an assembled body of material. In general usage it refers to any coherent collection treated as a single unit: a body of text, a body of law, or the physical body in anatomy and medicine. The term is intentionally broad and its precise meaning depends on context.

Core senses and common uses

Typical senses of corpus include: a structured collection of texts used for linguistic analysis; a compilation of an author's works; an anatomical structure named for its bodily form (for example, the corpus callosum); and legal phrases such as corpus juris or corpus delicti. When written about collections of language data, the plural is usually given as corpora.

Characteristics of textual corpora

In linguistics and natural language processing, a corpus is a carefully selected and often annotated set of written or spoken language. Important characteristics include size (number of words or tokens), representativeness (how well it reflects the language variety of interest), metadata about sources, and levels of annotation such as part-of-speech tags, syntactic trees, or semantic labels.

Types and examples

  • Monolingual corpora: text from a single language.
  • Parallel corpora: comparable texts in two or more languages used for translation studies.
  • Annotated/Treebank corpora: texts enriched with linguistic markup.
  • Historical or diachronic corpora: texts spanning time periods for studying language change.

Well-known examples from computational linguistics include the Brown Corpus and large national corpora compiled for research and reference.

History and development

Collecting representative texts dates to early philologists and editors who assembled an author's complete works. The modern scientific use of corpora grew in the 20th century with computerized storage and concordancing tools, enabling quantitative studies of language and the rise of corpus linguistics as a discipline. Advances in digital text availability and annotation methods have expanded corpus-based research across the humanities and sciences.

Importance and distinctions

Corpora underpin empirical study in linguistics, lexicography, translation, and language technology. They differ from generic databases by purpose and structure: a corpus is designed to be a sample for analysis rather than merely an archive. The same term also appears in anatomy and law with unrelated, discipline-specific meanings. For the word's origin see Latin.