The phrase "most common words in English" usually refers to frequency lists compiled from large collections of real language use. Such lists are based on tokens counted in a corpus and are often reported as lemmas or head words rather than every inflected form. For example, the lemma "be" is counted together with forms like "is", "are", "was" and "were". Major publishers and research groups compile these lists from very large collections of texts — in one notable study the analysis drew on a body of over a billion words (corpus).
How the lists are compiled
Creating a frequency list requires choices that affect the result. Compilers must decide what counts as a word unit (token) vs. a word type, whether to group forms under a single lemma, how to treat contractions and punctuation, and which texts to include. A representative corpus aims to include diverse registers — from formal writing and journalism to informal chat, emails and blogs — so the counts reflect broad usage rather than a single genre. Dictionaries and lexicographers typically treat headwords as the unit of analysis; see a definition at dictionary entry.
Historical background and sources
Frequency studies are not new. Earlier influential corpora include the Brown Corpus and the British National Corpus; more recent efforts build much larger electronic collections and use automated processing. Oxford University Press and associated projects have produced widely cited lists derived from large online corpora and the Oxford English Corpus (OEC). These modern corpora benefit from automated tagging and lemmatization but remain subject to editorial choices about which texts to include.
Patterns and notable facts
Several robust patterns appear across corpora. Function words (articles, prepositions, pronouns, auxiliary verbs) dominate the top ranks; a small number of words account for a large share of tokens. For example, pedagogical sources long note that the first 25 words can make up roughly one-third of printed English, and the first 100 can approach one-half. Frequency distributions also follow predictable mathematical shapes such as Zipf's law: the most frequent item is many times more common than the second, and frequency falls off predictably.
Practical uses and examples
- Language teaching: prioritizing high-frequency vocabulary gives learners early communicative payoff.
- NLP and search: frequency informs language models, stopword lists, and text-compression schemes.
- Lexicography and reading research: lists guide basic dictionaries and graded readers.
- Stylistics and corpus linguistics: comparing frequencies highlights register and genre differences.
Typical high-ranking lemmas are short function words and common verbs and nouns — for example, words such as "the", "be", "and", "of", "a", "in", "to", "have", "it" and pronouns like "I" and "you" frequently occur near the top of many lists. Exact order and percentages vary by corpus composition and whether counts collapse inflected forms into lemmas or treat each form separately.
Distinctions and caveats
When using frequency lists remember: (1) Type vs. token — a list of types does not show how often each type appears; (2) Lemma vs. wordform — combining forms can inflate the apparent importance of a headword; (3) Corpus composition — spoken language and social media change rankings compared with formal writing. For reliable interpretation, consult the documentation for the corpus used and, where possible, examine frequency by register or medium rather than relying on a single overall ranking.
Further reading and detailed lists are available from corpus projects and lexicographic resources; introductory definitions and corpus descriptions can be found via the linked resources above (dictionary, lemma, corpus, OEC).