Overview
Mojibake (Japanese: 文字化け, mojibake) is the term for garbled, unreadable characters that appear when text is decoded or rendered with the wrong character encoding. It typically looks like sequences of strange letters, question marks, replacement glyphs (such as �), or boxes. Mojibake occurs when the sequence of bytes that represents text is interpreted using a different character set mapping than the one originally used to encode it.
How it happens
Computers store text as sequences of bytes; those bytes have meaning only in the context of a character encoding. If the decoder assumes the wrong encoding, byte values map to incorrect characters. Common contributing factors include missing or incorrect metadata (for example HTTP headers or tags in HTML), mismatched system locales, legacy encodings in files or databases, transfer through software that alters bytes, and double-encoding errors. Fonts or rendering systems that lack glyphs may show empty boxes or 'tofu', which is distinct from mojibake.
Typical causes and examples
- Legacy encodings: files created with encodings such as ISO-8859-1, Windows-1252, Shift_JIS, or EUC-JP may appear garbled when read as UTF-8.
- Incorrect headers: a web server or email client that omits or mislabels the charset can lead browsers or mail readers to choose the wrong interpretation.
- Byte-order and BOM issues: a UTF-16 or UTF-8 byte order mark (BOM) can be misread by software expecting a different format.
- Double encoding: UTF-8 text mistakenly reinterpreted and re-encoded as Latin-1 (or vice versa) creates layered corruption.
Common visible patterns
Some sequences are widely recognized: accented letters turned into two-character combinations such as "é" or "ñ" when UTF-8 bytes are interpreted as ISO-8859-1/Windows-1252. Long dashes, quotes, and other punctuation may appear as combinations like "—" or "’". When decoding fails completely, renderers often show the Unicode replacement character U+FFFD (�) or display empty rectangles for missing glyphs.
History and standards
Before the widespread adoption of Unicode, many incompatible single-byte and multi-byte encodings existed, often specific to languages or operating systems. Unicode and its popular encodings (notably UTF-8, which uses one to four bytes per code point) were developed to provide a universal mapping for characters from many writing systems and to reduce mismatches. Even so, legacy data and systems still in use mean mojibake remains a practical problem.
Detecting and fixing mojibake
Practical remedies focus on ensuring a single, correct interpretation of bytes: save and serve text as Unicode (preferably UTF-8), include explicit charset declarations in HTTP headers and HTML (), configure databases and clients to use the intended encoding, and use conversion tools (iconv, recode) to repair legacy files. When diagnosing, it helps to know the provenance of the data: which program created the file, which platform or locale was used, and whether any intermediate systems might have altered bytes. Many editors and browsers can reinterpret a file under different encodings to reveal the original text.
Distinctions and notable facts
Mojibake refers specifically to incorrect characters produced by decoding errors. This differs from missing glyphs (tofu), which occur when the correct character is known but the font lacks a shape for it. The Japanese term itself combines moji (character) and bake (to transform or become a ghost), reflecting the appearance of characters changing into unintelligible forms. For further technical background see software and encoding documentation, general Unicode information at Unicode resources, and practical guidance on UTF-8 at UTF-8 references.
Understanding and preventing mojibake is an important part of software internationalization, data interchange, web publishing, and digital preservation. Consistent use of modern encodings and clear metadata dramatically reduces the chance that text will become corrupted in storage or transit.