Bytes become text only when a character encoding maps byte sequences to characters.
Learning Question
How can byte values become visible text such as letters, Korean characters, punctuation, or source code?
The key distinction is that bytes are not characters by themselves.
Characters are produced when bytes are decoded according to an encoding rule.
That rule may be simple for common English letters and more complex for the full range of modern text.
What Text Encoding Does
A text encoding is a rule for mapping between byte sequences and characters.
When saving text, software encodes characters into bytes.
When opening text, software decodes bytes into characters.
The two directions are related:
characters -> encoding -> bytes
bytes -> decoding -> charactersIf the same encoding rule is used consistently, the text can round-trip correctly.
If the wrong decoding rule is used, the visible text can break.
ASCII As A Simple Historical Example
ASCII is a historical character encoding that maps common English letters, digits, punctuation, and control characters to numeric values.
For example:
| Character | Decimal | Hex |
|---|---|---|
A | 65 | 41 |
H | 72 | 48 |
e | 101 | 65 |
| newline | 10 | 0A |
In ASCII, each character is represented by one byte value in the range 0x00 through 0x7F.
This simple mapping is why Hello looked straightforward in the earlier hex example:
48 65 6C 6C 6FThose byte values are printable ASCII characters.
UTF-8
UTF-8 is a widely used encoding for Unicode text.
It is variable-length.
That means one visible character may use one byte, two bytes, three bytes, or four bytes.
For common ASCII characters, UTF-8 uses the same single-byte values as ASCII.
For many non-ASCII characters, UTF-8 uses multiple bytes.
This is why the question:
How many characters are in this text?is not the same as:
How many bytes are in this file?One visible character can take more than one byte.
Byte Count Versus Character Count
Consider this text:
A가It has two visible characters:
A가
But in UTF-8, it uses four bytes:
41 ea b0 80The A uses one byte:
41The Korean character 가 uses three bytes:
ea b0 80So the file can contain two characters but four bytes.
That is not an error.
It is how UTF-8 represents that text.
Why Text Breaks Under The Wrong Encoding
Text can break when software decodes bytes using the wrong encoding.
The bytes are still there.
The decoding rule is wrong for those bytes.
Common symptoms include:
- replacement characters such as
� - wrong-looking characters
- question marks
- missing characters
- unreadable mixed symbols
This broken-looking output is often called mojibake.
The practical lesson is:
To recover text, bytes and encoding must match.
Source Files Are Text Files
Source files such as .java, .c, .js, and .md are usually text files.
That means their first interpretation layer is character decoding.
After decoding, another tool applies higher-level rules:
| File | First Layer | Next Interpreter |
|---|---|---|
.java | bytes decoded into characters | Java compiler |
.c | bytes decoded into characters | C compiler |
.js | bytes decoded into characters | JavaScript engine or tooling |
.md | bytes decoded into characters | Markdown renderer |
This distinction matters because source code is not directly executable machine behavior.
It is text that programming tools interpret, compile, transform, or render.
For Java-specific compilation, see From Java Source Code to Class Files.
For C-specific compilation, see From Source Code to Executable File.
Small Experiment
These commands assume a Unix-like shell such as WSL Ubuntu, and they assume the shell writes the text as UTF-8.
Create a file containing one ASCII character and one Korean character:
printf 'A가' > utf8.txt
xxd -g 1 utf8.txtThe exact output layout may vary, but the important byte values should be:
41 ea b0 80What To Observe
The visible text has two characters:
A가The bytes are:
41 ea b0 80The first byte, 41, represents A under ASCII-compatible UTF-8.
The next three bytes, ea b0 80, represent 가 under UTF-8.
The text editor shows two characters because it decodes the four bytes as UTF-8.
What This Proves
Text is not the same thing as “one byte per visible symbol.”
Text is byte sequences interpreted by an encoding.
ASCII makes common English characters look simple because one byte maps to one character.
UTF-8 keeps that simplicity for ASCII-compatible characters while also supporting many more characters through multi-byte sequences.
What Encoding Explains
This chapter does not teach the full Unicode standard, every encoding family, normalization, locale behavior, fonts, grapheme clusters, or shell portability.
Those topics matter in real systems.
The representation-layer boundary here is:
A text file is still bytes. It becomes text when decoded through a character encoding.
Encoding Rule To Carry Forward
Bytes become text through an encoding rule.
ASCII is a simple one-byte mapping for common English-era characters.
UTF-8 is a variable-length encoding that can use multiple bytes for one visible character.
Source files are usually text files first. Compilers and other tools interpret the decoded text later.
When text appears wrong, ask:
Are these bytes being decoded with the encoding they were written in?
Text As Decoded Bytes
Text is an interpretation layer over bytes.
A file containing source code, Markdown, or ordinary prose still starts as byte contents.
Those bytes become characters through encoding, and only after that can other tools interpret them as source code, markup, configuration, or documentation.