Bytes become text only when a character encoding maps byte sequences to characters.

Learning Question

How can byte values become visible text such as letters, Korean characters, punctuation, or source code?

The key distinction is that bytes are not characters by themselves.

Characters are produced when bytes are decoded according to an encoding rule.

That rule may be simple for common English letters and more complex for the full range of modern text.

What Text Encoding Does

A text encoding is a rule for mapping between byte sequences and characters.

When saving text, software encodes characters into bytes.

When opening text, software decodes bytes into characters.

The two directions are related:

characters -> encoding -> bytes
bytes -> decoding -> characters

If the same encoding rule is used consistently, the text can round-trip correctly.

If the wrong decoding rule is used, the visible text can break.

ASCII As A Simple Historical Example

ASCII is a historical character encoding that maps common English letters, digits, punctuation, and control characters to numeric values.

For example:

CharacterDecimalHex
A6541
H7248
e10165
newline100A

In ASCII, each character is represented by one byte value in the range 0x00 through 0x7F.

This simple mapping is why Hello looked straightforward in the earlier hex example:

48 65 6C 6C 6F

Those byte values are printable ASCII characters.

UTF-8

UTF-8 is a widely used encoding for Unicode text.

It is variable-length.

That means one visible character may use one byte, two bytes, three bytes, or four bytes.

For common ASCII characters, UTF-8 uses the same single-byte values as ASCII.

For many non-ASCII characters, UTF-8 uses multiple bytes.

This is why the question:

How many characters are in this text?

is not the same as:

How many bytes are in this file?

One visible character can take more than one byte.

Byte Count Versus Character Count

Consider this text:

A가

It has two visible characters:

  • A

But in UTF-8, it uses four bytes:

41 ea b0 80

The A uses one byte:

41

The Korean character uses three bytes:

ea b0 80

So the file can contain two characters but four bytes.

That is not an error.

It is how UTF-8 represents that text.

Why Text Breaks Under The Wrong Encoding

Text can break when software decodes bytes using the wrong encoding.

The bytes are still there.

The decoding rule is wrong for those bytes.

Common symptoms include:

  • replacement characters such as
  • wrong-looking characters
  • question marks
  • missing characters
  • unreadable mixed symbols

This broken-looking output is often called mojibake.

The practical lesson is:

To recover text, bytes and encoding must match.

Source Files Are Text Files

Source files such as .java, .c, .js, and .md are usually text files.

That means their first interpretation layer is character decoding.

After decoding, another tool applies higher-level rules:

FileFirst LayerNext Interpreter
.javabytes decoded into charactersJava compiler
.cbytes decoded into charactersC compiler
.jsbytes decoded into charactersJavaScript engine or tooling
.mdbytes decoded into charactersMarkdown renderer

This distinction matters because source code is not directly executable machine behavior.

It is text that programming tools interpret, compile, transform, or render.

For Java-specific compilation, see From Java Source Code to Class Files.

For C-specific compilation, see From Source Code to Executable File.

Small Experiment

These commands assume a Unix-like shell such as WSL Ubuntu, and they assume the shell writes the text as UTF-8.

Create a file containing one ASCII character and one Korean character:

printf 'A가' > utf8.txt
xxd -g 1 utf8.txt

The exact output layout may vary, but the important byte values should be:

41 ea b0 80

What To Observe

The visible text has two characters:

A가

The bytes are:

41 ea b0 80

The first byte, 41, represents A under ASCII-compatible UTF-8.

The next three bytes, ea b0 80, represent under UTF-8.

The text editor shows two characters because it decodes the four bytes as UTF-8.

What This Proves

Text is not the same thing as “one byte per visible symbol.”

Text is byte sequences interpreted by an encoding.

ASCII makes common English characters look simple because one byte maps to one character.

UTF-8 keeps that simplicity for ASCII-compatible characters while also supporting many more characters through multi-byte sequences.

What Encoding Explains

This chapter does not teach the full Unicode standard, every encoding family, normalization, locale behavior, fonts, grapheme clusters, or shell portability.

Those topics matter in real systems.

The representation-layer boundary here is:

A text file is still bytes. It becomes text when decoded through a character encoding.

Encoding Rule To Carry Forward

Bytes become text through an encoding rule.

ASCII is a simple one-byte mapping for common English-era characters.

UTF-8 is a variable-length encoding that can use multiple bytes for one visible character.

Source files are usually text files first. Compilers and other tools interpret the decoded text later.

When text appears wrong, ask:

Are these bytes being decoded with the encoding they were written in?

Text As Decoded Bytes

Text is an interpretation layer over bytes.

A file containing source code, Markdown, or ordinary prose still starts as byte contents.

Those bytes become characters through encoding, and only after that can other tools interpret them as source code, markup, configuration, or documentation.