What is Tokenization?

LLMs don’t actually work directly with words in text, but with tokens that can represent single words, subwords, individual characters, or occasionally multi-word expressions, depending on the tokenizer and how frequently sequences occur in the training data. For example, the word “unhappiness” might be broken into multiple tokens like “un”, “happi”, and “ness”, depending on how the tokenizer is configured.

During both training and inference (interacting with a model through prompts), text is converted into tokens by a preprocessing component called a tokenizer, using a specific tokenization algorithm. It is tokens—not words—that an LLM uses when learning statistical relationships between pieces of training data, and later when generating new text based on probability from those learned relationships.

Example: Human vs. Tokenized View

Let’s look at a very simplified, hypothetical, and non-standard example where we have a tokenizing scheme that breaks words into syllables:

A human may see this sentence as six words:

A cat is a furry animal.

However, an LLM using our hypothetical tokenizer will see 10 tokens:

A cat is a fur ry an i mal.

In reality, tokenization algorithms like Byte Pair Encoding (BPE) or WordPiece are more complex. These work by merging frequently occurring character sequences or subword units into tokens over time. This helps build a token vocabulary that balances size with broad linguistic coverage. This example is only meant to illustrate how different the model’s view of language is.

Why the Token vs. Word Distinction Matters

It’s important to make the distinction between words and tokens for a few reasons:

When we talk about context windows and maximum context (the amount of data a model can process during a single prompt—e.g., an 8K context window), we are talking about tokens, not characters or words.
Understanding that LLMs work with tokens and not words helps solidify the concept that LLMs generate text based on statistical patterns rather than comprehension.
This method is not without its complications. Tokenization is one of several reasons LLMs have difficulty counting words, counting the occurrences of a letter in a word, etc.

Why Use Tokenization Instead of Words?

Tokenization has actually been used for decades (before LLMs) and for some of the same reasons:

Better granularity—in this case, the ability to learn statistical associations between smaller components of language, rather than treating whole words as indivisible units. For example, if you ask an LLM “What is a cat?” the LLM can start to predict “A cat is a fur-” and from there choose completions like “-ry,” “-red,” or “–covered,” depending on how those parts were tokenized during training.
Higher efficiency by grouping parts of words that are similar, reducing model sizes and computational loads.
Benefits across languages, translation tasks, or when conceptualizing unknown words. For instance, if a model hasn’t seen the word “de-extinction,” it can still process it by breaking it into known components like “de,” “extinct,” and “-ion.”

Some negatives are that tokenization does add complexity and can contribute to certain problem areas for LLMs, such as counting words, counting letters in words, etc.

In Short

LLMs don’t use or understand language the way humans do. They break down language into tokens, learn statistical relationships between those pieces during training, and then break down user input in the same way to generate new text based on probability. This is also why token limits, prompt phrasing, and structure matter—because the model is seeing a version of your input that’s been broken into pieces, not the original surface text.

AightBits