Tokenizer
Definition
A component that splits text into smaller units called tokens (words, subwords, or characters) that can be processed by a language model.
Tokenizers are the critical first step in any NLP pipeline, converting raw text into a sequence of numerical token IDs that the model can process. Modern tokenizers like Byte-Pair Encoding (BPE), WordPiece, and SentencePiece operate at the subword level, balancing vocabulary size with the ability to represent any text including rare words and multiple languages. For example, "unhappiness" might be split into ["un", "happiness"] or ["un", "happ", "iness"]. The tokenizer determines how many tokens a piece of text consumes, directly affecting cost (most APIs charge per token) and whether text fits within a model's context window. Different models use different tokenizers, so the same text may tokenize differently across systems.
Related Terms
Large Language Model
A neural network with billions of parameters trained on massive text datasets, capable of understand...
Embedding
A learned dense vector representation that maps discrete data like words, tokens, or items into cont...
Context Window
The maximum number of tokens (input plus output) that a language model can process in a single inter...
Token
The basic unit of text that language models process, typically representing a word, subword, or char...