Back to GlossaryArchitecture

Tokenizer

Definition

A component that splits text into smaller units called tokens (words, subwords, or characters) that can be processed by a language model.

Tokenizers are the critical first step in any NLP pipeline, converting raw text into a sequence of numerical token IDs that the model can process. Modern tokenizers like Byte-Pair Encoding (BPE), WordPiece, and SentencePiece operate at the subword level, balancing vocabulary size with the ability to represent any text including rare words and multiple languages. For example, "unhappiness" might be split into ["un", "happiness"] or ["un", "happ", "iness"]. The tokenizer determines how many tokens a piece of text consumes, directly affecting cost (most APIs charge per token) and whether text fits within a model's context window. Different models use different tokenizers, so the same text may tokenize differently across systems.

Companies in Architecture

View Architecture companies →