Tokens
Smallest unit or chunks of text that a model processes
What is a token?
In the context of large language models, 'token' refers to the smallest unit or chunks of text that a model processes. Used by LLMs to process and generate language, tokens can be as short as one character, as long as a word, or even larger chunks of text-like phrases, depending on the model and its configuration.
Tokens serve as a connection between human language and a structure that AI models can understand. Many modern language models, such as GPT models, are trained as token-based models. AI models are designed to handle a specific number of tokens at one go.
Each input provided to the model is broken down into tokens and analyzed, and the understanding is used to create a response. The exact process is followed for creating a response - the model generates one token at a time based on the previous token.
Types of tokens:
Here are some types of tokens used in AI Large Language Models:
- Word Tokens: These represent individual words or phrases in the text, like "house."
- Sub-word Tokens: Words can be divided into smaller sub-word units. For instance, "speaking" can be segmented into "speak" and "ing."
- Punctuation Tokens: Tokens that signify various punctuation marks, such as commas (","), periods ("."), and others.
- Special Tokens: Unique symbols like "[CLS]" (classification token), "[SEP]" (separator token), or "[MASK]" (mask token) have specific roles within the model.
- Number Tokens: Textual numbers are transformed into numerical tokens. For example, "10" might be represented as a numerical token.
Liked the content? you'll love our emails!
Is Explainability critical for your 'AI' solutions?
Schedule a demo with our team to understand how AryaXAI can make your mission-critical 'AI' acceptable and aligned with all your stakeholders.