Key Takeaways
- What is a transformer? A neural network architecture that processes entire sequences simultaneously using "attention" — letting each word consider its relationship to every other word at once.
- The key innovation: Attention replaced sequential processing, enabling much faster training and far better handling of long-range dependencies in text.
- GPT and Claude: Both are decoder-only transformers trained to predict the next token. The "language" they speak is probability distributions over a vocabulary.
- Why it matters: Understanding the architecture explains the capabilities and limitations — including why hallucination happens and why context window size matters.
Transformers are the architecture that made modern AI possible. Every major language model you have used — GPT-4, Claude, Gemini, LLaMA, Mistral — is a transformer. Understanding what a transformer is, even at a conceptual level, gives you a fundamentally better model of what these systems can and cannot do.
You do not need linear algebra or calculus to understand transformers at a useful level. You need a few core concepts. This guide covers those concepts clearly, without shortcuts that sacrifice accuracy for simplicity.
Why Transformers Matter
The transformer, introduced in the 2017 paper "Attention Is All You Need" by Google researchers, is arguably the most consequential innovation in the history of AI development. It replaced the dominant architectures of the day, enabled models to be scaled to previously impossible sizes, and directly led to the current generation of AI capabilities.
Before transformers, the modern in natural language processing was recurrent neural networks (RNNs). After transformers, essentially all significant progress in language AI moved to transformer-based architectures. The transition was that decisive.
Before Transformers: RNNs and Their Limits
Recurrent Neural Networks (RNNs) processed text sequentially — one word at a time, left to right — maintaining a "hidden state" that carried information from previous words forward. This design had a fundamental problem: information from early in a sequence had to travel through every subsequent step to reach the end. Over long sequences, it got diluted and lost.
Think of it like playing a long game of telephone. By the time the original message reaches the twentieth person, the details have changed. RNNs had the same problem with long texts — context from early in a document would be largely forgotten by the time the model reached the end.
RNNs also could not be parallelized easily. To process word 10, you had to finish processing words 1-9 first. This made training on modern GPUs (which excel at parallel computation) inefficient. Training large models was slow and expensive.
The Key Insight: Attention
The key insight of the transformer is attention: instead of processing text sequentially and passing information through a bottleneck, let every position in the sequence look directly at every other position simultaneously.
With attention, every word can directly "attend to" any other word in the sequence regardless of distance. The word at position 500 can directly access information from the word at position 2 without it having to travel through 498 intermediate steps. Long-range dependencies — where the meaning of a word depends on something said far earlier — become much easier to capture.
And because all positions are processed simultaneously (rather than one at a time), transformers parallelize extremely well on GPU hardware. Training large models becomes dramatically faster.
How Attention Works in Plain Terms
Attention computes, for each token in the sequence, a weighted average of all other tokens — where the weights represent "how relevant is this other token to understanding me."
Consider the sentence: "The river bank was flooded after the storm." When the model processes "bank," attention asks: which other words are most relevant to understanding what this "bank" means? The word "river" should receive high attention weight. "Money," "savings," or "loan" are not in this sentence, so "bank" here means a riverbank, not a financial institution.
This happens via three learned vectors for each token: Query (what am I looking for?), Key (what do I contain?), and Value (what do I contribute?). The attention score between two tokens is computed by the Query of one token and the Key of another. High scores mean high relevance. These scores are used to compute a weighted sum of Value vectors, producing a rich representation of each token that incorporates context from the entire sequence.
Multi-head attention runs this process multiple times in parallel with different learned projections — allowing different "heads" to capture different types of relationships simultaneously. Some heads might attend to syntactic relationships, others to semantic relationships, others to positional patterns.
The Full Transformer Architecture
The original transformer consists of an encoder stack and a decoder stack, each made of multiple layers. Each layer contains a multi-head attention block followed by a feed-forward neural network, with layer normalization and residual connections throughout.
- Encoder: Reads the input sequence and builds rich contextual representations of each token. Used in models like BERT for understanding tasks.
- Decoder: Generates output tokens one at a time, attending to both previous output tokens and the encoder's representations. Used in sequence-to-sequence models like translation systems.
- Decoder-only: Many modern LLMs (GPT, Claude, LLaMA) drop the encoder entirely and use only the decoder, trained to generate the next token given previous tokens. This is called "causal" or "autoregressive" language modeling.
- Feed-forward layers: After attention computes relationships, feed-forward layers process each token's updated representation independently, adding capacity for complex transformations.
- Residual connections: Each sub-layer adds its output to its input ("skip connection"), helping gradients flow during training and enabling very deep networks.
- Layer normalization: Stabilizes training by normalizing activations at each layer.
Tokens: How Text Becomes Numbers
Transformers do not process words or characters directly — they process tokens, which are sub-word units produced by a tokenizer. A tokenizer splits text into chunks that balance vocabulary size, coverage, and efficiency.
Common words become single tokens. Rare words are split into multiple tokens. "Unbelievable" might be three tokens: "un", "believ", "able". Numbers and code are tokenized in their own patterns. Each token is mapped to a vector (embedding) that represents its meaning — these embeddings are learned during training and encode semantic relationships so that similar tokens have similar vectors in the high-dimensional embedding space.
A transformer model's vocabulary is typically 50,000-100,000+ tokens. The model's task is, at each step, to predict which of these tokens comes next — and the output is a probability distribution over the entire vocabulary.
How Transformers Are Trained
Language model transformers are pre-trained on a simple task: given preceding text, predict the next token. This is called autoregressive language modeling. Trained on trillions of tokens of text from the internet, books, and code, the model learns to predict next tokens so well that it internalizes vast amounts of world knowledge, linguistic patterns, and reasoning structure.
After pre-training, models are fine-tuned to be more helpful and safe. This typically involves Reinforcement Learning from Human Feedback (RLHF), where human raters score model outputs and the model is updated to produce outputs that receive higher ratings. This is what transforms a raw language model into a useful assistant.
GPT vs BERT: Decoder vs Encoder
GPT and Claude are decoder-only transformers optimized for generation. BERT is an encoder-only transformer optimized for understanding. The architectural difference drives different strengths.
BERT reads text bidirectionally — it can see words to the left and right simultaneously. This makes it excellent for classification, sentiment analysis, named entity recognition, and other tasks where understanding is the goal. BERT-style models power many search ranking and document classification systems.
GPT-style models generate text causally — each token can only attend to previous tokens, not future ones. This is necessary for generation (you cannot predict what comes next while looking at what comes next), and it enables the autoregressive generation of coherent, contextually rich text. Claude, GPT-4, and LLaMA are all in this category.
Context Windows
The context window is the maximum number of tokens a transformer can process at once. It determines how much text the model can "see" simultaneously.
Early GPT-2 had a 1,024-token context window. GPT-3 expanded to 4,096. Claude's 1M-token context window represents a roughly 1,000x increase over those early models. This expansion has been driven by architectural improvements (Flash Attention, ring attention), hardware advances, and training innovations.
A larger context window is not just about longer documents. It fundamentally changes what applications are possible. With a 1M-token context, a model can process an entire codebase, a year of meeting notes, or every document in a legal case at once — enabling qualitatively different applications than a 4,096-token model could support.