Transformers use self-attention to let every token in a sequence selectively gather information from other tokens. A key mental bridge is the progression from similarity scores (how much two tokens relate) to contextualized representations produced by self-attention. This article explains that journey in a neutral, step-by-step way, with emphasis on how softmax weights become the mechanism that decides what each token “pays attention to.”
1) From similarity scores to attention intent
At an early stage, the model produces similarity signals between tokens. In transformer terminology, each token yields three learned representations: Query (Q), Key (K), and Value (V). The attention mechanism starts by comparing Queries against Keys. The result is a matrix of scores that indicates how relevant each token is to every other token.
These raw scores can be interpreted as “attention intent.” For example, if a token representing “Let’s” is much more compatible with the token representing “Let’s” than with “go,” then its attention intent should strongly favor the former.
2) Softmax converts raw scores into weights
Raw similarity scores are not yet directly usable as contribution rules. They must be converted into a probability distribution so that contributions are normalized and stable. This is where the softmax function is used.
Softmax takes a set of scores and outputs weights that sum to 1. Conceptually, each weight answers: “How much should the model contribute information from this token to the current token’s representation?”
In the illustrative case where “Let’s” is far more similar to itself than to “go,” softmax produces a distribution where:
- “Let’s” gets a weight close to 1 (near 100%)
- “go” gets a weight close to 0 (near 0%)
As a result, the model effectively decides that most of the enriched representation for “Let’s” should come from “Let’s” itself, while “go” contributes very little in that specific context.
3) Why dynamic attention replaces static similarity
A major limitation of using only static similarity (for example, cosine similarity between fixed embeddings) is that it is context-independent. The similarity between two tokens does not change based on surrounding tokens. Self-attention removes that limitation by recomputing attention weights per sequence and per token position, driven by the current Q and K produced for that specific input.
This dynamic behavior is the core shift from “similarity scores” to “self-attention.” Similarity becomes a mechanism that is repeatedly recalculated, normalized, and then used to mix information.
4) Creating Value representations (V)
Once the softmax weights are known for a given token, the model must determine what information is available to mix. This comes from Value (V) vectors.
Each token has its own V representation created by a learned linear transformation. Importantly, the value vectors contain the information content that will be aggregated. Using the earlier example:
- Two value components for the token representing “Let’s” are generated (exact dimensionality depends on the model).
- Two value components for the token representing “go” are generated as well.
These values are not yet combined. They are scaled by the attention weights derived from softmax.
5) Scaling values by attention weights
The attention weight distribution decides how much each token’s value contributes. If “Let’s” receives weight near 1, then its value vector is scaled by approximately 1. If “go” receives weight near 0, its value vector is scaled by approximately 0.
In the simplified intuition:
- Scaled value(“Let’s”) ≈ value(“Let’s”)
- Scaled value(“go”) ≈ 0
This scaling is the practical meaning of “attention.” It turns relevance scores into an actual mixing coefficient.
6) Combining scaled values to form the self-attention output
After scaling, the model aggregates all scaled value vectors using addition. The result is a new representation for the token of interest (for example, a context-enriched representation for “Let’s”).
This final vector is often described as the token’s self-attention output or the self-attention values after weighted aggregation. The output encodes information collected from the entire sequence, but in a way that is weighted by token-to-token relevance.
In other words, the model constructs the representation for each token by summing contributions from every token, where the contributions are determined by the softmax-normalized attention scores.
7) Repeating attention across tokens
The mechanism described above is repeated for each token position. After producing the self-attention output for “Let’s,” the same steps are applied to produce the output for “go.” Depending on the query-key relationships for that token, the attention weights can shift, allowing different tokens to emphasize different parts of the sequence.
8) What this implies for full transformers (including multi-head attention)
Production transformer models typically use multi-head attention. Each head independently learns its own Q, K, V projections and therefore its own attention weight patterns. One head might emphasize syntactic relations, another might emphasize semantic similarity, and another might capture positional or structural effects. The heads are then combined to produce richer representations than a single attention pass.
Key takeaway
Self-attention can be understood as a pipeline: compute similarity (Q vs K), normalize with softmax to obtain attention weights, scale Value vectors (V) by those weights, and sum the results to produce a context-aware representation. This is how similarity scores become the operational mechanism of self-attention, enabling transformers to tailor information mixing to each token and to the specific sentence context.

Leave a Reply