3 LLMs

3 Large Language Models #

We now turn to the transformer architecture that revolutionized NLP and forms the basis of all LLMs that are of relevance today.

3.1 Transformers: overall structure #

Let us start with the original transformer architecture as introduced in 2017 by Vasvani et al.

In their paper “Attention is all you need”, the Google group proposed a setup with an overall two-column structure and internal structures that are partially repeated both within and between the columns:

Figure 3.1: Transformer architecture (Vasvani et al., Google, 2017) with encoder (left) and decoder (right). — **Figure 3.1:** Transformer architecture (Vasvani et al., Google, 2017) with encoder (left) and decoder (right).

The “Nx” next to the grey boxes indicated that these are repeated; in total, the grey parts are called “encoder” (left column) and “decoder” (right column), respectively. Before we look at the internal structures, we will focus on the general structure:

Figure 3.2: Simplified view on transformer architecture (encoder-decoder). — **Figure 3.2:** Simplified view on transformer architecture (encoder-decoder).

Here, we have neglected the structures labelled as “Positional Encoding”; they correspond to socalled “Absolute Positional Embedding (APE)” that is no longer used; instead, modern LLMs use “Rotary Positional Embedding (RoPE)” to be discussed later.

The original application the Vasvani paper was translation between English and German or French. In such applications, input (“prompt”) and output (“completion”) are of quite different character (here: English and German words respectively.)

The input touches only the encoder, the target language is processed only in the decoder. Translation is performed on the embedding level, using attention (which is “masked” only on the encoder side), to be discussed later.

Encoder-Decoder EN-DE — **Figure 3.3:** Application of the encoder-decoder transformer to a translation task English (EN) to German (DE).

One advantage of the encoder-decoder architecture is that sections of the input text can already be processed (i.e. translated) before the rest has been read. Such an algorithm can work even when the input text is (much) longer that the context length L of the model (essentially its memory).

Similar situations can arise, i.e., for

Coding (to some degree)
Summarization
Pure question / answer tasks.

However, many tasks involve input and output of the same type, as, e.g. in 4. Text continuation and 5. Generic chat.

In the 8 years since the publication of the original transformer paper, language models have become much larger both; context sizes have increased from about 1000 to 100,000 - 1 Mio.

All current (general purpose) LLMs are decoder-only: OpenAI GPT-x, Met Llama, Anthropic Claude, DeepSeek R1, Google Gemini, … Therefore, we will, from now on, focus on decoder-only transformers.

Figure 3.4: Simplified view on transformer architectures: encoder-decoder (left) vs. decoder-only (right). — **Figure 3.4:** Simplified view on transformer architectures: encoder-decoder (left) vs. decoder-only (right).

In a first step, we treat the decoder as a “black box” and discuss the broad-scale structure. All tokens of the input text are first transformed into embeddings by the “Input Embedding” block. The decoder then processes the embedding vectors in some way (to be discussed later). Finally, the embedding vectors are transformed back into the token vector space by the “Linear + Softmax” block.

Note that, on this overall level, the decoder takes the function of the hidden layer in the Word2vec method (in the skip-gram of CBoW variants, cf. section 2.3), replicated L times for each token in the context window:

Transformer - Word2vec — **Figure 3.5:** Decoder-only transformer vs. Word2vec algorithm.

3.2 Next-token prediction and perplexity #

In general, a (decoder) LLM takes n tokens as input and yields n tokens as output, where the output is shifted one position to the left, i.e. misses the first input token and, instead yields the next token in the top-right position. This is shown here for the input “The dog barks”:

Figure 3.6: Next-word prediction of an LLM — **Figure 3.6:** Next-word prediction of an LLM

Here and in the following, we assume, for simplicity, a context length of only $L=3$ and also assume that all words used in our examples are represented as single tokens (which is almost true).

Let us now look at the data flow in more detail: Each input word/token is represented by a one-hot-encoded token vector (where, without loss of generality, we assume that the first component encodes “The”, the second “_dog”, the third “_barks”, and the last components encode “_loudly” and “_again”, respectively; in all cases, “_” stands for space).

Figure 3.7: Flow of data through a decoder-only LLM (assuming embedding dimension d=3 and omitting the final embedding layer) — **Figure 3.7:** Flow of data through a decoder-only LLM (assuming embedding dimension d=3 and omitting the final embedding layer)

The overall data flow through transformers is organized in “lanes”; each of these starts with an input token (represented by a definite token vector, i.e. a token vector with exactly one non-zero component) and ends with another token vector (which is, in general not definite, i.e. may contain several non-zero components). By training, each lane tries to predict the input token of the next lane (for lanes 1 to L-1); the last lane predicts the next token. Obviously, these lanes must be coupled within the decoder; the underlying self-attention mechanism will be discussed later.

In our example, the model correctly predicts that none of the tokens “The”, “_dog”, or “_barks” can appear as a fourth token in the sentence; therefore, the first three components of the output token vector of the last lane are zero. According to the model, the next token is “_loudly” with probability 0.3 and “_again” with probability 0.6; the remaining probability of 0.1 must represent other tokens of the vocabulary that are not explicitly shown:

Next token probs — **Figure 3.8:** Probabilites for the next token

In “greedy” prediction, an LLM would always choose the most probable next token. This corresponds to applying the max activation function (instead of soft-max) to the next-token vector.

Note that information about the tokens considered in a given step and their probability distribution can be obtained via APIs (in the case of OpenAI/ChatGPT, for up to the 20 most probable tokens).

Here is a specific example using an OpenAI end point (which requires an OpenAI API account); logprobs are just the logarithms of the token probabilities, i.e. prob = exp(logprob):

Accessing logprobs and top_logprobs of GPT-4o via curl on the command line

  1rza025@MacBook-Pro-NB-2 Work % curl https://api.openai.com/v1/chat/completions \
  2  -H "Content-Type: application/json" \
  3  -H "Authorization: Bearer $OPENAI_API_KEY" \
  4  -d '{
  5    "model": "gpt-4o",
  6    "messages": [
  7      {
  8        "role": "system",
  9        "content": "You are a helpful assistant."
 10      },
 11      {
 12        "role": "user",
 13        "content": "Please answer with only one word, yes or no, to the following question: Is a tree an animal?"
 14      }
 15    ],
 16    "temperature": 0,
 17    "logprobs": true,
 18    "top_logprobs": 10
 19  }'
 20{
 21  "id": "chatcmpl-CkvGGiXyc62C8UEwIx7LTbnmTOeRQ",
 22  "object": "chat.completion",
 23  "created": 1765299236,
 24  "model": "gpt-4o-2024-08-06",
 25  "choices": [
 26    {
 27      "index": 0,
 28      "message": {
 29        "role": "assistant",
 30        "content": "No.",
 31        "refusal": null,
 32        "annotations": []
 33      },
 34      "logprobs": {
 35        "content": [
 36          {
 37            "token": "No",
 38            "logprob": 0.0,
 39            "bytes": [
 40              78,
 41              111
 42            ],
 43            "top_logprobs": [
 44              {
 45                "token": "No",
 46                "logprob": 0.0,
 47                "bytes": [
 48                  78,
 49                  111
 50                ]
 51              },
 52              {
 53                "token": "no",
 54                "logprob": -19.5,
 55                "bytes": [
 56                  110,
 57                  111
 58                ]
 59              },
 60              {
 61                "token": " No",
 62                "logprob": -19.875,
 63                "bytes": [
 64                  32,
 65                  78,
 66                  111
 67                ]
 68              },
 69              {
 70                "token": "Yes",
 71                "logprob": -23.125,
 72                "bytes": [
 73                  89,
 74                  101,
 75                  115
 76                ]
 77              },
 78              {
 79                "token": "-",
 80                "logprob": -23.375,
 81                "bytes": [
 82                  45
 83                ]
 84              },
 85              {
 86                "token": "_No",
 87                "logprob": -23.625,
 88                "bytes": [
 89                  95,
 90                  78,
 91                  111
 92                ]
 93              },
 94              {
 95                "token": "NO",
 96                "logprob": -23.875,
 97                "bytes": [
 98                  78,
 99                  79
100                ]
101              },
102              {
103                "token": "\"No",
104                "logprob": -23.875,
105                "bytes": [
106                  34,
107                  78,
108                  111
109                ]
110              },
111              {
112                "token": "-No",
113                "logprob": -24.0,
114                "bytes": [
115                  45,
116                  78,
117                  111
118                ]
119              },
120              {
121                "token": ".No",
122                "logprob": -24.125,
123                "bytes": [
124                  46,
125                  78,
126                  111
127                ]
128              }
129            ]
130          },
131          {
132            "token": ".",
133            "logprob": -0.0486002042889595,
134            "bytes": [
135              46
136            ],
137            "top_logprobs": [
138              {
139                "token": ".",
140                "logprob": -0.0486002042889595,
141                "bytes": [
142                  46
143                ]
144              },
145              {
146                "token": "<|end|>",
147                "logprob": -3.048600196838379,
148                "bytes": null
149              },
150              {
151                "token": "<|end|>",
152                "logprob": -11.923600196838379,
153                "bytes": null
154              },
155              {
156                "token": "。",
157                "logprob": -12.423600196838379,
158                "bytes": [
159                  227,
160                  128,
161                  130
162                ]
163              },
164              {
165                "token": "।",
166                "logprob": -14.298600196838379,
167                "bytes": [
168                  224,
169                  165,
170                  164
171                ]
172              },
173              {
174                "token": ".\n",
175                "logprob": -14.548600196838379,
176                "bytes": [
177                  46,
178                  10
179                ]
180              },
181              {
182                "token": ".\n\n",
183                "logprob": -15.173600196838379,
184                "bytes": [
185                  46,
186                  10,
187                  10
188                ]
189              },
190              {
191                "token": "<|end|>",
192                "logprob": -16.048601150512695,
193                "bytes": null
194              },
195              {
196                "token": "!",
197                "logprob": -16.423601150512695,
198                "bytes": [
199                  33
200                ]
201              },
202              {
203                "token": "۔",
204                "logprob": -16.423601150512695,
205                "bytes": [
206                  219,
207                  148
208                ]
209              }
210            ]
211          }
212        ],
213        "refusal": null
214      },
215      "finish_reason": "stop"
216    }
217  ],
218  "usage": {
219    "prompt_tokens": 39,
220    "completion_tokens": 2,
221    "total_tokens": 41,
222    "prompt_tokens_details": {
223      "cached_tokens": 0,
224      "audio_tokens": 0
225    },
226    "completion_tokens_details": {
227      "reasoning_tokens": 0,
228      "audio_tokens": 0,
229      "accepted_prediction_tokens": 0,
230      "rejected_prediction_tokens": 0
231    }
232  },
233  "service_tier": "default",
234  "system_fingerprint": "fp_83554c687e"
235}

On the GWDG platform, anybody with a German university account can also create API keys using the self-service at https://saia.gwdg.de/dashboard

Using such an API account, similar investigations can be performed for any of the hosted models, probably in a much less restricted way (not tested yet):

Accessing logprobs and top_logprobs of Llama 3.1 8b via curl on the command line (line breaks added)

 1rza025@MacBook-Pro-NB-2 Work % curl -i -X POST \
 2  --url https://chat-ai.academiccloud.de/v1/completions \
 3  --header 'Accept: application/json' \
 4  --header "Authorization: Bearer $GWDG_API_KEY" \
 5  --header 'Content-Type: application/json'\
 6  --data '{
 7  "model": "meta-llama-3.1-8b-instruct",
 8  "prompt": "San Francisco is a",
 9  "max_tokens": 20,
10  "temperature": 0, "logprobs": true, "top_logprobs": 10
11}'
12HTTP/2 200 
13content-type: application/json
14x-ratelimit-limit-hour: 200
15x-ratelimit-limit-day: 1000
16x-ratelimit-limit-month: 3000
17x-ratelimit-remaining-minute: 29
18x-ratelimit-remaining-hour: 197
19x-ratelimit-remaining-day: 997
20x-ratelimit-remaining-month: 2997
21ratelimit-reset: 52
22ratelimit-remaining: 29
23ratelimit-limit: 30
24x-ratelimit-limit-minute: 30
25date: Tue, 09 Dec 2025 17:32:08 GMT
26server: uvicorn
27x-kong-upstream-latency: 519
28x-kong-proxy-latency: 0
29via: kong/3.6.1
30x-kong-request-id: 0624fe7a1831765644f24277d15f8758
31access-control-allow-origin: *
32
33{"id":"cmpl-30c845c3364e49ba9228a7bb8d4ad570","object":"text_completion","created":1765301528,"model":"meta-llama-3.1-8b-instruct",
34"choices":[{"index":0,"text":" top tourist destination, and for good reason. The city is known for its iconic landmarks, vibrant neighborhoods",
35"logprobs":{"text_offset":[0,4,12,24,25,29,33,38,45,46,50,55,58,64,68,72,79,89,90,98],
36"token_logprobs":[-0.5904034376144409,-0.30229291319847107,-9.131014667218551e-05,-0.5512350797653198,-0.7526198029518127,-0.35508328676223755,-0.007136213127523661,-0.003330044448375702,-0.18498247861862183,-0.5632458925247192,
37-0.37957966327667236,-0.9599982500076294,-0.6791214942932129,-0.0011485177092254162,-0.0014234182890504599,-0.6429902911186218,-0.6409937739372253,-0.37631481885910034,-0.41895774006843567,-0.862339973449707],
38"tokens":[" top"," tourist"," destination",","," and"," for"," good"," reason","."," The"," city"," is"," known"," for"," its"," iconic"," landmarks",","," vibrant"," neighborhoods"],
39"top_logprobs":[{" top":-0.5904034376144409},{" tourist":-0.30229291319847107},{" destination":-9.131014667218551e-05},{",":-0.5512350797653198},{" and":-0.7526198029518127},{" for":-0.35508328676223755},
40{" good":-0.007136213127523661},{" reason":-0.003330044448375702},{".":-0.18498247861862183},{" The":-0.5632458925247192},{" city":-0.37957966327667236},{" is":-0.9599982500076294},{" known":-0.6791214942932129},
41{" for":-0.0011485177092254162},{" its":-0.0014234182890504599},{" iconic":-0.6429902911186218},{" landmarks":-0.6409937739372253},{",":-0.37631481885910034},{" vibrant":-0.41895774006843567},{" neighborhoods":-0.862339973449707}]},
42"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"service_tier":null,"system_fingerprint":null,
43"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20,"prompt_tokens_details":null},"kv_transfer_params":null}%

Obviously, such output requires some processing in order to be readily digested. Given API access, interfaces can be built with a large range of frameworks. Here is an implementation using flask (generated by Gemini 3) that shows the probability for each generated token:

LLM interface — **Figure 3.8a:** Exploration platform using the GWDG API backend (using just the input text as prompt) - continuation example.

Both responses are a bit peculiar (i.e. differ from responses that one would get using, e.g. the interface https://chat-ai.academiccloud.de using the same model and the same temperature etc.). The reason is that the input to the LLM does not have the usual structure learned in instruction tuning, i.e. keywords such as “system:”, “user:” etc. are missing - as is the entire system prompt ("… you are a helpful assistant …").

An interesting property of token probability distributions is the corresponding perplexity PP:

$$ PP[{p}] = \exp \left[-\sum_i p_i \log(p_i)\right] $$ In the special case of equally probable N different tokens, $p_i=1/N$ for $1\le i \le N$ which leads to $PP=N$. Therefore, a perplexity value of N can roughly be interpreted as the number of different tokens that the model can chooses between for the next completion. High perplexity can arise even when the top token has a relatively high probability (say: 0.5) - if the tail of the distribution is long enough.

x log x — **Figure 3.8c:** Function p log (p) appearing in the perplexity computation.

As one specific long-tail example, we consider $p_n = n^{-3/2}/\zeta(3/2)$; even though $p_1\approx 0.38$ is not particularly small, the corresponding perplexity is 25, i.e., relatively large.

The average perplexity is a measure of the quality of the LLM: Early language models had typical values of 20 and above - they did not really know what they wanted to say. Good models typically have perplexity values in the range 5-10.

3.3 Sampling and temperature #

In practice, it is important to retain at least some of the inherent stochastic nature of the next-token prediction in LLMs. However, the sampling process can and should be tuned, as we will discuss now.

The “raw” or intrinsic probabilites $p_i$ for token $i$ to appear as next token (shown in the above figure) arise from the application of the softmax function to the final embedding vector (with components $z_i$, the logits, and vocabulary size V):

$$ p_i=\frac{e^{z_i}}{\sum_{i=1}^{V}e^{z_i}} $$

This agrees with the thermal distribution for the occupation of states with energy $\epsilon_i$ (here for $k_\text{B}=1$):

$$ p_i=\frac{e^{-\epsilon_i/T}}{\sum_{i=1}^{V}e^{-\epsilon_i/T}} $$

when setting $\epsilon_i= -z_i$ and $T=1$.

In analogy, one introduces the temperature for LLM sampling as an overalll scaling factor for the logits (i.e. the final embedding vector):

$$ p_i=\frac{e^{z_i/T}}{\sum_{i=1}^{V}e^{z_i/T}} $$

It is easy to see, that the limit $T\to 0$ transforms the softmax distribution into the maximum distribution, while $T\to\infty$ transforms the softmax distribution into the uniform distribution.

Intermediate choices for T can continuously tweak the probability distribution for the next token. Assuming, for simplicity, that the “(other)” possible next tokens in our examples are just one token (denoted as “*”), we get, e.g., the following probabilities:

In addition to changing T from its natural value 1 (which is the value used in all training!), one can restrict the occurance of relatively unlikely tokens using the parameters

top_k = k: choose only among k most probable tokens
top_p = p: choose among most likely tokens until cumulative probability p is reached

3.4 Logit lens + Layer normalization #

We will now start looking at the internals of the (decoder-only) transformer.

Figure 3.10: Decoder-only transformer model. — **Figure 3.10:** Decoder-only transformer model.

First note that (for vocabulary size $V$ and embedding dimension $d_\text{model}$), the embedding and unembedding matrix (used in the block “Input Embedding” and “Linear + Softmax”, respectively) have dimensions $V\times d_\text{model}$ and $d_\text{model} \times V$, respectively. In theory, their entries could be determined independently. In practice, most models (and, usual transformers implementations) employ tied embedding (a.k.a. “parameter tying”), i.e., restrict the unembedding matrix to be the transpose of the embedding matrix:

$$ W_\text{unembed} = W_\text{embed}^T $$ This generally regularizes the model. Furthermore, it implies a consistent relationship between token vectors and embedding vectors both at the input and on the output side.

Residual property of transformers #

A second important point for our discussion is valid for all transformers (including encoder-decoder variants): They are examples of residual neural networks, a class of DNNs, where the input is always retained and only (slightly; see below) modified by the output, not fully replaced. This is symbolized by the lines going around the attention and feed-forward layers in the above figure.

We will now look at very interesting implications of the two points made above. For reasons that should become clear later, only the direction of embedding vectors is important in the transformers considered here. Therefore and for simplicity, we can neglect all normalization steps for the following considerations.

Let us make the residual property explicit by following the lanes of the example discussed in the previous section:

Figure 3.11: Lanes in a decoder model. — **Figure 3.11:** Lanes in a decoder model.

After each of the three input tokens is mapped (independently using the same matrix $W_\text{embed}$) to the corresponding initial embedding vectors $E_l^{(0)}$ (for $1\le l\le L$), the self-attention mechanism (to be discussed in the next section) kicks in. The important point is that this perturbation is small:

$$ E_l^{(0)} \stackrel{\text{self-attention}}{\longrightarrow} E_l^{(1)} = E_l^{(0)} + \Delta E_l^{(1)} $$ where $$ |\Delta E_l^{(1)}| \ll |E_l^{(0)}| $$

Similarly, also the subsequent feed-forward layer (which acts independently on each lane) only perturbs each embedding vector moderately.

As mentioned above, the elementary decoder structure with self-attention, feed forward (and normalization) is repeated several times before the final unembedding + softmax-step is reached. Given that the full transformer has to be able to map each token to all tokens (that have a significant probability of following the input token) within each lane, the requirement of individual steps being small obviously implies a quite large number of layers n. GPT-3, for example, employs 96 layers.

Logit lens #

Due to the unique embedding/unembedding matrix, each embedding vector throughout the transformer can be unembedded and the result interpreted as logits; application of softmax then yields probabilities for each token. By means of this logit lens, one can look inside the transformer:

While, in this example, the predictions are far from perfect (input is not fully reproduced in the output), the quasi-continuous evolution of the embedding vectors is well illustrated.

Excursion: token autoencoders #

Let us now turn to two related questions:

Is it really possible for an LLM to confidently predict each token of the vocabulary even when the embedding dimension is much smaller than the vocabulary size? How?
Why do LLMs need normalizations steps for stability?

The first question comes up in the following way: In order for a token to appear as a strong prediction, the corresponding component of the token vector has to be close to one (say: larger than 0.9). However, the unembedding maps from a space of much lower dimension $d_\text{model}$ into the vocabulary vector space (dimension V); this mapping can, at most, reach a $d_\text{model}$-dimensional subspace and certainly not produce V linearly independent vectors (as the unembedding matrix is fixed during inference).

In order to demonstrate point 1, we will consider the extreme case of a 2-dimensional embedding. For our question, the inner workings of the decoder are irrelevant. Therefore, we condense the decoder block to a single layer:

Autoencoder NN — **Figure 3.13:** 3-layer NN for our autoencoder model (linear middle layer, softmax activation in the final layer)

We keep the tied embedding used in most LLMs. Then the model takes the following form: $$ \text{token vector} \stackrel{W_\text{embed}}{\longrightarrow}\text{embedding vector}\stackrel{W_\text{embed}^T}{\longrightarrow} \text{logits} \stackrel{\text{softmax}}{\longrightarrow} \text{token vector} $$ We now consider the case of a vocabulary with $V=10$ different tokens in an autoencoder setup: The model is trained repeatedly with each possible input token to predict the same token as output. At first sight, this seems an impossible task: linear combinations of two 10-dimensional vectors must (in the unembedding step) produce sharp maxima at each of 10 positions.

Let us look at a numerical experiment. For reasons to be discussed in a moment, we employ a (minimalistic) normalization after each update: the Frobenius norms of the embedding/unembedding matrix is set to a constant.

Figure 3.14: Evolution of embedding vectors in d=2 in an autoencoder NN: optimal solution — **Figure 3.14:** Evolution of embedding vectors in d=2 in an autoencoder NN: optimal solution

We can see, that an initially random arrangement of all embedding vectors (of each of the 10 tokens) evolve towards a circular arrangement pretty rapidly; more precisely, the optimal configuration is a polygon with V corners.

This shows:

An LLM can choose between an arbitrary number of tokens using any embedding dimension $d\ge 2$ by arranging the embedding vectors of all tokens in a hyper sphere and pointing the final embedding vector in the appropriate direction.
In terms of the cosine similarity, the maximum cannot be sharp in extreme cases such as shown here.

Upon rerunning the training, we see that the model can be stuck (in extreme cases as considered here) in suboptimal solutions, where some token embeddings end up in the origin:

Figure 3.15: Evolution of embedding vectors in d=2 in an autoencoder NN: suboptimal solution 1 — **Figure 3.15:** Evolution of embedding vectors in d=2 in an autoencoder NN: suboptimal solution 1

Figure 3.16: Evolution of embedding vectors in d=2 in an autoencoder NN: suboptimal solution 2 — **Figure 3.16:** Evolution of embedding vectors in d=2 in an autoencoder NN: suboptimal solution 2

At this point, we can also see why the algorithm needs a normalization in order to remain stable: For any of the final solutions, the loss could be reduced by scaling up all embedding matrices (since the logits and the difference between the maximum logit and the next-lower logit scale up correspondingly, leading to more pronounced probability distributions). The corresponding terms in the gradients must be offset in order for convergence to become possible.

Given the insight that embedding vectors must (at least approximately) occupy hyper spheres in order to become/remain accessible for predictions, LLMs usually enforce this property by normalizing each individual embedding vector after each modification by self-attention or feed-forward blocks.

More specifically, in the socalled “layer normalization” step, LLMs usually calculate mean and variance of the components of each vector; they then set mean to zero and variance to 1, but finally reintroduce a scale factor and offset as learnable parameters. It is not obvious why one would want to restrict the mean of the vector components (which, in the case d=2 considered above, would force the embedding vectors on one diagonal). Indeed, many modern LLMs use “RMS normalization” which only rescales the embedding vectors.

Note that high-dimensional vector spaces have some unfamiliar properties: e.g., almost all vectors within a hypersphere are very close to its surface; one can also arrange a much larger number of vectors than d so that they are nearly perpendicular. This implies that extending our autoencoder example, e.g. to $d=100$ and \(V=10,000); already the cosine similarity would yield a rather strong distinction between the embedding vector of a selected token and all other vectors.

Code used for autoencoder experiments (developed with Gemini 3)

  1import torch
  2import torch.nn as nn
  3import torch.optim as optim
  4import matplotlib.pyplot as plt
  5import numpy as np
  6import os
  7
  8# --- Configuration ---
  9NUM_TOKENS = 10       
 10EMBED_DIM = 2         
 11NUM_EPOCHS = 140       # Adjusted to fit exactly 8 snapshots (0 to 140)
 12SNAPSHOT_INTERVAL = 20 
 13LEARNING_RATE = 0.05
 14OUTPUT_FILE = "token_evolution_final_grid.png"
 15
 16# Global Constraint: Fixed Frobenius Norm
 17TARGET_MATRIX_NORM = np.sqrt(NUM_TOKENS * 9.0) 
 18
 19torch.manual_seed(42)
 20
 21class TiedAutoencoder(nn.Module):
 22    def __init__(self, num_tokens, embed_dim):
 23        super().__init__()
 24        # 1. Initialize random Gaussian cloud
 25        self.embeddings = nn.Parameter(torch.randn(num_tokens, embed_dim))
 26        # 2. IMMEDIATELY enforce the constraint (Start as a cloud, not a dot)
 27        self.apply_global_constraint()
 28        
 29    def forward(self, x):
 30        h = torch.matmul(x, self.embeddings)
 31        logits = torch.matmul(h, self.embeddings.T)
 32        return logits, self.embeddings
 33
 34    def apply_global_constraint(self):
 35        """Scale matrix to fixed global norm."""
 36        with torch.no_grad():
 37            current_norm = self.embeddings.norm(p='fro')
 38            scale_factor = TARGET_MATRIX_NORM / (current_norm + 1e-8)
 39            self.embeddings.mul_(scale_factor)
 40
 41# --- Setup ---
 42inputs = torch.eye(NUM_TOKENS)
 43targets = torch.arange(NUM_TOKENS).long()
 44
 45model = TiedAutoencoder(NUM_TOKENS, EMBED_DIM)
 46optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
 47criterion = nn.CrossEntropyLoss()
 48
 49snapshots = []
 50losses = []
 51
 52print(f"Training 10 tokens in 2D (Epochs 0-140)...")
 53
 54# --- Training Loop ---
 55for epoch in range(NUM_EPOCHS + 1):
 56    
 57    # 1. Take Snapshot
 58    if epoch % SNAPSHOT_INTERVAL == 0:
 59        with torch.no_grad():
 60             logits_snap, current_embeddings = model(inputs)
 61             emb_data = current_embeddings.detach().cpu().numpy().copy()
 62             
 63             if len(losses) > 0:
 64                 curr_loss = losses[-1]
 65             else:
 66                 curr_loss = criterion(logits_snap, targets).item()
 67                 
 68             snapshots.append((epoch, emb_data, curr_loss))
 69             print(f"Epoch {epoch:4d} | Loss: {curr_loss:.4f}")
 70
 71    # 2. Update Steps
 72    optimizer.zero_grad()
 73    logits, _ = model(inputs)
 74    loss = criterion(logits, targets)
 75    loss.backward()
 76    optimizer.step()
 77    model.apply_global_constraint()
 78    losses.append(loss.item())
 79
 80# --- Visualization ---
 81print(f"Generating 4x2 Grid Plot...")
 82num_snaps = len(snapshots) # Should be 8
 83cols = 4
 84rows = 2 # Fixed for 0-140 range
 85
 86# Adjust figure height to accommodate the row spacing
 87fig = plt.figure(figsize=(16, 9))
 88
 89colors = plt.cm.rainbow(np.linspace(0, 1, NUM_TOKENS))
 90limit = (TARGET_MATRIX_NORM / np.sqrt(NUM_TOKENS)) * 1.8
 91target_radius = TARGET_MATRIX_NORM / np.sqrt(NUM_TOKENS)
 92
 93for i, (epoch, emb, loss_val) in enumerate(snapshots):
 94    ax = fig.add_subplot(rows, cols, i + 1)
 95    
 96    # Reference circle
 97    circle = plt.Circle((0, 0), target_radius, color='gray', fill=False, linestyle=':', alpha=0.3)
 98    ax.add_artist(circle)
 99    
100    for t_idx in range(NUM_TOKENS):
101        ax.scatter(emb[t_idx, 0], emb[t_idx, 1], color=colors[t_idx], s=80)
102        
103        # Offset label slightly
104        offset_x = np.sign(emb[t_idx, 0]) * 0.25
105        offset_y = np.sign(emb[t_idx, 1]) * 0.25
106        if offset_x == 0 and offset_y == 0: offset_x, offset_y = 0.25, 0.25
107
108        ax.text(emb[t_idx, 0] + offset_x, emb[t_idx, 1] + offset_y, 
109                str(t_idx), fontsize=9, color=colors[t_idx], ha='center', va='center')
110        
111        ax.plot([0, emb[t_idx, 0]], [0, emb[t_idx, 1]], color=colors[t_idx], alpha=0.2)
112
113    ax.set_xlim(-limit, limit)
114    ax.set_ylim(-limit, limit)
115    ax.set_aspect('equal')
116    
117    # Title configuration: closer to graph (pad=6)
118    ax.set_title(f"Epoch {epoch} (Loss: {loss_val:.2f})", fontsize=11, fontweight='bold', pad=6)
119    ax.axis('off')
120
121# Single Title
122plt.suptitle(f"Evolution of Tokens (d=2)", fontsize=18, y=0.98)
123
124# Layout adjustments
125# hspace=0.4 adds the requested vertical space between rows
126# wspace=0.1 keeps columns relatively tight
127plt.subplots_adjust(top=0.90, bottom=0.05, left=0.05, right=0.95, hspace=0.4, wspace=0.1)
128
129plt.savefig(OUTPUT_FILE, dpi=150)
130plt.close()
131
132print(f"Done! Plot saved to: {os.path.abspath(OUTPUT_FILE)}")

3.5 Self-attention #

Let us now turn to the heart of the transformer approach, the self-attention mechanism. In general, this approach allows earlier tokens to impact later tokens - which is obviously essential for generating meaningful text.

Note that some relationships between nearest-neighbor tokens can and will be learned via the feed-forward networks within each lane; if a certain token b never follows some token a within the training data, even a model with all attention blocks switched off could probably learn to not violate this rule. However, less trivial correlations and correlations beyond the nearest-neighbor distance must rely on attention.

Among the relationships that a LLM must learn in order to be useful are the following:

No	Class / Property	Example
1	negation	this is … not useful
2	word from token	_b + arks (in “The dog barks again”)
3	subject - pronoun	Henry is a criminal. He …
4	adjective - noun	… is a cruel man. The monster …
5	given name - sure name	Michael Jordan …
6	topic - specific term	… music festival … Queen was greeted with applause
7	grammar rules	… end of a sentence._Then follows … (space after full stop)
8	language consistency	(question in German) … (answer in German)

Note that relationships or connotations between words flow in both directions. While example 1 is mostly left-to-right, example 2 is essentially symmetric (since none of the subword tokens has a relevant meaning by itself). While in example 5 “Jordan” is specified by “Michael” (left-to-right), the converse connotation is probably even more relevant (if subsequent sentences refer to “Michael”). In Example 6, we learn that Queen is not a monarch, but a music group (which also the lack of the definite title “the” hints at); on the other hand, we can conclude that the music festival was probably a major event.

In general, self-attention is a mechanism for adding connotations to each token that reflect its context. In the following, we will first discuss the original formulation and then an improved variant (with Rotary Positional Encoding, RoPE).

Figure 3.17: Self-attention mechanism using Query (Q), Key (K), and Value (V) matrices (from Vasvani et al., Google, 2017). — **Figure 3.17:** Self-attention mechanism using Query (Q), Key (K), and Value (V) matrices (from Vasvani et al., Google, 2017).

Let us start by noting that LLMs employ large numbers of attention heads within each decoder (or encoder) block in parallel. This multi-head attention mechanism allows the models to implement a large variety of possible token-token relationships at the same time. For simplicity, we focus on one attention head and assume (following three blue one brown), that it encodes adjective-noun relationships (our example 4).

Each attention head is defined by matrices (where, for clarity, we have renamed $d_{\text{model}}\to d_{\text{embed}} $)

\[ \begin{aligned} W_Q &\in \mathbb{R}^{d_{\text{query}}\times d_{\text{embed}}} && \text{query matrix}\\ W_K &\in \mathbb{R}^{d_{\text{query}}\times d_{\text{embed}}} && \text{key matrix}\\ W_V &\in \mathbb{R}^{d_{\text{embed}}\times d_{\text{embed}}} && \text{value matrix} \end{aligned} \]

Note that usually a lower-rank factorization is used for the value matrix: $$ W_V = W_{V\downarrow} W_{V\uparrow}, \quad \text{where}\quad W_{V\downarrow} \in \mathbb{R}^{d_{\text{embed}}\times d_{\text{value}}}, \quad W_{V\downarrow} \in \mathbb{R}^{d_{\text{value}}\times d_{\text{embed}}}, $$ where, in practice, the reduced value dimension is chosen equal to the query dimension: $d_{\text{value}}= d_{\text{query}} $). Unfortunately, in the literature, all matrices denoted here as $W_{V\uparrow}$, stacked together across all attention heads, are often referred to as a single “output matrix”, while only the matrices here denoted as $W_{V\downarrow}$ are referred to as value matrices. For understanding the attention mechanism, these details are not important; we will, therefore treat W_V just as a mapping in the embedding space.

Let us initially consider the very first attention block, which acts on the initial embedding vectors that represent the input tokens. We consider the text

a big man enters the tiny sports car ; he

and assume that each space+word or interpunctation sign corresponds to one of the embedding vectors $E_1$ to $E_{10}$.

The attention mechanisms should reflect (at least) the following relationships

big → man
tiny → car
sports → car
(big) man → he
man → enters → car

Let us, for the moment focus on adjective - noun relationships and assume that one attention head addresses only this context aspect by enriching nouns (here embedding vectors $E_3$ and $E_8$) with the appropriate flavor of the adjective.

The attention mechanism than essentially works as follows:

Query: Each noun asks the embedding vectors on the left: which of you is an adjective?
Key: Each adjective answers: here I am!
According to the overlap between query and key, a pair score $0 \le s_{mn}\le $ is computed.
Value: Each adjective computes a value vector
Update step: Each noun adds the appropriate value vectors, scaled with the pair score

Obviously, the above description is oversimplistic. In reality, embedding vectors usually do not have definite identities (e.g. as noun or adjective) and the key-query mechanism is richer. This is the true process ($1\le m,n \le L$ for context length $L$)

Query: For each embedding vector $E_n$, a query vector is calculated: $Q_n = W_Q E_n$
Key: For each embedding vector $E_m$, a key vector is calculated: $K_m = W_K E_m$
An attention logit matrix is calculated: $z_{mn} = K_m \cdot Q_n / \sqrt{d_\text{query}}$
Masking: entries violating causality (key right of query) are suppressed: $z_{mn} = z^-$ for $m>n$
Attention scores via softmax: $p_{mn} = exp(z_{mn}) / \sum_{o=1}^L \exp(z_{mo})$
Value: For each embedding vector $E_m$, a value vector is calculated: $V_m = W_V E_m$
Modify embedding vectors: $E_n \to E_n + \sum_{m=1}^n p_{mn} V_m$

Here, $z^-$ represents a strongly negative logit (an approximation to $z=-\infty$)

For our example, we would expect the attention scores (focussing only on adjective - noun, after masking + softmax) similar to the following:

	1: `a`	2: `big`	3: `man`	4: `enters`	5: `the`	6: `tiny`	7: `sports`	8: `car`	9: `;`	10: `he`
1: `a`	1.0	0	0	0	0	0	0	0	0	0
2: `big`		1.0	0.9	0	0	0	0	0.4	0	0
3: `man`			0.1	0	0	0	0	0	0	0
4: `enters`				1.0	0	0	0	0	0	0
5: `the`					1.0	0	0	0	0	0
6: `tiny`						1.0	0	0.4	0	0
7: `sports`							1.0	0.1	0	0
8: `car`								0.1	0	0
9: `;`									1.0	0
10: `he`										1.0

Here, we have omitted entries that are zero due to masking / causality. Note that the $p_{11}=1.0$ due to the normalization constraint inherent in softmax, which enforces that each target embedding vector attends to something (cf. Miller, Attention is off by one, 2023). We have assumed that this sum-rule enforced attention usually gets allocated to the diagonal; for consistency, some diagonal contribution is retained also in the cases that offdiagonal attention is wanted.

Note that, in our example, we can reproduce the attention patterns big - man (with score 0.9) and tiny - car (with score 0.4) that we were aiming for. Since, however the attention mechanism discussed so far depends only on the tokens-embeddings, not their distance, we have to assume a similar unwanted pattern big - car (score 0.4).

Note that, in practical implementations, the computations stated above are performed in optimized form:

All computations are performed on the matrix level: Instead of computing query, key, and value vectors for each token/embedding position individually, all are computed as matrices.
Since information flows in a transformer only from bottom to top and left to right, extending the input by the generated token in position n+1 after a step of inference does not change any of the embedding vectors in lanes 1 to n (at least for $n<L$). With $W_{Q/K/V}$ being fixed at inference time, this implies that also the key and value vectors in these positions do not change (this is also true for the query vectors, but those are no longer needed).
As a consequence, LLMs employ key-value-caching.

It is clear that the setup discussed so far, with position-independent mapping from tokens to embedding vectors and distance-independent attention mechanisms, are not sufficient for a working language model.

3.6 Positional encoding #

Originally, the transformer architecture was set up with absolute positional encoding (APE) on the level of the embedding vectors. Let us define vectors $P_n$ at position n in the embedding space (i.e. with components $0\le l < d_{embed}$):

\[ \begin{array}{lll} (P_n)_{2s} &=& \sin(n \,\theta_{2s}) \\ (P_n)_{2s+1} &=& \cos(n \,\theta_{2s})\end{array} \qquad \text{where } \theta_{s} = N^{-s/d_{embed}}. \]

Here, $N$ is a free parameter that should be significantly larger than all $n$, i.e., larger than the context length L. The original paper uses $N=10000$. This construction introduces oscillations with a high frequency, i.e. short wavelength, in the lower dimensions (period $2\pi\approx 6$ for $l=1,2$) and longer wavelength in the higher embedding dimensions:

Figure 3.17: Absolute positional encoding (N=10000, d=100; illustration currently used in the Wikipedia transformer article.). — **Figure 3.17:** Absolute positional encoding (N=10000, d=100; illustration currently used in the Wikipedia transformer article.).

Code used for positional encoding illustration (developed with Gemini 3)

 1import numpy as np
 2import matplotlib.pyplot as plt
 3
 4def generate_figure():
 5    # 1. Configuration
 6    N = 10000       # Denominator base (standard is 10000)
 7    d = 100.0       # Scaling factor (dimension size proxy)
 8    
 9    # 2. Define Ranges
10    # Embedding Index (k)
11    ks = np.arange(0, 100) 
12    # Position (pos)
13    poss = np.arange(0, 1001)
14
15    # 3. Create Grid
16    data = np.zeros((len(ks), len(poss)))
17
18    # 4. Calculate Sine/Cosine values
19    for i, k in enumerate(ks):
20        if k % 2 == 1:
21            # Odd indices: sin(pos / N^((k-1)/d))
22            theta = poss / (N ** ((k - 1) / d))
23            data[i, :] = np.sin(theta)
24        else:
25            # Even indices: cos(pos / N^(k/d))
26            theta = poss / (N ** (k / d))
27            data[i, :] = np.cos(theta)
28
29    # 5. Plotting
30    fig = plt.figure(figsize=(16, 8), dpi=200) # High resolution
31    
32    # Heatmap
33    # cmap='RdBu_r': Red=Positive, Blue=Negative, White=Zero
34    plt.imshow(data, aspect='auto', cmap='RdBu_r', origin='lower', vmin=-1, vmax=1)
35
36    # 6. Custom Ticks (Linear Scale)
37    # Positions: 0, 100, 200, ... 
38    plt.xticks(np.arange(0, 1001, 100))
39    # Embeddings: 0, 10, 20, ...
40    plt.yticks(np.arange(0, 100, 10))
41
42    # 7. Labels
43    plt.ylabel("Embedding Index")
44    plt.xlabel("Position")
45    plt.colorbar()
46
47    # 8. Save and Show
48    plt.tight_layout()
49    plt.savefig("positional_encoding.png")
50    print("Figure saved as 'positional_encoding.png'")
51    plt.show()
52
53if __name__ == "__main__":
54    generate_figure()

These positional encoding vectors are added to the initial embedding vectors before they enter the decoder blocks:

$$ E_n^{(0)} \longrightarrow E_n^{(0)} + P_n $$

As a consequence, all initial embedding vectors can be expected to be different and, thus, be associated with different query, key and value vectors. Therefore, the model has the chance to learn position-dependent attention patterns. At the same time, the symmetry in the embedding space is broken: embedding directions are no longer equivalent in the model architecture (remember that in DNN, inherent symmetries have to be broken by random initialization in order for nontrivial solutions to be accessible by gradient descent).

However, this mechanism is quite indirect and makes variation of the context length (after training) difficult.

In contrast, rotary positional encoding (RoPE) acts within the attention mechanism, specifically in the query space. Identical rotations are applied to the query and key vectors; the associated angles depend linearly on the positions. Consequently, the relative rotation angles between query and key vector only depend on their positional distance.

Specifically, rotations are applied in two-dimensional subspaces,

\[ R_m = \begin{pmatrix} \cos{m\theta_1}& -\sin{m\theta_1}&0&0&\cdots&0&0\\ \sin{m\theta_1}&\cos{m\theta_1}&0&0&\cdots&0&0 \\ 0&0&\cos{m\theta_2}& -\sin{m\theta_2}&\cdots&0&0\\ 0&0&\sin{m\theta_2}&\cos{m\theta_2}&\cdots&0&0 \\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&0&\cdots&\cos{m\theta_{d/2}}& -\sin{m\theta_{d/2}}\\ 0&0&0&0&\cdots&\sin{m\theta_{d/2}}&\cos{m\theta_{d/2}} \end{pmatrix} \]

where the angles $\theta_i=10000^{-2(i-1)/d}, i \in [1, 2, …, d/2]$ are chosen in exactly the same way as in APE.

This leads to the following algorithm (Self-attention with RoPE):

Query: For each embedding vector $E_n$, a query vector is calculated: $Q_n = R_n W_Q E_n$
Key: For each embedding vector $E_m$, a key vector is calculated: $K_m = R_m W_K E_m$
An attention logit matrix is calculated: $z_{mn} = K_m \cdot Q_n / \sqrt{d}$
Masking: entries violating causality (key right of query) are suppressed: $z_{mn} = z^-$ for $m>n$
Attention scores via softmax: $p_{mn} = exp(z_{mn}) / \sum_{o=1}^L \exp(z_{mo})$
Value: For each embedding vector $E_m$, a value vector is calculated: $V_m = W_V E_m$
Modify embedding vectors: $E_n \to E_n + \sum_{m=1}^n p_{mn} V_m$

Taken together, the attention scores using RoPE can be written as follows:

\[ p_{mn} = \text{softmax} \left[ \text{mask} \left[ E_m^T \, W_K^T \, R_{n-m} \, W_Q \, E_n \right] \right]_n \]

3.7 Feed-forward layers #

The feed-forward units, that are applied after each transformer block, contain two mappings (i.e. input layer, hidden layer and output layer); only the hidden layer has an activation:

\[ \text{FFN}(E_n)= W_2 \,\sigma(W_1\, E_n + b_1)+ b_2 \] Here $E_n$ is the embedding vector; the dimension $d_\text{ff}$ of the hidden layer is usually chosen as $d_\text{ff} = 4 d_\text{embed}$.

Instead of ReLU, a smooth activation function is usually chosen for the hidden layer. GPT-2, for example, uses GELU:

\[ \mathrm{GELU}(x) \approx 0.5 x \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715 x^{3}\right)\right)\right) \]

Due to their large sizes, the matrices $W_{1/2}$ (one pair for each decoder layer) contain a large fraction of all parameters of an LLM (e.g. about half of the parameters for GPT-2). It is assumed that they store a large part of the factual information learned by an LLM.

As an specific example, we look at the parameters for Gemma 2 (see https://arxiv.org/abs/2408.00118):

Parameters	2B	9B	27B
$d_{model}$	2304	3584	4608
Layers	26	42	46
Pre-norm	yes	yes	yes
Post-norm	yes	yes	yes
Non-linearity	GeGLU	GeGLU	GeGLU
Feedforward dim	18432	28672	73728
Head type	GQA	GQA	GQA
Num heads	8	16	32
Num KV heads	4	8	16
Head size	256	256	128
Global att. span	8192	8192	8192
Sliding window	4096	4096	4096
Vocab size	256128	256128	256128
Tied embedding	yes	yes	yes

Model	Embedding Parameters	Non-embedding Parameters
2B	590,118,912	2,024,517,888
9B	917,962,752	8,324,201,984
27B	1,180,237,824	26,047,480,320

3.8 Training of LLMs #

As described above, the revolutionary transformer architecture with self-attention was introduced in 2017 by the global corporation Google. However, the breakthrough was achieved by the startup OpenAI five years later.

The success of ChatGPT (i.e., the GPT-3, GPT-4, GPT-4o models, etc.) is not based solely on model architectures, but especially on the training strategies and the effort invested in them.

Training of Base / Foundation Models #

All generative LLMs can be derived from base models (also called foundation models) that are initially trained—as described above—using unsupervised learning on massive text corpora. These typically consist of some or all of the following sub-corpora:

Wikipedia
Books of all kinds
Newspaper articles
Scientific articles
Social media
Computer programs
Q&A forums (especially for IT)
Other websites

For each individual training step, a sequence of tokens of length L (for context length L) is chosen from parts of the training corpus as the input, and the sequence shifted by one token is used as the output; in other words, the model is trained to predict the next token.

This process requires enormous resources (especially a large number of NVIDIA GPUs with lots of memory); for large models, a complete training typically costs around 100 million USD.

Important: A base model trained in this way initially only has the ability to continue text coherently.

Example: For the input (prompt) “What is the meaning of life?”, an LLM might continue (similar to a randomly selected webpage of MDR) with the (teaser) text “We are born into this world without being asked and are supposed to shape our lives. But how exactly, and above all, why? Why are we here?”—that is, mostly with further questions.

Note: To some extent, base models can be guided toward more constructive behavior using one-shot training (a variant of prompt engineering).

Instruction Tuning #

To ensure models respond constructively, they must undergo instruction tuning (a special case of fine-tuning), an application of supervised learning.

A large number of (hopefully representative) input examples, each paired with a desired output, is used as training data, for example:

User: What is the capital of France?

Agent: The capital of France is Paris.

It is obviously highly nontrivial to generate training datasets that sufficiently cover all conceivable applications of an LLM without overly restricting linguistic variability. However, fixed keywords (such as “System”, “User”, “Agent”) are often used, which the model learns and which imply what an optimal prompt should look like.

The parameters learned in the (then called pretrained) base model are usually only slightly modified through instruction fine-tuning. The resulting instruction model generally reacts far more constructively to user prompts.

Problem: In some cases, pure instruction models are too constructive and reproduce learned stereotypes without filtering. Such models might readily help with building bombs or committing suicide, and they would reproduce racist and sexist attitudes.

Alignment #

The models used for ChatGPT (in fact, almost all LLMs that are available to the general public) therefore undergo an additional fine-tuning step, usually Reinforcement Learning from Human Feedback (RLHF). In this process, the LLM being trained generates multiple responses (e.g., 2 or 4) to a given input, which are evaluated or ranked by test users. Again, the previously learned parameters are modified (usually minimally).

Meanwhile, methods have also been established in which LLMs essentially monitor themselves (to suppress or delete harmful output), and in some cases specialized models are used as overseers.

LLM training stages — **Figure:** Training stages of LLMs (generated using ChatGPT)

Last updated: 2025-12-10 13:25

← Elementary NLP References →

Parameters	2B	9B	27B
\(d_{model}\)	2304	3584	4608
Layers	26	42	46
Pre-norm	yes	yes	yes
Post-norm	yes	yes	yes
Non-linearity	GeGLU	GeGLU	GeGLU
Feedforward dim	18432	28672	73728
Head type	GQA	GQA	GQA
Num heads	8	16	32
Num KV heads	4	8	16
Head size	256	256	128
Global att. span	8192	8192	8192
Sliding window	4096	4096	4096
Vocab size	256128	256128	256128
Tied embedding	yes	yes	yes