2 Elementary NLP

2 Elementary Natural Language Processing #

We will now discuss methods for natural language processing (NLP). Before we get to the highly sophisticated (and complicated) transformer-based architectures, we will cover more elementary NLP methods.

In this short lecture, we cannot dive too deeply into technical details. In particular, we will not show code of actual implementations. Such details can be found, e.g., in “The Hundred-Page Language Models Book: hands-on with PyTorch” by Adriy Burkov (open access via https://thelmbook.com) that served as inspiration for parts of this lecture.

2.1 Bag of Words (BoW) – Text Classification #

We begin with Bag of Words (BoW), a conceptually simple algorithm. It can be used to automatically classify documents or text fragments (e.g., sentences) according to a given set of topics.

As an example, consider the following 9 (short) sentences:

Soccer is exciting.
I like playing tennis.
Sports are fun.
Today it is sunny.
Tomorrow it will rain.
The weather is nice.
I like eating pizza.
Bread is tasty.
Fruit is healthy.

For us as humans, it is obvious that these sentences can be grouped by topic based on certain keywords (marked in color here):

No.	Sentence	Topic
1	Soccer is exciting.	sports
2	I like playing tennis.	sports
3	Sports are fun.	sports
4	Today it is sunny.	weather
5	Tomorrow it will rain.	weather
6	The weather is nice.	weather
7	I like eating pizza.	food
8	Bread is tasty.	food
9	Fruit is healthy.	food

Other words are less clear: play could refer not only to sports but also to an instrument or a board game; healthy may refer to food but also to sports; is, I, it, etc. are quite generic (whereas nice, gladly tend to indicate moods).

The Bag of Words method employs supervised machine learning. One starts with labeled training data: $$ {S_i, t_i}_{i=1}^n $$

Here, each of the n entries consists of a sentence $S_i$ and a classification label $t_i\in\{1, \dots, s\}$ for s classes.

In the Bag of Words method, one first extracts the vocabulary from the corpus (the text in the training data). For our 9 sentences, this vocabulary consists of 25 words (all lowercased):

[‘are’, ‘bread’, ‘eating’, ‘exciting’, ‘fruit’, ‘fun’, ‘healthy’, ‘i’, ‘is’, ‘it’, ‘like’, ‘nice’, ‘pizza’, ‘playing’, ‘rain’, ‘soccer’, ‘sports’, ‘sunny’, ‘tasty’, ‘tennis’, ‘the’, ‘today’, ‘tomorrow’, ‘weather’, ‘will’]

Using this defined vocabulary—an alphabetically sorted list of words—each of the sentences (more generally: each document) can be mapped to a feature vector, whose n-th entry indicates whether the n-th vocabulary word appears in it (one-hot vector).

For sentence 1 (S1) this mapping has the following form (feature vector in row notation): $$ S_1 = \text{“Soccer is exciting.”} \longrightarrow \mathbf{x}_1 =[0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0] $$

Explanation of vector entries

Entries 1–3 of the vector are 0 because the words ‘are’, ‘bread’, ‘eat’ do not appear in the sentence. The 4th entry is 1 because the word 'exciting' does appear. In total, there are 3 nonzero elements because the sentence contains 3 different words (which by construction all appear in the vocabulary).

A more readable representation is the document-term matrix (DTM); apart from the header row, the columns correspond to the feature vectors (in column notation):

	S₁	S₂	S₃	S₄	S₅	S₆	S₇	S₈	S₉
are	0	0	1	0	0	0	0	0	0
bread	0	0	0	0	0	0	0	1	0
eating	0	0	0	0	0	0	1	0	0
exciting	1	0	0	0	0	0	0	0	0
fruit	0	0	0	0	0	0	0	0	1
fun	0	0	1	0	0	0	0	0	0
healthy	0	0	0	0	0	0	0	0	1
i	0	1	0	0	0	0	1	0	0
is	1	0	0	1	0	1	0	1	1
it	0	0	0	1	1	0	0	0	0
like	0	1	0	0	0	0	1	0	0
nice	0	0	0	0	0	1	0	0	0
pizza	0	0	0	0	0	0	1	0	0
playing	0	1	0	0	0	0	0	0	0
rain	0	0	0	0	1	0	0	0	0
socker	1	0	0	0	0	0	0	0	0
sports	0	0	1	0	0	0	0	0	0
sunny	0	0	0	1	0	0	0	0	0
tasty	0	0	0	0	0	0	0	1	0
tennis	0	1	0	0	0	0	0	0	0
the	0	0	0	0	0	1	0	0	0
today	0	0	0	1	0	0	0	0	0
tomorrow	0	0	0	0	1	0	0	0	0
weather	0	0	0	0	0	1	0	0	0
will	0	0	0	0	1	0	0	0	0

The Bag of Words method is based on the idea that by mapping sentences (documents) to feature vectors, generic algorithms for classifying vector-valued data can be applied. Specifically, a neural network (NN) is used, typically with softmax activation in the output layer.

During a training phase, parameters of the neural network (weights) are adjusted (“learned”), and these weights then allow classification when the model is applied to a (usually previously unseen) document.

As a minimal approach, one could employ multinomial logistic regression (i.e., a NN with input layer + output layer).

To illustrate this, we construct a simplified, “by hand” procedure that captures the essential aspects. Let’s assume we apply scores to key words on a scale of 0 to 5 for each possible class (sports/weather/food): $$ \text{“tennis”}\longrightarrow [5,0,0], \text{“play”}\longrightarrow [2,0,0], \text{“food”}\longrightarrow [0,0,5], \text{“healthy”}\longrightarrow [2,0,3],\dots $$ Given a sentence, one can then obtain a score vector z by summing the partial scores, here for example sentence 2:

$$ S_2 = \text{“I like to play tennis.”} \longrightarrow \mathbf{z}_2 = [5+2,0,0] = [7,0,0] $$

In this case, the raw result (logit) has a nonzero entry only in the component corresponding to the “Sports” class. Thus, the classification is 100% sports.

In a more complex case, the situation may be less clear:

$$ S^* = \text{“After sports I like to eat some pizza.”} \longrightarrow \mathbf{z}^* = [5,0,5+5] = [5,0,10] $$

Here the sentence (contains some out-of-vocabulary words and) refers to both sports and food. If a single decision is required, food would be the classification result (because the 3rd component has the highest score). As an activation function, one could use the max function (mapping [5,0,10] to [0,0,1]) or use a softmax activation to obtain probability scores for the classes.

In practice, the BoW method is usually implemented using a 3-layer NN (e.g. with ReLU for hidden layer)

Note that, for binary input data, a hidden layer with ReLU activation can (learn to) implement the logical AND function (starting with the linear function $z=a+b-1$ for inputs a and b).

AND function in NN with ReLU — **Figure**: Realization of AND function for binary data

Thus, given enough neurons in the middle layer, the NN can learn both scores for key words and scores for specific combinations of key words (such as “neural” + “network”) - as well as more complicated patterns.

Obviously, the Bag of Words method has several severe limitations:

Order of words ignored
Semantics (word forms, synonyms) not learned
Huge sparse vectors (dim = # words)

In principle, it is possible to address some limitations, e.g. by curating word forms (“play“, „plays“, „played“, „playing“, … ⟶ „play“). However, true progress requires the new concepts introduced in later sections.

2.2 N-gram Methods - include word order #

In the Bag of Words method, the order of words is completely ignored: the sentences “man bites dog” and “dog bites man” are mapped to the same feature vector. Obviously, BoW is not suitable for text generation.

In N-gram methods, one considers tuples of words (more generally, tokens or characters), such as sequences of 2 or 3 words. Specifically, one extracts from a training corpus the vocabulary of words as well as tuples and triplets, along with their frequencies.

Example: Consider the following two original sentences (with 3 and 4 words):

Soccer is exciting.
I like eating pizza.

From these we obtain the 7 unigrams, i.e., the vocabulary (listed in order of appearance for clarity):

[soccer, is, exciting, i, like, eating, pizza]

the 5 bigrams

[soccer is, is exciting, i like, like eating, eating pizza]

and the 3 trigrams

[soccer is exciting, i like eating, like eating pizza]

Remark on sentence boundaries

N-grams that cross sentence boundaries are not considered here, because such spans strongly change (and generally reduce) correlations.

If such an analysis is carried out on a sufficiently large corpus, one obtains (possibly topic-specific) statistics of word sequences that can be used to determine likely sentence continuations. Suppose the analysis yields the following trigrams beginning with “eat gladly”:

Trigram	Frequency
eat gladly pizza	5
eat gladly Japanese	2
eat gladly to	4
eat gladly nothing	1

An n-gram text generator would therefore complete the phrase “I eat gladly” with the most likely continuation “pizza”; an n-gram suggestion system might offer “pizza” and “to” as options for the user. If a text beginning does not match any known trigram, the system falls back on bigrams (or, secondarily, unigrams).

The modern language models studied later are far better suited for generating text. Nevertheless, variants of the n-gram method are still used, e.g., for autocomplete during text entry on smartphones.

2.3 Embeddings, Word2vec #

Sparse feature vectors in BoW, huge dimensionality

In the Bag-of-Words method, each word is mapped to a feature vector whose dimension (number of components) corresponds to the size of the vocabulary, i.e., the number of distinct words in the corpus (including variants created by inflection and conjugation). Even for a single language, this number reaches into the millions. During each analysis (inference), there is a risk that unknown words will appear (out-of-vocabulary words); this applies even more strongly to tuples in n-gram methods. The feature vector in BoW is typically sparse: for a word, it has only one nonzero component (one-hot vector); for a sentence or document with M words, it has at most M nonzero components (e.g., 20 out of 2 million). Any computations based on this representation would be extremely inefficient and would neglect semantic meaning.

A central concept of modern language models is embeddings: words (more generally: tokens, see below) are mapped from an initial sparse representation (one-hot-encoding in the vocabulary space, cf. BoW method) to a new (learned) representation as dense vectors in a space of much lower dimensionality (typically: $d \sim 300$); the vector components in this lower-dimensional embedding space are roughly analogous to the topic scores used for manual classification in Section 2.1.

In 2013, Mikolov et al. (Google) introduced the Word2vec method for computing embeddings for a vocabulary from a training corpus using unsupervised learning. In general, one considers n-tuples of words (or tokens) from the corpus: a central word plus the context word up to some distance; we will assume $n=5$, e.g.:

[ the | employee | travels | by | train ]

In the Continuous Bag of Words (CBoW) variant, the training task is to predict the central word given the (here 4) context words. Conversely, in the skip-gram variant, which we will consider, context words are predicted given the central word.

In practice, each of the context words is considered separately (and independent of the relative position); for our example, we would obtain four training pairs:

	Input	Output
1	travels	the
2	travels	employee
3	travels	by
4	travels	train

Similar training pairs are obtained from all 5-tuples occuring in the documents or individual sentences of the corpus and used for training a 3-layer feed-forward NN with input and output layer of the (huge) size of the vocabulary, while the hidden layer has a much smaller dimension:

3-layer NN — **Figure:** NN used in the skip-gram algorithm (here for embedding dimension 4)

After training (typically on millions of sentences), the model predicts which words can appear in the context of a given input word, i.e. yields probabilities for each word of the vocabulary (most of which will be very close to zero). This is shown here for the word travels of our example:

**Figure:** Skip-gram predictions for context words of the central word “travels“.

Note that the final probabilities depend on the input word only via the narrow hidden layer. This dense word representation is called word embedding. Since this is the property of interest, the original output layer of the fully NN can be discarded; the (former) hidden layer then becomes the output layer of the resulting embedding model.

Since true synonyms have very similar context probabilities, it can be expected that word2vec methods also yield very similar embedding vectors for synonymous word pairs. It was a positive surprise that much less trivial semantic relationships emerge in the trained embeddings. In some cases, one can even perform arithmetic with embedding vectors: “king” − “man” + “woman” ≈ “queen”:

Embedding projection — **Figure:** PCA projection of embeddings for the terms “king“, “queen“, “man“, and “woman“, illustrating interesting semantic relationships.

Similar relationships have been observed, e.g., between countries and capitals (such as “Germany“ - “Berlin“ ≈ “France“ - “Paris“).

Note:

By construction, the mapping word (one-hot vector) → embedding is not directly invertible: in general, words must be determined via a classification algorithm that corresponds to a given embedding result vector.
Word embeddings can be directly transferred to sentence or document embeddings by simply adding or averaging the embedding vectors of their constituent parts; this concept is used in RAG (see below).
In the Word2vec method, the dimensions of the embedding vectors are equivalent at the model level; this inherent symmetry must be broken during initialization (as in most deep learning algorithms).

Question: How could one address deficiencies of the Bag of Words method and construct an improved text classification algorithm using embeddings?

2.4 Tokens #

So far, we have considered words as the basic unit of text processing. Due to the enormous vocabularies, this is not efficient even for a single language; language models that “understand” multiple languages (such as the LLMs used for ChatGPT) would hardly be conceivable this way. In practice, it is also important to be able to encode misspelled words. Of course, there already exists a compact universal basis for text: the letters (e.g., the Latin alphabet), which, however, lack individual meaning.

Instead, so-called tokens—i.e., word fragments—are used in practice as the basis for representing written language in modern NLP: The language models do not use letters or words, but exclusively a model-specific set of tokens (as well as internal embeddings, etc.). Input and output each have to be translated by an appropriate tokenizer.

If we were to manually define tokens, we would likely choose syllables, word stems, etc. as candidates—for example, splitting the word “launches” into the stem “launch” and the ending “es”. A language model could then learn the meaning of the word stems on the one hand and the possible combinations between stems and endings on the other. However, this procedure does not scale, cannot be fully automated, and is hardly applicable to multi-language models.

Basic BPE algorithm #

By contrast, an algorithm originally introduced for text compression to generate a token vocabulary of fixed size—(modified) Byte-Pair Encoding (BPE)—is conceptually strikingly simple:

Choose a text corpus as a training dataset (e.g., English Wikipedia)
Define the target size of the token vocabulary (e.g., 10,000 tokens).
Initialize the vocabulary with all letters and punctuation symbols occuring in the corpus.
Count the frequency of all token pairs in the corpus.
Define the most frequent token pair as a new token; replace this pair in the corpus with the new token.
Iterate steps 4+5 until the vocabulary reaches the target size.

Example: Let us illustrate the BPE process with a reduced vocabulary. We will consider the sentence “Another reason honest hearts are lost here.”; if we ignore, for the momement, capitalization and punctuation, it contains only 9 different characters: a, e, h, l, n, o, r, s, t.

Initial tokenization: characters only

Initial vocabulary (characters): {a, e, h, l, n, o, r, s, t} (9 symbols)

Pair frequency analysis (most common adjacent pair)

h e → 3× (in hearts, here, another)
r e → 3× — but we select he as the first merge
e r, o n, s t, a r → 2× each

Merge 1: h e → he

Updated tokenization:

Updated vocabulary: {a, e, h, l, n, o, r, s, t, he} (10 symbols)

Most frequent pair now:

r e → 3× (in reason, are, here)

Merge 2: r e → re

Updated tokenization:

Updated vocabulary: {a, e, h, l, n, o, r, s, t, he, re} (11 symbols)

Next most frequent pairs:

o n → 2×
s t → 2×

Choose the most stable high-frequency pair:

Merge 3: o n → on

Updated tokenization:

Updated vocabulary: {a, e, h, l, n, o, r, s, t, he, re, on} (12 symbols)

In this example, also the next merge would involve only single characters (i.e., original tokens). At some stage and in longer texts, also larger tokens would emerge.

BPE for LLMs in more detail #

Note that we have cheated a bit in the above example, i.e. have deviated from the rules stated initially for BPE:

We have treated all words individually (instead of encoding the full string).
We have not encoded spaces.
We have not encoded punctuation.
We have not encoded capitalization.

The first point is very sensible since word boundaries (at least in western languages) have important meanings; it also greatly simplifies generating the pair statistics. In contrast, it is absolutely essential to fully include spacing, punctuation, and capitalization in the encoding. Here are rules (in addition to the initial rules 1-6) that respect word boundaries (they are observed, e.g., by SentencePiece tokenizers and the BPE tokenizer for GPT-2):

Spaces can only appear at the start of tokens (e.g., <space> + “he” can be merged, but not “he” +<space>)
No punctuation symbol is merged with characters.

Since (in western languages, written correctly), words are preceded by a space except for the first word of a line or document or within brackets or quotation marks (such as “hell”), many tokens start with a space. Let us, for illustration look at the tokenization of this paragraph:

Tokenization example — **Figure:** Tokenized text using the gpt-2 (or gpt-4o) tokenizer; generated using https://tiktokenizer.vercel.app/?model=gpt2

We can see that only the words “Since”, “in”, “such” and “hell” appear in tokens that don´t start with a space; conversely, the token " at" used in the text can obviously not be used for encoding, e.g., “spat”. Also note that almost all words have been encoded using a single token (since they are very common) - except only for the word “tokenization”.

Many large companies have their own tokens in all relevant variants:

As shown in the following table (composed using ChatGPT), typical vocabulary sizes employed for LLMs have increased from about 30,000 to about 100,000 - 200,000 within the last few years:

Year	Model / Family	Vocabulary size	Notes
2018	BERT (bert-base-uncased)	30,522	WordPiece vocab used in the original BERT models. (Hugging Face)
2019	GPT-2	50,257	Byte-level BPE; 50k merges + base bytes + special EOT. (Hugging Face)
2019	BART (facebook/bart-base/large)	50,265	Shared encoder–decoder vocab; same size for base & large. (Hugging Face)
2019	T5 (T5-base etc.)	32,128	32k “core” tokens + extra IDs → 32,128 embeddings in HF config. (Hugging Face)
2022	GPT-3.5 (gpt-3.5-turbo; tokenizer cl100k_base)	≈100,000	cl100k_base tokenizer used by GPT-3.5 & GPT-4, documented as “about 100k” tokens. (alignmentforum.org)
2024	GPT-4o (tokenizer o200k_base)	≈200,000	o200k_base doubles vocab vs. cl100k_base for better multilingual and code efficiency. (njkumarr)
2024	Llama 3	128,000	Meta’s Llama 3 uses a 128k-token vocabulary (TikToken-style tokenizer). (ai.meta.com)
2025	DeepSeek R1 (Deepseek-V3 config)	129,280	Hugging Face config lists vocab_size = 129,280 for the underlying DeepSeek-V3 model used in R1. (Hugging Face)

LLMs usually also employ special tokens such as (here for GPT-4o) im_start, im_end, im_sep for the start, end or between message blocks; these are added after completion of the BPE algorithm. Note that decoder-only models such as GPT do not employ start-of-text tokens (which are present in the original encode-decoder transformer architecture).

BPE for UTF-8 #

One complication that we have not covered so far: Many languages use characters outside the ASCII alphabet. While moderate extensions such as ISO 8859-1 suffice for languages such as German or French, Chinese includes over 100,000 characters. UTF-8, the current world standard for text encoding, defines 1,112,064 Unicode socalled code points, using a variable-with encoding of one to four one-byte (8-bit) code units. Note that these code points are not all characters (of some sort), but include, e.g., emojis such as 😂 (U+1F602).

Problem: How can one tokenize if the start vocabulary size (the number of characters = code points) is already much larger than the target vocabulary size?

Answer: There are two straightforward solutions:

One works strictly on the level of code points, but encodes only a subset (of a size much smaller than the target vocabulary size, maybe 1/4) of the most common characters and symbols; all other code points are replaced by a special unknown-token (UNK). This approach has been used, e.g. for the BERT model, in early T5 models as well as in Llama 1 and 2 (WordPiece/SentencePiece). Google models still uses SentencePiece (with a byte fallback) for all LLMs - and keeps digits seperated.
The BPE is not employed on the level of code points (= characters), but on the level of their constituents, the individual bytes; the starting vocabulary size is, then, 256. This has been used in models such as BART and is used in all GPT models (tiktoken); Meta switched to tiktoken starting with Llama 3.

Early byte-level BPE tokenizers (such as used for GPT-2) have included special treatments for punctuation, i.e. respected the rules 7+8 stated above. Recent OpenAI tokenizers do merge punctuation with characters (such as “.T”), violating rule 8, but still obey rule 7.

Tokenizers can be tested online at https://tiktokenizer.vercel.app/ and https://platform.openai.com/tokenizer.

Note that LLMs using BPE on the code unit level (such as GPT models, Llama 3, DeepSeek R1) could, in principle, produce token sequences that are not valid UTF-8.

Mechanism	Ensures valid UTF-8?	How strong?
Model architecture	❌ No	None
Training data	✔ Yes	Extremely strong
Tokenizer vocab	✔ Helps	Encourages whole-character tokens
Decoder fallback	✔ Optional	Absolute guarantee

Special Tokens #

A final remark: early LLMs, in particular early variants of ChatGPT, suffered from interesting problems involving strange tokens such as SolidGoldMagikarp (see, e.g., a blog post). This is a good example of the dangers of undertrained neural networks.

Last updated: 2025-12-10 13:15

← Introduction LLMs →